dns and server failover

fyi: managed dns services - edgedirector.com

Round robin dns is a technique for directing traffic to multiple servers. However, if there is a server failure, the clients using that server will not be served.

Server failover is similar to round robin dns. Dns failover services avoid disruption of service to clients by monitoring the health of the individual servers. The servers may be configured as active-active or active-standby.

In active-active configurations, all servers participate in serving client requests while they are healthy. If the server health monitor detects that a specific server is unresponsive, the specific server address is automatically taken out of the service rotation. When the failed server is detected as healthy again, it is returned to the service rotation.

In active-standby configurations, one or more servers are marked as hot standby servers. When a failure of an active server is detected by the health monitor one of the hot standby servers is automatically added to the pool of active servers eligible for service.

The server health monitor system sends notifications to designated recipients upon all failure events in either configuration.

Dns based failover and load balancing are especially well suited to creating redundancy in distributed locations.

For the ultimate in uptime, combine dns based failover/load balancing with local load balanced clusters in different data centers. The local load balancer maintains uptime at the primary site and minimises the need to switch sites. However, when it becomes necessary, the dns failover will ensure that your other physical cluster is put into service.

Even in a fairly unaggressive setup, good results can be achieved. An independent experiment was carried out by the owner of bestcrosswords.com.

Using a TTL of 30 minutes, he manually changed the dns entries for his site to point at his backup server. At the end of 30 minutes, 80 percent of users were seeing the new address, and at the end of 40 minutes, 90 percent of users were seeing the new address.

While it can be seen that there are still some dns caches that are not respecting the TTL, it must be noted that the majority are operated according to the rules. It is important to realise that it is better to be able to serve 80 percent of users under real life conditions, than to abandon 100 percent of your users. And, of course remember that the delays can be shortened by using a shorter TTL.

He posted the full results of his experiment here.