automatic distributed server failover to hot standby

background

This configuration example covers server failover of a server from a main production server to a hot standby server using the edgedirector failover dns service.

To avoid publishing details that might apply to servers belonging to other parties, the domain name used in the example will be example.com and the addresses used will be private ip ranges. This practice is in accordance with the recommendations of the IETF as published in various RFC's.

usage scenario

example.com wants to ensure that the site is available even if their main data center goes offline. A duplicate site is hosted at an alternate data center. The alternate server should only be in service if the main server is not available. The site is published as example.com and www.example.com

typical failover configuration

1. Create two address records for example.com, one for the main server ip address, and one for the hot standby server ip address. Set the ttl of both records to a value between 600 and 60 seconds. Mark both with "failback on recovery". Mark the standby server with "hot standby".

2. Create one alias record for www.example.com pointed at example.com with a matching ttl. The www.example.com alias record will follow the example.com target automatically.

3. Create a http monitor test for each of the two address records. The test interval will be automatically set to twenty percent of the underlying ttl. The test interval is bounded between the range of 900 seconds to 60 seconds.

4. Turn on alerts for each of the test monitors.

5. Create administrative access records for each of the servers, main.example.com pointed at the main server and fail.example.com pointed at the hot standby server. These records are used by administrators to access the correct server unambigously by name without regard to failover state. These records should not be monitored.

6. Publish dns records using the red link at the top of the dns management page. The records will be available in the next two minutes.

system operation

The monitoring service monitors the availability of services at all of the example.com address records marked for monitoring. The probes are executed simultaneously from multiple servers.

A record will be marked as unavailable if no valid response is returned to any of the probes. If any probe receives a valid response, the record is considered available for service. When a production address record is marked unavailable, the hot standby record is substituted.

An alert will be sent to the designated alert recipients.

The monitoring service continues to test all monitored example.com records. When the main server becomes available again, the records are modified to return the main address and the standby server is withdrawn from service.

additional information

If no records for example.com are available, then an alias record is returned for hold.dxmx.org. The server is hosted by edgedirector. It is configured to respond with a courtesy notification that example.com is not currently available and suggests that the visitor try again in a few minutes.

Only address records can be configured for monitoring. All address records with the same name are part of a single group. All other records target address records by name and will follow them automatically.

Multiple address records can be used as a production server. In that case, all live records will be returned in dns answers. Additionally, multiple standby records can exist. However, no standby record will be returned unless no production record is available.

The ttl value impacts how quickly a site can be switched over when servers go offline and come back online. The recommended value is 300 seconds. Values over 600 seconds are not recommended due to slow failover response.

Sites may wish to consider combining global load balancing with failover. This makes better use of facilities which are already being paid for.

It is not required that the main and failover sites be identical in content, functionality or size. For example, the main site might be a colocated server while the failover site is on shared hosting.

A record can be withdrawn from service manually by using the control panel. This allows a physical site to be taken out of service for maintenance while the other server takes over serving visitors.

cloud configuration

Where there is a main site, a manually started backup site on cloud services and a hot standby site a modified configuration is required.

Designate the main site and cloud site as failover and the hot standby site as hot standby.

When the main site fails the hot standby site will be put into service because the cloud site is also unavailable. Once the administrator has brought the cloud site online, the hot standby site will be withdrawn in favour of the cloud site.

When the main site comes back online, it will also be included in dns answers. As soon as The administor quieces the cloud service, it will be interpreted as a failure, and the cloud service records will be withdrawn.

At that point, only the main server will be in service.