global server load balancing with failover

background

This configuration example covers global server load balancing and failover of multiple distributed servers using the edgedirector advanced dns service.

To avoid publishing details that might apply to servers belonging to other parties, the domain name used in the example will be example.com and the addresses used will be private ip ranges. This practice is in accordance with the recommendations of the IETF as published in various RFC's.

usage scenario

example.com wants to ensure visitors from around the globe are served by the closest server. Multiple servers are in place at geographically distributed data centres. They also require that when a particular server goes offline, the next best server is is returned in dns answers to avoid complete service outages. The site is published as example.com and www.example.com

typical global load balancing and failover configuration

The steps outlined in this configuration combine most of the elements found in the failover and load balancing configuration examples.

1. Create one example.com address record for each distributed server ip address. Set the ttl of all records to a value between 600 and 60 seconds. Mark all records with "failback on recovery". Do not mark any records with "hot standby".

2. Create one alias record for www.example.com pointed at example.com with a matching ttl. The www.example.com alias record will follow the example.com target automatically.

3. Create a http monitor test for each of the two address records. The test interval will be automatically set to twenty percent of the underlying ttl. The test interval is bounded between the range of 900 seconds to 60 seconds.

4. Turn on alerts for each of the test monitors.

5. Set the geo coverage for each server according to physical location. Leave the global coverage enabled for all servers.

6. Create administrative access records for each of the servers. These records are used by administrators to access the correct server unambigously by name.

7. Publish dns records using the red link at the top of the dns management page. The records will be available in the next two minutes.

system operation

The backend system generates multiple records matching the geodns parameters set by the dns administrator. These records are pushed to the public facing dns servers.

When queries arrive, the address record most closely matching the geographical location of the query source is selected as the answer record. Matches are made from smallest to largest qualifying records. The visitor will in turn receive the dns answer and initiate a connection to the server determined to be closest according to these rules.

When queries arrive for the www.example.com alias, an address record is also returned in the answer packet. The address record is selected using the same rules as a direct query for an address record.

In the event of a server going offline, the corresponding record is withdrawn from service. Dns answers are then drawn from the remaining pool of available servers. Leaving the global dns coverage enabled for all records ensures that a valid answer is given, even if the closest server is unavailable.

The monitoring service monitors the availability of services at all of the example.com address records marked for monitoring. The probes are executed simultaneously from multiple servers.

A record will be marked as unavailable if no valid response is returned to any of the probes. If any probe receives a valid response, the record is considered available for service. When a production address record is marked unavailable, it is withdrawn from the pool of available answers.

An alert will be sent to the designated alert recipients.

The monitoring service continues to test all monitored example.com records. When a server becomes available again, the records are modified to return the corresponding record to the available pool.

additional information

It is important to understand that dns queries will generally originate from the dns cache servers operated by the visitor's internet service provider. Some cache servers are geopraphically distant from the visitor, or, their ip addresses do not bear a direct relationship with their geographical location. This limitation is unavoidable, so users must recognise and allow for some leakage.

As compared to failover alone, Combining global load balancing and failover is the more efficient use of committed resource expenditures. When failover is used alone, the hot standby server incurrs expenses even when it is not being used to serve visitors. In contrast, global load balancing ensures that all committed resources are employed for live production. Combining both feature sets allows sites to be constantly available even in the face of individual server outages.

The ttl value impacts how quickly a site can be switched over when servers go offline and come back online. The recommended value is 300 seconds. Values over 600 seconds are not recommended due to slow failover response.

A record can be withdrawn from service manually by using the control panel. This allows a physical site to be taken out of service for maintenance while the other server takes over serving visitors.

If no records for example.com are available, then an alias record is returned for hold.dxmx.org. The server is hosted by edgedirector. It is configured to respond with a courtesy notification that example.com is not currently available and suggests that the visitor try again in a few minutes.