Friday, July 15, 2011

A simple but highly useful feature request for DNS

Most people believe that by having two Windows domain controllers can provide transparent fail over, i.e., if one DC fails, the clients will automatically use the other. However, this is not true. The client will simply use the first DC returned by the DNS. Similarly, if you use DNS to load-balance between multiple web servers, when one of them fails, some clients will still be directed to it.
To fix the problem, there is a very simple solution: enhance the DNS server to perform a health check against the resulting host of the resource record. For example, the administrator could specify the TCP port to connect to as in the imaginary syntax below:
  www        A      1.1.1.1    80     ; return this record only if we can connect to its TCP port 80
www A 1.1.1.2 80
www A 1.1.1.3 80

Of course, the health check could be more general, then you could use a script:
  www        A      1.1.1.1    web-check.sh  ; return this record only if the script returns true

where the IP would be passed to that script as an argument for checking.
It works for domain controllers too:
  _ldap._tcp.dc._msdcs.foo.com.   SRV  1.1.1.1  dc-check.sh
_ldap._tcp.dc._msdcs.foo.com. SRV 1.1.1.2 dc-check.sh

Finally, one might ask why implement this checking in the DNS server instead of the clients? The idea is that problems should be detected as early as possible to avoid bad effects downstream. In concrete terms, if a server is down but the DNS server (broker) still refers the clients to it, many clients will need to perform this health check themselves. But if the DNS server performs this health check, the checking is only done once, saving a lot of trouble downstream.