locked
Multi-Subnet Availability Group client DNS mismatch RRS feed

  • Question

  • I am having some issues with Multi-Subnet availability groups in SQL 2012.  I am able to successfully create the AV group and clients can connect to it (sometimes).  My issue stems from the fact that when creating the listener a DNS entry is created for each subnet that my AV group is in.  Since the DNS server uses a round-robin method to distribute IP addresses, when clients resolve the listener name to the IP address, the resolved address rotates between the DNS entries for the listener name.  As such I have my AV group in two subnets, so I have two DNS entries and when a client resolves the listener name it switches back and forth between the two DNS entries.  If the DNS entry returned is not the current active IP address, then the client connection attempt fails, and will continue to fail until the DNS is flushed or until the DNS lease expires.  As such 50% of the time when a client resolves the AV group listener name they will be unable to connect.

    This issue has not been consistent for all applications; in Microsoft Systems Center Operations Manager we never saw this issue; however for a couple of homegrown web applications we are seeing it consistently.

    Below I have detailed an example of our environment and the steps we execute to see the issue.

    Environment

    ·         Windows Domain (Windows 2008)

    o   Server A

    o   In the 192.168.1.0 subnet

    o   Windows 2008 R2

    o   SQL 2012

    ·         Server B

    o   In the 192.168.2.0 subnet

    o   Windows 2008 R2

    o   SQL 2012

    ·         Server C

    o   in the 192.168.3.0 subnet

    o   Windows 2003

    ·         Server A and B clustered together

    ·         Availability Group J

    o   containing a database for web application X

    ·         Listener K

    o   Is configured with an IP address in both subnet 192.168.1.0 and 192.168.2.0

    ·         Server C hosts web application X using IIS

    ·         The web.config for web application X is configured to use listener K to connect to the database

    Issue

    ·         Availability Group J is currently on Server A.

    ·         Through web browser login to web application X and generate report. (Success)

    ·         Ping Listener K which resolves to address in 192.168.1.0 subnet.

    ·         Perform manual failover of Availability Group J from Server A to Server B.

    ·         Through web browser login to web application X and generate report. (Failure)

    ·         Ping Listener K which resolves to address in 192.168.1.0 subnet.

    ·         Flush DNS cache

    ·         Through web browser login to web application X and generate report. (Success)

    ·         Ping Availability Group J which resolves to address in 192.168.2.0 subnet.

    ·         Perform manual failover of Availability Group J from Server B to Server A.

    ·         Through web browser login to web application X and generate report. (Failure)

    ·         Ping Availability Group J which resolves to address in 192.168.2.0 subnet.

    ·         Flush DNS cache

    ·         Through web browser login to web application X and generate report. (Success)

    Has anyone else run into this?  If so what was the solution that you came to?  Any assistance with this would be appreciated.

    Friday, December 16, 2011 2:04 PM

Answers

  • Hi!

    I can confirm this behaviour in some scenarios on my side as well. For .Net based clients the solution is simple: Microsoft has brought out an update package to support the new AG features, adding the "MultiSubNetFailover=True" parameter to the connection string. For legacy clients that do not use a Microsoft driver this is a problem though. The only workaround I was able to come up with so far is throwing out the listener and readding it manually via the cluster admin. In this scenario the cluster only registers the IP in DNS that is currently up. You still have to face with DNS client caches in this scenario, but it's better than nothing...

    @Justin, Santosh: Any comments from Microsoft on that?

    Lucifer

    • Marked as answer by g0ld2k Thursday, December 29, 2011 9:28 PM
    Tuesday, December 27, 2011 6:54 AM
  • Interesting to know, thank you.

    It took me a while, but I figured out where the difference in the VNN config is between the cluster generated VNN and the SQL generated Listener: The Network Name resource has a private property "RegisterAllProvidersIP", that SQL sets to true. If you change that to false via cluster.exe you have the same behaviour as with the cluster generated VNN but you still see it in SQL Management Studio.

    I will file a bug entry for this whole system with Microsoft. This is not really a good thing for real life applications...

    • Marked as answer by g0ld2k Thursday, December 29, 2011 9:28 PM
    • Unmarked as answer by g0ld2k Friday, December 30, 2011 2:11 PM
    • Marked as answer by g0ld2k Friday, December 30, 2011 2:11 PM
    Wednesday, December 28, 2011 6:28 AM

All replies

  • Hi!

    I can confirm this behaviour in some scenarios on my side as well. For .Net based clients the solution is simple: Microsoft has brought out an update package to support the new AG features, adding the "MultiSubNetFailover=True" parameter to the connection string. For legacy clients that do not use a Microsoft driver this is a problem though. The only workaround I was able to come up with so far is throwing out the listener and readding it manually via the cluster admin. In this scenario the cluster only registers the IP in DNS that is currently up. You still have to face with DNS client caches in this scenario, but it's better than nothing...

    @Justin, Santosh: Any comments from Microsoft on that?

    Lucifer

    • Marked as answer by g0ld2k Thursday, December 29, 2011 9:28 PM
    Tuesday, December 27, 2011 6:54 AM
  • Thanks Lucifer, I had tried configuring it that way as well (I saw the same results you did) however when setting up the listener through the Cluster Admin console the listener is not seen by SQL Server (as far as I can tell this is not a big deal, however I am concerned about possible issues).

    Also if you can use the 'MultiSubNetFailover=True' connection string parameter make sure you also use the 'Connection Timeout' parameter and set it to at least 60 seconds.  I found that without the 'Connection Timeout' parameter the connection times out each time when 'MultiSubNetFailover=True'.  This appears to be due to the default timeout of 15 seconds, so when I change the Timeout to 60 seconds I am able to establish a connection and failover to it without issue.

    ~Chris


    • Edited by g0ld2k Tuesday, December 27, 2011 2:55 PM Forgot something
    Tuesday, December 27, 2011 2:53 PM
  • Interesting to know, thank you.

    It took me a while, but I figured out where the difference in the VNN config is between the cluster generated VNN and the SQL generated Listener: The Network Name resource has a private property "RegisterAllProvidersIP", that SQL sets to true. If you change that to false via cluster.exe you have the same behaviour as with the cluster generated VNN but you still see it in SQL Management Studio.

    I will file a bug entry for this whole system with Microsoft. This is not really a good thing for real life applications...

    • Marked as answer by g0ld2k Thursday, December 29, 2011 9:28 PM
    • Unmarked as answer by g0ld2k Friday, December 30, 2011 2:11 PM
    • Marked as answer by g0ld2k Friday, December 30, 2011 2:11 PM
    Wednesday, December 28, 2011 6:28 AM
  • Lucifer,

    I appreciate your help with determining the differences between the VNN created by the cluster and SQL.  I played with it and confirmed that this does cause a single DNS entry to be maintained for the Listener and that the DNS entry is updated when an Availability Group failover occurs.

    I did run into one issue.  From my testing if you want to use the 'MultiSubNetFailover=True' connection string property and have a single DNS entry then 'MultiSubNetFailover=True' does not work.

    If the application opens a connection to the database and an Availability Group failover occurs, then the application can no longer connect to the database.  If you re-launch the application then the application appears to use the cached DNS entry and does not connect to the application (unless you flush the DNS or allow the DNS entry to expire).

    I assume that the 'MultiSubNetFailover=True' connection string property caches all IP addresses returned by DNS for a hostname, which allows it to not use the single currently cached IP address from DNS that Windows has but instead is able to connect to the currently active IP address.

    ~Chris

    Friday, December 30, 2011 2:27 PM