Dealing with Failover and Transient Errors on a Mirror, Always On or Azure

As you move your platform from running on a single database and single server to running on a highly available scenario, be that a Mirror, AlwaysOn, or a cloud based platform like Windows or SQL Azure, you will find out that highly available does not equal highly reliable! The service will still go down during failover and this will not be covered by your “up time SLA”.

The solution is unfortunately the need to re-code the application tier to detect these “transient” errors and retry the offending code – hoping that the service is now running again on another node.

I will explore three common scenarios that need fault retry logic:

  • SQL Mirror (SQL 2005+)
  • SQL Always On (SQL 2012)
  • Windows or SQL Azure

1) Dealing with Fault Retry Logic on a SQL Mirror

The sales blurb may claim that a SQL Mirror is zero down time with its “transparent” client redirection.  The truth is that during a failover you will loose your current connection, and a rollback of any uncommitted transactions will occur.

If you are happy that your application crashes for a little while (e.g. web users see say and error page), and that in progress transactions are lost, then this is ok. Otherwise you will need the application code to support Fault Retry Logic.

The good news is that there is some sample code blocks available on codeplex that can be used as a starting point if you are developing on dot.net

<!-- wp:preformatted -->
<pre class="wp-block-preformatted">  // The relevant error codes which are documented in http://support.microsoft.com/kb/109787</pre>
<!-- /wp:preformatted -->

<!-- wp:preformatted -->
<pre class="wp-block-preformatted">        // which seem to be worthwhile for retrying.  If errors occur which are not in this list they </pre>
<!-- /wp:preformatted -->

<!-- wp:preformatted -->
<pre class="wp-block-preformatted">        // are simply rethrown.</pre>
<!-- /wp:preformatted -->

<!-- wp:preformatted -->
<pre class="wp-block-preformatted">        readonly static int[] RetryErrorNumbers = new int[] </pre>
<!-- /wp:preformatted -->

<!-- wp:preformatted -->
<pre class="wp-block-preformatted">        { </pre>
<!-- /wp:preformatted -->

<!-- wp:preformatted -->
<pre class="wp-block-preformatted">             1608,  // A network error was encountered while sending results to the front end</pre>
<!-- /wp:preformatted -->

<!-- wp:preformatted -->
<pre class="wp-block-preformatted">            10008,  // Bad token from SQL Server: datastream processing out of sync</pre>
<!-- /wp:preformatted -->

<!-- wp:preformatted -->
<pre class="wp-block-preformatted">            10010,  // Read from SQL Server failed</pre>
<!-- /wp:preformatted -->

<!-- wp:preformatted -->
<pre class="wp-block-preformatted">            10018,  // Error closing network connection</pre>
<!-- /wp:preformatted -->

<!-- wp:preformatted -->
<pre class="wp-block-preformatted">            10025,  // Write to SQL Server failed</pre>
<!-- /wp:preformatted -->

<!-- wp:preformatted -->
<pre class="wp-block-preformatted">            10053,  // Software caused connection abort</pre>
<!-- /wp:preformatted -->

<!-- wp:preformatted -->
<pre class="wp-block-preformatted">            10054,  // Connection reset by peer</pre>
<!-- /wp:preformatted -->

<!-- wp:preformatted -->
<pre class="wp-block-preformatted">            10058,  // Can't send after socket shutdown</pre>
<!-- /wp:preformatted -->

<!-- wp:preformatted -->
<pre class="wp-block-preformatted">            17824,  // Unable to write to server-side connection</pre>
<!-- /wp:preformatted -->

<!-- wp:preformatted -->
<pre class="wp-block-preformatted">            17825,  // Unable to close server-side connection</pre>
<!-- /wp:preformatted -->

<!-- wp:preformatted -->
<pre class="wp-block-preformatted">            17832,  // Unable to read login packet(s)            </pre>
<!-- /wp:preformatted -->

<!-- wp:preformatted -->
<pre class="wp-block-preformatted">        };</pre>
<!-- /wp:preformatted -->

http://sqlfaultretryprovide.codeplex.com

If you are not developing on dot.net, then this code is still useful as you can read from the code the typical error codes that you need to look out for on failover. You can see the error numbers along with a nice KB article below.

<!-- wp:preformatted -->
<pre class="wp-block-preformatted">   1:  /// <summary></pre>
<!-- /wp:preformatted -->

<!-- wp:preformatted -->
<pre class="wp-block-preformatted">   2:          /// This method invokes the supplied delegate with retry logic wrapped around it.  No values are returned.  If the delegate raises </pre>
<!-- /wp:preformatted -->

<!-- wp:preformatted -->
<pre class="wp-block-preformatted">   3:          /// recoverable SQL server or client errors, then the supplied delegate is reinvoked after a certain amount of delay</pre>
<!-- /wp:preformatted -->

<!-- wp:preformatted -->
<pre class="wp-block-preformatted">   4:          /// (which grows exponentially for each retry) until the retry limit is exceeded, at which point the exception</pre>
<!-- /wp:preformatted -->

<!-- wp:preformatted -->
<pre class="wp-block-preformatted">   5:          /// is rethrown.  Other exceptions are not caught and will be visible to callers.</pre>
<!-- /wp:preformatted -->

<!-- wp:preformatted -->
<pre class="wp-block-preformatted">   6:          /// </summary></pre>
<!-- /wp:preformatted -->

<!-- wp:preformatted -->
<pre class="wp-block-preformatted">   7:          /// <param name="body">The delegate to invoke within the retry code.</param></pre>
<!-- /wp:preformatted -->

<!-- wp:preformatted -->
<pre class="wp-block-preformatted">   8:          internal void WithMirrorRetry (RetryBody body)</pre>
<!-- /wp:preformatted -->

<!-- wp:preformatted -->
<pre class="wp-block-preformatted">   9:          {</pre>
<!-- /wp:preformatted -->

<!-- wp:preformatted -->
<pre class="wp-block-preformatted">  10:              int retryCount = 0;</pre>
<!-- /wp:preformatted -->

<!-- wp:preformatted -->
<pre class="wp-block-preformatted">  11:  &nbsp;</pre>
<!-- /wp:preformatted -->

<!-- wp:preformatted -->
<pre class="wp-block-preformatted">  12:              while (true)</pre>
<!-- /wp:preformatted -->

<!-- wp:preformatted -->
<pre class="wp-block-preformatted">  13:              {</pre>
<!-- /wp:preformatted -->

<!-- wp:preformatted -->
<pre class="wp-block-preformatted">  14:                  try</pre>
<!-- /wp:preformatted -->

<!-- wp:preformatted -->
<pre class="wp-block-preformatted">  15:                  {</pre>
<!-- /wp:preformatted -->

<!-- wp:preformatted -->
<pre class="wp-block-preformatted">  16:                      body();</pre>
<!-- /wp:preformatted -->

<!-- wp:preformatted -->
<pre class="wp-block-preformatted">  17:                      return;</pre>
<!-- /wp:preformatted -->

<!-- wp:preformatted -->
<pre class="wp-block-preformatted">  18:                  }</pre>
<!-- /wp:preformatted -->

<!-- wp:preformatted -->
<pre class="wp-block-preformatted">  19:                  catch (SqlException se)</pre>
<!-- /wp:preformatted -->

<!-- wp:preformatted -->
<pre class="wp-block-preformatted">  20:                  {</pre>
<!-- /wp:preformatted -->

<!-- wp:preformatted -->
<pre class="wp-block-preformatted">  21:                      if (!RetryErrorNumbers.Contains<int>(se.Number))</pre>
<!-- /wp:preformatted -->

<!-- wp:preformatted -->
<pre class="wp-block-preformatted">  22:                          throw;</pre>
<!-- /wp:preformatted -->

<!-- wp:preformatted -->
<pre class="wp-block-preformatted">  23:                      else</pre>
<!-- /wp:preformatted -->

<!-- wp:preformatted -->
<pre class="wp-block-preformatted">  24:                          if (retryCount >= _maxRetries) throw;</pre>
<!-- /wp:preformatted -->

<!-- wp:preformatted -->
<pre class="wp-block-preformatted">  25:                  }</pre>
<!-- /wp:preformatted -->

<!-- wp:preformatted -->
<pre class="wp-block-preformatted">  26:                  //Sleep exponentially more each time through the retry loop</pre>
<!-- /wp:preformatted -->

<!-- wp:preformatted -->
<pre class="wp-block-preformatted">  27:                  Thread.Sleep((int)Math.Pow(_delayBase, retryCount));</pre>
<!-- /wp:preformatted -->

<!-- wp:preformatted -->
<pre class="wp-block-preformatted">  28:                  retryCount += 1;</pre>
<!-- /wp:preformatted -->

<!-- wp:preformatted -->
<pre class="wp-block-preformatted">  29:              }</pre>
<!-- /wp:preformatted -->

<!-- wp:preformatted -->
<pre class="wp-block-preformatted">  30:          }</pre>
<!-- /wp:preformatted -->

The sample code block uses a delegate to ensure facilitate the following algorithm:

a) Try method 
b) If it throws and exception, check to see if the exception number is one of the known errors associated with mirror failover (see above)
c) If it is a known error message then go to sleep and retry
d) each retry attempt takes exponentially longer until the max retries are reached

2) Dealing with Fault Retry Logic for SQL 2012 Always On

In theory SQL 2012 is similar needs a similar approach to that of the Mirror Fault Retry Provider (above). However there is now an expanded list of potential error numbers with new messages specific to Always On. A quick search on sys.messages shows some 306 new messages that mention replicas or availability groups

I haven’t managed to find a definitive list of which ones are relevant for detecting failover, but so far in my testing I am only hitting error codes 976 and 983 on top of the standard mirror ones listed in http://support.microsoft.com/kb/109787

Hopefully there will be an official KB article or code block from the patterns and practices group to help narrow down the list of error codes to use.

3) Dealing with Transient Errors in Windows Azure

Just to make matters more complex, environments like Windows Azure also have “transient” errors relating to either failover, changed connections or just the nature of the cloud.

You can read about the Transient Fault Handling Block and download it from this link

A Word of Warning on Application Transactions

If your application is using Microsoft DTC for transactions, or explicitly starting database transactions outside of a stored procedure, then the failover will cause the entire transaction to roll back. Your retry logic needs to do a lot more work than just retry the last statement after re-opening the connection. It will need to retry all statements, including the start of the transactions.

The sample code on transient errors, or fault retry will not help you here. You will need to ensure that retry happens for the whole batch of statements within the application. This seems to be a very common practice with Java development, so make sure the project includes a review of this risk.

Leave a Reply