Broadcast Radio Forums

Failover cluster


http://forums.broadcastradio.com/Topic9041.aspx

By Dan Morgan - Tuesday, September 25, 2018 11:03:13 AM

I'm using a 2 node failover cluster in Windows server 2012 to provide file and SQL servers for Myriad.
During testing I can see that the cluster is failing over to the second node, however, Myriad still crashes - indicating that it has lost connection to the SQL/file servers.
This is obviously not the desired outcome, is anyone running Myriad from a failover cluster, can you suggest whet I might be missing?
By Dan Morgan - Tuesday, September 25, 2018 11:46:43 AM

Peter Jarrett - Tuesday, September 25, 2018 11:39:33 AM
Don't worry, clustering is simple in theory, but VERY complicated in practice, as we found out when building 5! Smile

With v4 and earlier, we maintained a persistent connection to the database on the SQL Server which worked just great as long as the server stays running - which of course if 99.999% of the time ideally! This persistent connection was the usual way of connecting to databases for SQL2005 (which was the main version at the time!) plus essential for users still needing to run with the old Jet file based databases.

But with clustering, when SQL-A fails, the cluster realises it needs to spin up SQL-B, (which yes, takes a short delay) but when it's running, SQL-B has no record of the connections that SQL-A had back before it died, so Myriad can no longer get any data from that server. 

With v5 we changed to a per-request connection system which is how all modern DB systems expect connections, and is essential for things like clustering to work. This would actually be very slow if every time a connection was needed a new one had to be created, so In the middle there is actually a very clever SQL Connection Cache (managed by Windows itself) which  keeps track of connections and re-uses them between requests, so the performance is kept incredibly fast. This is "cluster aware" so if the server dies then the cache knows to kill off it's contents and start working with the new server.

Myriad v5 also has a fairly extensive multi-retry system baked into the code as well so that in the event of a SQL Server running slow (or in the case of a cluster, taking a little while to spin up the new server)  Myriad actually retries certain queries multiple times to try and get some data to keep running - this is why v5 is much more tolerant when run on flaky hardware or over slow links that often have packet timeouts as we see on large WAN systems in some overseas countries.

Hope that makes sense?

It does make perfect sense - thanks for a very succinct explanation.  So, in fact it isn't possible to make v4 fault tolerant? If so, at least I now understand why...and we also have clustering in place for when v5 comes along (which I will now be making a case for!).
Thanks again, Peter!