Disaster Recovery: Overview of Hot Solutions for Sitecore 9.1 Content Delivery Environments

Have you ever experienced a blue screen of death on your computer? Probably yes, nobody is immune. IT systems that host a Sitecore website are not different and they are vulnerable to failures too. The impact of downtime can be devastating for a business, causing loss of revenue and loss of traffic, and damaging a company’s brand image as well.

While a software failure might get resolved in few hours, sometimes an hardware failure might require days to be fixed. Having the right disaster recovery plan in place will help minimize the application downtime and mitigate any loss.

Sitecore support for SQL Server scaling solutions

A disaster recovery solution can be hot, warm or cold, based on the risk of data loss and the expected recovery time that offers. A hot solution should provide a fast recovery with no data loss, while a cold solution would likely introduce a data loss and it might take hours to recover a website.

As of today, Sitecore supports only one SQL Server technology to implement a high availability scaled solution for a Sitecore 9.1 website: the Always On Availability Groups. The support for this technology started with Sitecore 8.2 and replaced the support for Database Mirroring, a SQL Server technology that Microsoft has deprecated and plans to remove in future versions of SQL Server.

Sitecore Disaster Recovery solution using Always On Availability Groups

An Always On Availability Group contains two sets of databases, a primary set and a secondary set, and can be configured to achieve high availability (HA) or read-scale. An HA availability group is a group of databases that fail over together, instead a read-scale availability group is a group of databases that are copied to other instances of SQL Server for read-only workload.

The secondary databases in an availability group are readable, but unfortunately the Sitecore platform requires connections to primary databases only. For this reason the usage of a single Availability Group is not enough to implement a disaster recovery solution, and instead a more complex solution is required, using a new feature introduced with SQL Server 2016, called Distributed Availability Group.

In order to implement a hot disaster recovery solution for a distributed system with two geographically distant datacenters, a Distributed Availability Group solution would require three availability groups: two HA availability groups, with each Sitecore website instance connected to their primary set of databases, and one distributed availability group to replicate the data between the two HA availability groups primary sets.

Disaster Recovery solution for a Sitecore Content Deliver Environment using a Distributed Availability Group.

This solution can be very expensive, considering that Always On Availability Groups technology requires a SQL Server Enterprise license for each availability group.

An alternative solution using a secondary publishing target

In any complex technical implementation, there is always a compromise to reach between cost and quality in order to achieve a desired goal. The same principle applies when you are trying to define the right disaster recovery plan for your company.

If a company cannot afford the expensive costs of a hot disaster recovery solution using Distributed Availability Groups, there is an alternative solution fully supported by Sitecore that involves the application logic: using a secondary publishing target.

Disaster Recovery solution for a Sitecore Content Delivery environment using a secondary publishing target.

The published content of the website, stored in the primary web database, gets fully replicated in a secondary web database, publishing the Sitecore items always on both primary and secondary publishing targets.

The core database and any other database (depending if your system is in CMS-only configuration mode or not) would be cold copies of their respective primary databases.

This solution is theoretically less robust that a solution based on a SQL Server technology, since it relies on the application layer to manage the data replication process, but definitely more affordable.

The following two sections describe how this solution can be improved, using the Sitecore Publishing Service and a simple customization in Sitecore to constrain the publishing process to the secondary publishing target.

Sitecore Publishing Service: a must!

In a disaster recovery architecture, the primary system and the disaster recovery system are usually hosted in geographically separate locations to ensure the availability of the secondary environment when the primary system fails. A geographically distributed architecture introduces a significant latency in any network communication between the primary system and the secondary system. This latency drastically affects the normal Sitecore publishing process and it might take from many minutes, to publish large content sections of a website, to even hours for a full site republish.

The usage of the Sitecore Publishing Service becomes a must if you want to eliminate the negative impact of the network latency. When publishing a large number of items, the Publishing Service minimizes the number of transactions performed with a SQL database, loading in memory the state of the source (items in the master database) and the destination (items in the publishing target database) at the beginning of the publishing job.

Always publish to the disaster recovery publishing target

In the Sitecore platform a content author has the ability to select which publishing target an item will be published to. In order to ensure that the Sitecore items are always published to the disaster recovery publishing target, its selection should not be left to a manual process, prone to human mistakes, but it should be automated and locked.

An easy way to implement this automation is customizing the javascript code in the Publish.js file executed in the Publish dialog form, simply selecting and disabling the Disaster Recovery publishing target option every time the Publish dialog loads. The Publish.js file can be found here in the website folder:

\sitecore\shell\Applications\Dialogs\Publish\Publish.js

The Sitecore Publish dialog with the Disaster Recovery publishing target always selected and disabled.

This is the code added to the Publish.js file:

// Select and disable the Disaster Recovery publishing target
$('#pb_7B5A52337B044DD88FFC9C5F565B54DE').attr('checked', true);
$('#pb_7B5A52337B044DD88FFC9C5F565B54DE').attr('disabled', true);

The selector used in the jQuery function is the ID of the Disaster Recovery publishing target item in Sitecore.

Conclusion

In this blog post I described two different approaches to architect a hot disaster recovery solution for Sitecore 9.1 Content Delivery environments. If you have any questions, or want to share your different implementation approach, please don’t hesitate to comment on this post.

Thank you for reading!

7 thoughts on “Disaster Recovery: Overview of Hot Solutions for Sitecore 9.1 Content Delivery Environments”

judgedice says:

February 13, 2019 at 12:24 pm

Love this… succinct, valid, and contains concrete solutions!

LikeLike

suresh8771 says:

January 8, 2020 at 11:46 pm

how do you suggest for an azure environment with Geo replication enabled database ( which includes coveo , solr,CD and CM , identity)

LikeLike

1. Alessandro Faniuolo says:
  
  January 9, 2020 at 6:39 am
  
  Hi suresh8771, this article on the Sitecore documentation explains the approach and the architecture of a hot high availability disaster recovery solution in Azure: https://doc.sitecore.com/developers/93/sitecore-experience-manager/en/the-hadr-hot-hot-process.html
  
  LikeLiked by 1 person
  
  1. suresh8771 says:
    
    January 13, 2020 at 11:14 pm
    
    Thanks really will help to understand scenarios and technique
    
    LikeLike
Prabhu (@prabhu2k) says:

December 14, 2020 at 6:44 am

Hi Alessandro,

Would an additional SOLR instance be required for hot disaster recovery?

How would xDb tracking activity be affected if we had to remain on the failover option for an extended period of time?

LikeLike

1. Alessandro Faniuolo says:
  
  December 14, 2020 at 8:44 am
  
  Hi Prabhu,
  
  For hot disaster recovery, I recommend to implement a Solr Cross Data Center Replication (CDCR) solution with a load-balancer that redirects traffic to the active Solr instance (https://lucene.apache.org/solr/guide/7_2/cross-data-center-replication-cdcr.html). I haven’t explored if xDb tracking would be affected, but I would expect the disaster recover instance to continue to track if the xConnect instance is available in a DR scenario.
  
  LikeLike
  
  1. Prabhu (@prabhu2k) says:
    
    December 14, 2020 at 11:08 pm
    
    Thank you for the reply Alessandro.
    
    Would CDCR SOLR will be working for On-Premise environments? Also for xConnect, if the disaster recovery data center in a different region, how about the performance lagging while DR connecting the same xConnect?
    
    LikeLike