13.12.2015

Disaster recovery

Ivan Vagunin

This post describes key concepts considering disaster recovery of IT services. As an example I will use recovery of SharePoint service, since it is what I am working with mostly, but most of concepts are applicable for any IT services.

Overview
You probably hear a lot about backups and recovery but do you have disaster recovery plan? In my experience less than 10% of customers have it. I don't mean they don't have backups, they do. But do they really know how to restore the service using those backups? Did they create detailed step by step instructions on how to do that? Do they test those instructions regularly? Well, as I said, the answer is NO in 90% of cases.

However most of the customers (I mean almost all of them) consider services like intranet and extranet critical for business (more or less in each case but still critical).
However, having backups and think that you can restore them in case of disaster is like learning to swim by being thrown into the water – sometimes it may work, but risk of drowning is pretty high, right?

The truth is that you should fully understand what it takes to restore SharePoint farm form backups, what kind of backups you have, how old are those backups. So let's try to sort it out.

Key points you need to understand are: what is disaster, backup, recovery, recovery plan, high availability and how they are different form reach other. But first, you should ask yourself: "Should I really care about all of that?" and the answer is "It depends…". It really does, so I'll try to point out some things to consider and you decide.

Disaster. Why should I care (RTO)?

Everyone probably understands what disaster is. So I won't provide any long definition, let's just agree that disaster is a significant (from time perspective) service outrage. Significance of outage time is something unique to the service, e.g. few hours outage is not a disaster for intranet with 100 users, but it is for large cloud service like SharePoint Online with millions of users. And that would be the first thing to consider while making your decision. If you have all time in the world to restore your service, you don’t really need disaster recovery plan, at least it can reside in your head and be like "I will just restore from the backup".

But what if you want to recover service in a certain period of time. Even worse, you may have it written on the paper in form of Service Level Agreement (SLA). In this case you may want to think of reliable plan in advance. And time to implement the plan would be first metric to consider (in terms of disaster recovery it is called Recovery Time Objective or RTO). It's important that RTO is less than outage, otherwise there is no sense to implement the plan, you just wait when service comes online "by itself".

"But wait... what if disaster never happens?". Well, that would be really nice, but before you take the risk consider following statistics (not sure how accurate they are, but looks realistic):

60% of companies that lose their data will shut down within 6 months of the disaster
20% of small to medium businesses will suffer a major disaster causing loss of critical data every 5 years. (Source: Richmond House Group)
About 70% of business people have experienced (or will experience) data loss due to accidental deletion, disk or system failure, viruses, fire or some other disaster (Source: Carbonite, an online backup service)
7 out of 10 small firms that experience a major data loss go out of business within a year. (DTI/PriceWaterhouseCoopers)
30% of all businesses that have a major fire go out of business within a year.
70% fail within five years. (Home Office Computing Magazine)

"So what can be the reason of all those horrible things?" Well, it's not that important from recovery perspective but it can affect your backup strategy. Let's check statistics on data loss reasons. It may vary but major reasons are (in risk descending order):

1. Human error (Unfortunately there is no good protection against humans)
2. Viruses and damaging malware (Apparently virus protection is not good enough too, because of human errors)
3. Mechanical damages of hard drive (It happens from time to time)
4. Power failures (Do you have a backup power source?)
5. Theft of computer (It is not likely that some will steal your server, but still)
6. Spilling coffee, and other water damages (looks like a human error again)
7. Fire accidents and explosions (looks like this is also somehow connected with humans)
8. Asteroids, yeti and aliens stealing our data (well, not really, so you may take this risk)

The point is that it's not only natural disasters (that is probably not likely to happen). However it's up to you to take the risk or not. If you are aware of reasons and consequences you can make right decisions. And the risk that you are trying to address will affect your backup storage (for example if you want to keep backups in case of fire in DC, you should store them in another physical location).

Backups (RPO).
"But I have my backups in safe place, do you I even need a disaster recovery plan?"

In most cases the answer is "Yes". Considering SharePoint farm (intranet or extranet) is not only about the data. Well, in some sense it is, but I mean it is not only about data in content database (that you most likely to backup), it is also about configuration data (in databases and on server disks), search index files, custom solutions, consistent state of all machines in the farm and so on. So in this case (and most of other cases when you deal with complex software) you should be aware of what is backed up, when was the last backup taken and how to get consistent set of backs of all components in the farm.

Another common question asked about backups is "How often should I take those backups?" Answering to this question is answering how much data you can lose. And the answer is: "I can't lose any single bit of my precious data!" But before you rush to setup SQL availability groups or mirroring consider how much it will cost. Consider you own a small intranet with 500 users. If in case of disaster you lose data changes that were made in last 2 hours, the total effort lost will be about 500 hours (I consider that only 50% of users were active that time or all users were active during 50% of that time) which is about 3 man-month. On the other hand consider how much effort it will take to setup and maintain fully synchronized databases for several years, before disaster happens. Considering how critical the data is, if it can be restored manually or not you and can find backup strategy that pays you most with less investment. Amount of data you can lose would be metric of your recovery (in terms of disaster recovery it is called Recovery Point Objective or RPO)

Recovery (DRP)
Though backup is the key point to prevent data loss and you mostly like to use them in your recovery process, it's not the only thing you need. The first recovery is not only about data restoration, it's about service functioning restoration. You may need to restore some service functions even if you don't have any backups (e.g. recreate your SharePoint farm from the scratch).

The key thing is to have plan how to recover the service (disaster recovery plan or DRP). And the plan is based on two important metrics: RTO and RPO (that were explained before). The plan should be written down and put in a safe place (along with all other stuff you need for recovery, e.g. installation medias). Plan should be detailed enough that anyone (who don't know details of your intranet) can implement it. And last but not least, plan should be tested and updated on regular basis.

Remember statistics: 34% of companies fail to test their tape backups, and those that do, 77% have found tape back-up failures. Schrodinger's backup statement is simple but important: The condition of any backup is unknown until a restore is attempted. So remember to test you disaster recovery plan, not only to check backups and also because system may change and plan needs to be adapted to those changes.

High availability
Now that you know what is disaster recovery (DR) we can quickly go through another similar (but not the same) concept of High Availability (HA). Simply saying High Availability means that your service is available to users no matter what happens (unlikely DR means that you can recover your service with given RTO). High availability means you should have full redundant deployment up and running all the time, and that costs money. Thus as in case of the disaster recovery plan, you should take costs and benefits into account before implementing HA.

Creating an initial disaster recovery plan is one of the most important disaster recovery activities. Knowit Oy offers 5 day SharePoint disaster recovery workshop to implement disaster recovery plan that suites to your needs.

Disaster recovery

Ivan Vagunin

Tietoa kirjoittajasta

Ivan Vagunin

Aiheeseen liittyvät julkaisut

Synteettinen testidata tehostaa kehitystä – neljä opittua periaatetta, jotka toimivat läpi toimialojen

Miksi hankinta on yrityksen kestävyyden Most Valuable Player?