Modern solutions for meeting RPO/RTO ?

What is RPO/RTO ?

The question of RPO/RTO has caused some confusion for IT folks through the years. Let’s start with the traditional definition of a recovery point from Wikipedia:

It is the maximum tolerable period in which data might be lost from an IT service due to a major incident. The RPO gives systems designers a limit to work to. For instance, if the RPO is set to 4 hours, then in practice, offsite mirrored backups must be continuously maintained – a daily offsite backup on tape will not suffice.

The second point is key, and I’ve underlined it to emphasise it, and it raises an important question regarding how companies manage recovery points, in the real world.

Typical Scenario for Database Protection

Let’s use the example of a database system which requires nightly fulls and log backups every hour.

If they live in a tape world, it’s a difficult problem meeting aggressive RPOs such as 1 hour, while balancing optimal use of tape capacity. One of the reasons is that since LTO-2 and LTO-3, tape capacities grew much larger compared to DLT, so if a log backup is written to tape, in order to survive a site failure, the tape should travel offsite immediately when written.

This not only leads to a lot of wasted tape space (and associated cost for each tape) but more importantly, the operational overhead of managing this process is very high. This is a very labour-intensive activity that is prone to errors, and can ultimately soak up many mandays for IT staff.

Management Overhead can be hard to record, without adequate time management tracking, and in many companies having a single member of staff spending 3-4 mandays a week managing a relatively small environment with a single data centre is common. This is where tape is the dominant backup storage medium.

Is RPO being met ?

Now most people know this is just not practical and I would guess a lot of people are storing multiple copies on a single tape across the day, and waiting until at least the next day to collect this.

In terms of meeting RPO requirements that a business has, this is a matter for the business to dictate whether this is acceptable or not, but being absolutely rigorous, it is probably not.

In many cases does the business know what is going on ?.

Smarter Tape deployment

To reduce cost and meet business requirements requires a paradigm shift that is being addressed in the first instance by tape being replaced in this process by disk, with a software layer on top to manage the data and where it should live.

We can talk about solutions such as hybrid software/hardware solutions like Data Domain/Netbackup, Avamar and pure software solutions (agnostic of underlying disk) like Commvault. I like Commvault as it is consistent with the SDDC and allows de-coupling of hardware from software. The same is true of Veeam and many others where you can deploy commodity hardware to meet your needs.

So I don’t advocate removal of tape – I advocate it’s removal from the primary protection cycle. So tape becomes a tactical tool, with an equivalent deployment model to SATA in Hybrid disk solutions like Nimble, Nutanix or VMware’s VSAN.

What about Disk Footprint?

If we replace tape with disk, without de-duplication, the footprint required would be impractical, so it is a fundamental requirement of any disk based backup solution. Personally I think unless the environment is large enough, VM-centric solutions like Veeam (which doesn’t offer support for physical servers) mean you have 2 solutions.

As a backup practitioner, managing 2 solutions doesn’t feel right unless as I said the environment warrants a discrete solution for managing VMware/Hyper-V/Xen etc. If you are using Veeam it obviously has a lot of good features for replication too. So if replication is a requirement then it is definitely in play – also having a single management solution for vSphere and Hyper-V is also a powerful feature.

With Disk as a target, the log backup – let’s call it what it is, a Recovery Point – is on disk, so now we can manage it through it’s lifecycle with zero operational overhead.

We need to be able to automate replication and retention using policy-based controls that ideally can be mapped to a technical solution. We need to forget about calling backups backups and start referring to them as recovery points. Then we need a solution that can manage recovery points, through software.

So we now do this

Get data onto disk medium to create recovery points, within the backup window, using disk and deduplication to optimise data movement and storage capacity.
Have background jobs moving secondary replicas to alternate sites, again leveraging disk deduplication to ensure optimal bandwidth utilization and fastest speed of transfer.
Use tape only to meet long term RPO requirements.

The real world of multi-TB recoveries

We need to start asking ourselves some realistic questions regarding what actually happens when an application fails.

Many people don’t like the argument that for large (multi-TB) database systems it is likely we may never recover from a backup. It takes too long, has so many linked applications and other systems that would be completely out of step that it could be impractical for the likes of a bank, for example. Also, even disk-based offsite backups of multi-TB systems could be at the end of a Gigabit link which even at that speed just won’t be “fast enough” to perform a full recovery, to meet RPO objectives.

Most people will fix the problem at source if at all possible – fix-forward. This poses interesting questions regarding what needs to stay close to the source data for recovery purposes.

Using solutions like Commvault Auxiliary Copy, Data Replication and Intellisnap allows you to build up a service catalog of solutions that will define the RPO/RTO each technical solution can offer, and apply it to different applications. This makes this whole area more manageable, and moves you away from point solutions (doubletake/wansync/rsync etc etc) where the intellectual property is in the head of a single staff member. You can plug different solutions into your physical design, once they meet your RPO/RTO requirements.

Data Classification and Service Catalog

I always say the starting point for this must be Data Classification – knowing which systems are most important – and understanding the data within your estate. Then when you need to know whether you are backing stuff up in the right place i.e. maybe for your multi-TB systems you need a SAN snapshot at the production site, and full backups at DR.

So defining a service catalog of solutions to meet the requirements of different RPOs and RTOs, and standardising on these solutions-sets streamlines management and delivery of the entire process. This will ultimately allow a business and application-centric approach to be taken with a small subset of standard methodologies for achieving RPOs.

277 total views, 2 views today