Planning For Business Continuity & Service Affecting Issues
Chief Information Security, Kerv Digital|Kerv Digital
Published 24/10/22 under:
In terms of the post Covid era, the lessons of the last few years are clear. Business continuity means that workforces over the globe were forced to adapt quickly and in some cases very dramatically. Does that mean now that this process should begin to slow down? Absolutely not.
For those that are keen on laying out a roadmap for the future, in this blog our Chief Information Security Officer, Tony Leary, lays out the fundamentals that you’ll need to bear in mind.
First and foremost, always remember that planning for service-affecting issues is good practice and may keep you in business!
Pre-pandemic, disaster recovery and business continuity were topics many found contrived, and perhaps even pointless. But those organisations that had a documented and tested ‘work from home’ business continuity plan, perhaps to mitigate the loss of a key building, likely coped with the first Covid lockdown better than those that didn’t.
Time is of the Essence
Business continuity as ‘availability’ forms part of the information security ‘CIA triad’ along with confidentiality and integrity, so it’s very much part of IT security architecture and governance practice. Availability is usually expressed as a % of uptime over a period e.g., 99.9% measured monthly means that a service may have up to 0.1% (a bit under 45mins) of unplanned downtime a month. As availability is usually backed by a contractual agreement, IT suppliers must be confident that they can comfortably meet this figure for their service. Confident rather than certain, as availability represents the perhaps the most expensive risks to mitigate, given component failures are usually mitigated by adding spare capacity that may only be used if a failure occurs.
Breaking Down the Breakdowns
In common with every other aspect of IT, there are a plethora of initialisms. Besides availability, business continuity is often referred to as disaster recovery (DR) or service continuity. Some other terms that it’s good to be aware of are:
- RPO: Recovery Point Objective: in simple terms, this represents the maximum amount of data a service consumer is willing to lose if the service fails, so a 1hr RPO means up to an hour of data would be lost.
- RTO: Recovery Time Objective: how long it should take to recover a service following an incident that impacted its availability.
- MTPD: Maximum Tolerable Period of Disruption: the time an organisation can tolerate the loss of a service, given any other processes, such as rekeying in that 1hr of lost data, that may be necessary following service recovery.
The relationship between these three terms is typically RPO<RTO<MTPD, e.g., while up to one hour of data may be lost, up to twenty four hours may be needed to restore the whole system (24 hour RTO) and there may then be a further 12 hours allowed to key in data recorded elsewhere (perhaps even on paper) while the service was down, giving an MTPD of 36 hours.
So where do these values come from? First, business stakeholders must provide the MTPD envelope that a service is required to operate within. Next would be any constraints from third parties and/or vendor technologies: enterprises rarely operate in isolation, and are often dependent on other, existing services or platforms. Once the service is built, but before it is ‘live’, testing is vital to prove that the requirements can be met.
It’s easy to see this is an area that is critical to understand for any new service. Quantifying customer risk appetite helps architects narrow down architectural options, whether they are building a service, or selecting one from a third party, who may be willing to commit to RPO/RTO figures in a contract.
The Cloud Continuity Conundrum
The emergence of cloud services has altered the IT and security landscape in lots of ways, so it shouldn’t be surprising that approaches to business continuity need to change too.
Cloud services are built from the ground up to be highly resilient and are obviously closely monitored by vendor support teams, so are likely to be more reliable than the majority of traditional, on-premise services that customers may run themselves.
Of the three main types of cloud product; infrastructure-as-a-service (IaaS) and platform-aaS, which are based on discrete components, do usually offer various options around resilience, whereas Software-aaS is typically provided as a full managed service and ‘sold as seen’ with only an availability SLA.
While there are pockets of RPO/RTO SLAs from Azure and AWS (usually for IaaS products) the vast majority of services only offer an availability SLA.
While you may not get a contracted RPO/RTO SLA from a cloud provider, there may be the possibility of assuring one yourself through testing e.g. replicating a failover by directing/blocking network traffic, or disabling components.
This approach has its limits however, as some services are so abstracted that they provide no way for a cloud consumer to force any kind of failover. Azure Functions, even when deployed to a single AZ (without resilience) cannot be deployed to an alternate AZ – that configuration is entirely hidden from the consumer. It’s therefore possible (and likely when using PaaS and SaaS) to build a service from cloud components that offer neither an RPO/RTO SLA, nor the means to test the service to establish one manually.
Where does this leave industry standard metrics such as RTO? It’s fair to say that its relevance will fade as more services move towards the cloud, based on the services offered currently. Though conversely perhaps cloud providers needing to both attract on-premise hold-outs, while differentiating themselves, may see an opportunity in providing RPO and RTO SLAs in the future. In the meantime, it’s vital for architects and stakeholders to take such constraints into account as early in the project lifecycle.