Table of Contents
- Building an IT Disaster Recovery Plan That Actually Works
- 14-Step Disaster Recovery Plan
- Step 1: Supply Chain Considerations
- Step 2: Business Functions and Processes
- Step 3: Risk Assessment
- Step 4: Required IT Infrastructure
- Step 5: Business Impact Assessment
- Step 6: Financial Assessment
- Step 7: Backup Strategy
- Step 8: RPO (Recovery Point Objective)
- Step 9: RTO (Recovery Time Objective)
- Step 10: Cybersecurity Insurance
- Step 11: Emergency Response Team (ERT)
- Step 12: Disaster Recovery Team (DRT)
- Step 13: Communication and Roles
- Step 14: Testing
- Ongoing Maintenance
- Practical Steps for Small and Mid-Size Businesses
- How Stratify IT Can Help
- Frequently Asked Questions
Building an IT Disaster Recovery Plan That Actually Works
Most businesses don't discover the gaps in their disaster recovery plan until something actually goes wrong. By then, the cost of finding out is already accumulating. According to the ITIC 2024 Hourly Cost of Downtime Survey, over 90% of mid-size enterprises report losing more than $300,000 per hour during an outage, a figure that excludes legal exposure and regulatory penalties. For smaller businesses, the 2025 ITIC and Calyptix Security SMB survey found that 8% of respondents report downtime costs exceeding $25,000 per hour, a figure that, for a small business, can threaten the entire operation.
An effective IT disaster recovery plan isn't a binder on a shelf. It's a tested, role-assigned, regularly updated set of procedures that defines exactly what happens in the first minutes and hours of an incident, and how long it takes to return to normal operations. The steps below cover what goes into building one that holds up when it's tested for real.
14-Step Disaster Recovery Plan
Step 1: Supply Chain Considerations
Most businesses depend on third-party vendors for critical functions, cloud hosting, internet connectivity, line-of-business software, payroll processing. If any of those vendors go down, your recovery plan needs to account for it. Document which vendors are in the critical path for each core operation, confirm their own recovery SLAs, and identify fallback options where they exist. A vendor outage you haven't planned for can extend your own downtime significantly.
Step 2: Business Functions and Processes
Not every business function carries the same weight. Map your operations and rank them: which systems, if unavailable for one hour, would halt revenue? Which could tolerate 24 hours of downtime with manageable impact? This tiering drives every downstream decision in the plan, what gets backed up most frequently, what gets restored first, and where to concentrate recovery resources.
Step 3: Risk Assessment
Identify the specific threats your business faces: ransomware, hardware failure, natural disaster, power outage, insider error, vendor failure. For each threat, assess likelihood and potential operational impact. Ransomware deserves particular attention, it's not just an external attack vector. Compromised credentials can trigger an encryption event from inside the perimeter. Physical threats vary by geography: flood zones, hurricane corridors, and seismic regions each shift the probability weighting. Power instability is underrated; extended UPS and generator coverage is a recovery variable most SMBs don't account for until after an event. The Verizon 2025 Data Breach Investigations Report found that SMBs experience ransomware incidents at more than double the rate of large enterprises, making ransomware recovery a baseline requirement, not an edge case.
Step 4: Required IT Infrastructure
Document the full IT environment: servers, workstations, network equipment, cloud services, SaaS platforms, and the dependencies between them. Identify single points of failure, a server with no redundant backup, a firewall with no failover, a critical application with no secondary instance. Also define your alternate site options now, not during an incident. Cloud-first organizations have more flexibility: Azure Site Recovery and AWS Disaster Recovery Service can replicate workloads to geographically separate regions with automated failover. Organizations running on-premises infrastructure need a defined secondary location, a colocation facility, a secondary office, or a hot, warm, or cold site arrangement with documented activation steps. The difference between those three matters: a hot site runs continuously in parallel (fastest recovery, highest cost); a warm site has systems pre-configured but not live; a cold site is physical space with power and connectivity where you'd rebuild from scratch.
Step 5: Business Impact Assessment
Simulate the operational and financial impact of a 1-hour, 4-hour, 24-hour, and 72-hour outage. For each tier, ask: What revenue is lost? What contractual obligations are missed? What compliance thresholds are crossed? Organizations under HIPAA, for example, face breach notification obligations within 60 days and potential penalties for impermissible disclosures, making the compliance timeline part of the BIA, not just the financial one.
Step 6: Financial Assessment
Quantify what downtime actually costs your business. Direct costs include lost revenue, idle staff time, and recovery expenses. Indirect costs include customer churn, reputational damage, and potential regulatory fines. The financial assessment should also justify the investment in DR infrastructure, cloud backup, redundant connectivity, DR-as-a-Service, by benchmarking those costs against the calculated cost of even one significant incident. A useful baseline: annual revenue divided by 2,080 business hours gives a per-hour revenue figure. Add hourly labor cost for idled employees, contract penalties for missed SLAs, and regulatory fine exposure. For most businesses, the result justifies significantly more investment than they're currently making. Executive sign-off on RTO and RPO targets belongs here too, those targets carry real cost implications, and the CFO needs to understand the tradeoff between a 4-hour RTO and a 24-hour one before approving the IT budget.
Step 7: Backup Strategy
The 3-2-1 backup rule is the baseline: three copies of data, on two different media types, with one copy offsite or in the cloud. For most businesses, this means a combination of local backup (NAS or on-premises backup appliance) and cloud-based backup through a service like Azure Backup, Veeam, or Datto. Cloud-based DR-as-a-Service options add the ability to spin up workloads in the cloud during an outage, reducing recovery time significantly compared to restore-from-tape approaches.
For organizations with aggressive RPO requirements or significant ransomware exposure, 3-2-1 is a floor, not a ceiling. Immutable backups, write-once copies that cannot be altered or deleted, even by a ransomware payload running with admin credentials, are now standard practice in higher-risk environments. Platforms like Veeam, Acronis, and Datto can enforce immutability on cloud-hosted backup repositories. Also review retention windows: 30-day retention on a database that holds 90 days of billable records is a gap. And if your backup procedures haven't been tested with an actual restore in the last 90 days, the data you think you have may not be what you'll get under pressure.
Step 8: RPO (Recovery Point Objective)
The RPO defines how much data loss is acceptable, in practice, it determines how frequently backups need to run. A business with an RPO of four hours needs backups running at least every four hours. A business with an RPO of 15 minutes needs near-continuous replication. Set RPO before RTO: it determines your backup frequency and storage architecture, which in turn constrains what RTO is actually achievable. A law firm's document management system might tolerate a 2-hour RPO; a retail point-of-sale system may need sub-30-minute. Setting the RPO correctly requires input from the business side, not just IT, losing four hours of financial transactions has a very different impact than losing four hours of email.
Step 9: RTO (Recovery Time Objective)
The RTO defines how long systems can be offline before the impact becomes unacceptable. A 24-hour RTO might be fine for a secondary file server; a 1-hour RTO for a point-of-sale system is a completely different technical requirement. RTOs drive decisions about recovery architecture, whether to use cold, warm, or hot standby systems, and should be validated against the financial assessment to confirm they're achievable within acceptable cost parameters.
Step 10: Cybersecurity Insurance
Cyber liability insurance covers costs that a DR plan can't: breach notification, forensic investigation, legal defense, regulatory fines, and ransom payments where covered. Insurers now require documented security controls, MFA, EDR, regular patching, and tested backup procedures before issuing coverage. A well-documented DR plan with verified, tested backups can directly affect premium rates and coverage eligibility. Underwriters now ask for test records before binding coverage, so post-test documentation serves double duty as audit evidence and insurance leverage.
Step 11: Emergency Response Team (ERT)
The ERT handles the first phase of an incident: containing the damage, communicating internally, and initiating the DR plan. Define who is on the ERT, their contact information including out-of-band contact methods that don't depend on systems that may be down, and their specific responsibilities. The ERT should have the authority to make decisions quickly, including taking systems offline, without waiting for approval chains that slow response. The first 15 minutes of a declared incident are mostly about containment: isolate affected systems, notify the incident commander, and preserve logs before anything gets overwritten. Don't reboot immediately, that reflex destroys forensic evidence and can reinfect a clean system from an unexamined one.
Step 12: Disaster Recovery Team (DRT)
The DRT owns the actual recovery process: restoring systems, validating data integrity, coordinating with vendors, and tracking progress against RTO targets. This is typically an IT-led team, but should include representation from business operations who can confirm when a restored system is actually functional for their purposes, not just technically online. Each step in the plan needs a named owner and a named backup. "IT will handle the server restoration" is not an assignment. "Alex Chen, Systems Administrator, will initiate the Veeam restore job; backup owner is Jordan Lee" is. During an actual incident, the person you assumed would lead the response may be unreachable.
Step 13: Communication and Roles
Communication failures during an incident compound the technical ones. Define in advance: who notifies customers and when, who speaks to regulators if required, who updates employees, and what the escalation path looks like if the incident exceeds the initial team's ability to manage. Store communication templates and contact lists somewhere accessible when primary systems are down, a printed binder, an out-of-band cloud document, or both. Remote employees add another layer: define a communication channel that doesn't depend on your corporate network or email, both of which may be down. A group SMS thread or a tool like Slack on mobile data works; assuming everyone will check their work email during a network outage doesn't.
Step 14: Testing
An untested DR plan is a hypothesis. There are four levels of testing, and most organizations only do the easiest one.
- Tabletop exercises walk the team through a simulated scenario in a conference room, useful for surfacing process gaps and clarifying roles, but they don't verify whether backups actually restore.
- Walkthrough drills add documentation review to confirm procedures match current infrastructure.
- Functional simulations bring systems partially online in an isolated environment to verify restore procedures against real data.
- Full-scale failover tests take production workloads offline and run actual recovery from backup to production-ready state, measuring how close real RTO and RPO performance comes to documented targets.
Aim for tabletop exercises quarterly and a functional simulation or full-scale test at least annually. Organizations subject to CMMC Level 2 should note that NIST 800-171 requires contingency plans be tested and results documented, a C3PAO will review test schedules and completed exercise records during a formal assessment.
Every test produces findings. Log the actual RTO and RPO achieved, gaps between planned and actual recovery steps, any credentials or access paths that failed, and changes made to the plan as a result. This documentation improves the plan iteratively and provides evidence of due diligence for auditors and insurers.
Ongoing Maintenance
A DR plan written in 2022 reflects the infrastructure, personnel, and threats of 2022. If it hasn't been updated since, it's already outdated.
Trigger-Based Reviews
Certain business events should automatically trigger a DRP review: a new SaaS platform added to the stack, a significant personnel change in IT or leadership, a merger or acquisition, a new compliance obligation, or a change in physical location. Don't wait for the annual cycle if material changes happen sooner. Maintain a change log in the DRP document so reviewers can see what changed and when.
Monitoring Integration
Connect DR readiness to your active monitoring environment. RMM platforms can alert on backup job failures, replication lag, or storage thresholds before they become recovery problems. If a backup job fails on Tuesday and no one notices until ransomware hits on Friday, the protection gap was visible, it just wasn't acted on. Backup health should surface in the same dashboards used to track system uptime and security events.
Legal, Regulatory, and Contractual Requirements
HIPAA requires covered entities to have contingency plans including data backup, disaster recovery, and emergency mode operation procedures, and to test them. CMMC Level 2 includes specific requirements around system backup and recovery under NIST 800-171 controls 3.6.1 and 3.6.2. HIPAA's Security Rule (45 CFR §164.308(a)(7)) specifically requires documented disaster recovery procedures. Review what your compliance framework requires and confirm your DR plan satisfies it, not just functionally but documentably.
Practical Steps for Small and Mid-Size Businesses
The 14-step plan above covers the formal DR framework. The following applies directly to businesses with limited IT staff, where the challenge is less about knowing what to do and more about actually getting it done.
Verify backups regularly. Backups that haven't been tested may not actually restore. Schedule monthly restore tests on a subset of backed-up data. Cloud-based backup solutions like Azure Backup, Acronis, or Datto provide automated backup verification, flagging failures before an incident rather than during one.
Layer cybersecurity controls. Ransomware is the leading cause of DR plan activation for SMBs. Reducing that risk requires layered defenses: DNS filtering to block malicious domains before connections are made, EDR tools like CrowdStrike or SentinelOne to catch malware that gets through, and MFA on all accounts with internet-facing access.
Establish alternate work capabilities. If your primary office becomes inaccessible, employees need to be able to work from somewhere else immediately. Cloud-based infrastructure, Microsoft 365, Azure Virtual Desktop, allows staff to access systems from any device. Define in advance which staff need which access and test it before it's needed.
Train employees on their DR role. Employee training should cover how to recognize and report a potential incident, what not to do when an incident is suspected, and their specific responsibilities if the DR plan is activated. Internal tabletop exercises cover the response side; platforms like KnowBe4 cover the detection side.
How Stratify IT Can Help
Stratify IT helps businesses build and maintain disaster recovery programs that match their actual risk profile. We assess your current backup architecture, define RPO and RTO targets based on your operations, implement layered cybersecurity controls to reduce the likelihood of DR activation, and test recovery procedures so gaps are found in a controlled environment rather than during an incident. For organizations with HIPAA, CMMC, or SOC 2 requirements, we align the DR program to the specific control requirements those frameworks mandate.
If your current DR plan hasn't been tested recently, or doesn't exist yet, contact us to start with an IT assessment. We'll give you a clear picture of where you stand and what it would take to get to a defensible, tested recovery program.
Learn more about our disaster recovery and business continuity services to see the full range of what we offer. For the specific steps your team should take in the first 72 hours of a security incident, see our cyber incident response playbook.
Stratify IT, disaster recovery built around your business, not a template.
Frequently Asked Questions
Most practitioners recommend a full tabletop exercise at least annually, with component-level tests, failover, backup restoration, communication trees, run quarterly. The distinction matters: reviewing a document finds typos; testing finds the assumption that someone who left the company eighteen months ago is still your incident commander. Quarterly testing also catches configuration drift, where backups stop working silently because an application was updated and nobody updated the backup job.
Ownership typically falls to the IT director or a senior operations manager, but the real failure point is diffusion, when everyone is loosely responsible, nobody is specifically accountable. Designate one named individual with authority to call recovery procedures, make vendor escalation decisions, and halt normal operations if needed. That person needs a documented backup, because disasters have a habit of occurring when key people are traveling or unreachable.
Recovery Time Objective is how long you can tolerate being down; Recovery Point Objective is how much data loss you can absorb. Set RPO first, because it determines your backup frequency and storage architecture, which in turn constrains what RTO is actually achievable. A business that can only tolerate losing one hour of transactions needs near-continuous replication, and that infrastructure has real cost implications before you even get to the question of how fast you recover.
Yes, and meaningfully so. Ransomware introduces a complication that a server failure doesn't. Your backups may themselves be compromised if the attacker had network access long enough to reach them. Your plan should specify air-gapped or immutable backup copies that ransomware can't encrypt, a decision tree for whether to restore versus negotiate, and legal and insurance notification steps that have specific time windows under many cyber policies. Treating it like a standard hardware failure is a documented way to make the situation worse.
Most calculations stop at lost revenue. The bigger surprises are recovery labor, IT staff working 60-hour weeks at overtime rates, contract penalties for missed SLAs, regulatory notification costs under HIPAA or state breach laws, and cyber insurance deductibles. Organizations with HIPAA obligations also face the clock on a 60-day notification window, which has its own legal and administrative cost. Getting the full number in front of leadership tends to change the investment conversation.
No, DRaaS and backup solve different problems. Backup protects your data. DRaaS goes further. It replicates your entire environment, servers, configurations, applications, and can spin up workloads in the cloud during an outage, cutting recovery time from days to hours. Platforms like Datto and Acronis offer DRaaS with defined RTO guarantees in their SLAs. For businesses where downtime directly stops revenue, DRaaS is worth evaluating alongside standard backup.
The first 15 minutes are mostly about containment and communication, not recovery. Isolate affected systems from the network to prevent spread, notify the incident commander and your managed IT provider if you have one, and preserve logs before anything gets overwritten. Don't immediately start rebooting things, that reflex destroys forensic evidence and can reinfect a clean system from an unexamined one. The plan itself should have a one-page laminated quick-reference card for exactly this window, because people don't read binders under pressure.
Remote workers complicate two things: access and communication. Your plan needs a defined out-of-band communication channel, a group SMS thread or a tool like Slack on mobile data, that doesn't depend on your corporate network or email, both of which may be down. It should also specify which remote employees have VPN or direct cloud access that bypasses office infrastructure entirely, since they may be able to continue operating while on-site staff cannot, and that asymmetry should be mapped in advance.
Certain events should trigger an immediate review: adding a major SaaS platform, a significant IT or leadership personnel change, a merger or acquisition, a new compliance obligation, or a physical office move. Beyond triggers, the plan should be reviewed annually at minimum. A change log inside the document helps reviewers see what's drifted. A plan that reflects last year's infrastructure is a plan that fails on this year's incident.