How to calculate your real RTO (not the one on paper)
A step-by-step method to measure a realistic RTO starting from a Business Impact Analysis. Spreadsheet template and the mistakes to avoid.
TL;DR
Calculating a realistic RTO needs three inputs: cost of downtime per hour, technical dependencies of the process and measured restore time from a drill. The resulting number is almost always 2-3× what management expects. Better to learn that before the disaster.
The 4-step method
1. Quantify downtime cost
For each critical process, calculate the damage per hour of downtime, accounting for:
- lost revenue (for processes that directly generate sales);
- internal productivity loss (people × hourly cost);
- SLA penalties to customers;
- reputational damage (qualitative but real).
Example: an 80-employee manufacturer with 50 in production halted by a MES failure = ~€3,000/hour staff cost + ~€12,000/hour lost production = €15,000/hour.
2. Map dependencies
Systems do not restart alone. For system X you need:
- authentication (Active Directory, identity provider);
- internal DNS;
- storage and database;
- network (firewall, switches, routing);
- dependent software (e.g. ERP that requires CRM).
Draw the map. Restart needs the correct order. Your RTO is the sum of restart times in dependency order, not the single-system time.
3. Measure, do not estimate
Any whiteboard estimate is 30-50% optimistic. The only reliable number comes from a timed drill:
- tabletop drill: conservative estimate, fine as a baseline;
- partial workload failover: more reliable number;
- full stack failover: the real number.
You typically start with a tabletop, run a partial drill in 90 days, a full drill in 12 months.
4. Add the buffers
Three mandatory buffers on top of the measured number:
- +15% for "real-day surprises" (slower network, unavailable people);
- +30 minutes for the emergency call to support (problem recognition, DR authorisation);
- +10 minutes for post-restore checks before "reopening".
Numerical example
Take a company with a critical ERP:
- Restart cloud Active Directory: 8 minutes (measured in drill);
- Restart database server: 12 minutes;
- Restart ERP application server: 6 minutes;
- Client reconnection: 5 minutes;
- Technical total: 31 minutes.
- +15% buffer: 36 minutes.
- Problem recognition (15 min) + authorisation (10 min) + post-check (10 min): +35 minutes.
- Realistic RTO: ~70 minutes.
If the BIA says "max 30 minutes" there is a problem to solve before the disaster, not during.
The spreadsheet we use
A simple 5-column structure:
| System | Target RTO | Measured restart | Buffer | Notes | |---|---|---|---|---| | Primary AD | 10 min | 8 min | +5 min | OK | | ERP DB | 20 min | 18 min | +5 min | Borderline | | ERP app | 10 min | 12 min | +5 min | Off target |
Reviewed monthly during plan reviews. Red rows must be discussed.
Three mistakes to avoid
- starting from the target: if management says "I want 15-minute RTO", do not take it as given. Measure first.
- ignoring human time: automatic restarts are few. Most need an admin to approve, verify, monitor. That time counts.
- forgetting external dependencies: software licenses, KMS, SaaS authentication. They have RTOs you do not control.
FAQ
Can I reduce RTO without more budget?
Yes, in two ways: (1) cut dependencies (e.g. local KMS cache), (2) automate the sequential restart with orchestration tooling.
Is RTO the same for every process?
No. Out of 20 processes an SMB typically has 3-4 critical (RTO ≤ 30 min), 6-8 important (≤ 4 hours), the rest can wait (≤ 24 hours).
How often should RTO be recalculated?
Annually at minimum, and after every large infrastructure change (migration, vendor swap, new systems).
To pair RTO measurement with RPO, read RTO vs RPO: differences. For drill methods that produce the real RTO, see DR drills.
Want to see Sefthy in action?
Same IP, same subnet, RTO in minutes. Try it free for 7 days or talk to one of our specialists.