Failover without touching DNS: the L2 dividend
Flipping DNS records under stress is the worst part of L3 DR. With an L2 tunnel you skip it entirely because the IP does not change.
TL;DR
Flipping DNS records under stress is the worst part of Layer 3 DR: TTL, propagation, system cache, app bugs. With an L2 tunnel you skip it entirely because the recovered VM IP is identical to the primary's.
The DNS-under-stress problem
In a standard Layer 3 DR:
- the cloud VM gets a new IP (e.g. 10.99.0.50);
- internal DNS records have to be updated (
erp.local A 10.99.0.50); - previous TTLs sit in cache until expiry (15 min - 24h);
- some clients have aggressive local caching;
- some apps DNS-lookup only at startup and keep the IP in cache.
Result: in the 30-90 minutes after failover some clients work, others do not, apparently at random. The worst experience for sysadmins under stress.
The three DNS cache types
Resolver cache
Internal DNS servers (Active Directory DNS, BIND, etc.) cache responses. TTL configured. Usually 1-4 hours.
OS cache
Windows: cache until TTL. Linux: with nscd or systemd-resolved, configurable. macOS: aggressive.
Application cache
Java, .NET, Python apps with default configuration cache DNS for the JVM/process lifetime. App restart = fresh lookup. Without restart = stuck on old IP.
The low-TTL workaround
Common practice: keep internal DNS TTLs low (5-15 min) for critical resources. Helps, but:
- loads the resolvers with more queries;
- does not solve app caching;
- requires rigorous DNS management policy.
It works, but it is fragile.
The L2 advantage
With an L2 tunnel the problem does not exist:
- the recovered VM IP is the same as the primary's;
- clients find the IP in cache: correct;
- apps with hard-coded IPs: keep working;
- no DNS changes needed.
Time spent fixing DNS issues in emergency: zero.
When DNS is still needed
Even with an L2 tunnel, some cases require DNS:
- public services (e.g. webmail, corporate VPN outbound) — the public IP changes, so public DNS must update;
- failover between different Sefthy datacentres (rare);
- planned migrations.
FAQ
Can I disable aggressive client TTLs?
Yes, but it is a fragile workaround and impacts performance. Better to use L2.
Do cloud-native apps (modern web apps) suffer from this?
Less, because they rely on dynamic service discovery. But for legacy apps the issue is real.
How much RTO does this save?
On average 15-45 minutes, depending on the legacy app mix.
For the L2 pillar, L2 tunnel for DR. For legacy apps, L2 and legacy apps.
Want to see Sefthy in action?
Same IP, same subnet, RTO in minutes. Try it free for 7 days or talk to one of our specialists.