Failover without touching DNS: the L2 dividend

Flipping DNS records under stress is the worst part of L3 DR. With an L2 tunnel you skip it entirely because the IP does not change.

2 min read

TL;DR

Flipping DNS records under stress is the worst part of Layer 3 DR: TTL, propagation, system cache, app bugs. With an L2 tunnel you skip it entirely because the recovered VM IP is identical to the primary's.

The DNS-under-stress problem

In a standard Layer 3 DR:

  1. the cloud VM gets a new IP (e.g. 10.99.0.50);
  2. internal DNS records have to be updated (erp.local A 10.99.0.50);
  3. previous TTLs sit in cache until expiry (15 min - 24h);
  4. some clients have aggressive local caching;
  5. some apps DNS-lookup only at startup and keep the IP in cache.

Result: in the 30-90 minutes after failover some clients work, others do not, apparently at random. The worst experience for sysadmins under stress.

The three DNS cache types

Resolver cache

Internal DNS servers (Active Directory DNS, BIND, etc.) cache responses. TTL configured. Usually 1-4 hours.

OS cache

Windows: cache until TTL. Linux: with nscd or systemd-resolved, configurable. macOS: aggressive.

Application cache

Java, .NET, Python apps with default configuration cache DNS for the JVM/process lifetime. App restart = fresh lookup. Without restart = stuck on old IP.

The low-TTL workaround

Common practice: keep internal DNS TTLs low (5-15 min) for critical resources. Helps, but:

  • loads the resolvers with more queries;
  • does not solve app caching;
  • requires rigorous DNS management policy.

It works, but it is fragile.

The L2 advantage

With an L2 tunnel the problem does not exist:

  • the recovered VM IP is the same as the primary's;
  • clients find the IP in cache: correct;
  • apps with hard-coded IPs: keep working;
  • no DNS changes needed.

Time spent fixing DNS issues in emergency: zero.

When DNS is still needed

Even with an L2 tunnel, some cases require DNS:

  • public services (e.g. webmail, corporate VPN outbound) — the public IP changes, so public DNS must update;
  • failover between different Sefthy datacentres (rare);
  • planned migrations.

FAQ

Can I disable aggressive client TTLs?

Yes, but it is a fragile workaround and impacts performance. Better to use L2.

Do cloud-native apps (modern web apps) suffer from this?

Less, because they rely on dynamic service discovery. But for legacy apps the issue is real.

How much RTO does this save?

On average 15-45 minutes, depending on the legacy app mix.


For the L2 pillar, L2 tunnel for DR. For legacy apps, L2 and legacy apps.

Want to see Sefthy in action?

Same IP, same subnet, RTO in minutes. Try it free for 7 days or talk to one of our specialists.