Operations 11 min read

Analysis of ARP Cache Behavior and Failover Issues in a Dual‑Firewall Environment

This article details a network outage caused by ARP cache state transitions, neighbor reachability detection differences between Linux 6.0/7.0, Cisco NXOS ARP timeout, and keepalived standby handling, explaining how these factors led to traffic disruption and the eventual root‑cause resolution.

Qunar Tech Salon
Qunar Tech Salon
Qunar Tech Salon
Analysis of ARP Cache Behavior and Failover Issues in a Dual‑Firewall Environment

Author : Feng Yawei, NetOps at Qunar.com, with extensive network operations experience.

Fault Description : At 19:01 fw4 remained master while fw3 promoted to master; at 19:27 fw3 dropped master causing the fault; subsequent ARP cache issues on server1 were resolved by clearing ARP entries and sending a free ARP refresh.

Key Knowledge Points

ARP cache states in Linux (REACHABLE, STALE, DELAY, PROBE, FAILED) and their timers.

Neighbor unreachability detection via upper‑layer protocols (TCP ACK, ICMP reply) or unicast ARP probes.

Differences between Linux 6.0 and 7.0 handling of STALE entries: Linux 6.0 relies on upper‑layer feedback, while Linux 7.0 sends unicast ARP immediately.

Cisco NXOS ARP timeout: after 18 min 50 s, unicast ARP probes every 37 s for up to 10 attempts; entry expires after ~25 min.

iptables creates conntrack entries for non‑TCP first packets when SNAT rules exist.

keepalived in standby still receives VIP‑directed packets if the destination MAC matches its NIC and continues SNAT processing.

Free ARP handling: Linux updates existing entries to STALE; NXOS resets timers without creating new entries.

keepalived dual‑active mechanism and election fallback behavior.

Failure Process and Analysis

Initially fw4 was primary with VIP; fw3 unexpectedly became primary, sending free ARP that refreshed VIP entries to the MAC of fw3. Linux 6.0 and 7.0 updated the ARP entry to STALE, then to REACHABLE via neighbor detection, causing traffic to flow through fw3.

When both firewalls detected dual‑active status, an election returned fw3 to standby. Server2’s ARP entry became STALE; subsequent unicast ARP to fw3 received no reply, moving the entry to FAILED, prompting a broadcast ARP that fw4 answered, restoring traffic to fw4. Server1’s entry remained STALE longer, leading to DELAY and REACHABLE transitions, but eventually the VIP ARP on the core switch timed out after 25 minutes, causing the core to query fw4 again.

Root causes:

fw4 did not send a free ARP after winning the election, leaving stale VIP ARP information on the network.

Linux 6.0’s reliance on upper‑layer detection delayed detection of the VIP change.

Resolution involved ensuring the primary firewall sends free ARP on state changes and understanding the ARP state behavior across kernel versions.

operationslinuxARPCisco NXOSKeepalivedNetwork Failover
Qunar Tech Salon
Written by

Qunar Tech Salon

Qunar Tech Salon is a learning and exchange platform for Qunar engineers and industry peers. We share cutting-edge technology trends and topics, providing a free platform for mid-to-senior technical professionals to exchange and learn.

0 followers
Reader feedback

How this landed with the community

login Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.