Databases 12 min read

Fixing Aerospike Cluster Outage: Network Glitches, Memory Limits, and Rebalancing

After a network disruption caused Paxos messages to be ignored and memory shortages prevented data migration, this article details how the Aerospike cluster became unavailable, the diagnostic logs, and the step‑by‑step remediation involving node restarts, memory tuning, and adding new nodes for load balancing.

Xiaolei Talks DB
Xiaolei Talks DB
Xiaolei Talks DB
Fixing Aerospike Cluster Outage: Network Glitches, Memory Limits, and Rebalancing

Aerospike is a highly available, horizontally scalable distributed database designed for large‑scale data workloads. This article analyzes a recent cluster incident, explains its root causes, and presents the resolution steps.

Problem Overview

1. The cluster reported availability below 50% and several nodes were marked unavailable.

2. Logging into an AS node with asadm showed that only 2 of the 5 nodes were alive, rendering the cluster unavailable.

3. Business services reported read/write failures.

Root Cause Analysis

The issue stemmed from network instability, causing Paxos messages from other nodes to be ignored.

<code>Aug 20 2021 16:35:04 GMT+0800: WARNING (clustering): (clustering.c:4313) ignoring paxos accepted from node bb9dde810bf926c - it is not in acceptor list
Aug 20 2021 16:35:04 GMT+0800: WARNING (clustering): (clustering.c:4313) ignoring paxos accepted from node bb9658411bf926c - it is not in acceptor list</code>

The official documentation explains that unstable network conditions cause Paxos messages to be dropped.

Additional logs confirmed delayed heartbeat packets due to the same network problem.

<code>Aug 20 2021 16:35:06 GMT+0800: INFO (clustering): (clustering.c:7242) ignoring stale join request from node bb9d9e710bf926c - delay estimate 366(ms)</code>

Network cut‑over activities at the same time corroborated the timing of the issue.

Remediation Steps

1. Restart Affected AS Services

All failed Aerospike processes were restarted. Because the cluster holds a large dataset, a cold start required loading data from disk and rebuilding primary‑key indexes, which took considerable time.

The restart alone did not resolve the problem; nodes repeatedly left and re‑joined the cluster, and the AS process eventually failed with an out‑of‑memory error:

<code>Aug 20 2021 16:35:04 GMT+0800: WARNING (arenax): (arenax_ee.c:436) could not allocate 1073741824-byte arena stage 58: Cannot allocate memory</code>

Memory inspection showed only ~2 GB free, insufficient for the required 1 GB contiguous block needed for data migration.

<code>free -g
              total        used        free      shared  buff/cache   available
Mem:      125           2          2         119         120          0
Swap:        0           0           0</code>

Official documentation confirms that migration needs a contiguous memory region, which was unavailable.

2. Proper Long‑Term Fix

Simply restarting nodes caused a cycle of joins and leaves without addressing the underlying availability issue. Two main problems were identified:

High cluster load left insufficient free memory.

Old nodes, once started, only exported data and refused incoming migrations.

To address the load, additional nodes were added to increase throughput.

Five new nodes were provisioned, allowing the cluster to handle reads/writes even if the original nodes remained down. Two migration scenarios were considered:

Accept possible data loss and let the new nodes form a fresh cluster.

Preserve existing data by gradually migrating it to the new nodes.

Memory‑related configuration parameters were tuned: the memory-size and high-water-memory-pct settings were reduced to reflect actual usage, preventing excessive write amplification.

Reduce migration threads to slow down data movement: asadm -e 'asinfo -v "set-config:context=service;migrate-threads=1"'

Allow new nodes to only receive data: asadm -e 'asinfo -v "set-config:context=service;migrate-max-num-incoming=4"'

Prevent old nodes from receiving data, only exporting: asadm -e 'asinfo -v "set-config:context=service;migrate-max-num-incoming=0"'

As data migrated, old nodes freed memory, reducing load and increasing available memory.

After rebalancing, migration settings should be restored to appropriate defaults. Example default configuration:

<code>asinfo -v 'set-config:context=namespace;id=mediav;migrate-sleep=1'
asinfo -v 'set-config:context=service;migrate-threads=6'
asinfo -v 'set-config:context=service;migrate-max-num-incoming=8'
asinfo -v 'set-config:context=network;fabric.channel-bulk-recv-threads=8'</code>

Conclusion

The outage was caused by a network incident that made Paxos messages drop, combined with high cluster load and insufficient free memory. When enough nodes were OOM‑killed or lost connectivity, the cluster fell below quorum and stopped serving traffic. Expanding the cluster and adjusting memory and migration parameters restored availability.

References:

Aerospike discussion on adding nodes

FAQ on high‑water disk percentage

Aerospike server log messages reference

Memory ManagementclusterDatabase TroubleshootingAerospikeData RebalancingNetwork Issues
Xiaolei Talks DB
Written by

Xiaolei Talks DB

Sharing daily database operations insights, from distributed databases to cloud migration. Author: Dai Xiaolei, with 10+ years of DB ops and development experience. Your support is appreciated.

0 followers
Reader feedback

How this landed with the community

login Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.