Operations 14 min read

Improving Solr Search Stability and Performance in a High‑Traffic Personalization Service

This article describes how a team tackled stability and performance problems in a SolrCloud‑based search and recommendation stack serving 150,000 requests per minute, detailing root‑cause analysis, memory and GC tuning, replica configuration changes, and the resulting reductions in latency, resource usage, and operational complexity.

Architects Research Society

Jun 4, 2022

Improving Solr Search Stability and Performance in a High‑Traffic Personalization Service

This is a short story about how we managed to overcome stability and performance issues in our search and relevance stack.

Context

Over the past ten months we have been working with the personalization and relevance team, providing personalized and relevant content to users via three public endpoints: Home Feed, Search, and Related Items API. After a few months we faced the challenge of delivering high‑quality service to larger key markets while maintaining the perfect performance and stability we had achieved in smaller regions.

We run SolrCloud (v7.7) on OpenShift in AWS, coordinated by Zookeeper. At the time of writing the API serves roughly 150,000 requests per minute and sends about 210,000 updates per hour to Solr in our largest region.

Baseline

After deploying Solr in our biggest market we performed load testing with internal tools, assuming Solr was well‑configured. The team focused on improving client‑side performance and raising Solr time‑outs, accepting a slightly looser traffic handling policy.

Post‑Migration

The service responded within acceptable latency and Solr clients performed well until circuit breakers opened due to time‑outs caused by random long‑running replica responses. The issues observed included:

High proportion of replicas entering recovery for long periods.

Errors in replicas not reaching the leader because the replicas were too busy.

Leader nodes overloaded by indexing, querying, and replica synchronization, leading to shard crashes.

Suspicion on the "index/update service" because reducing traffic to Solr prevented replicas from entering recovery.

Frequent full garbage collections (old and young generations).

SearchExecutor threads consuming CPU alongside the GC.

SearchExecutor threads throwing exceptions during cache warm‑up (LRUCache.warm).

Response times increasing from ~30 ms to ~1500 ms.

IOPS on some Solr EBS volumes hitting 100%.

Problem Handling

Analysis

We identified several topics for deeper analysis.

Lucene Configuration

Apache Solr is a widely used search and ranking engine built on Lucene. Adjusting Lucene settings is possible but often requires sacrificing document structure, and the effort is rarely worthwhile.

Document and Disk Size

Assuming ~10 million documents with an average size of 2 KB, the initial disk footprint can be estimated (see image).

Sharding

Having multiple shards does not automatically make Solr more resilient; the slowest shard determines overall response time. Splitting documents across shards reduces cache size and disk usage, improving indexing.

Index/Update Process

Our main market experiences up to 210 k updates per hour during peak traffic.

Zookeeper

Zookeeper maintains cluster state; frequent replica recoveries can cause state drift, leading to long‑running recovery loops.

Memory Theory

RAM is a critical driver for Solr performance. Solr uses both heap memory (for Java objects) and off‑heap direct memory (for OS file‑system cache). Proper heap sizing and GC configuration are essential; for example, on a 28 GB RAM machine we might allocate an 18 GB heap.

Cache Evidence

We tuned caches based on Solr admin panel metrics:

queryResultCache hit rate: 0.01

filterCache hit rate: 0.43

documentCache hit rate: 0.01

GC and Heap

New Relic showed frequent circuit‑breaker trips due to memory thresholds (20% trips, GC CPU threshold 10%). This indicated insufficient available memory.

High‑CPU searcherExecutor threads consumed ~99% of heap, and JMX/JConsole logs showed LRUCache.warm exceptions related to cache size and warm‑up.

Disk Activity – AWS IOPS

Starting to Solve the Issues

Search Result Fault Tolerance

We ensured Solr replicas remained available for queries, using Solr 7's new replica types:

NRT replicas – traditional near‑real‑time replication.

TLOG replicas – use transaction log and binary replication.

PULL replicas – pull from leader using binary replication only.

With this configuration, as long as a shard has a leader, a PULL replica will answer queries, improving reliability and reducing recovery frequency.

Adjusting Solr Memory

We estimated RAM requirements for 7 million documents (~3.8 TB) and, after sharding into five shards, still required ~3.4 TB, confirming the need for ample memory.

Cache Results

After re‑tuning caches, hit rates improved dramatically:

queryResultCache hit rate: 0.01

filterCache hit rate: 0.99

documentCache hit rate: 0.02

GC Results

New Relic showed no old‑generation GC activity, preventing circuit‑breaker trips.

Disk Activity Results

Disk I/O improved significantly, and indexing throughput increased.

External Service Results

One service accessing Solr showed notable reductions in response time and error rate.

Adjusting Solr Cluster

Multi‑shard mode can increase response time when a replica fails, as the leader waits for all shards. To mitigate this, we gradually reduced the number of nodes and shards, lowering the internal replication factor.

Conclusion

After weeks of investigation, testing, and tuning we eliminated the initial issues, reduced latency, simplified management by using fewer shards and replicas, achieved reliable full‑load indexing/update service, and cut AWS EC2 costs by roughly half.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

performance Operations Scalability cloud Search

Written by

Architects Research Society

A daily treasure trove for architects, expanding your view and depth. We share enterprise, business, application, data, technology, and security architecture, discuss frameworks, planning, governance, standards, and implementation, and explore emerging styles such as microservices, event‑driven, micro‑frontend, big data, data warehousing, IoT, and AI architecture.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.