Big Data 22 min read

Design and Key Technologies of the 360 Search Engine for Billion‑Scale Web Retrieval

This article explains how 360 Search handles billions of daily crawls and hundred‑billion‑scale indexing by describing its overall architecture, core modules such as offline indexing and online retrieval, query analysis, relevance scoring, and the engineering techniques that enable efficient large‑scale web search.

Architecture Digest

Nov 15, 2019

Design and Key Technologies of the 360 Search Engine for Billion‑Scale Web Retrieval

360 Search is a flagship product of the company, operating tens of thousands of servers to crawl up to one billion web pages per day and maintaining an index that covers hundreds of billions of high‑quality pages.

The presentation outlines the overall design of a hundred‑billion‑scale search engine, dividing the content into four modules: overall engine design, key technologies for billion‑scale computation, web‑index organization, and web retrieval & relevance.

It starts with the basic retrieval workflow: a user query is tokenized, terms are looked up in the inverted index, intersected to obtain a document list, and then ranked using features from both the forward and inverted indexes before being displayed on the front‑end.

The article explains the two main stores used in retrieval: the forward index (storing document attributes and token lists) and the inverted index (mapping terms to posting lists), and how they support efficient lookup.

The retrieval model is broken down into query analysis, resource selection, relevance calculation, and re‑search strategies, emphasizing the importance of understanding user intent, term weighting, and handling cases where initial results are insufficient.

Query analysis covers three aspects: determining tokenization granularity, analyzing term weights (including importance and directionality), and inferring user intent and timeliness.

Query strategy includes selecting the appropriate resource pool, determining the relevant document set (including partial term matches), computing relevance scores, and applying re‑search tactics such as expanding resources, adjusting token granularity, or rewriting queries.

Key technologies for billion‑scale processing are described in two parts: offline indexing (using HBase/HDFS, MapReduce, and Storm/Kafka for data storage, batch index creation, and real‑time updates) and online retrieval (distributed services, request broadcasting, load balancing, and the core intersection and relevance modules).

The offline indexing pipeline involves index partitioning, batch creation via MapReduce, and incremental updates to handle daily new data and rank‑related feature changes.

Online retrieval relies on a distributed service framework that broadcasts queries to partitioned index shards, merges results, and performs load‑balanced processing, with intersection and basic relevance calculations at its core.

The architecture diagram shows the flow from web crawling to storage in HBase, quality filtering, index partitioning, and both batch and real‑time index generation, followed by distributed query handling and result ranking.

Index organization details include forward‑index design for independent updates, handling sparse attributes with variable‑length blocks, and inverted‑index compression using block‑level encoding and segment metadata to enable fast range lookups.

The intersection model selects the shortest posting list, uses segment information to locate the appropriate block, and applies binary search with step‑size optimizations to accelerate document ID lookup.

Basic relevance scoring combines TF‑IDF (or BM25) weighting with proximity calculations that consider term adjacency and directionality to produce a final relevance score.

Additional related technologies mentioned are timeliness handling, cluster resource optimization, retrieval performance tuning, caching, system stability, and real‑time big‑data computation.

The article concludes with a summary of the overall design and key techniques used in the 360 Search engine.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

search engine ranking information retrieval large-scale indexing

Written by

Architecture Digest

Focusing on Java backend development, covering application architecture from top-tier internet companies (high availability, high performance, high stability), big data, machine learning, Java architecture, and other popular fields.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.