From Text to Images: Building Multi‑Modal Product Search with Elasticsearch Serverless
This article walks through a complete multi‑modal product search solution that transforms textual and visual product data into embeddings, leverages dense, sparse and hybrid models, applies vector similarity and quantization techniques such as SQ and BBQ, and demonstrates how Elasticsearch Serverless provides a serverless, cost‑effective, auto‑scaling backbone for end‑to‑end retrieval.
With the rapid rise of AI, users now expect search experiences that go beyond simple keyword matching. Traditional text‑only search cannot satisfy scenarios such as searching by a photo of a unique hair‑dryer in a hotel or retrieving a children’s shorts based on visual attributes like color and cartoon pattern. Multi‑modal and cross‑modal search addresses these gaps by allowing queries in text, image, or natural‑language description.
1. Multi‑modal Product Search Solution
The solution consists of three logical layers:
Data Processing Layer : Structured product metadata (title, description, category, tags) is tokenized and indexed in a traditional text engine. Images are fed to a large‑scale vision model (e.g., CNN) to generate descriptive captions, which are then tokenized and indexed alongside the text.
Embedding & Vector Engine : Both the textual captions and the raw images are converted into high‑dimensional vectors (embeddings). Dense vectors capture semantic similarity, while sparse vectors retain exact term matching. The vectors are stored in a dedicated vector engine.
Fusion & Ranking Layer : Results from the text engine and the vector engine are merged. A Rerank module combines the textual relevance score and the vector similarity score, and the final list is produced by Reciprocal Rank Fusion (RRF) , which ranks documents based on their positions in multiple result sets.
2. Core Embedding Technologies
Embedding converts unstructured data into machine‑readable vectors. Three model families are discussed:
Dense Model (e.g., Word2Vec, S‑BERT, LLM‑based encoders): produces dense vectors where most dimensions are non‑zero, capturing deep semantic relationships.
Sparse Model (e.g., BM25, SPLADE): generates high‑dimensional sparse vectors with only a few non‑zero entries, preserving exact term matches similar to a bag‑of‑words approach.
Hybrid Model : simultaneously outputs a dense vector and a sparse vector, combining semantic generalization with precise keyword matching. This hybrid representation consistently outperforms single‑model baselines in benchmark tests.
3. Vector Retrieval Fundamentals
Vector similarity is measured using several metrics:
Euclidean Distance (L2) : straight‑line distance in vector space; smaller distance means higher similarity. Scores are often normalized with 1 / (1 + L2_norm^2) to map the distance to a 0‑1 range.
Dot Product : sum of element‑wise products; when vectors are L2‑normalized, the dot product equals cosine similarity.
Cosine Similarity : cosine of the angle between two vectors, ranging from -1 to 1; higher values indicate greater alignment.
4. Elasticsearch Vector Support
Elasticsearch now offers first‑class vector capabilities: dense_vector: stores dense float32 vectors. sparse_vector: stores high‑dimensional sparse vectors efficiently. semantic_text: an abstract type that automatically maps text to the appropriate vector representation via a configured inference model.
Inference API : calls external AI models (e.g., M2‑Encoder, Qwen2‑VL) during indexing or query time to produce vectors on the fly.
Ingest Pipeline : text_embedding or inference processors can convert incoming text fields to vectors automatically, simplifying data preparation.
KNN Search : native approximate nearest‑neighbor API for fast vector similarity lookup on dense_vector fields.
Hybrid Search : combines match (text) and KNN (vector) in a single query, though score fusion requires careful balancing.
RRF (Reciprocal Rank Fusion) : merges rankings from text and vector recall sets by summing the reciprocal of each document’s rank, providing a robust final ordering.
5. Performance Optimizations via Quantization
When dealing with billions of high‑dimensional vectors, memory consumption becomes a bottleneck. Two quantization techniques are highlighted:
Scalar Quantization (SQ) : maps 32‑bit float values to 8‑bit (or 4‑bit) integers per segment, reducing memory by 4‑8× while preserving most of the similarity information.
BBQ (Better Binary Quantization) : builds on SQ to achieve up to 95% memory reduction, enabling hundred‑billion‑scale vector retrieval on a single Elasticsearch cluster. The trade‑off is a modest recall loss, which can be mitigated by increasing the num_candidates parameter.
Example: a dataset of 100 billion 1024‑dimensional float32 vectors (~37 TB) can be stored in ~1.8 TB after applying BBQ + HNSW indexing, shrinking the required compute nodes from 170 to 9.
6. Best‑Practice Architecture with Alibaba Cloud
The end‑to‑end pipeline integrates two Alibaba Cloud products:
AI Search Open Platform : provides offline data services to extract product records from RDS, multi‑modal vector services that invoke built‑in models (e.g., M2‑Encoder, Qwen2‑VL), and an API to convert queries into vectors.
Elasticsearch Serverless : a fully managed, serverless Elasticsearch offering that handles indexing, vector storage, and query execution without any operational overhead.
Data flow:
Product data (ID, text, image) resides in RDS.
Offline jobs pull records, send text and images to the AI Search platform, which returns multi‑modal embeddings.
Embeddings and processed text are written to Elasticsearch Serverless using the appropriate field types ( dense_vector, sparse_vector, or semantic_text).
At query time, the front‑end sends a text or image query; the AI Search platform vectorizes the query, which is then dispatched to Elasticsearch Serverless for multi‑route recall (text + vector).
Elasticsearch returns the top‑N results after RRF fusion, which are presented to the user.
7. Advantages of Elasticsearch Serverless
Zero Ops : No cluster management, version upgrades, or security patching required.
Resource‑less : Users work with logical applications; capacity planning is unnecessary.
Built‑in Monitoring : Out‑of‑the‑box QPS, indexing traffic, and latency dashboards.
Pay‑per‑Use : Billing is measured in Compute Units (CU) per second, matching traffic spikes precisely.
Auto‑Scaling : The platform automatically expands or contracts resources based on real‑time load, and even adjusts index replicas and throttling thresholds.
Seamless AI Model Integration : All AI Search Open Platform models are accessible via Elasticsearch Inference API; custom models can be plugged in via simple API configuration.
Vector‑Specific Optimizations : Automatic exclusion of vector fields from _source to save storage, one‑click activation of int8 or BBQ quantization, and auto‑pre‑warming of HNSW graphs to eliminate cold‑start latency.
8. Demo Overview
The article concludes with a video demo that walks through the complete workflow: extracting product data, generating multi‑modal embeddings via the AI Search platform, indexing them into Elasticsearch Serverless, and performing real‑time text‑or‑image queries that return accurate, ranked product results.
Overall, the combination of modern embedding techniques, efficient vector quantization, and a truly serverless Elasticsearch backend enables developers to build high‑performance, cost‑effective multi‑modal search systems without the traditional operational burden.
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
DataFunSummit
Official account of the DataFun community, dedicated to sharing big data and AI industry summit news and speaker talks, with regular downloadable resource packs.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
