Understanding Embeddings and Vector Databases for LLM Applications
This article explains what embeddings and vector databases are, how they are generated with models like OpenAI's Ada, why they enable semantic search and help overcome large language model token limits, and demonstrates a practical workflow for retrieving relevant document chunks using cosine similarity.
Vector databases and embeddings have become hot topics in the AI field. Companies such as Pinecone have raised significant funding, and firms like Shopify, Brex, and Hubspot already use these technologies in their AI applications.
An embedding is a multi‑dimensional array of numbers that can represent any item—text, music, video, etc. This article focuses on text embeddings.
Embeddings are created by sending text to an embedding model (e.g., OpenAI’s Ada), which returns a vector that can be stored for later use.
These vectors enable semantic search because they capture meaning, allowing similarity‑based queries such as finding related concepts like “man”, “king”, “woman”, and “queen” in a vector space.
For a more intuitive illustration, imagine a child looking for similar toys (e.g., a toy car and a toy bus) based on the shared concept of transportation; this is semantic similarity.
Embeddings are especially valuable for large language models (LLMs) because LLMs have context‑window limits (e.g., GPT‑3.5 around 4 k tokens, GPT‑4 up to 32 k). By embedding large documents and retrieving only the most relevant chunks, we can stay within these limits.
A typical workflow is:
Split a large document (e.g., a PDF) into chunks.
Generate an embedding vector for each chunk using a model.
Store the vector and its associated text chunk in a database.
When a user asks a question, the query is also embedded, and cosine similarity is used to find the most relevant chunk vectors.
Example data structure (simplified):
{
[1,2,3,34]: 'text chunk 1',
[2,3,4,56]: 'text chunk 2',
[4,5,8,23]: 'text chunk 3',
...
}After retrieving the top‑k similar chunks, they are combined with a prompt and fed to the LLM, for instance:
Known context: text chunk 1, text chunk 2, text chunk 3.
User question: "What did they say about xyz?"
Please answer based on the given context.If the LLM cannot answer, it should honestly respond, "I cannot answer this question."
This demonstrates how embeddings and vector search empower LLMs to provide chat‑like capabilities over arbitrary data sources, without being a form of fine‑tuning.
Architect
Professional architect sharing high‑quality architecture insights. Topics include high‑availability, high‑performance, high‑stability architectures, big data, machine learning, Java, system and distributed architecture, AI, and practical large‑scale architecture case studies. Open to ideas‑driven architects who enjoy sharing and learning.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.