Artificial Intelligence 11 min read

How Embeddings Power AI Knowledge Bases: From Theory to Practice

This article explains what embeddings are, how they capture semantic similarity, and how to use OpenAI's embedding API to transform a custom knowledge base into vector representations for efficient search, retrieval, and question‑answering.

KooFE Frontend Team

May 21, 2023

How Embeddings Power AI Knowledge Bases: From Theory to Practice

This article is a personal learning summary introducing what embeddings are and how to connect a self‑built knowledge base to OpenAI using embeddings.

What are embeddings

Embeddings capture the "relevance" of text, images, video, or other data types. They are commonly used for:

Search: How similar is a query to the main text?

Recommendation: How similar are two products?

Classification: How to classify text?

Clustering: How to identify trends?

Consider a simple example with three phrases:

"The cat chases the mouse"

"The kitten hunts rodents"

"I like ham sandwiches"

Your task is to group phrases with similar meanings. Humans can easily see that phrases 1 and 2 are almost identical in meaning, while phrase 3 is completely different.

Even though phrases 1 and 2 share no common words (except "the"), we need a way for a computer to understand their semantic similarity.

Human language

Humans use words and symbols to convey meaning, but isolated words often lack meaning without shared knowledge and experience. For example, the phrase "you should Google it" only makes sense if you know that Google is a search engine and is used as a verb.

Similarly, we need to train neural‑network models to understand human language. An effective model is trained on millions of examples to learn what each word, phrase, sentence, or paragraph may mean in different contexts.

How does this relate to embeddings?

How embeddings work

Embeddings compress discrete information (words, symbols) into distributed continuous vectors. If we plot the earlier phrases as points, they might look like this:

Phrase 1 and 2 would be placed close together because their meanings are similar, while phrase 3 would be far away because it is unrelated. A fourth phrase like "Sally ate Swiss cheese" would fall somewhere between phrase 3 (cheese can be on a sandwich) and phrase 1 (mice like Swiss cheese).

In this example we only have two dimensions (X and Y), but in practice many more dimensions are needed to capture the complexity of human language.

OpenAI embeddings

OpenAI provides an API that generates embeddings for any text string using its language models. You supply any text (blog posts, documents, a company knowledge base) and receive a floating‑point vector that represents the text's "meaning". Compared to the 2‑D example, OpenAI's latest model text-embedding-ada-002 outputs 1536 dimensions.

Why use embeddings? The OpenAI text-davinci-003 model has a 4000‑token limit for prompts and completions, so the entire prompt—including any knowledge‑base content—must fit within that limit. If you want GPT‑3 to answer questions based on a custom knowledge base, repeatedly sending the whole knowledge base would exceed the limit and waste tokens.

The typical solution is to generate embeddings for the knowledge‑base documents once, store them in a database, and then perform a two‑stage process at query time:

Query the embedding database to find the most relevant documents for the user's question.

Inject those documents as context into GPT‑3 so it can reference them in its answer.

This approach lets OpenAI return not only the existing documents but also synthesize a coherent answer that blends the retrieved information.

A possible workflow looks like this:

Pre‑process the knowledge base and generate embeddings for each document page.

Store the embeddings for later use.

Build a search page that prompts the user for input.

When a user submits a question, generate a one‑time embedding for the query and perform a similarity search against the stored embeddings.

Submit the retrieved embeddings and the user’s question to ChatGPT, then stream the final response back to the client.

Real‑world embedding applications

The open‑source data‑processing platform Supabase offers AI search functionality in its documentation (see https://supabase.com/docs).

There is also an open‑source project nextjs-openai-doc-search that lets you quickly set up an AI‑powered documentation system; all components are open source and can be deployed with a single click.

The project processes Markdown documents, generates embeddings during each deployment, and performs similarity search at query time. Its implementation details are:

[Build] Pre‑process the knowledge base (.mdx files) and generate embeddings. [Build] Store embeddings in Postgres using pgvector. [Run] Perform vector similarity search to find relevant content. [Run] Inject the content into an OpenAI GPT‑3 completion prompt and stream the response to the client.

Steps 1 and 2 happen at build time (e.g., when Vercel builds your Next.js app). A generate-embeddings script carries out these tasks.

Steps 3 and 4 occur at runtime when a user submits a question.

Another project, deno-fresh-openai-doc-search, performs the same embedding generation during CI/CD.

Conclusion

By converting text into embedding vectors, you can automatically classify and tag documents in a self‑built knowledge base, organize files and resources more effectively, and combine embeddings with OpenAI's Q&A system to retrieve relevant information and generate answers, thereby enhancing knowledge accessibility and utilization.

How Embeddings Power AI Knowledge Bases: From Theory to Practice

What are embeddings

Human language

How embeddings work

OpenAI embeddings

Real‑world embedding applications

Conclusion

Related documentation

KooFE Frontend Team

How this landed with the community

Was this worth your time?

0 Comments