How to Build Word Vectors from Scratch: A Step‑by‑Step Guide
This article explains the fundamentals of word vectors in NLP, walks through constructing them via co‑occurrence matrices and dimensionality reduction, demonstrates the process with a concrete example and Python code, and evaluates the resulting embeddings using cosine similarity.
In recent years large language models have become popular, and word vectors are a fundamental technique in natural language processing (NLP) that represent words as multi‑dimensional vectors to capture semantic information.
Word vectors rely on the idea that a word’s meaning can be inferred from its surrounding words (“You shall know a word by the company it keeps”).
Basic Concept of Word Vectors
Word vectors map words into a semantic space, allowing mathematical handling of similarity and relationships; the core idea is that surrounding words reveal a word’s meaning.
Mathematical Model
The construction process based on co‑occurrence consists of four steps:
Corpus preparation : collect a large text corpus to count co‑occurrences.
Context window : choose a window size (typically 5‑10 words) to define context words.
Co‑occurrence matrix : build a matrix where rows are target words, columns are context words, and entries are co‑occurrence frequencies.
Dimensionality reduction : apply algorithms such as SVD or PCA to compress the high‑dimensional matrix into low‑dimensional word vectors.
Case Study
A small corpus is used to illustrate the steps.
Corpus preparation : the corpus consists of three sentences: <code>I love machine learning. Machine learning is fun. I love coding.</code> The vocabulary includes I, love, machine, learning, is, fun, coding.
Context window : a window size of 2 (two words on each side) is chosen.
Co‑occurrence matrix : frequencies of word pairs within the window are counted.
Dimensionality reduction : Singular Value Decomposition (SVD) reduces the matrix to two dimensions, yielding the following word vectors (shown in the figure).
To evaluate the vectors, cosine similarity between word pairs is computed:
<code>import numpy as np
# define reduced word‑vector matrix
word_vectors = np.array([
[1.51499668, -1.4173672],
[1.87698946, 1.68604424],
[1.66865789, 0.19649234],
[1.49526816, -0.93713897],
[1.08538304, 0.17374271],
[0.72099204, 0.30320536],
[0.52440038, -0.66966223]
])
words = ["I", "love", "machine", "learning", "is", "fun", "coding"]
def cosine_similarity(vec1, vec2):
dot_product = np.dot(vec1, vec2)
norm_vec1 = np.linalg.norm(vec1)
norm_vec2 = np.linalg.norm(vec2)
return dot_product / (norm_vec1 * norm_vec2)
pairs = [("love", "coding"), ("machine", "learning"), ("is", "fun")]
for pair in pairs:
idx1 = words.index(pair[0])
idx2 = words.index(pair[1])
similarity = cosine_similarity(word_vectors[idx1], word_vectors[idx2])
print(f"'{pair[0]}' and '{pair[1]}' cosine similarity: {similarity:.4f}")
</code>Results show high similarity for “machine‑learning” (0.7794) and “is‑fun” (0.9715), while “love‑coding” has low similarity (‑0.0675), reflecting the small corpus size.
The article suggests further study of neural‑network‑based methods such as Word2Vec, GloVe, FastText, as well as contextual models like BERT and Transformer, and techniques like t‑SNE or PCA for visualizing embeddings.
Model Perspective
Insights, knowledge, and enjoyment from a mathematical modeling researcher and educator. Hosted by Haihua Wang, a modeling instructor and author of "Clever Use of Chat for Mathematical Modeling", "Modeling: The Mathematics of Thinking", "Mathematical Modeling Practice: A Hands‑On Guide to Competitions", and co‑author of "Mathematical Modeling: Teaching Design and Cases".
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.