Artificial Intelligence 7 min read

How to Build Word Vectors from Scratch: A Step‑by‑Step Guide

This article explains the fundamentals of word vectors in NLP, walks through constructing them via co‑occurrence matrices and dimensionality reduction, demonstrates the process with a concrete example and Python code, and evaluates the resulting embeddings using cosine similarity.

Model Perspective

May 29, 2024

How to Build Word Vectors from Scratch: A Step‑by‑Step Guide

In recent years large language models have become popular, and word vectors are a fundamental technique in natural language processing (NLP) that represent words as multi‑dimensional vectors to capture semantic information.

Word vectors rely on the idea that a word’s meaning can be inferred from its surrounding words (“You shall know a word by the company it keeps”).

Basic Concept of Word Vectors

Word vectors map words into a semantic space, allowing mathematical handling of similarity and relationships; the core idea is that surrounding words reveal a word’s meaning.

Mathematical Model

The construction process based on co‑occurrence consists of four steps:

Corpus preparation : collect a large text corpus to count co‑occurrences.

Context window : choose a window size (typically 5‑10 words) to define context words.

Co‑occurrence matrix : build a matrix where rows are target words, columns are context words, and entries are co‑occurrence frequencies.

Dimensionality reduction : apply algorithms such as SVD or PCA to compress the high‑dimensional matrix into low‑dimensional word vectors.

Case Study

A small corpus is used to illustrate the steps.

Corpus preparation : the corpus consists of three sentences:

I love machine learning.
Machine learning is fun.
I love coding.

The vocabulary includes I, love, machine, learning, is, fun, coding.

Context window : a window size of 2 (two words on each side) is chosen.

Co‑occurrence matrix : frequencies of word pairs within the window are counted.

Dimensionality reduction : Singular Value Decomposition (SVD) reduces the matrix to two dimensions, yielding the following word vectors (shown in the figure).

To evaluate the vectors, cosine similarity between word pairs is computed:

import numpy as np

# define reduced word‑vector matrix
word_vectors = np.array([
    [1.51499668, -1.4173672],
    [1.87698946, 1.68604424],
    [1.66865789, 0.19649234],
    [1.49526816, -0.93713897],
    [1.08538304, 0.17374271],
    [0.72099204, 0.30320536],
    [0.52440038, -0.66966223]
])

words = ["I", "love", "machine", "learning", "is", "fun", "coding"]

def cosine_similarity(vec1, vec2):
    dot_product = np.dot(vec1, vec2)
    norm_vec1 = np.linalg.norm(vec1)
    norm_vec2 = np.linalg.norm(vec2)
    return dot_product / (norm_vec1 * norm_vec2)

pairs = [("love", "coding"), ("machine", "learning"), ("is", "fun")]
for pair in pairs:
    idx1 = words.index(pair[0])
    idx2 = words.index(pair[1])
    similarity = cosine_similarity(word_vectors[idx1], word_vectors[idx2])
    print(f"'{pair[0]}' and '{pair[1]}' cosine similarity: {similarity:.4f}")

Results show high similarity for “machine‑learning” (0.7794) and “is‑fun” (0.9715), while “love‑coding” has low similarity (‑0.0675), reflecting the small corpus size.

The article suggests further study of neural‑network‑based methods such as Word2Vec, GloVe, FastText, as well as contextual models like BERT and Transformer, and techniques like t‑SNE or PCA for visualizing embeddings.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

Python NLP semantic similarity SVD dimensionality reduction word vectors

Written by

Model Perspective

Insights, knowledge, and enjoyment from a mathematical modeling researcher and educator. Hosted by Haihua Wang, a modeling instructor and author of "Clever Use of Chat for Mathematical Modeling", "Modeling: The Mathematics of Thinking", "Mathematical Modeling Practice: A Hands‑On Guide to Competitions", and co‑author of "Mathematical Modeling: Teaching Design and Cases".

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.