Artificial Intelligence 13 min read

Common Distance and Similarity Measures in Machine Learning and Data Mining

This article reviews the most frequently used distance and similarity formulas in machine learning and data mining, explaining their definitions, mathematical properties, practical examples, and when each metric is appropriate for measuring differences between data points or probability distributions.

Qunar Tech Salon

Mar 14, 2015

Common Distance and Similarity Measures in Machine Learning and Data Mining

In machine learning and data mining, quantifying the difference between individual data points is essential for similarity assessment, classification, and clustering; a distance function d(x,y) must satisfy non‑negativity, identity of indiscernibles, symmetry, and the triangle inequality.

The article enumerates a comprehensive list of common metrics, including Minkowski, Euclidean, Manhattan, Chebyshev, Mahalanobis, cosine similarity, Pearson correlation, Hamming, Jaccard, edit (Levenshtein), Dynamic Time Warping (DTW), and KL divergence.

1. Minkowski Distance

The Minkowski distance generalizes Euclidean (p=2) and Manhattan (p=1) distances; as p approaches infinity it becomes the Chebyshev distance. When p<1 the function violates the triangle inequality, limiting its use. Data scaling (z‑transform) is often required before applying this metric.

2. Mahalanobis Distance

Mahalanobis distance removes correlations and differing scales between dimensions by applying a Cholesky transformation to the covariance matrix, effectively converting the Euclidean distance in the transformed space into a distance that accounts for data distribution.

3. Vector Inner Product and Cosine Similarity

The inner product measures similarity based on magnitude and direction; normalizing by vector lengths yields cosine similarity, which is invariant to vector magnitude and widely used in document and image similarity.

4. Pearson Correlation Coefficient

Pearson correlation provides both translation and scale invariance, measuring linear relationship between two variables and is commonly applied in recommendation systems.

5. Hamming Distance and Jaccard Similarity

Hamming distance counts differing positions in equal‑length strings, while Jaccard similarity evaluates the overlap of sets (e.g., user‑item interactions) by dividing the size of the intersection by the size of the union.

6. Edit (Levenshtein) Distance

Edit distance computes the minimum number of insertions, deletions, or substitutions required to transform one string into another, solved via dynamic programming.

7. Dynamic Time Warping (DTW)

DTW aligns sequences that may be out of phase in time or speed, finding the optimal warping path while preserving order, also solved by dynamic programming.

8. KL Divergence (Relative Entropy)

KL divergence measures the extra coding length when using an approximate distribution q(x) instead of the true distribution p(x); it is asymmetric and underlies loss functions in softmax and logistic regression.

Additional related measures such as Chi‑Square test, mutual information, Spearman’s rank coefficient, Earth Mover’s Distance, and SimRank are mentioned for further exploration.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

Machine Learning data mining Cosine Similarity distance metrics KL divergence Mahalanobis Distance similarity measures

Written by

Qunar Tech Salon

Qunar Tech Salon is a learning and exchange platform for Qunar engineers and industry peers. We share cutting-edge technology trends and topics, providing a free platform for mid-to-senior technical professionals to exchange and learn.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.