Common Distance and Similarity Measures in Machine Learning and Data Mining
This article reviews the most frequently used distance and similarity formulas in machine learning and data mining, explaining their definitions, mathematical properties, practical examples, and when each metric is appropriate for measuring differences between data points or probability distributions.
In machine learning and data mining, quantifying the difference between individual data points is essential for similarity assessment, classification, and clustering; a distance function d(x,y) must satisfy non‑negativity, identity of indiscernibles, symmetry, and the triangle inequality.
The article enumerates a comprehensive list of common metrics, including Minkowski, Euclidean, Manhattan, Chebyshev, Mahalanobis, cosine similarity, Pearson correlation, Hamming, Jaccard, edit (Levenshtein), Dynamic Time Warping (DTW), and KL divergence.
1. Minkowski Distance
The Minkowski distance generalizes Euclidean (p=2) and Manhattan (p=1) distances; as p approaches infinity it becomes the Chebyshev distance. When p<1 the function violates the triangle inequality, limiting its use. Data scaling (z‑transform) is often required before applying this metric.
2. Mahalanobis Distance
Mahalanobis distance removes correlations and differing scales between dimensions by applying a Cholesky transformation to the covariance matrix, effectively converting the Euclidean distance in the transformed space into a distance that accounts for data distribution.
3. Vector Inner Product and Cosine Similarity
The inner product measures similarity based on magnitude and direction; normalizing by vector lengths yields cosine similarity, which is invariant to vector magnitude and widely used in document and image similarity.
4. Pearson Correlation Coefficient
Pearson correlation provides both translation and scale invariance, measuring linear relationship between two variables and is commonly applied in recommendation systems.
5. Hamming Distance and Jaccard Similarity
Hamming distance counts differing positions in equal‑length strings, while Jaccard similarity evaluates the overlap of sets (e.g., user‑item interactions) by dividing the size of the intersection by the size of the union.
6. Edit (Levenshtein) Distance
Edit distance computes the minimum number of insertions, deletions, or substitutions required to transform one string into another, solved via dynamic programming.
7. Dynamic Time Warping (DTW)
DTW aligns sequences that may be out of phase in time or speed, finding the optimal warping path while preserving order, also solved by dynamic programming.
8. KL Divergence (Relative Entropy)
KL divergence measures the extra coding length when using an approximate distribution q(x) instead of the true distribution p(x); it is asymmetric and underlies loss functions in softmax and logistic regression.
Additional related measures such as Chi‑Square test, mutual information, Spearman’s rank coefficient, Earth Mover’s Distance, and SimRank are mentioned for further exploration.
Qunar Tech Salon
Qunar Tech Salon is a learning and exchange platform for Qunar engineers and industry peers. We share cutting-edge technology trends and topics, providing a free platform for mid-to-senior technical professionals to exchange and learn.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.