Big Data 10 min read

Scalable Time Series Similarity Search in Big Data: Partitioning, Dimensionality Reduction, and LSH Approaches

This article examines the challenges of performing time‑series similarity queries on massive datasets and presents three scalable solutions—partition‑based indexing, dimensionality‑reduction using MinHash, and a combined approach with Locality Sensitive Hashing—to reduce computation while preserving similarity accuracy.

JD Tech Talk

Nov 30, 2020

Scalable Time Series Similarity Search in Big Data: Partitioning, Dimensionality Reduction, and LSH Approaches

Time‑series data, which changes over time, is ubiquitous and has exploded in volume with the rise of electronic monitoring and wearable devices. Large‑scale systems such as a 600 MW power plant generate tens of thousands of sensor readings per second.

Time‑series similarity query aims to retrieve the k most similar sequences to a given query q . While a brute‑force scan works for small datasets, it becomes infeasible for big data due to two main challenges: (1) the sheer number of sequences makes pairwise similarity computation prohibitively expensive, and (2) high dimensionality further increases per‑comparison cost.

To address the first challenge, the article proposes placing similar sequences into the same partition so that only the relevant partition needs to be scanned. For the second challenge, dimensionality reduction techniques are explored to lower computational cost while preserving similarity.

1. Problem Definition

A time‑series X of length n is represented as <x₁, x₂, …, xₙ> with binary values and equal time intervals. The database D contains many such series. Similarity is measured by the Jaccard index: for two series A and B , let a and b be the counts of non‑zero entries, and c the count of positions where both are non‑zero; then Jaccard = c / ( a + b – c ).

2. Partition‑Based Solution

The first idea is to partition the data space. A naïve uniform grid (splitting an n‑dimensional hypercube) can cause data hotspots. An improved method uses data‑aware partitioning, such as KD‑tree‑based splits, to ensure each cell contains roughly equal amounts of data. After partitioning, a query first determines its cell and only scans that partition, drastically reducing the amount of data examined. This yields an approximate result because the true top‑ k may reside in other partitions; additional filters can be applied when exact results are required.

3. Dimensionality‑Reduction Solution

Reducing dimensionality while preserving similarity is tackled with the MinHash algorithm. For binary series A and B , a random permutation of indices is applied, and the first non‑zero index after permutation becomes the MinHash value. Repeating this process m times (with m ≪ n ) yields a signature vector [h₁, h₂, …, hₘ] for each series. The probability that two signatures match in a given position equals the Jaccard similarity of the original series, allowing similarity estimation in the low‑dimensional signature space.

4. Combined Partition and Dimensionality‑Reduction Solution

Even after MinHash, a full pairwise scan is still needed. By further hashing the signature vectors into buckets (Locality Sensitive Hashing, LSH), similar signatures are likely to fall into the same bucket. Each bucket acts as a partition; queries first locate the bucket of the query signature and only compare against candidates within that bucket, dramatically cutting down the number of similarity calculations. LSH is probabilistic, so it may produce false positives (dissimilar items in the same bucket) or false negatives (similar items in different buckets), but the trade‑off is acceptable for large‑scale scenarios.

The article concludes by noting that the JUST team is building a distributed time‑series platform based on these techniques.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

Big Data LSH Time-series Partitioning similarity search dimensionality reduction Minhash

Written by

JD Tech Talk

Official JD Tech public account delivering best practices and technology innovation.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.