How to Perform Fuzzy Queries on Encrypted Data: Approaches and Trade‑offs
This article examines why encrypted data is unfriendly to fuzzy search, categorises three implementation strategies—naïve, conventional, and advanced—analyses their advantages and disadvantages, and provides practical guidance and reference links for securely enabling fuzzy queries on encrypted fields.
Encrypted data is not naturally compatible with fuzzy search; this article explores the problem and presents three categories of solutions.
Naïve ("Silly") Approaches
Load all encrypted records into memory, decrypt them, and perform fuzzy matching in application code.
Create a plaintext mapping table (tag table) for the ciphertext and query the tags.
These methods work only for very small datasets. For example, encrypting the phone number 13800138000 with DES yields HE9T75xNx6c5yLmS5l4r6Q== , which occupies 24 bytes. Storing millions of such records can quickly consume hundreds of megabytes to several gigabytes of RAM, leading to out‑of‑memory failures.
Conventional Approaches
Implement encryption/decryption functions in the database and modify fuzzy‑search conditions to decrypt before matching, e.g., decode(key) LIKE '%partial%' .
Tokenise the plaintext, encrypt each token, store them in an auxiliary column, and query using key LIKE '%partial%' .
The first method is easy to adopt but cannot leverage indexes and may suffer from algorithm mismatches between application and database. The second method adds storage overhead (encrypted tokens are larger than plaintext) but allows index usage and is generally recommended for most scenarios.
Advanced ("Super‑God") Approaches
These solutions involve algorithmic research, such as designing new reversible encryption schemes that preserve order or using specialized structures like Bloom filters. References include Hill‑cipher based fuzzy encryption (FMES), Bloom‑filter‑enhanced searchable encryption, and Lucene‑based encrypted search.
While offering the best security‑performance balance, they require deep expertise and custom implementation.
Practical Recommendations
For most projects, the second conventional method (tokenisation + encrypted auxiliary column) provides a good trade‑off between security, storage cost, and query performance. If the organization has dedicated cryptography talent, exploring advanced schemes may be worthwhile.
Reference Links
Taobao encrypted field search: https://open.taobao.com/docV3.htm?docId=106213&docType=1
Alibaba encrypted field search: https://jaq-doc.alibaba.com/docs/doc.htm?treeId=1&articleId=106213&docType=1
Pinduoduo encrypted field search: https://open.pinduoduo.com/application/document/browse?idStr=3407B605226E77F2
JD encrypted field search: https://jos.jd.com/commondoc?listId=345
Database fuzzy‑search encryption methods: https://www.jiamisoft.com/blog/6542-zifushujumohupipeijiamifangfa.html
Bloom‑filter based searchable encryption: http://kzyjc.cnjournals.com/html/2019/1/20190112.htm
Lucene‑based encrypted fuzzy search: https://www.cnblogs.com/arthurqin/p/6307153.html
In summary, avoid naïve approaches, prefer the token‑based conventional method for most use‑cases, and consider advanced algorithmic solutions only when you have the necessary expertise.
Architecture Digest
Focusing on Java backend development, covering application architecture from top-tier internet companies (high availability, high performance, high stability), big data, machine learning, Java architecture, and other popular fields.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.