File Upload, Download, and Keyword Search with Elasticsearch
This article demonstrates how to use Elasticsearch, along with plugins like ingest-attachment, Kibana, and Elasticsearch-head, to build a system that supports file upload and download, preprocesses various file types (txt, pdf, word), extracts text for precise keyword search, and highlights results using the ik analyzer.
This guide explains how to implement a file management system that supports uploading, downloading, and keyword searching of documents (txt, pdf, word) using Elasticsearch . The author chose Elasticsearch because it provides powerful full‑text search capabilities via a simple REST API.
Elasticsearch Overview
Elasticsearch is an open‑source search engine built on Lucene . It wraps Lucene with a RESTful interface, offers distributed storage, and supports plugins for extended functionality.
Key Plugins and Tools
kibana – visual interface for building queries and visualizations.
elasticsearch-head – browser‑based UI for managing clusters and indices.
Development Environment
Install Elasticsearch , kibana , and elasticsearch-head . Ensure the Kibana version matches the Elasticsearch version (e.g., 7.9.1 with Kibana 7.9.1). The default ports are 9200 for Elasticsearch and 9100 for Elasticsearch‑head.
Core Problems
File Upload
Plain text files can be uploaded directly, but PDF and Word files contain extra metadata (images, tags) that must be pre‑processed. Elasticsearch 5.x+ provides an ingest node and the ingest‑attachment plugin to extract text from these formats.
Install the plugin:
./bin/elasticsearch-plugin install ingest-attachmentDefine an ingest pipeline named attachment :
PUT /_ingest/pipeline/attachment
{
"description": "Extract attachment information",
"processors": [
{ "attachment": { "field": "content", "ignore_missing": true } },
{ "remove": { "field": "content" } }
]
}Document Mapping
Create an index docwrite with mappings that include an attachment field and appropriate analyzers (Chinese ik_max_word for the file name and ik_smart for the extracted content):
PUT /docwrite
{
"mappings": {
"properties": {
"id": { "type": "keyword" },
"name": { "type": "text", "analyzer": "ik_max_word" },
"type": { "type": "keyword" },
"attachment": {
"properties": {
"content": { "type": "text", "analyzer": "ik_smart" }
}
}
}
}
}Testing Upload
Files must be Base64‑encoded before indexing. Example Java code reads a file, encodes it, and uploads it via the high‑level REST client.
public class FileObj {
String id; // file id
String name; // file name
String type; // pdf, word, txt
String content; // Base64 content
}
public FileObj readFile(String path) throws IOException {
File file = new File(path);
FileObj fileObj = new FileObj();
fileObj.setName(file.getName());
fileObj.setType(file.getName().substring(file.getName().lastIndexOf(".") + 1));
byte[] bytes = getContent(file);
String base64 = Base64.getEncoder().encodeToString(bytes);
fileObj.setContent(base64);
return fileObj;
}
public void upload(FileObj file) throws IOException {
IndexRequest indexRequest = new IndexRequest("fileindex");
indexRequest.source(JSON.toJSONString(file), XContentType.JSON);
indexRequest.setPipeline("attachment");
IndexResponse response = client.index(indexRequest, RequestOptions.DEFAULT);
System.out.println(response);
}Keyword Search
Elasticsearch’s built‑in analyzers split Chinese text into single characters, which is often too granular. The ik analyzer provides two modes:
ik_max_word – maximum segmentation.
ik_smart – smart segmentation (e.g., "进口红酒" → "进口", "红酒").
Search example using the ik_smart analyzer and highlighting matches:
GET /docwrite/_search
{
"query": {
"match": {
"attachment.content": {
"query": "实验一",
"analyzer": "ik_smart"
}
}
},
"highlight": {
"fields": { "attachment.content": {} },
"pre_tags": ["
"],
"post_tags": ["
"]
}
}Multi‑File Testing
The demo was extended to upload an entire folder of mixed‑type documents, then searched via the elasticsearch‑head UI to verify that all files were indexed and searchable.
Remaining Issues
Elasticsearch truncates content longer than 100 k characters; handling larger documents requires additional configuration.
Reading whole files into memory can cause out‑of‑memory errors for very large files; streaming or chunked upload strategies are needed for production.
Overall, the article provides a practical walkthrough for integrating Elasticsearch’s ingest pipeline, IK analyzer, and Java client to achieve reliable file storage and full‑text search.
Architect
Professional architect sharing high‑quality architecture insights. Topics include high‑availability, high‑performance, high‑stability architectures, big data, machine learning, Java, system and distributed architecture, AI, and practical large‑scale architecture case studies. Open to ideas‑driven architects who enjoy sharing and learning.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.