Using Elasticsearch for File Upload, Indexing, and Keyword Search with Ingest Attachment Plugin
This article explains how to implement file upload, download, and precise keyword search for Word, PDF, and txt documents using Elasticsearch, covering environment setup, ingest‑attachment preprocessing, index mapping, Java code for uploading and querying, Chinese analysis with IK analyzer, and highlighting of results.
The requirement is to support uploading and downloading files (Word, PDF, txt) and to enable precise keyword search within the file contents. Elasticsearch is chosen as the core search engine because it provides a simple REST API and powerful indexing capabilities.
Elasticsearch is an open‑source search engine built on Apache Lucene. It wraps Lucene to offer distributed storage and RESTful APIs. Commonly used plugins such as kibana and elasticsearch‑head provide visual interfaces for managing clusters.
Development environment: install Elasticsearch, elasticsearch‑head , and kibana . All three tools are "out‑of‑the‑box" and must have matching versions (e.g., Elasticsearch 7.9.1 with Kibana 7.9.1).
The core problems are file upload and keyword query. Plain text files are straightforward, but PDF and Word files contain extra metadata that must be stripped before indexing.
Elasticsearch 5.x+ offers an ingest node that can run a pipeline to preprocess documents. The ingest‑attachment plugin extracts text from binary files. Install it with:
./bin/elasticsearch-plugin install ingest-attachmentDefine an ingest pipeline named attachment :
PUT /_ingest/pipeline/attachment
{
"description": "Extract attachment information",
"processors": [
{ "attachment": { "field": "content", "ignore_missing": true } },
{ "remove": { "field": "content" } }
]
}Create an index with a mapping that includes the attachment field and uses the Chinese IK analyzer for full‑text search:
PUT /docwrite
{
"mappings": {
"properties": {
"id": { "type": "keyword" },
"name": { "type": "text", "analyzer": "ik_max_word" },
"type": { "type": "keyword" },
"attachment": {
"properties": {
"content": { "type": "text", "analyzer": "ik_smart" }
}
}
}
}
}Before indexing, files must be Base64‑encoded because Elasticsearch stores JSON documents. Convert the file to Base64, place the encoded string in the content field, and send the document using the pipeline:
IndexRequest indexRequest = new IndexRequest("fileindex");
indexRequest.source(JSON.toJSONString(fileObj), XContentType.JSON);
indexRequest.setPipeline("attachment");
client.index(indexRequest, RequestOptions.DEFAULT);Keyword search uses the IK analyzer to obtain meaningful tokens. The default Unicode tokenizer splits Chinese characters individually, which is not desired. Installing the IK analyzer plugin enables two modes:
ik_max_word – splits into the maximum number of tokens.
ik_smart – splits according to common usage (e.g., "进口红酒" becomes "进口" and "红酒").
Install the IK analyzer:
./bin/elasticsearch-plugin install https://github.com/medcl/elasticsearch-analysis-ik/releases/download/.../elasticsearch-analysis-ik-7.9.1.zipSearch example using ik_smart and highlighting:
SearchSourceBuilder srb = new SearchSourceBuilder();
srb.query(QueryBuilders.matchQuery("attachment.content", "实验一").analyzer("ik_smart"));
HighlightBuilder hb = new HighlightBuilder();
HighlightBuilder.Field hf = new HighlightBuilder.Field("attachment.content");
hb.field(hf);
hb.preTags("
");
hb.postTags("
");
srb.highlighter(hb);
searchRequest.source(srb);Java helper classes for file handling:
public class FileObj {
String id; // file id
String name; // file name
String type; // pdf, word, txt
String content; // Base64 encoded content
}
public FileObj readFile(String path) throws IOException {
File file = new File(path);
FileObj obj = new FileObj();
obj.setName(file.getName());
obj.setType(file.getName().substring(file.getName().lastIndexOf('.') + 1));
byte[] bytes = Files.readAllBytes(file.toPath());
obj.setContent(Base64.getEncoder().encodeToString(bytes));
return obj;
}
public void upload(FileObj file) throws IOException {
IndexRequest req = new IndexRequest("fileindex");
req.source(JSON.toJSONString(file), XContentType.JSON);
req.setPipeline("attachment");
client.index(req, RequestOptions.DEFAULT);
}Search code using the IK analyzer:
SearchSourceBuilder srb = new SearchSourceBuilder();
srb.query(QueryBuilders.matchQuery("attachment.content", keyword).analyzer("ik_smart"));
searchRequest.source(srb);
SearchResponse resp = client.search(searchRequest, RequestOptions.DEFAULT);
for (SearchHit hit : resp.getHits()) {
// process hit
}Remaining challenges include Elasticsearch truncating documents longer than 100,000 characters and high memory consumption when loading large files entirely into memory, which may require streaming or chunked processing in production.
IT Architects Alliance
Discussion and exchange on system, internet, large‑scale distributed, high‑availability, and high‑performance architectures, as well as big data, machine learning, AI, and architecture adjustments with internet technologies. Includes real‑world large‑scale architecture case studies. Open to architects who have ideas and enjoy sharing.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.