Big Data 13 min read

File Upload, Download, and Keyword Search with Elasticsearch

This article demonstrates how to use Elasticsearch, along with plugins like ingest-attachment, Kibana, and Elasticsearch-head, to build a system that supports file upload and download, preprocesses various file types (txt, pdf, word), extracts text for precise keyword search, and highlights results using the ik analyzer.

Architect

Apr 27, 2022

File Upload, Download, and Keyword Search with Elasticsearch

This guide explains how to implement a file management system that supports uploading, downloading, and keyword searching of documents (txt, pdf, word) using Elasticsearch. The author chose Elasticsearch because it provides powerful full‑text search capabilities via a simple REST API.

Elasticsearch Overview

Elasticsearch

is an open‑source search engine built on Lucene. It wraps Lucene with a RESTful interface, offers distributed storage, and supports plugins for extended functionality.

Key Plugins and Tools

kibana

– visual interface for building queries and visualizations. elasticsearch-head – browser‑based UI for managing clusters and indices.

Development Environment

Install Elasticsearch, kibana, and elasticsearch-head. Ensure the Kibana version matches the Elasticsearch version (e.g., 7.9.1 with Kibana 7.9.1). The default ports are 9200 for Elasticsearch and 9100 for Elasticsearch‑head.

Core Problems

File Upload

Plain text files can be uploaded directly, but PDF and Word files contain extra metadata (images, tags) that must be pre‑processed. Elasticsearch 5.x+ provides an ingest node and the ingest‑attachment plugin to extract text from these formats.

Install the plugin:

./bin/elasticsearch-plugin install ingest-attachment

Define an ingest pipeline named attachment:

PUT /_ingest/pipeline/attachment
{
  "description": "Extract attachment information",
  "processors": [
    { "attachment": { "field": "content", "ignore_missing": true } },
    { "remove": { "field": "content" } }
  ]
}

Document Mapping

Create an index docwrite with mappings that include an attachment field and appropriate analyzers (Chinese ik_max_word for the file name and ik_smart for the extracted content):

PUT /docwrite
{
  "mappings": {
    "properties": {
      "id": { "type": "keyword" },
      "name": { "type": "text", "analyzer": "ik_max_word" },
      "type": { "type": "keyword" },
      "attachment": {
        "properties": {
          "content": { "type": "text", "analyzer": "ik_smart" }
        }
      }
    }
  }
}

Testing Upload

Files must be Base64‑encoded before indexing. Example Java code reads a file, encodes it, and uploads it via the high‑level REST client.

public class FileObj {
    String id;   // file id
    String name; // file name
    String type; // pdf, word, txt
    String content; // Base64 content
}

public FileObj readFile(String path) throws IOException {
    File file = new File(path);
    FileObj fileObj = new FileObj();
    fileObj.setName(file.getName());
    fileObj.setType(file.getName().substring(file.getName().lastIndexOf(".") + 1));
    byte[] bytes = getContent(file);
    String base64 = Base64.getEncoder().encodeToString(bytes);
    fileObj.setContent(base64);
    return fileObj;
}

public void upload(FileObj file) throws IOException {
    IndexRequest indexRequest = new IndexRequest("fileindex");
    indexRequest.source(JSON.toJSONString(file), XContentType.JSON);
    indexRequest.setPipeline("attachment");
    IndexResponse response = client.index(indexRequest, RequestOptions.DEFAULT);
    System.out.println(response);
}

Keyword Search

Elasticsearch’s built‑in analyzers split Chinese text into single characters, which is often too granular. The ik analyzer provides two modes: ik_max_word – maximum segmentation. ik_smart – smart segmentation (e.g., "进口红酒" → "进口", "红酒").

Search example using the ik_smart analyzer and highlighting matches:

GET /docwrite/_search
{
  "query": {
    "match": {
      "attachment.content": {
        "query": "实验一",
        "analyzer": "ik_smart"
      }
    }
  },
  "highlight": {
    "fields": { "attachment.content": {} },
    "pre_tags": ["<em>"],
    "post_tags": ["</em>"]
  }
}

Multi‑File Testing

The demo was extended to upload an entire folder of mixed‑type documents, then searched via the elasticsearch‑head UI to verify that all files were indexed and searchable.

Remaining Issues

Elasticsearch truncates content longer than 100 k characters; handling larger documents requires additional configuration.

Reading whole files into memory can cause out‑of‑memory errors for very large files; streaming or chunked upload strategies are needed for production.

Overall, the article provides a practical walkthrough for integrating Elasticsearch’s ingest pipeline, IK analyzer, and Java client to achieve reliable file storage and full‑text search.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

Java Elasticsearch file upload IK Analyzer Ingest Attachment keyword-search

Written by

Architect

Professional architect sharing high‑quality architecture insights. Topics include high‑availability, high‑performance, high‑stability architectures, big data, machine learning, Java, system and distributed architecture, AI, and practical large‑scale architecture case studies. Open to ideas‑driven architects who enjoy sharing and learning.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.