Extracting Personal Information from PDF, DOC, DOCX, and TXT Files Using Apache Tika
This tutorial demonstrates how to use Apache Tika in a Java project to parse PDF, Word, and text documents, extract specific fields such as name and ID number, and shows the required Maven dependencies and sample code for performing the extraction.
This article explains how to extract feature data—such as a person's name and ID number—from various document formats (PDF, DOC, DOCX, TXT) using Apache Tika. The author provides a step‑by‑step guide that has been personally tested.
1. Add Maven dependencies
<!-- apache tika package for parsing pdf, word, txt -->
org.apache.tika
tika-core
2.8.0
org.apache.tika
tika-parsers-standard-package
2.8.0
org.apache.xmlbeans
xmlbeans
5.1.12. Write the Java code
package org.example.wordcontent;
import org.apache.tika.Tika;
import org.apache.tika.exception.TikaException;
import java.io.File;
import java.io.IOException;
import java.io.InputStream;
import java.util.regex.Matcher;
import java.util.regex.Pattern;
/**
* Extract data from pdf, doc, docx, txt using Apache Tika.
* Core jars: tika-core 2.8.0, tika-parsers-standard-package 2.8.0 (requires xmlbeans 5.1.1 for Word).
* Assumes documents contain fields like:
* 授权人(签字):张三
* 身份证号码: 322025199902256056
*/
public class TikaExtrator {
public static void main(String[] args) {
try {
// Replace with actual file path; example uses a resource file.
InputStream input = TikaExtrator.class.getClassLoader().getResourceAsStream("综合信息查询授权书测试.docx");
String text = extractTextFromFile(input);
System.out.println("text: " + text);
String name = extractName(text);
String idNumber = extractIdNumber(text);
System.out.println("授权人姓名: " + name);
System.out.println("身份证号码: " + idNumber);
} catch (IOException e) {
e.printStackTrace();
}
}
private static String extractTextFromFile(InputStream inputStream) throws IOException {
Tika tika = new Tika();
try {
return tika.parseToString(inputStream);
} catch (TikaException e) {
throw new RuntimeException(e);
}
}
private static String extractName(String text) {
Pattern pattern = Pattern.compile("授权人(签字)[::]([\\u4e00-\\u9fa5]+)");
Matcher matcher = pattern.matcher(text);
if (matcher.find()) {
return matcher.group(1);
}
return "";
}
private static String extractIdNumber(String text) {
Pattern pattern = Pattern.compile("身份证号码[::](\\d{18}|\\d{15})");
Matcher matcher = pattern.matcher(text);
if (matcher.find()) {
return matcher.group(1);
}
return "";
}
}3. Execution result
Running the program prints the extracted text, the name (e.g., 张三), and the ID number (e.g., 322025199902256056). The original article includes a screenshot of the console output.
Java Captain
Focused on Java technologies: SSM, the Spring ecosystem, microservices, MySQL, MyCat, clustering, distributed systems, middleware, Linux, networking, multithreading; occasionally covers DevOps tools like Jenkins, Nexus, Docker, ELK; shares practical tech insights and is dedicated to full‑stack Java development.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.