Backend Development 10 min read

Master jsoup: Real‑World Spring Boot 3 Examples for HTML Parsing

This tutorial walks through practical jsoup usage within Spring Boot 3, covering dependency setup, parsing HTML from strings, fragments, URLs or files, extracting titles, links, images, applying CSS selectors, modifying elements, and sanitizing content to prevent XSS attacks.

Spring Full-Stack Practical Cases

Apr 25, 2025

Master jsoup: Real‑World Spring Boot 3 Examples for HTML Parsing

Spring Boot 3 practical case collection includes 118 examples; this article introduces jsoup, a Java library that simplifies HTML and XML processing.

1. Introduction

jsoup provides an easy‑to‑use API for fetching URLs, parsing data, extracting and modifying content using DOM, CSS and XPath selectors. It implements the WHATWG HTML5 specification and parses HTML into a DOM identical to modern browsers.

WHATWG HTML5 specification: https://html.spec.whatwg.org/multipage/syntax.html

Fetch and parse HTML from a URL, file or string.

Find and extract data with DOM traversal or CSS selectors.

Manipulate HTML elements, attributes and text.

Clean user‑submitted content against a safelist to prevent XSS attacks.

Output tidy HTML.

jsoup can handle malformed “tag soup” HTML and still produce a reasonable parse tree.

2. Practical Cases

2.1 Dependency Management

<dependency>
    <groupId>org.jsoup</groupId>
    <artifactId>jsoup</artifactId>
    <version>1.18.3</version>
</dependency>

2.2 Parse HTML from a String

String html = """
    <html>
      <head><title>Parse String HTML Document</title></head>
      <body><p>Parsed HTML into a doc.</p></body>
    </html>
    """;
Document doc = Jsoup.parse(html);
Elements titleElement = doc.getElementsByTag("title");
System.err.printf("title: %s%n", titleElement);

Output:

title: <title>Parse String HTML Document</title>

2.3 Parse HTML Fragment

String html = "<div><p>Lorem ipsum.</p>";
Document doc = Jsoup.parseBodyFragment(html);
Element body = doc.body();
System.err.printf("body: 
%s%n", body);

Output:

<body><div><p>Lorem ipsum.</p></div></body>

2.4 Load HTML Document

From URL:

Document document = Jsoup.connect("http://www.baidu.com").get();
System.err.println(document);

Output:

From File:

ClassPathResource resource = new ClassPathResource("templates/invoice.html");
Document document = Jsoup.parse(resource.getFile(), "utf-8");
System.err.println(document);

Output:

2.5 Retrieve Element Content

Get page title:

Document document = Jsoup.connect("http://www.baidu.com").get();
System.err.println(document.title());

Output: 百度一下，你就知道 Get favicon:

Document document = Jsoup.connect("http://www.baidu.com").get();
Element element = document.head().select("link[href~=.*\\.(ico|png)]").first();
String favImage = null;
if (element == null) {
    element = document.head().select("meta[itemprop=image]").first();
    if (element != null) {
        favImage = element.attr("content");
    }
} else {
    favImage = element.attr("href");
}
System.err.println(favImage);

Output: https://www.baidu.com/favicon.ico Get all links:

Document document = Jsoup.connect("http://www.baidu.com").get();
Elements links = document.select("a[href]");
for (Element link : links) {
    System.out.printf("text: %s, link : %s%n", link.text(), link.attr("href"));
}

Output (example screenshot):

Get all images:

Document document = Jsoup.connect("http://www.baidu.com").get();
Elements images = document.select("img[src~=(?i)\\.(png|jpe?g|gif)]");
for (Element image : images) {
    System.out.printf("src : %s, width: %s, height: %s%n", image.attr("src"), image.attr("height"), image.attr("width"));
}

Output (example screenshot):

2.6 Use CSS Selectors

Document doc = Jsoup.connect("http://www.baidu.com").get();
Elements links = doc.select("a[href]");
Elements pngs = doc.select("img[src$=.png]");
Element masthead = doc.select("div.masthead").first();
Elements resultDivs = doc.select("h3.r > div");
Elements resultAs = resultDivs.select("a");

Most CSS selectors are supported.

2.7 Modify Elements

String html = """
  <html>
    <head><title>Parse String HTML Document</title></head>
    <body><p>Parsed HTML into a doc.</p></body>
  </html>
  """;
Document doc = Jsoup.parse(html);
Element div = doc.select("body").first();
div.prepend("<p>First</p>");
div.append("<p>Last</p>");
System.err.println(doc);

Output (screenshot):

Modify specific element content:

String html = """
    <html>
      <head><title>Parse String HTML Document</title></head>
      <body><p class=\"xxxooo\">Parsed HTML into a doc.</p></body>
    </html>
    """;
Document doc = Jsoup.parse(html);
Element div = doc.select("p.xxxooo").first();
div.text("xxxooo pack...");
System.err.println(doc);

Output (screenshot):

2.8 Prevent XSS Attacks

String unsafe = "<p><ahref='http://www.pack.com/'onclick='getCookies()'>惊喜</a></p>";
String safe = Jsoup.clean(unsafe, Safelist.basic());
System.err.println(safe);

Output:

<p><a href="http://www.pack.com/" rel="nofollow">惊喜</a></p>

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

Java html-parsing Spring Boot jsoup Web Scraping XSS protection

Written by

Spring Full-Stack Practical Cases

Full-stack Java development with Vue 2/3 front-end suite; hands-on examples and source code analysis for Spring, Spring Boot 2/3, and Spring Cloud.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.