Master jsoup: Real‑World Spring Boot 3 Examples for HTML Parsing
This tutorial walks through practical jsoup usage within Spring Boot 3, covering dependency setup, parsing HTML from strings, fragments, URLs or files, extracting titles, links, images, applying CSS selectors, modifying elements, and sanitizing content to prevent XSS attacks.
Spring Boot 3 practical case collection includes 118 examples; this article introduces jsoup, a Java library that simplifies HTML and XML processing.
1. Introduction
jsoup provides an easy‑to‑use API for fetching URLs, parsing data, extracting and modifying content using DOM, CSS and XPath selectors. It implements the WHATWG HTML5 specification and parses HTML into a DOM identical to modern browsers.
WHATWG HTML5 specification: https://html.spec.whatwg.org/multipage/syntax.html
Fetch and parse HTML from a URL, file or string.
Find and extract data with DOM traversal or CSS selectors.
Manipulate HTML elements, attributes and text.
Clean user‑submitted content against a safelist to prevent XSS attacks.
Output tidy HTML.
jsoup can handle malformed “tag soup” HTML and still produce a reasonable parse tree.
2. Practical Cases
2.1 Dependency Management
<code><dependency>
<groupId>org.jsoup</groupId>
<artifactId>jsoup</artifactId>
<version>1.18.3</version>
</dependency></code>2.2 Parse HTML from a String
<code>String html = """
<html>
<head><title>Parse String HTML Document</title></head>
<body><p>Parsed HTML into a doc.</p></body>
</html>
""";
Document doc = Jsoup.parse(html);
Elements titleElement = doc.getElementsByTag("title");
System.err.printf("title: %s%n", titleElement);
</code>Output:
<code>title: <title>Parse String HTML Document</title></code>2.3 Parse HTML Fragment
<code>String html = "<div><p>Lorem ipsum.</p>";
Document doc = Jsoup.parseBodyFragment(html);
Element body = doc.body();
System.err.printf("body: \n%s%n", body);
</code>Output:
<code><body><div><p>Lorem ipsum.</p></div></body></code>2.4 Load HTML Document
From URL:
<code>Document document = Jsoup.connect("http://www.baidu.com").get();
System.err.println(document);
</code>Output:
From File:
<code>ClassPathResource resource = new ClassPathResource("templates/invoice.html");
Document document = Jsoup.parse(resource.getFile(), "utf-8");
System.err.println(document);
</code>Output:
2.5 Retrieve Element Content
Get page title:
<code>Document document = Jsoup.connect("http://www.baidu.com").get();
System.err.println(document.title());
</code>Output:
<code>百度一下,你就知道</code>Get favicon:
<code>Document document = Jsoup.connect("http://www.baidu.com").get();
Element element = document.head().select("link[href~=.*\\.(ico|png)]").first();
String favImage = null;
if (element == null) {
element = document.head().select("meta[itemprop=image]").first();
if (element != null) {
favImage = element.attr("content");
}
} else {
favImage = element.attr("href");
}
System.err.println(favImage);
</code>Output:
<code>https://www.baidu.com/favicon.ico</code>Get all links:
<code>Document document = Jsoup.connect("http://www.baidu.com").get();
Elements links = document.select("a[href]");
for (Element link : links) {
System.out.printf("text: %s, link : %s%n", link.text(), link.attr("href"));
}
</code>Output (example screenshot):
Get all images:
<code>Document document = Jsoup.connect("http://www.baidu.com").get();
Elements images = document.select("img[src~=(?i)\\.(png|jpe?g|gif)]");
for (Element image : images) {
System.out.printf("src : %s, width: %s, height: %s%n", image.attr("src"), image.attr("height"), image.attr("width"));
}
</code>Output (example screenshot):
2.6 Use CSS Selectors
<code>Document doc = Jsoup.connect("http://www.baidu.com").get();
Elements links = doc.select("a[href]");
Elements pngs = doc.select("img[src$=.png]");
Element masthead = doc.select("div.masthead").first();
Elements resultDivs = doc.select("h3.r > div");
Elements resultAs = resultDivs.select("a");
</code>Most CSS selectors are supported.
2.7 Modify Elements
<code>String html = """
<html>
<head><title>Parse String HTML Document</title></head>
<body><p>Parsed HTML into a doc.</p></body>
</html>
""";
Document doc = Jsoup.parse(html);
Element div = doc.select("body").first();
div.prepend("<p>First</p>");
div.append("<p>Last</p>");
System.err.println(doc);
</code>Output (screenshot):
Modify specific element content:
<code>String html = """
<html>
<head><title>Parse String HTML Document</title></head>
<body><p class=\"xxxooo\">Parsed HTML into a doc.</p></body>
</html>
""";
Document doc = Jsoup.parse(html);
Element div = doc.select("p.xxxooo").first();
div.text("xxxooo pack...");
System.err.println(doc);
</code>Output (screenshot):
2.8 Prevent XSS Attacks
<code>String unsafe = "<p><ahref='http://www.pack.com/'onclick='getCookies()'>惊喜</a></p>";
String safe = Jsoup.clean(unsafe, Safelist.basic());
System.err.println(safe);
</code>Output:
<code><p><a href="http://www.pack.com/" rel="nofollow">惊喜</a></p></code>Spring Full-Stack Practical Cases
Full-stack Java development with Vue 2/3 front-end suite; hands-on examples and source code analysis for Spring, Spring Boot 2/3, and Spring Cloud.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.