Parsing HTML file instances using open source library jsoup in Java

Html is the core of web. All the pages you see on the Internet are HTML, no matter they are dynamically generated by JavaScript, JSP, PHP, ASP or other web technologies. Your browser will parse the HTML and render it for you. But what if you need to parse HTML documents in Java programs and find some elements, tags, attributes, or check whether a specific element exists? If you have been using java programming for many years, I'm sure you've tried parsing XML and used parsers like DOM or Sax, but it's likely that you've never done any HTML parsing. More ironically, in Java applications, there are few times when you need to parse HTML documents. Servlet or other Java Web technologies are not included here. What's worse, there is no HTTP or HTML Library in the JDK core, at least I didn't know it. This is why when it comes to parsing HTML files, many Java programmers have to Google first to see how to take out an HTML tag in Java. When I need this, I believe there will be some open source libraries that can implement this, but I didn't expect to have a library as cool and full-featured as jsoup. It not only supports reading and parsing HTML documents, but also allows you to extract any elements, their attributes and their CSS attributes from HTML files, and you can modify them. With jsoup, you can do almost anything with HTML documents. We'll see examples of how to download and parse HTML files from the Google home page or any URL in Java.

What is the jsoup library

Jsup is an open source Java library, which can be used to process HTML in practical applications. It provides a very convenient API for data extraction and modification, and makes full use of the advantages of DOM, CSS and jQuery style methods. Jsoup implements the wahtwg HTML5 specification. The DOM parsed from HTML is completely consistent with that parsed from modern browsers such as chrome and Firefox. Here are some useful features of the jsup Library:

1. Jsup can obtain and parse HTML from URLs, files, or strings. 2. Jsup can find and extract data, and can use DOM traversal or CSS selector. 3. You can use jsup to modify HTML elements, attributes and text. 4. Jsup ensures that the content submitted by users is clean through a secure white list to prevent XSS attacks. 5. Jsup can also output neat HTML.

Jsup is designed to handle various kinds of HTML in real life, including correct and valid HTML and incomplete and invalid tag sets. One of the core competencies of jsup is its robustness.

Using jsup for HTML parsing in Java

In this tutorial on Java HTML parsing, we will see three different examples of using jsup to parse and traverse HTML in Java. In the first example, we will parse an HTML string whose content is a tag composed of string literals in Java. In the second example, we will download the HTML document from the web, and in the third example, we will load an HTML sample file login HTML for parsing. This file is an example of an HTML document. It contains a title tag. There is a div tag in the body, which contains a form. It has an input tag to get the user name and password, as well as a submit and reset button for the next operation. It is a correct and valid HTML, that is, all tags and attributes are closed correctly. Here is our sample HTML file:

The content of this article comes from the network collection of netizens. It is used as a learning reference. The copyright belongs to the original author.
THE END
分享
二维码
< <上一篇
下一篇>>