Java crawler crawls the mobile search page httpclient + jsup on JD
1. Requirements and configuration
Demand: crawl the information of JD mobile phone search page, record the name, price, number of comments, etc. of each mobile phone, and form a data table that can be used for actual analysis.
Using the Maven project, log4j records logs, which are only exported to the console.
Maven dependencies are as follows (POM. XML)
Log4j configuration (log4j. Properties) outputs info and above information to the console without setting the output document separately.
2. Requirements analysis and code
2.1 demand analysis
The first step is to establish a connection between the client and the server, and obtain the HTML content on the web page through the URL.
The second step is to parse the HTML content and get the required elements.
The third step is to output the HTML content to the local text document, which can be directly analyzed by other data analysis software.
According to the above analysis, four classes are established: gethtml (for obtaining website HTML), parsehtml (for parsing HTML), writeto (for outputting documents) and maincontrol. The following describes the four classes respectively. In order to make the code as concise as possible, all exceptions are thrown directly from the method without catching.
2.2 code
2.2. 1gethtml class
This class contains two methods: geth (string URL) and urlcontrol (string baseurl, int page), which are used to obtain web page HTML and control URL respectively. Since the crawled web page content is only the search result of a certain kind of goods on JD, you don't need to traverse all the URLs on the page. You just need to observe the change of URL when turning the page and launch the law. Only the urlcontrol method is exposed externally, and a private log attribute is set in the class: private static logger log = logger getLogger(getHTML.class); Used for logging.
Geth (string URL) to obtain the HTML content of a single URL.
Urlcontrol (string baseurl, int page), set the loop and access the data of multiple pages. By reviewing the elements, you can see that the change of the search page on JD is actually a change in odd order.
Take another look at the change of the web address after clicking. You can find that the actual change is the value of the page attribute. The address of the next web page can be easily obtained by splicing.
https://search.jd.com/Search?keyword=%E6%89%8B%E6%9C%BA&enc=utf -8&qrst=1&rt=1&stop=1&vt=2&cid2=653&cid3=655&page=3&s=47&click=0 https://search.jd.com/Search?keyword=%E6%89%8B%E6%9C%BA&enc=utf -8&qrst=1&rt=1&stop=1&vt=2&cid2=653&cid3=655&page=5&s=111&click=0 https://search.jd.com/Search?keyword=%E6%89%8B%E6%9C%BA&enc=utf -8&qrst=1&rt=1&stop=1&vt=2&cid2=653&cid3=655&page=7&s=162&click=0
Overall Code:
2.2. 2parsehtml class
This step needs to determine the label of the content to be crawled through the review element, and then obtain it through the CSS selector in jsup.
2.2. 3writeto class
The method in this class writes the parsed content to a local file. Just simple io.
2.2. 4maincontrol class
The master program writes the base address and the number of pages you want to get. Call the urlcontrol method in the gethtml class to grab the page.
3. Crawling results
Climb to page 20.
3.1 console output
3.2 document output
You can open it directly with Excel, and the separator is tab. The columns are item number, name, price and number of comments.
4. Summary
Httpclient and jsup are used in this crawl. You can see that these tools are very efficient for simple requirements. In fact, you can write all classes into one class. The way to write multiple classes is clear.
The above Java crawler crawls to the mobile search page httpclient + jsup on JD. That's all the content Xiaobian shared with you. I hope it can give you a reference and support more programming tips.