Detailed explanation of Java Douban movie crawler – little crawler growth record (with source code)
I have also used crawlers before, such as using nutch to crawl specified seeds, search based on the crawled data, and have roughly seen some source code. Of course, nutch's consideration of reptiles is very comprehensive and detailed. Every time I see the past crawling web page information and processing information on the screen, I always feel that this is very black technology. Just this time, with the opportunity of combing spring MVC, I want to get a small crawler by myself. It doesn't matter if it's simple. It doesn't matter if there are some small bugs. All I need is a website that can crawl the information I want for a certain seed website. If there is an exception, solve it. It may be the improper use of some APIs, the exception of HTTP request status, or the problem of database reading and writing. That is, in the process of reporting and solving the exception, jewelcrawler (son's nickname) can crawl data independently, and there is a small skill of emotion analysis based on word2vec algorithm.
There may be unknown exceptions waiting to be solved later, and some performance needs to be optimized, such as interaction with the database, data reading and writing, etc. However, we haven't paid much attention to it during the year, so we'll make a simple summary today. The first two articles mainly focus on functions and results. This article will talk about how jewelcrawler was born, And put the code on GitHub (the source code address is at the end of the article), and those who are interested can pay attention to it (for communication and learning only, do not use it for other purposes, consider Douban Jun. be more sincere and less hurt)
Environment introduction
Development tool: IntelliJ idea 14
Database: MySQL 5.5 + database management tool Navicat (can be used to connect to query database)
Language: Java
Jar package management: maven
Version management: Git
directory structure
among
com. ansj. VEC is the Java implementation of word2vec algorithm
com. jackie. crawler. Doublanmovie is a crawler implementation module, which also includes
Some packages are empty because these modules are not yet used, including
The resource module stores configuration files and resource files, such as
Test module is a test module used to write ut
Database configuration
1. Add dependent packages
Jewelcrawler uses Maven management, so it only needs to be in POM Add corresponding dependencies to XML
2. Declare data source bean
We need to be in beans Bean declaring data source in XML
Note: the external configuration file JDBC is bound here Properties, the parameters of the specific data source are read from this file.
If you encounter the problem "SQL [insert into user (ID) values (?)]; Field 'name' doesn't have a default value;” The solution is to set the corresponding fields of the table as self growth fields.
Problems encountered in parsing the page
You need to parse the DOM structure to get the data you want. The following errors are encountered during the process
org. htmlparser. Node not recognized
Solution: add jar package dependency
org. apache. http. Httpentity is not recognized
Solution: add jar package dependency
Of course, this is the problem encountered during the process. Finally, the page parsing done by jsup is used.
The download speed of Maven warehouse is slow
Previously, the default Maven central warehouse was used, and the download speed of jar packages was very slow. I don't know whether it was my network problem or other reasons. Later, I found Alibaba cloud Maven warehouse on the Internet. After updating, it was recommended to spit blood in seconds compared with before.
Find Maven's settings XML file, just add this image.
A method of reading files under resource module
For example, read seed Properties file
About regular expressions
When using regrex regular expression, if the defined pattern is matched, you need to call the find method of the matcher first, and then use the group method to find the substring. There is no way to find the result you want by directly calling the group method.
I looked at the source code of the matcher class above
The reason is this: if you call group directly without calling the find method first, you can find that the group method calls Group (int group), and the method body of the method has if first < 0. Obviously, this condition is true here, because the initial value of first is - 1, so exceptions will be thrown here. However, if you call the find method, you can find that search (nextsearchindex) will eventually be called. Note that the nextsearchindex here has been assigned by last, and the value of last is 0, and then jump to the search method
The nextsearchindex is passed to from, and from is assigned to first in the method body. Therefore, after calling the find method, the first of this is no longer - 1, so it is not throwing an exception.
The source code has been uploaded to Baidu online disk: http://pan.baidu.com/s/1dFwtvNz
The problems mentioned above are relatively broken. They are all summaries when encountering and solving problems. There will be other problems during the specific operation. If you have any problems or suggestions, you are welcome to put forward them ^ ^.
Finally, put some data crawled up to now
Record table
There are 79032 stored and 48471 crawled web pages
Movie table
At present, it has crawled 2964 film and television works
Comments table
Crawled 29711 records
The above is the whole content of this article. I hope it will be helpful to your study, and I hope you can support programming tips.