Detailed explanation of Java Douban movie crawler – little crawler growth record (with source code)

2019-08-07 • Java

I have also used crawlers before, such as using nutch to crawl specified seeds, search based on the crawled data, and have roughly seen some source code. Of course, nutch's consideration of reptiles is very comprehensive and detailed. Every time I see the past crawling web page information and processing information on the screen, I always feel that this is very black technology. Just this time, with the opportunity of combing spring MVC, I want to get a small crawler by myself. It doesn't matter if it's simple. It doesn't matter if there are some small bugs. All I need is a website that can crawl the information I want for a certain seed website. If there is an exception, solve it. It may be the improper use of some APIs, the exception of HTTP request status, or the problem of database reading and writing. That is, in the process of reporting and solving the exception, jewelcrawler (son's nickname) can crawl data independently, and there is a small skill of emotion analysis based on word2vec algorithm.

There may be unknown exceptions waiting to be solved later, and some performance needs to be optimized, such as interaction with the database, data reading and writing, etc. However, we haven't paid much attention to it during the year, so we'll make a simple summary today. The first two articles mainly focus on functions and results. This article will talk about how jewelcrawler was born, And put the code on GitHub (the source code address is at the end of the article), and those who are interested can pay attention to it (for communication and learning only, do not use it for other purposes, consider Douban Jun. be more sincere and less hurt)

Environment introduction

Development tool: IntelliJ idea 14

Database: MySQL 5.5 + database management tool Navicat (can be used to connect to query database)

Language: Java

Jar package management: maven

Version management: Git

directory structure

among

　　com. ansj. VEC is the Java implementation of word2vec algorithm

　　com. jackie. crawler. Doublanmovie is a crawler implementation module, which also includes

Some packages are empty because these modules are not yet used, including

The resource module stores configuration files and resource files, such as

Test module is a test module used to write ut

Database configuration

1. Add dependent packages

Jewelcrawler uses Maven management, so it only needs to be in POM Add corresponding dependencies to XML

2. Declare data source bean

We need to be in beans Bean declaring data source in XML

Note: the external configuration file JDBC is bound here Properties, the parameters of the specific data source are read from this file.

If you encounter the problem "SQL [insert into user (ID) values (?)]; Field 'name' doesn't have a default value;” The solution is to set the corresponding fields of the table as self growth fields.

Problems encountered in parsing the page

You need to parse the DOM structure to get the data you want. The following errors are encountered during the process

org. htmlparser. Node not recognized

Solution: add jar package dependency

org. apache. http. Httpentity is not recognized

Solution: add jar package dependency

Of course, this is the problem encountered during the process. Finally, the page parsing done by jsup is used.

The download speed of Maven warehouse is slow

Previously, the default Maven central warehouse was used, and the download speed of jar packages was very slow. I don't know whether it was my network problem or other reasons. Later, I found Alibaba cloud Maven warehouse on the Internet. After updating, it was recommended to spit blood in seconds compared with before.

Find Maven's settings XML file, just add this image.

A method of reading files under resource module

For example, read seed Properties file

About regular expressions

When using regrex regular expression, if the defined pattern is matched, you need to call the find method of the matcher first, and then use the group method to find the substring. There is no way to find the result you want by directly calling the group method.

I looked at the source code of the matcher class above

The reason is this: if you call group directly without calling the find method first, you can find that the group method calls Group (int group), and the method body of the method has if first < 0. Obviously, this condition is true here, because the initial value of first is - 1, so exceptions will be thrown here. However, if you call the find method, you can find that search (nextsearchindex) will eventually be called. Note that the nextsearchindex here has been assigned by last, and the value of last is 0, and then jump to the search method

The nextsearchindex is passed to from, and from is assigned to first in the method body. Therefore, after calling the find method, the first of this is no longer - 1, so it is not throwing an exception.

The source code has been uploaded to Baidu online disk: http://pan.baidu.com/s/1dFwtvNz

The problems mentioned above are relatively broken. They are all summaries when encountering and solving problems. There will be other problems during the specific operation. If you have any problems or suggestions, you are welcome to put forward them ^ ^.

Finally, put some data crawled up to now

Record table

There are 79032 stored and 48471 crawled web pages

Movie table

At present, it has crawled 2964 film and television works

Comments table

Crawled 29711 records

The above is the whole content of this article. I hope it will be helpful to your study, and I hope you can support programming tips.

The content of this article comes from the network collection of netizens. It is used as a learning reference. The copyright belongs to the original author.

THE END

Java

二维码

Interpretation of integer method for Java source code parsing

< <上一篇

How to implement ArrayList and HashMap by themselves

下一篇>>

搜索内容

Detailed explanation of Java Douban movie crawler – little crawler growth record (with source code)

热门文章