An example of building a general crawler framework based on vert. X and rxjava 2
Recently, because the business needs to monitor some data, although there are many excellent crawler frameworks on the market, I still intend to implement a complete crawler framework from scratch.
In terms of technology selection, I did not choose spring to build the project, but chose the more lightweight vert. X. On the one hand, spring is too heavy, and vert. X is a JVM based, lightweight and high-performance framework. It is event based and asynchronous, relies on the fully asynchronous Java Server netty, and extends many other features.
GitHub address: https://github.com/fengzhizi715/NetDiscovery
1、 Functions of crawler framework
The crawler framework includes a spider engine and a spider. Spiderengine can manage multiple spiders.
1.1 Spider
Spider mainly contains several components: Downloader, queue, parser, pipeline and proxy pool IP. Proxy pool is a separate project. I wrote some time ago that proxy IP often needs to be switched when using the crawler framework, so I introduced it.
Proxypool address: https://github.com/fengzhizi715/ProxyPool
The other four components are interfaces. Some implementations are built in the crawler framework, such as multiple downloaders, including those implemented by vertx's webclient, HTTP client, okhttp3 and selenium. Developers can choose to use or develop a new downloader according to their own situation.
The download method of the downloader will return a may < response >.
In spider, a series of subsequent chain calls are realized through the May < response > object. For example, the response is converted into a page object, and then the page object is parsed. After the page is parsed, a series of pipeline operations are performed.
Using rxjava 2 here can make the whole crawler framework look more responsive:)
1.2 SpiderEngine
Spiderengine can contain multiple spiders. You can add crawlers to spiderengine and create new spiders and add them to spiderengine through addspider() and createspider().
In spiderengine, if the httpd (port) method is called, you can also monitor each spider in spiderengine.
1.2.1 get the status of a crawler
http://localhost: {port}/netdiscovery/spider/{spiderName}
Type: get
1.2.2 get the status of all crawlers in spiderengine
http://localhost: {port}/netdiscovery/spiders/
Type: get
1.2.3 modify the status of a crawler
http://localhost: {port}/netdiscovery/spider/{spiderName}/status
Type: Post
Parameter Description:
Examples of using frameworks
Create a spider engine, and then create three spiders. Each crawler crawls a page at regular intervals.
After the above program runs for a period of time, enter in the browser: http://localhost:8080/netdiscovery/spiders
We can see the results of three crawlers.
Format JSON
case
Recently, the chain of chain has been paid more attention to. Therefore, a program has been made to capture the price of three digital currencies in real time, and the latest price can be obtained by asking the official account.
At present, the program has been on-line, and can ask for my official account to get the latest price of these digital currencies in real time.
TODO
summary
This crawler framework has just started, and I have also referred to many excellent crawler frameworks. In the future, I will consider adding screenshots to the framework to analyze the data in the picture. Even combined with cv4j framework. Before the Chinese new year, priority will be given to the identification of login verification codes in the crawler framework.
The above is the whole content of this article. I hope it will be helpful to your study, and I hope you can support programming tips.