An example of building a general crawler framework based on vert. X and rxjava 2

Recently, because the business needs to monitor some data, although there are many excellent crawler frameworks on the market, I still intend to implement a complete crawler framework from scratch.

In terms of technology selection, I did not choose spring to build the project, but chose the more lightweight vert. X. On the one hand, spring is too heavy, and vert. X is a JVM based, lightweight and high-performance framework. It is event based and asynchronous, relies on the fully asynchronous Java Server netty, and extends many other features.

GitHub address: https://github.com/fengzhizi715/NetDiscovery

1、 Functions of crawler framework

The crawler framework includes a spider engine and a spider. Spiderengine can manage multiple spiders.

1.1 Spider

Spider mainly contains several components: Downloader, queue, parser, pipeline and proxy pool IP. Proxy pool is a separate project. I wrote some time ago that proxy IP often needs to be switched when using the crawler framework, so I introduced it.

Proxypool address: https://github.com/fengzhizi715/ProxyPool

The other four components are interfaces. Some implementations are built in the crawler framework, such as multiple downloaders, including those implemented by vertx's webclient, HTTP client, okhttp3 and selenium. Developers can choose to use or develop a new downloader according to their own situation.

The download method of the downloader will return a may < response >.

In spider, a series of subsequent chain calls are realized through the May < response > object. For example, the response is converted into a page object, and then the page object is parsed. After the page is parsed, a series of pipeline operations are performed.

Using rxjava 2 here can make the whole crawler framework look more responsive:)

1.2 SpiderEngine

Spiderengine can contain multiple spiders. You can add crawlers to spiderengine and create new spiders and add them to spiderengine through addspider() and createspider().

In spiderengine, if the httpd (port) method is called, you can also monitor each spider in spiderengine.

1.2.1 get the status of a crawler

http://localhost: {port}/netdiscovery/spider/{spiderName}

Type: get

1.2.2 get the status of all crawlers in spiderengine

http://localhost: {port}/netdiscovery/spiders/

Type: get

1.2.3 modify the status of a crawler

http://localhost: {port}/netdiscovery/spider/{spiderName}/status

Type: Post

Parameter Description:

Examples of using frameworks

Create a spider engine, and then create three spiders. Each crawler crawls a page at regular intervals.

After the above program runs for a period of time, enter in the browser: http://localhost:8080/netdiscovery/spiders

We can see the results of three crawlers.

Format JSON

case

Recently, the chain of chain has been paid more attention to. Therefore, a program has been made to capture the price of three digital currencies in real time, and the latest price can be obtained by asking the official account.

At present, the program has been on-line, and can ask for my official account to get the latest price of these digital currencies in real time.

TODO

summary

This crawler framework has just started, and I have also referred to many excellent crawler frameworks. In the future, I will consider adding screenshots to the framework to analyze the data in the picture. Even combined with cv4j framework. Before the Chinese new year, priority will be given to the identification of login verification codes in the crawler framework.

The above is the whole content of this article. I hope it will be helpful to your study, and I hope you can support programming tips.

The content of this article comes from the network collection of netizens. It is used as a learning reference. The copyright belongs to the original author.
THE END
分享
二维码
< <上一篇
下一篇>>