Method of simple image crawler based on rxjava2

Since October this year, I have tried to import some pictures to tensorflow to generate models, which requires a large number of pictures. At first, I only wrote a simple httpclient program to capture pictures. Later, I simply wrote a simple picture crawler program for universality. It can be used to capture a single picture, multiple pictures, all pictures under a web page and all pictures under multiple web pages.

GitHub address: https://github.com/fengzhizi715/PicCrawler

This crawler uses some features of httpclient, rxjava2 and Java 8. It supports some simple customization, such as customizing user agent, referer, cookies, etc.

1、 Download and install:

For Java projects built with gradle, since jcenter is not used by default, it needs to be configured in build.gradle of the corresponding module

Gradle:

Maven:

2、 Usage:

2.1 downloading a single picture

1. Common way

Here, timeout () represents the timeout of the network request. Filestrategy () indicates the directory where the files are stored, the format used by the files, and the policy used when generating the files. Repeat () indicates the number of times the request for the picture is repeated.

Piccrawler supports a variety of file generation strategies, such as randomly generating file names, growing file names from 1, generating specified file names, and so on.

The following figure shows that the program is used to download the image of a verification code 200 times.

2. Download using rxjava

3. Using rxjava, the downloaded pictures can be processed later

In the consumer, you can do some subsequent processing on files.

2.2 downloading multiple pictures

2.3 download all pictures of a web page

Use the above program to capture the pictures on the home page of my Jane book.

2.4 download all pictures of multiple web pages

Download the pictures on the homepage of personal profile book and the pictures of developer headlines.

3、 Partial source code analysis

3.1 download all pictures of a web page

The downloadwebpageimages () method means to download all the images of a URL.

Downloadwebpageimages () is divided into three steps: creating a network request, parsing the image path contained in the current page, and downloading these images.

In the first step, httpclient is used to create the network request.

The second step is to convert the returned response into string type, and use jsoup to filter out all links with pictures.

Jsoup is a Java HTML parser, which can directly parse a URL address and HTML text content. It provides a very labor-saving API, which can fetch and manipulate data through DOM, CSS and operation methods similar to jQuery.

The third step is to download these pictures using java 8's completable future. Completable future is a new class for asynchronous processing in Java 8, and its performance is better than that of the traditional future.

3.2 download all pictures of multiple web pages

The downloadwebpageimages () method also supports transferring a list collection to represent the addresses of multiple web pages.

Here, it is applied to parallelflowable, because parallel () can convert flowable into parallelflowable.

summary

Piccrawler is a simple image crawler, which can basically meet my needs at present. If there are new requirements in the future, I will continue to add functions.

When doing piccrawler, in fact, a proxypool library is also made to obtain the available agent pool, which is also implemented based on rxjava2.

The above is the whole content of this article. I hope it will be helpful to your study, and I hope you can support programming tips.

The content of this article comes from the network collection of netizens. It is used as a learning reference. The copyright belongs to the original author.
THE END
分享
二维码
< <上一篇
下一篇>>