Crawler of java learning

Crawler of java learning

0x00 Preface

After summarizing the basic stage, write a reptile practice, from which you can learn a lot.

0x01 reptile structure and concept

The more official name of crawler is data collection, which is generally called spider in English. It is to automatically collect data from the Internet through programming. What a crawler needs to do is simulate a normal network request. For example, if you click a web address on the website, it is a network request.

Here we can talk about the role of crawlers in penetration. For example, we need to climb the external chain on the website in batches or the user name and mobile phone number of the poster of the forum. If we collect by hand, it will greatly affect the efficiency.

Generally speaking, the crawler process is actually request, filtering, that is, data extraction, and then storing the extracted content.

0x02 request from crawler

maven:

<dependency>
            <groupId>org.apache.httpcomponents</groupId>
            <artifactId>httpclient</artifactId>
            <version>4.5.12</version>
        </dependency>

Here's the prophet forum for a demonstration,

Get request

package is.text;


import org.apache.http.client.methods.CloseableHttpResponse;
import org.apache.http.client.methods.HttpGet;
import org.apache.http.impl.client.CloseableHttpClient;
import org.apache.http.impl.client.HttpClients;
import org.apache.http.util.EntityUtils;

import java.io.IOException;

public class http1get {
    public static void main(String[] args) {
        CloseableHttpClient client = HttpClients.createDefault(); //创建httpclient 对象。
        HttpGet httpGet = new HttpGet("https://xz.aliyun.com/?page=1");  //创建get请求对象。
        CloseableHttpResponse response = null;
        try {
            response = client.execute(httpGet);   //发送get请求
            if (response.getStatusLine().getStatusCode()==200){
                String s = EntityUtils.toString(response.getEntity(),"utf-8");
                System.out.println(s);
                System.out.println(httpGet);

            }


        } catch (IOException e) {
            e.printStackTrace();
        }finally {
            try {
                response.close();
                client.close();

            } catch (IOException e) {
                e.printStackTrace();
            }

        }


    }
}


Method resolution:

createDefault
公共静态CloseableHttpClient  createDefault()
CloseableHttpClient使用默认配置创建实例。


Get carry parameter request:

package is.text;


import org.apache.http.client.methods.CloseableHttpResponse;
import org.apache.http.client.methods.HttpGet;
import org.apache.http.client.utils.URIBuilder;
import org.apache.http.impl.client.CloseableHttpClient;
import org.apache.http.impl.client.HttpClients;
import org.apache.http.util.EntityUtils;

import java.io.IOException;
import java.net.URISyntaxException;

public class http1get {
    public static void main(String[] args) throws URISyntaxException {
        CloseableHttpClient client = HttpClients.createDefault(); //创建httpclient 对象。
        URIBuilder uriBuilder = new URIBuilder("https://xz.aliyun.com/");   //使用URIBuilder设置地址
        uriBuilder.setParameter("page","2");    //设置传入参数
        HttpGet httpGet = new HttpGet(uriBuilder.build());  //创建get请求对象。
//        https://xz.aliyun.com/?page=1
        CloseableHttpResponse response = null;
        try {
            response = client.execute(httpGet);   //发送get请求
            if (response.getStatusLine().getStatusCode()==200){
                String s = EntityUtils.toString(response.getEntity(),"utf-8");
                System.out.println(s);
                System.out.println(httpGet);

            }


        } catch (IOException e) {
            e.printStackTrace();
        }finally {
            try {
                response.close();
                client.close();

            } catch (IOException e) {
                e.printStackTrace();
            }

        }


    }
}

Post request


import org.apache.http.client.methods.CloseableHttpResponse;
import org.apache.http.client.methods.HttpPost;
import org.apache.http.impl.client.CloseableHttpClient;
import org.apache.http.impl.client.HttpClients;

import org.apache.http.util.EntityUtils;


import java.io.IOException;


public class httppost {
    public static void main(String[] args) {
        CloseableHttpClient client = HttpClients.createDefault();
        HttpPost httpPost = new HttpPost("https://xz.aliyun.com/");
        CloseableHttpResponse response = null;
        try {
            response = client.execute(httpPost);
            
                String s = EntityUtils.toString(response.getEntity());
                System.out.println(s);
                System.out.println(httpPost);



        } catch (IOException e) {
            e.printStackTrace();
        }
    }

}

In get and post requests that do not carry parameters, the request method of get is basically similar to that of post. However, when creating the request object, the get request uses httpget to generate the object, while post uses httppost to generate the object.

Post carry parameter request

package is.text;

import org.apache.http.NameValuePair;
import org.apache.http.client.entity.UrlEncodedFormEntity;
import org.apache.http.client.methods.CloseableHttpResponse;

import org.apache.http.client.methods.HttpPost;
import org.apache.http.impl.client.CloseableHttpClient;
import org.apache.http.impl.client.HttpClients;
import org.apache.http.message.BasicNameValuePair;
import org.apache.http.util.EntityUtils;

import java.io.IOException;

import java.util.ArrayList;
import java.util.List;

public class httpparams {
    public static void main(String[] args) throws IOException {
        CloseableHttpClient client = HttpClients.createDefault();//创建httpClients对象
        HttpPost httpPost = new HttpPost("https://xz.aliyun.com/");  //设置请求对象
        List<NameValuePair> params = new ArrayList<NameValuePair>();  //声明list集合,存储传入参数
        params.add(new BasicNameValuePair("page","3"));
        UrlEncodedFormEntity formEntity = new UrlEncodedFormEntity(params,"utf-8"); //创建表单的Entity对象,传入params参数
        httpPost.setEntity(formEntity);   //设置表单内容到post包中
        CloseableHttpResponse response  = client.execute(httpPost);
        String s = EntityUtils.toString(response.getEntity());
        System.out.println(s);
        System.out.println(s.length());
        System.out.println(httpPost);


    }
}

Connection pool

If an httpclient is created for each request, there will be a problem of frequent creation and destruction. You can use the connection pool to solve this problem.

Create a connection pool object:


PoolingHttpClientConnectionManager cm = new PoolingHttpClientConnectionManager();

Common methods:

Poolinghttpclientconnectionmanager class

public void setMaxTotal(int max)
        设置最大连接数

public void setDefaultMaxPerRoute(int max)
        设置每个主机的并发数

    

Httpclients class

Common methods:

createDefault()
CloseableHttpClient使用默认配置           创建实例。

custom()
          创建用于构建定制CloseableHttpClient实例的构建器对象 。
          

Create connection pool code

package is.text;

import org.apache.http.client.config.RequestConfig;
import org.apache.http.client.methods.CloseableHttpResponse;
import org.apache.http.client.methods.HttpGet;
import org.apache.http.impl.client.CloseableHttpClient;
import org.apache.http.impl.client.HttpClients;
import org.apache.http.impl.conn.PoolingHttpClientConnectionManager;
import org.apache.http.util.EntityUtils;

import java.io.IOException;

public class PoolHttpGet {
    public static void main(String[] args) {
        PoolingHttpClientConnectionManager cm = new PoolingHttpClientConnectionManager();
        cm.setMaxTotal(100); //设置最大连接数
        cm.setDefaultMaxPerRoute(100);   //设置每个主机的并发数
        
        doGet(cm);
        doGet(cm);
    }

    private static void doGet(PoolingHttpClientConnectionManager cm) {
        CloseableHttpClient httpClient = HttpClients.custom().setConnectionManager(cm).build();
        HttpGet httpGet = new HttpGet("www.baidu.com");
        try {
            CloseableHttpResponse response = httpClient.execute(httpGet);
            String s = EntityUtils.toString(response.getEntity(),"utf-8");
        } catch (IOException e) {
            e.printStackTrace();
        }
    }
}

Httpclient request configuration

package is.text;

import org.apache.http.client.config.RequestConfig;
import org.apache.http.client.methods.CloseableHttpResponse;
import org.apache.http.client.methods.HttpGet;
import org.apache.http.impl.client.CloseableHttpClient;
import org.apache.http.impl.client.HttpClients;
import org.apache.http.util.EntityUtils;

import java.io.IOException;

public class gethttp1params {
    public static void main(String[] args) throws IOException {
        CloseableHttpClient client = HttpClients.createDefault();
        HttpGet httpGet = new HttpGet("http://www.baidu.com");
        RequestConfig config = RequestConfig.custom().setConnectTimeout(1000) // 设置创建连接的最长时间
                .setConnectionRequestTimeout(500) //设置获取连接最长时间
                .setSocketTimeout(500).build();  //设置数据传输最长时间

        httpGet.setConfig(config);
        CloseableHttpResponse response  = client.execute(httpGet);
        String s = EntityUtils.toString(response.getEntity());
        System.out.println(s);


    }
}

0x03 data extraction of crawler

jsoup

Jsoup is a Java HTML parser, which can directly parse a URL address and HTML text content. It provides a very labor-saving API, which can fetch and manipulate data through DOM, CSS and operation methods similar to jQuery.

The main functions of jsoup are as follows:

To write a code for crawling to the forum Title:

package Jsoup;

import org.jsoup.Jsoup;
import org.jsoup.nodes.Document;
import org.junit.Test;
import java.net.URL;

public class JsoupTest1 {
    @Test
    public void testUrl() throws Exception {
        Document doc = Jsoup.parse(new URL("https://xz.aliyun.com/"),10000);//设置请求url与超时时间
        String title = doc.getElementsByTag("title").first().text();// //获取title的内容
        System.out.println(title);
    }
}

Here, first () means to get the first element, and text () means to get the label content

DOM traversal element



getElementById	根据id查询元素

getElementsByTag         
根据标签获取元素

getElementsByClass	根据class获取元素

getElementsByAttribute	根据属性获取元素


Crawling prophet forum articles

package Jsoup;

import org.jsoup.Jsoup;
import org.jsoup.nodes.Document;
import org.jsoup.nodes.Element;
import org.junit.Test;

import java.io.File;
import java.io.IOException;
import java.net.URL;

public class HttpDomTest {
    @Test
    public void TestDom() throws IOException {
        Document doc = Jsoup.parse(new URL("https://xz.aliyun.com/t/8091"),10000);
        String topic_content = doc.getElementById("topic_content").text();

        String titile = doc.getElementsByClass("content-title").first().text();
        System.out.println("title"+titile);
        System.out.println("topic_content"+topic_content);
    }
}

Crawl 10 pages of all articles

Get data from element:


1.	从元素中获取id
2.	从元素中获取className
3.	从元素中获取属性的值attr
4.	从元素中获取所有属性attributes
5.	从元素中获取文本内容text

package Jsoup;

import org.jsoup.Jsoup;
import org.jsoup.nodes.Document;
import org.jsoup.nodes.Element;
import org.jsoup.select.Elements;
import org.junit.Test;

import java.io.File;
import java.net.MalformedURLException;
import java.net.URL;
import java.util.List;

public class HttpDomTest10 {
    @Test
    public void xianzhi10test() throws Exception {
        String url = "https://xz.aliyun.com";
        Document doc = Jsoup.parse(new URL(url),10000);
        Elements element = doc.getElementsByClass("topic-title");
        List<String> href = element.eachAttr("href");
        for (String s : href) {
            try{
                Document requests = Jsoup.parse(new URL(url+s),100000);
                String topic_content = requests.getElementById("topic_content").text();
                String titile = requests.getElementsByClass("content-title").first().text();
                System.out.println(titile);
                System.out.println(topic_content);
            }catch (Exception e){
                System.out.println("爬取"+url+s+"报错"+"报错信息"+e);
            }




        }




/*
String topic_content = doc.getElementById("topic_content").text();

            String titile = doc.getElementsByClass("content-title").first().text();
            System.out.println("title"+titile);
            System.out.println("topic_content"+topic_content);

 */


    }
}


Successfully crawl to the content of the previous page.

Since we can crawl a page, we can directly define a for loop to traverse 10 times and then make a request. Crawling through 10 pages is done.

package Jsoup;

import org.jsoup.Jsoup;
import org.jsoup.nodes.Document;
import org.jsoup.nodes.Element;
import org.jsoup.select.Elements;
import org.junit.Test;

import java.io.File;
import java.net.MalformedURLException;
import java.net.URL;
import java.util.List;

public class HttpdomTEST2 {
    @Test
    public void xianzhi10test() throws Exception {
        String url = "https://xz.aliyun.com/";

        for (int i = 1; i < 10; i++) {
            String requesturl = "https://xz.aliyun.com/?page="+i;
            Document doc = Jsoup.parse(new URL(requesturl),10000);
            Elements element = doc.getElementsByClass("topic-title");
            List<String> href = element.eachAttr("href");
            for (String s : href) {
                try{
                    Document requests = Jsoup.parse(new URL(url+s),100000);
                    String topic_content = requests.getElementById("topic_content").text();
                    String titile = requests.getElementsByClass("content-title").first().text();
                    System.out.println(titile);
                    System.out.println(topic_content);
                }catch (Exception e){
                    System.out.println("爬取"+url+s+"报错"+"报错信息"+e);
                }

        }





        }




/*
String topic_content = doc.getElementById("topic_content").text();

            String titile = doc.getElementsByClass("content-title").first().text();
            System.out.println("title"+titile);
            System.out.println("topic_content"+topic_content);

 */



    }
}

The crawler crawls the connection of the content of a page and then requests. If there are more than a dozen articles on a page and crawls ten pages, there must be more requests. Single thread is far from enough. Multiple threads are required to crawl data.

Multi thread crawling article custom thread and page

Implementation class:

import java.util.concurrent.locks.reentrantlock;

public class Climbimpl implements Runnable {
    private String url ;
    private int pages;



    Lock lock = new reentrantlock();

    public Climbimpl(String url,int pages) {
        this.url = url;
        this.pages = pages;
    }

    public void run() {
        lock.lock();

//        String url = "https://xz.aliyun.com/";
        System.out.println(this.pages);
        for (int i = 1; i < this.pages; i++) {
            try {
            String requesturl = this.url+"?page="+i;
            Document doc = null;
            doc = Jsoup.parse(new URL(requesturl),10000);
            Elements element = doc.getElementsByClass("topic-title");
            List<String> href = element.eachAttr("href");
                for (String s : href) {
                    try{
                        Document requests = Jsoup.parse(new URL(this.url+s),100000);
                        String topic_content = requests.getElementById("topic_content").text();
                        String titile = requests.getElementsByClass("content-title").first().text();
                        System.out.println(titile);
                        System.out.println(topic_content);
                    }catch (Exception e){
                        System.out.println("爬取"+this.url+s+"报错"+"报错信息"+e);
                    }
                }


            } catch (IOException e) {
                e.printStackTrace();
            }


        }
        lock.unlock();

    }
}

main:

package Jsoup;
public class TestClimb {
    public static void main(String[] args) {
        int Threadlist_num = 50; //线程数
        String url = "https://xz.aliyun.com/";  //url
        int pages = 10; //读取页数

        Climbimpl climbimpl = new Climbimpl(url,pages);
        for (int i = 0; i < Threadlist_num; i++) {
            new Thread(climbimpl).start();
        }
    }
}

Select selector

tagname: 通过标签查找元素,比如:span
#id: 通过ID查找元素,比如:# city_bj
.class: 通过class名称查找元素,比如:.class_a
[attribute]: 利用属性查找元素,比如:[abc]
[attr=value]: 利用属性值来查找元素,比如:[class=s_name]

Code case:

通过标签查找元素:

Elements span = document.select("span");

通过id查找元素:

String str = document.select("#city_bj").text();

通过类名查找元素:

str = document.select(".class_a").text();

通过属性查找元素

str = document.select("[abc]").text();

属性值来查找元素:

str = document.select("[class=s_name]").text();

标签+元素组合:

str = document.select("span[abc]").text();

任意组合:

str = document.select("span[abc].s_name").text();

查找某个父元素下的直接子元素:

str = document.select(".city_con > ul > li").text();

 查找某个父元素下所有直接子元素:
 str = 
 document.select(".city_con > *").text();
 

0x04 end

Java crawlers rely on jsoup, which basically integrates all the functions required by crawlers.

The content of this article comes from the network collection of netizens. It is used as a learning reference. The copyright belongs to the original author.
THE END
分享
二维码
< <上一篇
下一篇>>