elasticsearch7. Version 6.1 + jsoup crawl JD commodity data and use it

2021-11-29 • Java

Actual combat: elasticsearch7 Version 6.1 + jsoup crawl JD commodity data and use it

@H_ 404_ 3 @ Preface

@H_ 404_ 3 @ directory

1. Preliminary preparation

@H_ 404_ 3@ok ~Now that the preparations are ready, you can start to work next!

2. Data preparation

Use jsoup to crawl data from jd.com

First, let's find the search java on JD: https://search.jd.com/Search?keyword=java

Then search Vue: https://search.jd.com/Search?keyword=vue

Search Apple again: https://search.jd.com/Search?keyword=apple

It can be observed that @ h_ 404_ 3@url Fixed prefix: https://search.jd.com/Search?keyword= , which will be used later

Then open F12, click the mouse in the upper left corner to locate the commodity column information

Through the above observation, the basic analysis and acquisition data code can be given

/**
 * @Author: Amg
 * @Date: Created in 15:13 2021/04/11
 * @Description: 商品信息实体类
 */
@Data
@NoArgsConstructor
@AllArgsConstructor
public class GoodsContent {

    private String price;

    private String title;

    private String img;
}


/**
 * @Author: Amg
 * @Date: Created in 15:27 2021/04/11
 * @Description: 使用Jsoup解析网页数据获取数据
 */
public class ParseHtmlUtil {

    //这里只针对京东搜索，如果需要改变url，请自行去查看网页构造结构，然后修改结构；
    //固定的url，后续只需要拼接对应的关键字即可
    private static final String BASE_URL = "https://search.jd.com/Search?keyword=";

    //当然如果这么写，每次都只能解析第一页的数据，所以如果要获取不同页面的数据，就可以通过循环来操作，这里有兴趣的朋友可以自行改造
    // private static final String BASE_URL = https://search.jd.com/Search?keyword=java&page=5
    
    /**
     * 按照 【关键词】搜索，并且返回具体的数据
     * @param keyword   关键词
     * @return 一个装载GoodsContent的list
     */
    public static List<GoodsContent> parseHtml(String keyword) {

        String url = BASE_URL + keyword;

        List<GoodsContent> list = new ArrayList<>();
        try {

            //获取浏览器Document对象,设置超时时间为30s
            final Document document = Jsoup.parse(new URL(url),30000);

            //通过此对象就可以解析网页标签结构，得到想要的信息；所以你需要做的就是分析网页结构

            //这就是刚刚我们定位到的【京东商品信息存放地方】
            final Elements elements = document.getElementById("J_goodsList").getElementsByTag("li");

            //进一步处理数据，把我们想要得数据给拿出来
            for (Element element : elements) {

                final String price = element.getElementsByClass("p-price").eq(0).text();
                final String title = element.getElementsByClass("p-name").eq(0).text();
                //图片如果直接找img src路径是获取不到的，因为类似京东这样子的网站，图片如此之多，肯定需要做懒加载，不然加载就会非常的慢
                //网页上在数据未加载完成的时候，可能只会存放一张静态的图片，等到数据获取回来，才会渲染上去，所以这里需要获取真正存放这个图片的路径
                final String img = element.getElementsByTag("img").eq(0).attr("data-lazy-img");

                GoodsContent goods = new GoodsContent();
                goods.setImg(img);
                goods.setTitle(title);
                goods.setPrice(price);

                list.add(goods);
            }

        } catch (Exception e) {
            System.err.println("爬去数据失败，请检查原因...");
            e.printStackTrace();
        }

        return list;
    }
}

3. Business Writing

@H_ 404_ 3 @ so far, the ES also has test data, and the test acquisition data interface can return data

⚠️ Notice that the title is now labeled span

4. Front and rear end separation

With the interface and data, what is used by the front desk@ H_ 404_ 3@Vue

Use Axios to send a request to the back-end interface API. It is convenient to operate here. The page number is set to 1, and the query data is 10. The main purpose is to verify whether it works or not

@H_ 404_ 3 @ final effect

5. Summary

@H_ 404_ 3 @ the project itself is not complex and easy to use. Of course, ES more advanced playing methods are not fully displayed here. We still need to learn and explore constantly. If there is more fun, I will continue to update and share them

Finally, let's talk about the acquisition of source code

WeChat official account: Amg. Reply: ES actual combat, you can get all the source code of the project

Thank you for watching. See you next time!

The content of this article comes from the network collection of netizens. It is used as a learning reference. The copyright belongs to the original author.

THE END

Java

二维码

Performance optimization database

< <上一篇

Java – compare SSL certificates by signature: is that enough?

下一篇>>

搜索内容