elasticsearch7. Version 6.1 + jsoup crawl JD commodity data and use it
Actual combat: elasticsearch7 Version 6.1 + jsoup crawl JD commodity data and use it
@H_ 404_ 3 @ Preface
@H_ 404_ 3 @ directory
1. Preliminary preparation
@H_ 404_ 3@ok ~Now that the preparations are ready, you can start to work next!
2. Data preparation
Use jsoup to crawl data from jd.com
First, let's find the search java on JD: https://search.jd.com/Search?keyword=java
Then search Vue: https://search.jd.com/Search?keyword=vue
Search Apple again: https://search.jd.com/Search?keyword=apple
It can be observed that @ h_ 404_ 3@url Fixed prefix: https://search.jd.com/Search?keyword= , which will be used later
Then open F12, click the mouse in the upper left corner to locate the commodity column information
Through the above observation, the basic analysis and acquisition data code can be given
/**
* @Author: Amg
* @Date: Created in 15:13 2021/04/11
* @Description: 商品信息实体类
*/
@Data
@NoArgsConstructor
@AllArgsConstructor
public class GoodsContent {
private String price;
private String title;
private String img;
}
/**
* @Author: Amg
* @Date: Created in 15:27 2021/04/11
* @Description: 使用Jsoup解析网页数据获取数据
*/
public class ParseHtmlUtil {
//这里只针对京东搜索,如果需要改变url,请自行去查看网页构造结构,然后修改结构;
//固定的url,后续只需要拼接对应的关键字即可
private static final String BASE_URL = "https://search.jd.com/Search?keyword=";
//当然如果这么写,每次都只能解析第一页的数据,所以如果要获取不同页面的数据,就可以通过循环来操作,这里有兴趣的朋友可以自行改造
// private static final String BASE_URL = https://search.jd.com/Search?keyword=java&page=5
/**
* 按照 【关键词】搜索,并且返回具体的数据
* @param keyword 关键词
* @return 一个装载GoodsContent的list
*/
public static List<GoodsContent> parseHtml(String keyword) {
String url = BASE_URL + keyword;
List<GoodsContent> list = new ArrayList<>();
try {
//获取浏览器Document对象,设置超时时间为30s
final Document document = Jsoup.parse(new URL(url),30000);
//通过此对象就可以解析网页标签结构,得到想要的信息;所以你需要做的就是分析网页结构
//这就是刚刚我们定位到的【京东商品信息存放地方】
final Elements elements = document.getElementById("J_goodsList").getElementsByTag("li");
//进一步处理数据,把我们想要得数据给拿出来
for (Element element : elements) {
final String price = element.getElementsByClass("p-price").eq(0).text();
final String title = element.getElementsByClass("p-name").eq(0).text();
//图片如果直接找img src路径是获取不到的,因为类似京东这样子的网站,图片如此之多,肯定需要做懒加载,不然加载就会非常的慢
//网页上在数据未加载完成的时候,可能只会存放一张静态的图片,等到数据获取回来,才会渲染上去,所以这里需要获取真正存放这个图片的路径
final String img = element.getElementsByTag("img").eq(0).attr("data-lazy-img");
GoodsContent goods = new GoodsContent();
goods.setImg(img);
goods.setTitle(title);
goods.setPrice(price);
list.add(goods);
}
} catch (Exception e) {
System.err.println("爬去数据失败,请检查原因...");
e.printStackTrace();
}
return list;
}
}
3. Business Writing
@H_ 404_ 3 @ so far, the ES also has test data, and the test acquisition data interface can return data
⚠️ Notice that the title is now labeled span
4. Front and rear end separation
With the interface and data, what is used by the front desk@ H_ 404_ 3@Vue
Use Axios to send a request to the back-end interface API. It is convenient to operate here. The page number is set to 1, and the query data is 10. The main purpose is to verify whether it works or not
@H_ 404_ 3 @ final effect
5. Summary
@H_ 404_ 3 @ the project itself is not complex and easy to use. Of course, ES more advanced playing methods are not fully displayed here. We still need to learn and explore constantly. If there is more fun, I will continue to update and share them
Finally, let's talk about the acquisition of source code
WeChat official account: Amg. Reply: ES actual combat, you can get all the source code of the project
Thank you for watching. See you next time!