Java – how to get content in Web crawl
Hi! I'm trying to implement this pseudo code for spider algorithm to explore the network I need some ideas about the next step of pseudo code: "use spiderleg to get content". I have a method in another class spiderleg. It has a method to get all the URLs of the web page, but I want to know how to use it in this class?
// method to crawl web and print out all URLs that the spider visit public List<String> crawl(String url,String keyword) throws IOException{ String currentUrl; // while list of unvisited URLs is not empty while(unvisited != null ){ // take URL from list currentUrl = unvisited.get(0); //using spiderLeg to fetch content SpiderLeg leg = new SpiderLeg(); } return unvisited; }
Cheers! I'll try... But I tried this D.S. without using the queue. It almost works, but it won't stop the program when searching for some words
When it finds that it only displays links to web pages, not all specific URLs where it finds the word Want to know if you can do this?
private static final int MAX_PAGES_TO_SEARCH = 10; private Set<String> pagesVisited = new HashSet<String>(); private List<String> pagesToVisit = new LinkedList<String>(); public void crawl(String url,String searchWord) { while(this.pagesVisited.size() < MAX_PAGES_TO_SEARCH) { String currentUrl; SpiderLeg leg = new SpiderLeg(); if(this.pagesToVisit.isEmpty()) { currentUrl = url; this.pagesVisited.add(url); } else { currentUrl = this.nextUrl(); } leg.getHyperlink(currentUrl); boolean success = leg.searchForWord(searchWord); if(success) { System.out.println(String.format("**Success** Word %s found at %s",searchWord,currentUrl)); break; } this.pagesToVisit.addAll(leg.getLinks()); } System.out.println("\n**Done** Visited " + this.pagesVisited.size() + " web page(s)"); }
Solution
The crawling algorithm is essentially breadth first search. You need to maintain a queue of unreachable URLs. Each time you access the URL you queued, you need to queue any unreachable URLs found in the HTML parser (spiderleg)
The conditions for adding a URL to the queue depend on you, but usually you need to keep the distance between the URL and the seed URL as the stop point, so you won't traverse the web forever These rules may also contain details that you are interested in searching so that you can only add relevant URLs