Java – how to get content in Web crawl

Hi! I'm trying to implement this pseudo code for spider algorithm to explore the network I need some ideas about the next step of pseudo code: "use spiderleg to get content". I have a method in another class spiderleg. It has a method to get all the URLs of the web page, but I want to know how to use it in this class?

// method to crawl web and print out all URLs that the spider visit
public List<String> crawl(String url,String keyword) throws IOException{
    String currentUrl;
    // while list of unvisited URLs is not empty
    while(unvisited != null ){
        // take URL from list 
        currentUrl = unvisited.get(0);
       //using spiderLeg to fetch content   
        SpiderLeg leg = new SpiderLeg();
    }
    return unvisited;
}

Cheers! I'll try... But I tried this D.S. without using the queue. It almost works, but it won't stop the program when searching for some words

When it finds that it only displays links to web pages, not all specific URLs where it finds the word Want to know if you can do this?

private static final int MAX_PAGES_TO_SEARCH = 10;
  private Set<String> pagesVisited = new HashSet<String>();
  private List<String> pagesToVisit = new LinkedList<String>();



public void crawl(String url,String searchWord)
  {
      while(this.pagesVisited.size() < MAX_PAGES_TO_SEARCH)
      {
          String currentUrl;
      SpiderLeg leg = new SpiderLeg();
      if(this.pagesToVisit.isEmpty())
      {
          currentUrl = url;
          this.pagesVisited.add(url);
      }
      else
      {
          currentUrl = this.nextUrl();
      }
      leg.getHyperlink(currentUrl);
      boolean success = leg.searchForWord(searchWord);
      if(success)
      {
          System.out.println(String.format("**Success** Word %s found at %s",searchWord,currentUrl));
          break;
      }
      this.pagesToVisit.addAll(leg.getLinks());
  }
  System.out.println("\n**Done** Visited " + this.pagesVisited.size() + " web page(s)");
  }

Solution

The crawling algorithm is essentially breadth first search. You need to maintain a queue of unreachable URLs. Each time you access the URL you queued, you need to queue any unreachable URLs found in the HTML parser (spiderleg)

The conditions for adding a URL to the queue depend on you, but usually you need to keep the distance between the URL and the seed URL as the stop point, so you won't traverse the web forever These rules may also contain details that you are interested in searching so that you can only add relevant URLs

The content of this article comes from the network collection of netizens. It is used as a learning reference. The copyright belongs to the original author.
THE END
分享
二维码
< <上一篇
下一篇>>