Java – search for data in Web sites

I'm new to Java and have some problems

The main idea is to connect to a website and collect information from it and store it in an array

What I want the program to do is search the website, find a keyword, and store the content after the keyword

At the bottom of the website, there is a section called "tag cloud" on the front page of daniweb, which is marked with tags / short words

Tag cloud: "I want to store what is written here"

My idea is to first read the HTML of the website, then use scanner and stringtokenizer to search for keywords followed by text in the file, and then store it as an array

Is there a better way / easier?

Where do you suggest I look for some examples

This is what I have so far

import java.net.*;
import java.io.*;

public class URLReader {

    public static void main(String[] args) throws Exception {

        URL dweb = new URL("http://www.daniweb.com/");
        URLConnection dw = dweb.openConnection();
        BufferedReader in = new BufferedReader(new InputStreamReader(hc.getInputStream()));
        System.out.println("connected to daniweb");
        String inputLine;

        PrintStream out = new PrintStream(new FileOutputStream("OutFile.txt"));

        try {
        while ((inputLine = in.readLine()) != null)
            out.println(inputLine);

            //System.out.println(inputLine);
            //in.close();
        out.close();
        System.out.println("printed text to outfile");
        } catch (FileNotFoundException e) {
            e.printStackTrace();
        }

        try {
            Scanner scan = new Scanner(OutFile.txt);
            String search = txtSearch.getText();
            while (scan.hasNextLine()) {
                line = scan.nextLine();
            //still working
                while (st.hasMoreTokens()) {
                    word = st.nextToken();
                    if (word == search) {

                    } else {

                    }
                }
            }
            scan.close();
            SearchWin.dispose();
        } catch (IOException iox) {
        }
    }

Any help will be greatly appreciated!

Solution

I recommend jsoup It will retrieve and parse the page for you

On daniweb, each tag cloud link has a CSS class tagcloudlink So you just need to tell jsoup to extract all the text in the tag with class tagcloudlink

This is my mind plus some help from the jsoup website; I haven't tested it, but it should get you started:

List<String> tags = new ArrayList<String>();
Document doc = Jsoup.connect("http://daniweb.com/").get();
Elements taglinks = doc.select("a.tagcloudlink");
for (Element link : taglinks) {
    tags.add(link.text());
}
The content of this article comes from the network collection of netizens. It is used as a learning reference. The copyright belongs to the original author.
THE END
分享
二维码
< <上一篇
下一篇>>