Java – search for data in Web sites
I'm new to Java and have some problems
The main idea is to connect to a website and collect information from it and store it in an array
What I want the program to do is search the website, find a keyword, and store the content after the keyword
At the bottom of the website, there is a section called "tag cloud" on the front page of daniweb, which is marked with tags / short words
Tag cloud: "I want to store what is written here"
My idea is to first read the HTML of the website, then use scanner and stringtokenizer to search for keywords followed by text in the file, and then store it as an array
Is there a better way / easier?
Where do you suggest I look for some examples
This is what I have so far
import java.net.*; import java.io.*; public class URLReader { public static void main(String[] args) throws Exception { URL dweb = new URL("http://www.daniweb.com/"); URLConnection dw = dweb.openConnection(); BufferedReader in = new BufferedReader(new InputStreamReader(hc.getInputStream())); System.out.println("connected to daniweb"); String inputLine; PrintStream out = new PrintStream(new FileOutputStream("OutFile.txt")); try { while ((inputLine = in.readLine()) != null) out.println(inputLine); //System.out.println(inputLine); //in.close(); out.close(); System.out.println("printed text to outfile"); } catch (FileNotFoundException e) { e.printStackTrace(); } try { Scanner scan = new Scanner(OutFile.txt); String search = txtSearch.getText(); while (scan.hasNextLine()) { line = scan.nextLine(); //still working while (st.hasMoreTokens()) { word = st.nextToken(); if (word == search) { } else { } } } scan.close(); SearchWin.dispose(); } catch (IOException iox) { } }
Any help will be greatly appreciated!
Solution
I recommend jsoup It will retrieve and parse the page for you
On daniweb, each tag cloud link has a CSS class tagcloudlink So you just need to tell jsoup to extract all the text in the tag with class tagcloudlink
This is my mind plus some help from the jsoup website; I haven't tested it, but it should get you started:
List<String> tags = new ArrayList<String>(); Document doc = Jsoup.connect("http://daniweb.com/").get(); Elements taglinks = doc.select("a.tagcloudlink"); for (Element link : taglinks) { tags.add(link.text()); }