Need help getting Java’s Web site HTML
I got some code from Java httpurlconnection cutting off HTML, which is almost the same as the code I got from websites in Java
I want to get HTML from this website:
But I've been catching garbage characters Although it is related to http://www.google.com Any other website is easy to use
This is the code I'm using:
public static String PrintHTML(){ URL url = null; try { url = new URL("http://www.geni.com/genealogy/people/William-Jefferson-Blythe-Clinton/6000000001961474289"); } catch (MalformedURLException e1) { // TODO Auto-generated catch block e1.printStackTrace(); } HttpURLConnection connection = null; try { connection = (HttpURLConnection) url.openConnection(); } catch (IOException e) { // TODO Auto-generated catch block e.printStackTrace(); } connection.setRequestProperty("User-Agent","Mozilla/5.0 (Windows; U; Windows NT 6.1; en-US; rv:1.9.2.6) Gecko/20100625 Firefox/3.6.6"); try { System.out.println(connection.getResponseCode()); } catch (IOException e) { // TODO Auto-generated catch block e.printStackTrace(); } String line; StringBuilder builder = new StringBuilder(); BufferedReader reader = null; try { reader = new BufferedReader(new InputStreamReader(connection.getInputStream())); } catch (IOException e) { // TODO Auto-generated catch block e.printStackTrace(); } try { while ((line = reader.readLine()) != null) { builder.append(line); builder.append("\n"); } } catch (IOException e) { // TODO Auto-generated catch block e.printStackTrace(); } String html = builder.toString(); System.out.println("HTML " + html); return html; }
I don't understand why it doesn't apply to the URL I mentioned above
Any help would be appreciated
Solution
The site compresses the response incorrectly regardless of the client's functionality Generally, the server should respond only when the client supports the response (go to accept encoding: gzip) You need to unzip it using gzipinputstream
reader = new BufferedReader(new InputStreamReader(new GZIPInputStream(connection.getInputStream()),"UTF-8"));
Note that I also add the right character set to the inputstreamreader constructor Typically, you want to extract it from the content - type header of the response
For more tips, see also how to use urlconnection to fire and handle HTTP requests? If everything you want is to parse / extract information from HTML, I strongly recommend that you use an HTML parser like jsup