Need help getting Java’s Web site HTML

I got some code from Java httpurlconnection cutting off HTML, which is almost the same as the code I got from websites in Java

I want to get HTML from this website:

But I've been catching garbage characters Although it is related to http://www.google.com Any other website is easy to use

This is the code I'm using:

public static String PrintHTML(){
    URL url = null;
    try {
        url = new URL("http://www.geni.com/genealogy/people/William-Jefferson-Blythe-Clinton/6000000001961474289");
    } catch (MalformedURLException e1) {
        // TODO Auto-generated catch block
        e1.printStackTrace();
    }
    HttpURLConnection connection = null;
    try {
        connection = (HttpURLConnection) url.openConnection();
    } catch (IOException e) {
        // TODO Auto-generated catch block
        e.printStackTrace();
    }
    connection.setRequestProperty("User-Agent","Mozilla/5.0 (Windows; U; Windows NT 6.1; en-US; rv:1.9.2.6) Gecko/20100625 Firefox/3.6.6");
    try {
        System.out.println(connection.getResponseCode());
    } catch (IOException e) {
        // TODO Auto-generated catch block
        e.printStackTrace();
    }
    String line;
    StringBuilder builder = new StringBuilder();
    BufferedReader reader = null;
    try {
        reader = new BufferedReader(new InputStreamReader(connection.getInputStream()));
    } catch (IOException e) {
        // TODO Auto-generated catch block
        e.printStackTrace();
    }
    try {
        while ((line = reader.readLine()) != null) {
            builder.append(line);
            builder.append("\n"); 
        }
    } catch (IOException e) {
        // TODO Auto-generated catch block
        e.printStackTrace();
    }
    String html = builder.toString();
    System.out.println("HTML " + html);
    return html;
}

I don't understand why it doesn't apply to the URL I mentioned above

Any help would be appreciated

Solution

The site compresses the response incorrectly regardless of the client's functionality Generally, the server should respond only when the client supports the response (go to accept encoding: gzip) You need to unzip it using gzipinputstream

reader = new BufferedReader(new InputStreamReader(new GZIPInputStream(connection.getInputStream()),"UTF-8"));

Note that I also add the right character set to the inputstreamreader constructor Typically, you want to extract it from the content - type header of the response

For more tips, see also how to use urlconnection to fire and handle HTTP requests? If everything you want is to parse / extract information from HTML, I strongly recommend that you use an HTML parser like jsup

The content of this article comes from the network collection of netizens. It is used as a learning reference. The copyright belongs to the original author.
THE END
分享
二维码
< <上一篇
下一篇>>