Java – check for link corruption
I try to use java to find all broken links in web pages This is the code:
private static boolean isLive(String link){ HttpURLConnection urlconn = null; int res = -1; String msg = null; try{ URL url = new URL(link); urlconn = (HttpURLConnection)url.openConnection(); urlconn.setConnectTimeout(10000); urlconn.setRequestMethod("GET"); urlconn.connect(); String redirlink = urlconn.getHeaderField("Location"); System.out.println(urlconn.getHeaderFields()); if(redirlink != null && !url.toExternalForm().equals(redirlink)) return isLive(redirlink); else return urlconn.getResponseCode()==HttpURLConnection.HTTP_OK; }catch(Exception e){ System.out.println(e.getMessage()); return false; }finally{ if(urlconn != null) urlconn.disconnect(); } } public static void main(String[] s){ String link = "http://www.somefakesite.net"; System.out.println(isLive(link)); }
Code from http://nscraps.com/Java/146-program-code-broken-link-checker.htm.
This code provides HTTP 200 status for all web pages (including damaged web pages) For example http://www.somefakesite.net/ The following title fields are given:
{null = [http / 1.1 200 OK], date = [sun, May 15, 2011, 18:51:29 GMT], @ r_771_301 @ = [chunked], keep alive = [timeout = 4, max = 100], connection = [keep alive], content type = [text / HTML], server = [Apache / 2.2.15 (Win32) PHP / 5.2.12], x-powered by = [PHP / 5.2.9-1]}
Even if these websites do not exist, how can they be classified as broken links?
Solution
Perhaps the problem is that many web servers and DNS providers have detected those "broken" links and redirected you to their "not found" page
Test it according to the URL you know to send 404 code (which displays the browser's original message)
The editor answers the author's comment (because it's too long to comment): I don't see a simple answer to your question, but there are several different types of failures:
>For redirected DNS failure (URL that DNS cannot find, you will be redirected to another page) All redirects (if you are redirected) may go to the same page (provided by your ISP / DNS provider) and you can check it Of course, if you try to use another ISP / DNS provider, the page may be different If you are not redirected, you will receive a connection error. > There should be a connection error for a server that has valid DNS but does not work properly (for example, google.com fails). > It is more difficult for the missing resources ("pages") in the server 404 indicates that it is corrupted, but if the server does not send it, there is no need to do it again Redirection may help to mark links as suspicious, but it should be checked manually later, as it is not only used to capture missing links (for example, www.google.com, www.google. Es)