Java – how to parse XML files containing BOMs?
I want to use JDOM to parse XML files from URLs But when trying this:
SAXBuilder builder = new SAXBuilder(); builder.build(aUrl);
I get this exception:
Invalid byte 1 of 1-byte UTF-8 sequence.
I think this may be a BOM problem So I looked at the source code and saw the BOM at the beginning of the file I try to use aURL Openstream() reads from the URL and deletes the BOM using common IO bominputstream But to my surprise, it didn't detect any BOM I try to read and write local files from the stream and parse local files I set all the encoding of inputstreamreader and outputstreamwriter to utf8, but it has crazy characters when I open the file
I think the problem is the source URL encoding But when I open the URL in the browser, save the XML in a file and read the file through the above process, everything is normal
I am grateful for the possible causes of this problem
Solution
The HTTP server is sending content as gzip (content encoding: gzip; if you don't know what this means, see http://en.wikipedia.org/wiki/HTTP_compression )Therefore, you need to add aURL Openstream () is wrapped in gzipinputstream, which will unzip it for you For example:
builder.build(new GZIPInputStream(aUrl.openStream()));
Edit and add according to the following comments: if you don't know whether the URL is gzipped in advance, you can write as follows:
private InputStream openStream(final URL url) throws IOException { final URLConnection cxn = url.openConnection(); final String contentEncoding = cxn.getContentEncoding(); if(contentEncoding == null) return cxn.getInputStream(); else if(contentEncoding.equalsIgnoreCase("gzip") || contentEncoding.equalsIgnoreCase("x-gzip")) return new GZIPInputStream(cxn.getInputStream()); else throw new IOException("Unexpected content-encoding: " + contentEncoding); }
(warning: untested) then use:
builder.build(openStream(aUrl.openStream()));
. This is basically the same as the above – aURL Openstream() is explicitly recorded as aURL openConnection(). Short for getinputstream() - except that it checks the content encoding header before deciding whether to wrap the stream in gzipinputstream
See the document for Java net. URLConnection.