Java – validation of XML document results in “invalid byte 1 of 1-byte UTF-8 sequence”
I use probatron 4J to internally use Saxon to validate some XML files in Schematron style sheets In most cases, this works normally, but occasionally, dealing with crashes and errors
My research shows that this message usually indicates (no special order)
>Blatantly invalid data (for example, try to read a zip file as if it were an XML file); > The existence of byte order marks; > Illegal characters exist in UTF-8; Either > claims to be a UTF - 8 encoded file
None of this applies to the file I'm working on I checked the input in the form of byte array during program execution. It does not contain BOM or any non ASCII characters
Process through about a fifth of my 30KB document, and then crash on a humble English sentence ("humble", I mean all bytes are between 32 (space) and 122 (lowercase z); In other words, standard keyboard characters) Assume that the byte of the problematic element is at the end of this article
Strangely, the failed document is generated by deleting some elements from the larger document, which are handled cleanly by the same code
I know the exception is implementing org xml. Thrown in the parse (inputsource input) method of the object of the saxxmlreader interface According to the Javadoc, saxexception indicates
Checking for exceptions in the debugger indicates that there are no wrapper exceptions
What are the possible causes of this error?
Edit:
[60,80,97,114,103,112,104,62,69,120,101,99,117,116,105,118,32,83,109,121,58,70,111,49,55,53,52,54,51,13,10,110,100,65,115,119,102,108,98,44,75,71,73,76,88,86,46,107,89,39,87,59,78,60,47,62]
Throw an exception after the third occurrence of 109
Solution
I have solved the problem Although Java internally uses UTF-8 as its string object, the GetBytes () method of the string class will generate bytes in the system's default encoding, unless you explicitly specify that you need UTF-8 (or other encoding scheme it understands)
I'm not entirely sure how or why this solves the problem, because the bytes near the place where the exception is thrown - the bytes at the end of the problem - are themselves valid UTF-8 bytes, but they do look so fixed
The only possible reason I can think of is that I missed an earlier invalid byte in the file, which screwed things up but didn't cause an immediate crash I am reading bytes from bytearrayinputstream, so the program may read a large block from the buffer at the same time, which will set the POS flag beyond the assumed position of bad characters