Java – validation of XML document results in “invalid byte 1 of 1-byte UTF-8 sequence”

2019-12-28 • Java

I use probatron 4J to internally use Saxon to validate some XML files in Schematron style sheets In most cases, this works normally, but occasionally, dealing with crashes and errors

My research shows that this message usually indicates (no special order)

>Blatantly invalid data (for example, try to read a zip file as if it were an XML file); > The existence of byte order marks; > Illegal characters exist in UTF-8; Either > claims to be a UTF - 8 encoded file

None of this applies to the file I'm working on I checked the input in the form of byte array during program execution. It does not contain BOM or any non ASCII characters

Process through about a fifth of my 30KB document, and then crash on a humble English sentence ("humble", I mean all bytes are between 32 (space) and 122 (lowercase z); In other words, standard keyboard characters) Assume that the byte of the problematic element is at the end of this article

Strangely, the failed document is generated by deleting some elements from the larger document, which are handled cleanly by the same code

I know the exception is implementing org xml. Thrown in the parse (inputsource input) method of the object of the saxxmlreader interface According to the Javadoc, saxexception indicates

Checking for exceptions in the debugger indicates that there are no wrapper exceptions

What are the possible causes of this error?

Edit:

[60,80,97,114,103,112,104,62,69,120,101,99,117,116,105,118,32,83,109,121,58,70,111,49,55,53,52,54,51,13,10,110,100,65,115,119,102,108,98,44,75,71,73,76,88,86,46,107,89,39,87,59,78,60,47,62]

Throw an exception after the third occurrence of 109

Solution

I have solved the problem Although Java internally uses UTF-8 as its string object, the GetBytes () method of the string class will generate bytes in the system's default encoding, unless you explicitly specify that you need UTF-8 (or other encoding scheme it understands)

I'm not entirely sure how or why this solves the problem, because the bytes near the place where the exception is thrown - the bytes at the end of the problem - are themselves valid UTF-8 bytes, but they do look so fixed

The only possible reason I can think of is that I missed an earlier invalid byte in the file, which screwed things up but didn't cause an immediate crash I am reading bytes from bytearrayinputstream, so the program may read a large block from the buffer at the same time, which will set the POS flag beyond the assumed position of bad characters

The content of this article comes from the network collection of netizens. It is used as a learning reference. The copyright belongs to the original author.

THE END

Java

二维码

Android does not respect meta tag deletion?

< <上一篇

Compile Java and xStream. XML with gcj (exception: cannot create xmlpullparser)

下一篇>>

搜索内容

Java – validation of XML document results in “invalid byte 1 of 1-byte UTF-8 sequence”

Solution

热门文章