Java – HTML hybrid coding?

First of all, I'd like to thank you in advance for your help

I am currently writing a web crawler that parses HTML content, strips HTML tags, and then spell checks the text retrieved from the parsing

Using jsoup and Google spell check APIs to strip HTML tags and spell check does not cause any problems

I can download content from the URL and pass this information to byte [], and then finally to string so that it can be stripped and spell checked I have a problem with character encoding

For example, parsing http://www.testwareinc.com/ When

We have expanded mobile network and mobile application testing services

... page uses iso-8859-1 based on meta tags

Iso-8859-1 parsing: we extend the mobile web and mobile application testing services

... then try using UTF-8

UTF - 8 parsing: we extend mobile web and mobile application testing services

Can the HTML of a web page contain multiple codes? How can it be found?

Solution

It seems that the apostrophe is encoded as 0x92 bytes, which is an unallocated / private code point according to Wikipedia

Since then, it seems that the browser will assume that it is a non encoded 1-byte Unicode code point: + 0092 (private use two), which seems to be expressed as an apostrophe No wait. If it is a byte, it is more likely to be cp1252: the browser must have a fallback policy according to the advertised CP, such as iso-8859-1 – > cp1252

Therefore, there is no mixed coding here, but as others have said, documents are broken, but backup heuristics sometimes help, sometimes not

If you are curious enough, you may need to know the source code of FF or chrome to understand their behavior in this case

The content of this article comes from the network collection of netizens. It is used as a learning reference. The copyright belongs to the original author.
THE END
分享
二维码
< <上一篇
下一篇>>