Java – can I determine whether the data is in English or Chinese?

Can you determine whether the data is in English or Chinese?

Solution

For example, this can be done using statistical methods English language has a very unique character distribution, and characters show a very unique distribution after another character (called level 1 model)

If 'e' is the most common symbol, it is unlikely that the language is not European

This distinction can also be fairly simple (but may not be 100% reliable) by looking at Unicode character values (converting between character sets if necessary) If there are characters with Unicode values greater than 127, English is a little unlikely (note that some symbols represent €) If many characters have thousands of Unicode values, East Asian languages become more and more possible. Code > 65535 is guaranteed to be Chinese

The content of this article comes from the network collection of netizens. It is used as a learning reference. The copyright belongs to the original author.
THE END
分享
二维码
< <上一篇
下一篇>>