Java – can I determine whether the data is in English or Chinese?
Can you determine whether the data is in English or Chinese?
Solution
For example, this can be done using statistical methods English language has a very unique character distribution, and characters show a very unique distribution after another character (called level 1 model)
If 'e' is the most common symbol, it is unlikely that the language is not European
This distinction can also be fairly simple (but may not be 100% reliable) by looking at Unicode character values (converting between character sets if necessary) If there are characters with Unicode values greater than 127, English is a little unlikely (note that some symbols represent €) If many characters have thousands of Unicode values, East Asian languages become more and more possible. Code > 65535 is guaranteed to be Chinese