Implementation of Chinese and English symbols and punctuation in Java judgment

2019-09-21 • Java

This paper introduces the implementation of Chinese and English symbols and punctuation in Java judgment, which is shared with you as follows:

Method 1: use Unicode block and Unicode script to judge

In Java, the character class is mainly used to deal with the functions related to characters, while character in JDK 1.7 is implemented according to unicode version 6.0, so we need to learn the common Unicode coding first.

The Unicode block and Unicode script classes can help us determine the character type. Unicode block is a basic unit of Unicode code organized by the Unicode Standards Association. In fact, a Unicode block represents a continuous Unicode number segment, and there is no overlap between Unicode blocks. For example, we usually use whether the Unicode code is 0x4e00c0x9fcc to judge whether a character is a Chinese character, because a Unicode block is specially divided into stored Chinese characters (to be exact, CJK unified Chinese characters). This Unicode block is called CJK unified ideographs, which defines 74617 Chinese characters in total.

Relationship between Unicode block and Unicode script:

Therefore, Unicode script classifies Unicode characters from the level of language writing rules, which is divided from the perspective of use, while Unicode block is divided from the perspective of hard coding.

1. Unicode block is a simple numerical range (there may be some "blank numbers" in some blocks that have not been assigned characters).

2. Characters in a Unicode script may be scattered in multiple Unicode blocks;

3. Characters in a Unicode block may be delimited into multiple Unicode scripts.

Identify Chinese punctuation marks.

Because Chinese punctuation mainly exists in the following five Unicode blocks,

U2000 general quotation (percentage sign, micrometer, single quotation mark, double quotation mark, etc.)

U3000-cjk symbols and punctuation (stop, full stop, book name, 〸, 〹, 〺, etc.; PS: do you know what the last three characters mean?:)

Uff00 halfwidth and full width forms (greater than, less than, equal to, brackets, exclamation marks, plus, minus, colon, semicolon, etc.)

Ufe30-cjk compatibility forms (mainly brackets used for vertical writing, discontinuous line h, wavy line k, etc.)

Ufe10 vertical forms (mainly some vertical punctuation marks, etc.)

Method 2: judge by character range

The above is the whole content of this article. I hope it will be helpful to your study, and I hope you can support programming tips.

The content of this article comes from the network collection of netizens. It is used as a learning reference. The copyright belongs to the original author.

THE END

Java

二维码

Detailed explanation of HBase shell command in Hadoop

< <上一篇

Principle and implementation of double ended linked list of Java data structure

下一篇>>

搜索内容

Implementation of Chinese and English symbols and punctuation in Java judgment

热门文章