Java – how to filter chat messages by normalizing alphabetic forms?

I filter chat messages on the chat system, where I need to limit the string to Latin-1 English Users tend to use creative types, such as

ßòógīě§

replace

Boogies

In Java, there are Unicode normalization methods to remove diacritic symbols, but I am more interested in the method of standardizing letter shapes into English and Latin-1 character sets

Are there any tables, libraries, or methods that map common Unicode characters other than Latin-1 to the nearest form? For example

ß -> B
§ -> S
¥ -> Y
¤ -> o

I suspect the answer is "no, it's too big, just filter them all out" but I hope

Solution

I think your best choice is to use OCR (optical character recognition) engine After all, that's what you're after: try your best to parse letters into readable a - Z characters (remember to print the chat message to the image using the same font used in the chat client.)

Two Java OCR libraries:

> Asprise > Tesseract

The content of this article comes from the network collection of netizens. It is used as a learning reference. The copyright belongs to the original author.
THE END
分享
二维码
< <上一篇
下一篇>>