Java – how to filter chat messages by normalizing alphabetic forms?
I filter chat messages on the chat system, where I need to limit the string to Latin-1 English Users tend to use creative types, such as
ßòógīě§
replace
Boogies
In Java, there are Unicode normalization methods to remove diacritic symbols, but I am more interested in the method of standardizing letter shapes into English and Latin-1 character sets
Are there any tables, libraries, or methods that map common Unicode characters other than Latin-1 to the nearest form? For example
ß -> B § -> S ¥ -> Y ¤ -> o
I suspect the answer is "no, it's too big, just filter them all out" but I hope
Solution
I think your best choice is to use OCR (optical character recognition) engine After all, that's what you're after: try your best to parse letters into readable a - Z characters (remember to print the chat message to the image using the same font used in the chat client.)
Two Java OCR libraries:
> Asprise > Tesseract