Java – how to clear bad characters that are not suitable for utf8 encoding in MySQL?
I have dirty data Sometimes it contains characters like this I use this data to query
WHERE a.address IN ('mydatahere')
For this role, I get
How to filter such characters? I use Java
thank you.
Solution
When I encounter such a problem, I use Perl script to ensure that the following code is used to convert the data to valid UTF-8:
use Encode; binmode(STDOUT,":utf8"); while (<>) { print Encode::decode('UTF-8',$_); }
The script occupies (possibly corrupted) UTF-8 on stdin and reprints valid UTF-8 to stdout Invalid characters are replaced with (U fffd, Unicode replacement character)
If you run this script on good UTF - 8 input, the output should be the same as the input
If you have data in the database, use DBI to scan the table and use this method to clean up all data to ensure that all contents are valid. UTF-8 is meaningful
This is the first-line Perl version of the same script:
perl -MEncode -e "binmode STDOUT,':utf8';while(<>){print Encode::decode 'UTF-8',\$_}" < bad.txt > good.txt
Edit: add Java only solution
This is an example of how to do this in Java:
import java.nio.ByteBuffer; import java.nio.CharBuffer; import java.nio.charset.CharacterCodingException; import java.nio.charset.Charset; import java.nio.charset.CharsetDecoder; import java.nio.charset.CodingErrorAction; public class UtfFix { public static void main(String[] args) throws InterruptedException,CharacterCodingException { CharsetDecoder decoder = Charset.forName("UTF-8").newDecoder(); decoder.onMalformedInput(CodingErrorAction.REPLACE); decoder.onUnmappableCharacter(CodingErrorAction.REPLACE); ByteBuffer bb = ByteBuffer.wrap(new byte[] { (byte) 0xD0,(byte) 0x9F,// 'П' (byte) 0xD1,(byte) 0x80,// 'р' (byte) 0xD0,// corrupted UTF-8,was 'и' (byte) 0xD0,(byte) 0xB2,// 'в' (byte) 0xD0,(byte) 0xB5,// 'е' (byte) 0xD1,(byte) 0x82 // 'т' }); CharBuffer parsed = decoder.decode(bb); System.out.println(parsed); // this prints: Пр?вет } }