Java modifies UTF-8 strings in Python
I connect to Java applications through python I need to be able to construct byte sequences that contain UTF - 8 strings Java in datainputstream The modified UTF-8 encoding is used in readutf(), which is not supported by python (yet at least)
Can anyone point out the right direction for me to construct Java modified UTF-8 strings in Python?
Update #1: to learn more about Java modified UTF-8, see the readutf method from the datainput interface on the 550 here or here in the Java se docs line
Update #2: I'm trying to interact with a third-party JBoss web application that is using this modified utf8 format by calling datainputstream Read UTF to read the string in the string (sorry for any confusion with normal Java UTF8 string operations)
Thank you in advance
Solution
You can ignore the modified UTF-8 encoding (mutf-8) and treat it as UTF-8 In Python, you can handle it like this,
>Convert the string to normal UTF-8 and store the bytes in the buffer. > Write 2-byte buffer length (not string length) as binary in big endian. > Write down the entire buffer
I did this in PHP, and Java didn't complain about my coding at all (at least in Java 5)
Mutf - 8 is mainly used in JNI and other systems with null termination strings The only difference from ordinary UTF - 8 is how u 0000 is encoded Ordinary UTF-8 uses 1 byte encoding (0x00), and mutf-8 uses 2 bytes (0xc0 0x80) First, you should not use U 0000 (invalid code point) in any Unicode text Second, datainputstream Readutf () does not enforce encoding, so it is happy to accept either
Editor: Python code should be like this,
def writeUTF(data,str): utf8 = str.encode('utf-8') length = len(utf8) data.append(struct.pack('!H',length)) format = '!' + str(length) + 's' data.append(struct.pack(format,utf8))