Java secure coding guide: string and coding
brief introduction
String is the most commonly used Java type in our daily coding process. Languages in different regions of the world are different. Even if Unicode is used, different coding methods will be adopted due to different coding formats, such as UTF-8, utf-16, UTF-32, etc.
What problems do we encounter when using character and string encoding? Let's have a look.
Creates a string using variable length encoded incomplete characters
In Java, the underlying storage char [] of string is encoded in utf-16.
StringBuilder and StringBuffer still use char [].
Then, when we use inputstreamreader, outputstreamwriter and string classes to read, write and build strings, we need to involve the conversion of utf-16 and other codes.
Let's take a look at the possible problems encountered in converting from UTF-8 to utf-16.
Let's take a look at the code of UTF-8:
UTF-8 uses 1 to 4 bytes to represent the corresponding characters, while utf-16 uses 2 or 4 bytes to represent the corresponding characters.
What problems may arise from the conversion?
public String readByteWrong(InputStream inputStream) throws IOException {
byte[] data = new byte[1024];
int offset = 0;
int bytesRead = 0;
String str="";
while ((bytesRead = inputStream.read(data,offset,data.length - offset)) != -1) {
str += new String(data,bytesRead,"UTF-8");
offset += bytesRead;
if (offset >= data.length) {
throw new IOException("Too much input");
}
}
return str;
}
In the above code, we read byte from stream and convert it into string every time we read byte. Obviously, UTF-8 is a variable length code. If part of the UTF-8 code is just read during byte reading, the constructed string will be wrong.
We need to do the following:
public String readByteCorrect(InputStream inputStream) throws IOException {
Reader r = new InputStreamReader(inputStream,"UTF-8");
char[] data = new char[1024];
int offset = 0;
int charRead = 0;
String str="";
while ((charRead = r.read(data,charRead);
offset += charRead;
if (offset >= data.length) {
throw new IOException("Too much input");
}
}
return str;
}
We use inputstreamreader, which will automatically convert the read data into char, that is, automatically convert UTF-8 to utf-16.
So there will be no problem.
Char cannot represent all Unicode
Because char is encoded by utf-16, for utf-16, U + 0000 to U + d7ff and U + e000 to U + ffff can be directly represented by a char.
However, for u + 010000 to U + 10ffff, it is represented by two chars in the range of 0xd800 – 0xdbff and 0xdc00 – 0xdfff.
In this case, it is interesting to combine two chars, and a single char is meaningless.
Consider our substring method, which is intended to find the position of the first non letter from the input string, and then intercept the string.
public static String subStringWrong(String string) {
char ch;
int i;
for (i = 0; i < string.length(); i += 1) {
ch = string.charAt(i);
if (!Character.isLetter(ch)) {
break;
}
}
return string.substring(i);
}
In the above example, we take out the char characters in the string one by one for comparison. If you encounter a character in the range of U + 010000 to U + 10ffff, you may report an error and mistakenly think that the character is not a letter.
We can modify it as follows:
public static String subStringCorrect(String string) {
int ch;
int i;
for (i = 0; i < string.length(); i += Character.charCount(ch)) {
ch = string.codePointAt(i);
if (!Character.isLetter(ch)) {
break;
}
}
return string.substring(i);
}
We use the codepointat method of string to return the Unicode code point of the string, and then use the code point to judge isletter.
Pay attention to the use of locale
In order to support internationalization, Java introduces the concept of locale. Because of locale, unexpected changes will occur in the process of string conversion.
Consider the following example:
public void toUpperCaseWrong(String input){
if(input.toUpperCase().equals("JOKER")){
System.out.println("match!");
}
}
We expect English. If the system sets locale as other languages, input Touppercase () may get a completely different result.
Fortunately, touppercase provides a locale parameter, which can be modified as follows:
public void toUpperCaseRight(String input){
if(input.toUpperCase(Locale.ENGLISH).equals("JOKER")){
System.out.println("match!");
}
}
Similarly, dateformat has problems:
public void getDateInstanceWrong(Date date){
String myString = DateFormat.getDateInstance().format(date);
}
public void getDateInstanceRight(Date date){
String myString = DateFormat.getDateInstance(DateFormat.MEDIUM,Locale.US).format(date);
}
When comparing strings, we must consider the influence of locale.
Encoding format in file reading and writing
When we use InputStream and OutputStream for file pair writing, because they are binary, there is no coding conversion problem.
However, if we use reader and writer for file objects, we need to consider the problem of file encoding.
If the file is UTF-8 encoded and we use utf-16 to read it, there will certainly be a problem.
Consider the following example:
public void fileOperationWrong(String inputFile,String outputFile) throws IOException {
BufferedReader reader = new BufferedReader(new FileReader(inputFile));
PrintWriter writer = new PrintWriter(new FileWriter(outputFile));
int line = 0;
while (reader.ready()) {
line++;
writer.println(line + ": " + reader.readLine());
}
reader.close();
writer.close();
}
We want to read the source file and insert the line number into the new file, but we don't consider the coding problem, so we may fail.
The above code can be modified as follows:
BufferedReader reader = new BufferedReader(new InputStreamReader(new FileInputStream(inputFile),Charset.forName("UTF8")));
PrintWriter writer = new PrintWriter(new OutputStreamWriter(new FileOutputStream(outputFile),Charset.forName("UTF8")));
The correctness of the operation is ensured by forcibly specifying the encoding format.
Do not encode non character data as strings
We often need to encode binary data into a string and store it in the database.
Binary is represented by byte, but we can know from the above introduction that not all bytes can be represented as characters. If you convert a byte that cannot be represented as a character into a character, there may be a problem.
Take the following example:
public void convertBigIntegerWrong(){
BigInteger x = new BigInteger("1234567891011");
System.out.println(x);
byte[] byteArray = x.toByteArray();
String s = new String(byteArray);
byteArray = s.getBytes();
x = new BigInteger(byteArray);
System.out.println(x);
}
In the above example, we convert BigInteger to byte number (large end sequence), then convert byte number to string, and finally convert string to BigInteger.
Let's look at the results first:
1234567891011
80908592843917379
No successful conversion was found.
Although string can receive the second parameter, the incoming character code, the character codes currently supported by Java are ASCII, iso-8859-1, UTF-8, utf-8be, utf-8le and utf-16. By default, string is also a large end sequence.
How to modify the above example?
public void convertBigIntegerRight(){
BigInteger x = new BigInteger("1234567891011");
String s = x.toString(); //转换成为可以存储的字符串
byte[] byteArray = s.getBytes();
String ns = new String(byteArray);
x = new BigInteger(ns);
System.out.println(x);
}
We can first convert BigInteger into a string that can be represented by toString method, and then convert it.
We can also use Base64 to encode the byte array without losing any characters, as shown below:
public void convertBigIntegerWithBase64(){
BigInteger x = new BigInteger("1234567891011");
byte[] byteArray = x.toByteArray();
String s = Base64.getEncoder().encodeToString(byteArray);
byteArray = Base64.getDecoder().decode(s);
x = new BigInteger(byteArray);
System.out.println(x);
}
Code for this article:
learn-java-base-9-to-20/tree/master/security