Several ways to improve string performance by 10 times! (source code + principle analysis)

String type is one of the most frequently used data types. So improving the running efficiency of string is undoubtedly the best way to improve program performance.

We will start with the source code of string and take you step by step to achieve the small goal of string optimization. It not only teaches you how to use strings effectively, but also reveals the deep-seated reasons behind this.

The knowledge points involved in this paper are shown in the figure below:

Before looking at how to optimize string, let's first understand the characteristics of string. After all, only by knowing ourselves and the enemy can we win a hundred battles.

Properties of strings

To understand the characteristics of string, you must start with its source code, as shown below:

// 源码基于 JDK 1.8
public final class String
    implements java.io.Serializable,Comparable<String>,CharSequence {
    // String 值的实际存储容器
    private final char value[];
    public String() {
        this.value = "".value;
    }
    public String(String original) {
        this.value = original.value;
        this.hash = original.hash;
    }
    // 忽略其他信息
}

From his source code, we can see that the string class and its value [] attribute are modified by final, where value [] is the final structure for string storage, and final means "last and final".

We know that the class modified by final cannot be inherited, that is, this class cannot have subclasses, and the variable modified by final is a constant, and its value cannot be changed. This means that once a string is created, it cannot be modified.

Why can't string be modified?

The class and attribute value [] of string are defined as final. The benefits of this are as follows:

1. Do not directly + = string

From the above, we know that the string class is immutable, so we can't use + = string frequently when using string.

Pre optimization code:

public static String doAdd() {
    String result = "";
    for (int i = 0; i < 10000; i++) {
        result += (" i:" + i);
    }
    return result;
}

Some people may ask, my business needs are like this, how can I achieve it?

The official provides us with two string splicing schemes: StringBuffer and StringBuilder. StringBuilder is non thread safe, while StringBuffer is thread safe. The splicing method of StringBuffer uses the keyword synchronized to ensure thread safety. The source code is as follows:

@Override
public synchronized StringBuffer append(CharSequence s) {
    toStringCache = null;
    super.append(s);
    return this;
}

Because of the synchronized decoration, the splicing performance of StringBuffer is lower than that of StringBuilder.

Then we use StringBuilder to realize string splicing. After optimization, the code is as follows:

public static String doAppend() {
    StringBuilder sb = new StringBuilder();
    for (int i = 0; i < 10000; i++) {
        sb.append(" i:" + i);
    }
    return sb.toString();
}

Let's test the performance difference between the two methods through code:

public class StringTest {
    public static void main(String[] args) {
        for (int i = 0; i < 5; i++) {
            // String
            long st1 = System.currentTimeMillis(); // 开始时间
            doAdd();
            long et1 = System.currentTimeMillis(); // 开始时间
            System.out.println("String 拼加,执行时间:" + (et1 - st1));
            // StringBuilder
            long st2 = System.currentTimeMillis(); // 开始时间
            doAppend();
            long et2 = System.currentTimeMillis(); // 开始时间
            System.out.println("StringBuilder 拼加,执行时间:" + (et2 - st2));
            System.out.println();
        }
    }
    public static String doAdd() {
        String result = "";
        for (int i = 0; i < 10000; i++) {
            result += ("Java中文社群:" + i);
        }
        return result;
    }
    public static String doAppend() {
        StringBuilder sb = new StringBuilder();
        for (int i = 0; i < 10000; i++) {
            sb.append("Java中文社群:" + i);
        }
        return sb.toString();
    }
}

The results of the above procedures are as follows:

The results show that the performance before and after optimization is very different.

Next, we want to think about a question: why StringBuilder The append() method performs better than + =? And the more times of splicing, the greater the performance gap?

When we open the source code of StringBuilder, we can find the "little secret". The implementation source code of the parent class abstractstringbuilder of StringBuilder is as follows:

abstract class AbstractStringBuilder implements Appendable,CharSequence {
    char[] value;
    int count;
    @Override
    public AbstractStringBuilder append(CharSequence s,int start,int end) {
        if (s == null)
            s = "null";
        if ((start < 0) || (start > end) || (end > s.length()))
            throw new indexoutofboundsexception(
                "start " + start + ",end " + end + ",s.length() "
                + s.length());
        int len = end - start;
        ensureCapacityInternal(count + len);
        for (int i = start,j = count; i < end; i++,j++)
            value[j] = s.charAt(i);
        count += len;
        return this;
    }
    // 忽略其他信息...
}

The StringBuilder uses the char [] provided by the parent class as the actual storage unit of its own value. The char [] array will be modified every time when adding. The source code of StringBuilder tostring() is as follows:

@Override
public String toString() {
    // Create a copy,don't share the array
    return new String(value,count);
}

From the above source code, it can be seen that StringBuilder uses char [] as the actual storage unit. Each time, you only need to modify the char [] array, but only create a string when tostring(); Once a string is created, it cannot be modified. Therefore, a new string needs to be created every time it is added, so StringBuilder The performance of append () will be much higher than that of + = string.

2. Make good use of intern method

Make good use of string The intern () method can effectively save memory and improve the running efficiency of strings. Let's first look at the definition and source code of the intern () method:

/**
* Returns a canonical representation for the string object.
* <p>
* A pool of strings,initially empty,is maintained privately by the
* class {@code String}.
* <p>
* When the intern method is invoked,if the pool already contains a
* string equal to this {@code String} object as determined by
* the {@link #equals(Object)} method,then the string from the pool is
* returned. Otherwise,this {@code String} object is added to the
* pool and a reference to this {@code String} object is returned.
* <p>
* It follows that for any two strings {@code s} and {@code t},* {@code s.intern() == t.intern()} is {@code true}
* if and only if {@code s.equals(t)} is {@code true}.
* <p>
* All literal strings and string-valued constant expressions are
* interned. String literals are defined in section 3.10.5 of the
* <cite>The Java&Trade; Language Specification</cite>.
*
* @return  a string that has the same contents as this string,but is
*          guaranteed to be from a pool of unique strings.
*/
public native String intern();

It can be seen that Intern () is an efficient local method. Its definition says that when calling the intern method, if the string is already included in the string constant pool, the reference of the string will be returned directly. If it is not included, the string will be added to the constant pool first, and then the reference of the object will be returned.

So when is the intern () method appropriate?

Twitter engineers once shared a string Using the example of intern(), every time twitter publishes a message status, it will generate an address information. According to the size of Twitter users at that time, the server needs 32g of memory to store the address information.

public class Location {
    private String city;
    private String region;
    private String countryCode;
    private double longitude;
    private double latitude;
}

Considering that many users have overlapping address information, such as country, province, city, etc., this part of information can be listed in a separate class to reduce duplication. The code is as follows:

public class SharedLocation {

  private String city;
  private String region;
  private String countryCode;
}

public class Location {

  private SharedLocation sharedLocation;
  double longitude;
  double latitude;
}

Through optimization, the data storage size is reduced to about 20g. But for the data stored in memory, it is still very large. What should we do?

Twitter engineers use string Intern() reduces the storage size of highly repetitive address information from 20g to hundreds of megabytes, thus optimizing the storage of string objects.

The core code of the implementation is as follows:

SharedLocation sharedLocation = new SharedLocation();
sharedLocation.setCity(messageInfo.getCity().intern());    
sharedLocation.setCountryCode(messageInfo.getRegion().intern());
sharedLocation.setRegion(messageInfo.getCountryCode().intern());

From jdk1 After version 7, the constant pool has been merged into the heap, so the string copy will not be copied, but the reference of the first encountered string will be added to the constant pool. At this time, it will only judge whether this string already exists in the constant pool. If so, it will return the string reference in the constant pool.

This is equivalent to the following code:

String s1 = new String("Java中文社群").intern();
String s2 = new String("Java中文社群").intern();
System.out.println(s1 == s2);

The result of execution is: true

Here, if someone asks why not assign values directly (using string S1 = "JAVA Chinese community"), it is because this code is created by simplifying the semantics of the above twitter business code. It uses the method of object rather than direct assignment. You can see more about intern() Don't ask me again how many objects are created by the new string! Let me prove it to you This article.

3. Use split method carefully

The reason why we should advise you to use split method with caution is that in most cases, split method uses regular expressions. This segmentation method itself has no problem. However, because the performance of regular expressions is very unstable, improper use will cause backtracking problems, which may lead to high CPU.

For example, the following regular expression:

String badRegex = "^([hH][tT]{2}[pP]://|[hH][tT]{2}[pP][sS]://)(([A-Za-z0-9-~]+).)+([A-Za-z0-9-~\\\\/])+$";
String bugUrl = "http://www.apigo.com/dddp-web/pdf/download?request=6e7JGxxxxx4ILd-kExxxxxxxqJ4-CHLmqVnenXC692m74H38sdfdsazxcUmfcOH2fAfY1Vw__%5EDadIfJgiEf";
if (bugUrl.matches(badRegex)) {
    System.out.println("match!!");
} else {
    System.out.println("no match!!");
}

The execution effect is shown in the following figure:

The engine implementation used by Java regular expression is NFA (non deterministic finite automaton) automata. This regular expression engine will backtrack when matching characters (backtracking), and once backtracking occurs, the time consumed will become very long, which may be a few minutes or hours. The length of time depends on the number and complexity of backtracking.

To better explain what backtracking is, we use the following example:

text = "abbc";
regex = "ab{1,3}c";

The purpose of the above example is relatively simple. It matches a string starting with a and ending with C, with 1-3 B characters in the middle.

The parsing process of NFA engine is as follows:

This is the regular matching execution process and the simple backtracking execution process. The above example matches "COM / dzfp Web / PDF / download? Request = 6e7jgm38jf......" Because of greedy matching, the program will always read the following string for matching. Finally, it is found that there is no dot, so it goes back one character by one, which will lead to excessive CPU operation.

So we should use the split () method carefully. We can use string Indexof() method replaces split() method to complete string segmentation. If you really can't meet the requirements, you can pay attention to the backtracking problem when using the split () method.

The content of this article comes from the network collection of netizens. It is used as a learning reference. The copyright belongs to the original author.
THE END
分享
二维码
< <上一篇
下一篇>>