A large number of GCs in the company’s online virtual machines have led to a surge in STW and CPU — a process of cocooning and positioning
background
There is a Tomcat service on the company's online, in which about 8 microservices are merged and deployed. The reason why they are not deployed separately as other microservices is to save server resources. Moreover, these 8 services are edge services with low concurrency. Even if they are down, they will not affect the core business.
Because the concurrency is not high, two Tomcat are deployed online for load balancing.
This Tomcat has just been put on the production line and runs very smoothly. About one day later, the operation and maintenance colleagues reported that the two Tomcat nodes were hung up. New requests cannot be accepted. CPU soared to 100%.
Troubleshooting process I
After taking over the problem. First, I took a general look at the JVM monitoring at that time.
CPU is indeed high
Full GC starts from about 22:00 this hour. Full GC can be performed 10 times a minute at most
Minor GC is nearly 60 times per minute, which is equivalent to minor GC every second
This is also reflected from the use of the old age
Randomly analyze the CPU occupation of threads for online applications, and use the top - H - P PID command
You can see that the first four threads occupy a lot of CPU resources. Then jstack is performed, the thread stack information is pulled down, and the ID of the first four threads is converted to hexadecimal for search. No corresponding thread found. Therefore, it is judged that it is not caused by the application thread.
First conclusion
By monitoring the JVM at that time, it can be found that. Before 22 minutes of this hour, the system has maintained a relatively stable operation state, and the utilization rate of the heap is not high. However, after 22 minutes, after a large number of minor GC of the younger generation, the old age is quickly filled in a few minutes. Results in full GC. At the same time, the continuous occurrence of full GC leads to a large number of STW, the CPU is occupied by full GC threads, and the CPU soars. Due to STW and high CPU, a large number of application threads are blocked. At the same time, new requests keep coming in. Finally, the thread pool of Tomcat is full and can no longer respond to new requests. The snowball finally rolled big.
After analyzing the crime scene. The problem to be solved becomes:
What causes the elderly generation to fill up quickly?
Pull down the JVM parameters at that time
There are 4 g heaps in total, and the younger generation gives 2 g separately. According to the ratio, the allocation of each area of JVM memory is as follows:
Therefore, I began to suspect that there was a problem with the JVM parameter setting, which led to the rapid filling of the old age.
But in fact, this parameter is already the result of the previous optimization. The Eden area is set very large. Most of the objects generated by our methods are life and death objects, and most of them should be cleaned up in the younger generation. Only surviving objects enter the survivor area. Only after reaching the age or triggering the conditions for entering the old age can we enter the old generation. Basically, most of the objects in the old age should have been living objects. For example, static modified objects, always referenced caches, beans in spring containers, and so on.
I looked at the trigger conditions of garbage collection into the old age (after the official account was returned, I replied "JVM"). After obtaining the data on JVM memory allocation and recycling mechanism, it is found that this scenario should belong to the case that large objects directly enter the old age, that is, after the younger generation performs minor GC, the surviving objects are large enough to be put down in the survivor area, and then directly enter the older generation.
However, more than 90% of the objects in a minor GC should be unreferenced objects, and only a small number of objects can survive. Moreover, the concurrency of these services has not been high. Why does so much data survive after a minor GC.
Then I looked at the file generated by the jmap - histo command at that time
It is found that there are more than 9000 w examples of string objects, occupying more than 2G heap. There must be a problem. However, there are 8 applications in tomcat, which cannot be located by analyzing the code. You still have to start with the JVM.
Second conclusion
The program concurrency is not high, but within a few minutes, a large number of objects are generated in the Eden area, and these objects cannot be recycled by minor GC. Because they are too large, the mechanism of large objects directly entering the old age is triggered, and the old age will fill up quickly, resulting in the surge of full GC and subsequent CPU, resulting in the downtime of Tomcat.
The basic judgment is that there should be no problem with JVM parameters. It is likely that the problem lies in the continuous generation of objects that cannot be recycled by the application itself. But I can't locate the specific code location for the time being.
Troubleshooting process II
The next day, I looked at the JVM monitoring at that time and found that such a monitoring data was missed at that time
This is the usage rate in the old age after full GC. You can see. After full GC, the old age still occupies more than 80% of the space. Full GC can't clean up old objects at all. This shows that these objects in the old days are referenced by the program. So I can't clean it up. But when it is stable, the old age has maintained a pile of about 300m. From 22:00 this hour, it soared to nearly 2G. This must be abnormal. It further confirms my previous view. This is caused by unrecoverable objects generated by the application.
But I didn't dump the JVM heap at that time. So we can only wait to reproduce the problem again.
Finally, at more than 9 p.m., this problem reappeared, familiar formula and familiar taste.
Directly jmap - dump, after a long wait, a 4.2g heap snapshot file dump Hprof is compressed to obtain a 466m tar GZ file
Then download to the local and unzip.
Run the heap analysis tool jpprofile and load the dump Hprof file.
Then view the proportion size of all classes in the heap at that time
It is found that the string object that causes heap overflow exceeds 2 g as the result of jmap before, and cannot be recycled
Then look at the large object view and find that these string objects are all Java util. ArrayList refers to, that is, an ArrayList refers to more than 2G objects
Then view the referenced diagram, trace the source upward, and the source finally appears:
The ArrayList is referenced by a thread stack, and the thread stack information can directly locate the corresponding service and corresponding class. The specific service is the micro service media.
It seems to be approaching the truth!
Third conclusion
The reason for this frequent full GC is that the application generates a large amount of data that cannot be recycled, which eventually enters the elderly generation and finally fills the elderly generation. The specific location has been analyzed through the dump file of the JVM, pointing to the imagecombineutils of the media service The method getcomputedlines is generated. It is not known yet. The code needs to be analyzed in detail.
last
When you know the specific code location, go in and have a look. After a small partner's reminder, it is found that there is a problem with this code.
This code is a word splitting method. The specific code will not be pasted. There is a loop. Each loop will add a string object to an ArrayList. At a certain stage of the loop, the loop counter I will be reset. There is no problem under ordinary parameters. But under certain conditions. It will constantly reset the cycle counter I, resulting in an dead cycle.
The following is the simulation result. You can see that after running for a while, the ArrayList generates 322W objects, and most Stirng objects are null.
At this point, the truth comes out.
Final conclusion
Because a section of the microservice program media in some special scenarios leads to an endless loop, resulting in a large ArrayList. This leads to the rapid filling of the young generation, and then triggers the mechanism of large objects directly entering the old age, directly putting them into the old age. After the old age is filled. Trigger full GC. However, these ArrayLists are referenced by the GC root root and cannot be recycled. It can't be recycled. The old age is still full, and the full GC is triggered immediately. At the same time, because the elderly generation can not be recycled, the minor GC can not be cleaned, and the minor GC is carried out continuously. A large number of GCs cause the STW and CPU to soar, causing the application thread to jam and block, until finally the whole service cannot accept the request.