For the memory leakage caused by development, the operation and maintenance team does not carry the pot in this way

A few days ago, the group arranged to be on duty and take turns to look after our services, mainly dealing with alarm mail processing, bug troubleshooting and operation issue processing. The working day is good. I have to go to work no matter what I do. If it's the weekend, the day will be ruined.

I don't know if the company network is wide. So it's still the network operation and maintenance group. There's always a problem in the network. It's not the switch off the net, that is, the router is broken down, and there are various kinds of timeouts. We are sensitive to the service and detection services, and we can always grasp the small problems accurately and suck up the good jobs.

Several times buddies make complaints about how to avoid the service life mechanism, and secretly stop the detection service without letting people find out (though not dare).

A few days ago, I handled a pot of detection service at the weekend. This article will continue to be revised, you can continue to pay attention.

1、 Problem network problem? Starting at more than 7 p.m., I began to receive alarm e-mails continuously. The e-mails showed that several interfaces detected had timeout. Most execution stacks are:

I've seen a lot of errors in this thread stack. The HTTP DNS timeout we set is 1s, the connect timeout is 2S, and the read timeout is 3S. These errors are caused by the detection service sending the HTTP request normally, and the server responding normally after receiving the request and processing it normally, but the data packet is lost in the network layer by layer forwarding, Therefore, the execution stack of the request thread will stay where the interface response is obtained.

The typical feature of this situation is that the corresponding log records can be found on the server. And the log will show that the server response is completely normal. In contrast, the thread stack stays at the socket connect, which fails when the connection is established, and the server is completely unaware.

I noticed that one of the interfaces reported errors more frequently. This interface needs to upload a 4m file to the server, and then return 2m text data after a series of business logic processing, while the other interfaces are simple business logic. I guess there may be too much data to upload and download, so the probability of packet loss caused by timeout is greater.

According to this conjecture, the group logs on to the server and uses the requested request_ ID search in the recent service log. Sure enough, the interface timed out due to network packet loss.

Of course, the leader will not be satisfied. This conclusion needs someone to take over the pot. So quickly contact the operation and maintenance team and the network group to confirm the network status at that time. The students in the network group replied that the switch in the computer room where our detection service is located is old, there is an unknown forwarding bottleneck, and it is being optimized, which makes me more relieved. So I simply explained it in the Department group, which is regarded as completing the task. The problem broke out. I thought there was such a small wave on duty this time. As a result, at more than 8:00 p.m., alarm emails from various interfaces poured in. I was unprepared to pack up my things for a single weekend on Sunday. this

Almost all interfaces are timed out, and our interfaces with a large number of network I / O must timeout every time. Is it the fault of the whole computer room?

Once again, I saw that the indicators of each interface were normal through the server and monitoring. After my own test, the interface was completely OK. Since it did not affect the online service, I was ready to stop the detection task through the detection service interface and then check it slowly.

As a result, I sent a request to the interface that suspended the detection task for a long time and did not respond. At this time, I knew it was not so simple.

2、 To solve the memory leak, I quickly logged in to the detection server. First, the top free DF triple company found some exceptions.

The CPU utilization of our probe process is particularly high, reaching 900%.

Our java process does not do a lot of CPU operations. Under normal circumstances, the CPU should be between 100% and 200%. In this case, the CPU soars, either to an endless loop or doing a lot of GC.

Use the jstat - GC PID [interval] command to check the GC status of the java process. Sure enough, full GC reaches once per second.

With so many full GCS, it should be that the memory leak did not run, so use jstack PID > jstack Log saves the scene of the thread stack, using jmap - dump: format = B, file = heap Log PID saved the heap site, then restarted the detection service, and the alarm email finally stopped. Jstat jstat is a very powerful JVM monitoring tool. Its general usage is: jstat [- options] PID interval. Its supported viewing items are:

Using it is very helpful to locate the memory problem of the JVM.

3、 Check

Although the problem has been solved, in order to prevent it from happening again, we should find out the root cause. The analysis of stack stack is very simple. See if the number of threads is too many and what most stacks are doing.

Only more than 400 threads, no exception.

The thread state seems to be normal. Next, analyze the heap file. Download heap dump files. Heap files are binary data. It is very troublesome to view them on the command line. The tools provided by Java are visual and cannot be viewed on the Linux server. First, download the files locally.

Since the heap memory we set is 4G, the heap file from dump is also large. Downloading it is really very troublesome, but we can compress it first.

Gzip is a powerful compression command. In particular, we can set - 1 ~ - 9 to specify its compression level. The larger the data, the greater the compression ratio and the longer the time-consuming. It is recommended to use - 6 ~ 7, - 9, which is too slow and has little benefit. With this compression time, the more files can be downloaded. Using mat to analyze JVM heap mat is a powerful tool to analyze Java heap memory. Use it to open our heap file (change the file suffix to. Hprof), which will prompt us to analyze the types. For this analysis, decisively select memory leak prospect.

The map uses ArrayList to store the response results of each detection interface according to the type. After each detection, it is stuffed into ArrayList for analysis. Because the bean object will not be recycled and this attribute has no clearing logic, when the service has not been restarted online for more than ten days, the map becomes larger and larger until the memory is full.

After the memory is full, you can no longer allocate memory for the HTTP response results, so it is stuck in the readLine all the time. Our interface with a large number of I / OS has a high number of alarms. It is estimated that the response is too large and requires more memory.

PR was proposed to the code owner, and the problem was successfully solved.

4、 Summary

In fact, you should reflect on yourself. At the beginning, there was such a thread stack in the alarm email:

When you see this error reporting thread stack, you don't think about it. You should know that TCP can ensure the integrity of the message. In addition, the value will not be assigned to the variable until the message is received. This is an obvious internal error. If you pay attention to it, you can find out the problem in advance. It's really bad to check the problem.

Author: Pillow Book

Source: https://zhenbianshu.github.io

Focus on WeChat official account: big data learning and sharing, get more technical dry cargo

The content of this article comes from the network collection of netizens. It is used as a learning reference. The copyright belongs to the original author.
THE END
分享
二维码
< <上一篇
下一篇>>