On the principle of Android anr online monitoring

Watchdog in Android

Implementation of watchdog thread deadlock monitoring

The watchdog monitoring thread deadlock requires the monitored object to implement the monitor() method of the watchdog.monitor interface, and then call the addmonitor() method, such as activitymanagerservice:

The above is the relevant code extracted from the activitymanagerservice about the object lock of watchdog monitoring activitymanagerservice, and the monitoring implementation is as follows. Watchdog is a thread object. After starting the thread, it will be checked every 30s after the wait. Such a continuous loop check:

First, the activitymanagerservice calls the addmonitor () method to add itself to the monitorchecker object of watchdog, which is a global variable of watchdog. All these variables have been initialized in the watchdog construction method and added to the monitoring object list of mhandlercheckers: ArrayList < handlerchecker >, Mmonitorchecker is an instance object of handlerchecker class. The code is as follows:

The mmonitors in the handlerchecker class are also a list of monitoring objects. Here are all monitoring objects that implement the watchdog.monitor interface. For those objects that do not implement the watchdog.monitor interface, a handlerchecker class will be created separately and added to the mhandlercheckers monitoring list of watchdog, When the watchdog thread starts to be healthy, go back to the list of mhandlercheckers and call the schedulechecklocked method of handlerchecker one by one:

There are several important flags in the handlerchecker class. One is mcompleted, which indicates whether the monitoring scan is completed within the specified time, mstarttime, which indicates the time when the scan starts, mhandler, which is the handler of the monitored thread, and schedulechecklocked, which enables the monitoring of the changed thread, Naturally, mcompleted will be set to false and the start time will be set. It can be seen that the monitoring principle is to post a task to the handler message queue of the monitored thread, that is, the handlerchecker itself, and then the handlerchecker task will be executed in the message queue maintained by the handler corresponding to the monitored thread, If the message queue is stuck due to a task, the handlerchecker task cannot be executed in time. After the specified time, it will be considered that the currently monitored thread is stuck (stuck due to deadlock or stuck due to time-consuming task execution). In the handlerchecker task:

First, traverse the monitoring objects in the mmonitors list and call the monitor () method to start monitoring. Generally, the monitor () method implemented on the monitored object is implemented as follows:

That is, monitor a deadlock, and then the monitoring is completed. Mcompleted is set to true. When all schedulechecklocked are executed, watchdog starts to wait, and must wait for 30s. Here is an implementation detail:

Originally, when I saw this code, I first noticed that systemclock. Uptimemillis() does not time when the device is dormant, so I guessed whether it is because the device is dormant and the wait stops. When the watchdog waits for 15s, the device is dormant and wakes up after 30 minutes of continuous dormancy. Will the wait wake up immediately, The answer is: normally, the wait will continue. I know that I won't wake up until the remaining 15s wait is completed, so I'm confused. So I check the interface document of the wait() method of the next thread and finally find the following explanation:

Generally speaking, when the thread is waiting, it may be awakened by active wake-up (notify or notifyAll), interrupt, or the expiration of the time of the wait. In practice, the probability of this false wake-up is very low, but for this false wake-up, The program needs to verify the wake-up conditions to distinguish whether the thread wakes up truly or falsely. If it wakes up falsely, we will continue to wait until it wakes up truly. In fact, we do need to pay attention to such minor details in the actual development process. It may not happen in 99% of cases, but after 1% of cases, Then this problem will be very obscure, and it will become very difficult to find the problem. It is strange why the thread is suddenly awakened during the good wait process, and we may even doubt our previous execution of thread wait in the device sleep state?, That's all the nonsense. Continue to study the watchdog mechanism. After the watchdog waits for 30s, it will call the evaluatecheckercompletionlocked() method to detect the operation of the monitored object:

Obtain the monitoring status of each handlerchecker by calling getcompletionstatelocked of handlerchecker:

From here, we can see that the status before and after 30s is distinguished by the mcompleted flag, because a handlerchecker task is posted in the message of the handler corresponding to the monitored thread before 30s, and then mcompleted = false. After waiting for 30s, if the handlerchecker is executed in time, Then mcompleted = true means that the task has been executed in time. If mcompleted = false, it means that the handlerchecker is still not executed. When mcompleted = false, it will continue to detect the execution time of the handlerchecker task. If the execution time in the wake-up state is less than 30 seconds, post monitoring and waiting again. If it is between 30 seconds and 60 seconds, It will dump some stack information, and then post again to monitor the waiting. When the waiting time has exceeded 60 seconds, it will be considered as an exception (either deadlock or time-consuming task for too long). At this time, it will collect various relevant information, such as code stack information, kernel information, CPU information, etc., generate a trace file and save the relevant information to drop@R_ 338_ 2419 @ folder, and then kill the process. Here, the monitoring is over

Implementation of watchdog thread Caton monitoring

Previously, we mentioned that the implementation of watchdog monitoring is achieved by posting a handler checker to the handler pair corresponding to the thread, and the deadlock monitoring objects are saved in the mmonitors list of handlerchecker. Therefore, external calls to addmonitor() method will eventually add them to the monitoring list in the global variable mmonitorchecker of watchdog, At one time, the deadlock monitoring of all threads is implemented by the mmonitorchecker. For the monitoring of thread time-consuming tasks, watchdog is implemented through the addthread() method:

The addthread() method actually creates a new handlerchecker object to monitor time-consuming tasks, and the list of mmanitors of the handlerchecker object is actually empty. Therefore, the monitor() method will not be executed when executing tasks, but the mcompleted flag bit will be directly set, Therefore, it can be explained as follows: the watchdog monitor is the handlerchecker, which implements thread deadlock monitoring and time-consuming task monitoring. When there is a monitor object, it will monitor thread deadlock and time-consuming tasks at the same time, while when there is no monitor, it is just caused by monitoring thread time-consuming tasks

Watchdog monitoring process

After understanding the watchdog monitoring process, we can consider whether to apply the watchdog mechanism to our actual project to monitor the deadlock of important threads in multi-threaded scenarios and the occurrence of anr of the main thread in real time? Of course, it can. In fact, the important role of watchdog in the framework is to monitor whether the main system servers are deadlocked or stuck. For example, monitor the activitymanagerservice. If an exception occurs, watchdog will kill the process restart. This can ensure that important system services can be recovered by restarting when they encounter similar problems, Watchdog is actually equivalent to a final guarantee. It dumps abnormal information in time and restores the process running environment

For the deadlock problem of the important thread in the application, the implementation principle can be consistent with that of watchdog

The implementation principle of anr Caton for monitoring application can be learned from watchdog. The specific implementation is slightly different. Anr occurs in 5 seconds for activity, 10 seconds for broadcast and 20 seconds for service. However, the four components actually run in the main thread, so you can use wait for 30 seconds to initiate monitoring like watchdog, Set the mcompleted flag bit to detect whether the task posted to the messagequeue is stuck and not executed in time. Calculate the execution time of the task through mstarttime, and then detect whether there are time-consuming operations in the execution of other tasks in the messagequeue through the execution time of the task. If the execution time exceeds 5 seconds, Then it can be explained that there are time-consuming tasks in the message queue. At this time, there may be the risk of ANR. You should save the dump thread stack information in time, and then report it to the background analysis through big data. Remember that this must be the time when the computing device is active. If the device is sleeping, the messagequeue will be suspended. In fact, this is not a deadlock or jam

Implementation and demo of anr online monitoring of watchdog mechanism

https://github.com/liuhongda/anrmonitor/tree/master/anrmonitor

Watchdog mechanism summary

Each thread can correspond to a looper and a looper corresponds to a messagequeue. Therefore, you can predict whether the detection task will be executed in time by posting the detection task to the messagequeue, so as to achieve the effect of detecting thread task jamming, but the premise is that the thread must create a looper first

Watchdog must run in a separate thread so that other threads can be monitored without affecting each other

Using the watchdog mechanism to realize online anr monitoring may not be 100% accurate. For example, an anr occurs in 5 seconds. When the critical value of 5 seconds is reached, the time-consuming task is just completed. At this time, the anr detection task is executed. During the execution of the detection task, it is possible that the waiting time of the watchdog thread is also up, At this time, it is found that the detection task has not been completed, so an anr is reported, which is inaccurate; Another situation may be that the anr has occurred in 5 seconds, but the watchdog thread has not detected that it is still a wait, that is, the time when the anr occurs is staggered with the time when the watchdog thread waits. When the next watchdog thread starts waiting, the anr has already occurred, and the main thread may have returned to normal. At this time, the collection of anr information will be missed, Therefore, the anr can be completely scanned and recorded only when the anr is stuck twice as long as the wait time of the watchdog thread, that is, the wait time of the watchdog is 2.5 seconds, which is a little too frequent in practical applications. If the device does not sleep, the watchdog will run every 2.5 seconds, which may have the risk of power consumption

The above is the whole content of this article. I hope it will be helpful to your study, and I hope you can support programming tips.

The content of this article comes from the network collection of netizens. It is used as a learning reference. The copyright belongs to the original author.
THE END
分享
二维码
< <上一篇
下一篇>>