Android ANR Resolution & Case Study of ANR Resolution Using AGC APM - Huawei Developers
1. Introduction to ANR
1.1 What Is ANR?
ANR is short for Application Not Responding. Such problems occur when the UI thread of an Android app is blocked for a long time.
In this case, a pop-up is displayed, providing an option for users to exit the app by force.
{
"lightbox_close": "Close",
"lightbox_next": "Next",
"lightbox_previous": "Previous",
"lightbox_error": "The requested content cannot be loaded. Please try again later.",
"lightbox_start_slideshow": "Start slideshow",
"lightbox_stop_slideshow": "Stop slideshow",
"lightbox_full_screen": "Full screen",
"lightbox_thumbnails": "Thumbnails",
"lightbox_download": "Download",
"lightbox_share": "Share",
"lightbox_zoom": "Zoom",
"lightbox_new_window": "New window",
"lightbox_toggle_sidebar": "Toggle sidebar"
}
1.2 What Are the Types of ANR Problems?
Apps running on Android devices are monitored by the Activity Manager and Window Manager system services. Generally, the services report ANR problems in the following cases:
KeyDispatchTimeout (the most common type): Triggered when an input event fails to be processed within 5s, including key-pressing events and touch events.
Log keyword: InputDispatching Timeout
BroadcastTimeout: Triggered when BroadcastReceiver fails to respond within a specific period (10s for foreground broadcasts and 60s for background broadcasts).
Log keyword: Timeout of broadcast BroadcastRecord
ServiceTimeout: Triggered when Service fails to respond within a specific period (20s for foreground services and 200s for background services).
Log keyword: Timeout executing service
ContentProviderTimeout: Triggered when ContentProvider fails to respond within 10s.
Log keyword: Timeout publishing content providers
1.3 Why Do ANR Problems Occur?
We have concluded the following typical ANR scenarios based on massive ANR case analysis:
The main thread is locked by another thread (proportion: 57%). The sleep() and wait() methods of the thread are called, causing the waiting timeout of the main thread.
The system resource is occupied (proportion: 14%). An app fails to obtain sufficient system resources because other processes occupy a large number of system resources (CPU, RAM, and I/O).
The main thread is suspended due to time-consuming tasks (proportion: 9%). The main thread is suspended due to a large number of database reads and writes, unstable network conditions, and high-intensity hardware computing.
2. Methodology for Solving ANR Problems
2.1 Overall Approach
1. Export ANR log information and check the name of the package or class in which ANR occurs, involved process ID, occurrence time, and cause type.
2. Check the system resource information, including the usage of system resources such as CPU, RAM, and I/O before and after ANR occurs.
3. Check the status of the main thread for any fault such as time-consuming operation and deadlock, and determine whether the ANR problem is caused by the app or OS.
4. Check whether the app is abnormal before the ANR problem occurs based on app logs or code.
2.2 Exporting ANR Logs
When an ANR problem occurs, the system collects ANR-related log information, CPU usage, and trace logs (how each thread is executed), generates a traces.txt file, and saves the file in the /data/anr/ directory.
Note: Each time a new ANR problem occurs, the previous ANR information is overwritten.
You can run the adb command to export the trace file to your computer.
Code:
adb root
adb shell ls /data/anr
adb pull /data/anr/<filename>
2.3 Reading Key Log Information
1. In the log information, locate information by specific keyword.
Keyword examples in the trace file:
Code:
09-24 15:20:20.211 1001 1543 1570 XXXXXXX: ANR in xxxxxx
09-24 15:20:20.211 1001 1543 1570 XXXXXXX: PID: xxxxx
09-24 15:20:20.211 1001 1543 1570 XXXXXXX: Reason: xxxxxx
The details are described as follows:
ANR in: Name of the package or class in which ANR occurs
PID: ID of the process in which ANR occurs
Reason: Cause of ANR such as keyDispatchingTimedOut
2. Check the CPU usage information.
Code:
09-24 15:20:20.211 1001 1543 1570 XXXXXX: CPUusage from xxx to xxx ago
xxxxxxxxxxxxxxxx
xxxxxxxxxxxxxxxx
09-24 15:20:20.211 1001 1543 1570 XXXXXX: CPUusage from xxx to xxx later
xxxxxxxxxxxxxxxx
xxxxxxxxxxxxxxxx
The details are described as follows:
ago: CPU usage before ANR occurs
later: CPU usage after ANR occurs
Note: Pay attention to TOTAL, user, kernel, and iowait to check the CPU usage.
2.4 Analysis
After analyzing the CPU usage, if the cause is still not clear, you need to further analyze the trace file. The file records each thread stack of the involved process before and after ANR occurs. Analyze the main thread stack information to check whether the app is abnormal before ANR occurs.
ANR problems vary depending on the scenario. Therefore, the solution here is provided for reference only.
3. ANR Troubleshooting
3.1 Difficulties
In most cases, users require proactive app response to their operations, such as button tapping, resource loading, and page redirection. If an ANR problem occurs, merely a pop-up indicating that the app does not respond is not enough to meet user requirements.
However, it is difficult to deal with ANR problems for many reasons:
1. ANR problems often occur on out-of-date devices or in poor network conditions, that tests can hardly cover.
2. Some ANR problems occur randomly, which are difficult to locate. An actual device that can be used to reproduce the problem is required for detailed log information, but not all devices can come in handy.
3. ANR problem locating is complex and affected by many factors, demanding much experience and expertise.
3.2 New Solution
In addition to traditional methods, third-party app monitoring platforms can also be introduced to handle ANR problems.
To locate and solve ANR problems and improve user experience, more and more service providers start to develop app performance monitoring tools.
HUAWEI AppGallery Connect is one of them. It provides App Performance Management (APM) with minute-level app performance monitoring capabilities such as ANR analysis. With ANR monitoring of AppGallery Connect APM, you can:
1. Monitor ANR problems on the live network in real time and master their trends.
2. Collect and view ANR information online without reproducing the problems.
3. Locate and solve ANR problems quickly and systematically.
4. ANR Solving Cases
The following demonstrates how to locate typical ANR problems using the APM service with some cases.
4.1 Case 1: ANR Caused by a Deadlock
4.1.1 Finding the Problem
Sign in to AppGallery Connect, click My projects, find your project, and click your app on the project card. Then go to Quality > APM > ANR analysis. It is found that the ANR-affected user rate of the top one problem type reaches 16.67%. This type of problems needs to be solved by priority.
4.1.2 Locating the Problem
Click the card of this problem type among top problems. The ANR details page of this type of problems is displayed.
According to the user distribution pie chart, the ANR problem affects the most users when the app version is 2.0, the device model is HUAWEI VOG-AL10, or the OS version is 10.
In the Records table under the charts, find a record that meets the preceding conditions, and click View details.
4.1.2.1 Analyzing System Resource Status
According to the report, when the problem occurs, the CPU usage is 20%, the I/O usage is 0%, low memory does not occur, the allocated heap size is 26.50 MB, the used heap size is 8.69 MB, and the number of threads is 61. Such information indicates that system resource usage is normal.
ANR problems can be caused by insufficient system resources or incorrect code logic. Based on the preceding system resource information, this type of ANR problems is not caused by insufficient system resources, and there is only one possible cause, that is incorrect code logic. So the problem analysis needs to focus on code logic verification.
4.1.2.2 Checking the Main Thread Status to Find the Code Snippet That Causes ANR
For ANR problems caused by code logic, check the main thread stack and the thread status. On the Main thread stack info tab page, find the faulty stack. It is found that the main thread is obtaining the lock status when the problem occurs. Therefore, it can be concluded that the problem occurs because the main thread keeps waiting for the lock resource and is therefore blocked. As a result, subsequent input events are not responded, and an ANR problem of the Input dispatching timed out type is triggered.
Locate the code snippet of the ANR problem in stack information. It is found that the deadlock occurs during the call of com.aiops.hiperformance.MainActivity.dispatchActivityDestroyed. Check the code. It is found that the deadlock occurs in the mLock.readLock().lock() function.
Search for the keyword mLock in the code. It is found that only the MainActivity file contains the mLock.readLock.lock() code. Therefore, it can be determined that the abnormal code exists only in the MainActivity file, and we can narrow down the fault scope. During coding, if a lock is not released, an unexpected exception may have already occurred. This may lead to lock release failures. Therefore, you need to check whether the app is abnormal before the problem occurs.
Find the start time of the lock application action, search for any exception that occurs earlier than the deadlock occurrence. Switch to the ANR info tab page. It is found that the first element of the primary execution queue already exists before 5.5s. The ANR occurrence time is 2020-09-27 09:48:27. Therefore, the lock obtaining action is performed at about 2020-09-27 09:48:21.
4.1.2.3 Viewing App Logs
Switch to the System logs tab page. Now we know that the lock obtaining action occurs at about 2020-09-27 09:48:21. Analyze logs generated before the time point for any exception that may lead to the failure to release the lock.
It is found that the system throws the OutofBoundsException exception at 09:48:18.365 and generates abnormal stack information. The exception indeed occurs in the MainActivity file, that is, the code scope we have narrowed down to. According to the stack information, the abnormal code is finally located.
It is found that an out-of-bounds error occurs when getShareDataInterceptor is called. As a result, mLock.readLock is not released. By now, we have completely found the cause: The lock resource fails to be released due to an exception, leading to the deadlock of the main thread.
4.1.3 Solving the Problem
To solve the problem and prevent similar problems, the following measures need to be taken:
1. Analyze the cause of the exception and modify the code to prevent out-of-bound exceptions
2. Capture similar exceptions and execute the protection code to throw the exceptions before the lock resource is released.
3. Check other code and add a protection mechanism before the lock resource release to ensure that the lock resource can be released in a timely manner.
4.2 Case 2: ANR Caused by Insufficient I/O Resources
4.2.1 Locating the Problem
Directly go to the details page of the ANR to utilize APM to locate the specific problem.
According to the experience of locating most ANR problems, this ANR problem is caused by insufficient system resources. Therefore, the troubleshooting approach should be as follows: Locate the code that causes the ANR problem, and optimize the code so that ANR will not be triggered even if the I/O usage is high.
4.2.1.2 Finding the Cause by Checking the Main Thread Status
Switch to the Main thread stack info tab page and check the main thread code.
According to the main thread stack information, the main thread directly performs database operations. When the system I/O usage is high, these operations will cause main thread blocking. Locate the involved code snippet based on the stack information.
Therefore, it is confirmed that the code contains the operation of accessing the SQLite. Experienced developers will know that the problem can be solved merely by perform I/O operations in a thread.
4.1.2.3 Viewing App Logs
This step is not necessary as we have already located the cause of ANR.
4.2.2 Solving the Problem
Optimize the code as follows to prevent such ANR problems.
4.3 Case 3: ANR Caused by the Main Thread Infinite Loop
4.3.1 Locating the Problem
Go to the details page of the ANR to utilize APM to locate the specific problem.
4.3.1.1 Analyzing System Resource Status
According to the report, when the problem occurs, the CPU usage is 25%, the I/O usage is 0%, low memory does not occur, the allocated heap size is 18.01 MB, the used heap size is 8.08 MB, and the number of threads is 43. Such information indicates that system resource usage is normal.
According to the experience of locating most ANR problems, this type of ANR problems is not caused by insufficient system resources, and there is only one possible cause, that is incorrect code logic. So the problem analysis needs to focus on code logic verification.
4.3.1.2 Finding the Cause by Checking the Main Thread Status
For ANR problems caused by code logic, check the main thread stack and the thread status. On the Main thread stack info tab page, find the faulty stack.
It is found that the main thread stack is blocked in getActivity and is in the SUSPENDED status. According to the stack information, the abnormal code is finally located.
After the analysis, it is suspected that an infinite loop occurs on the main thread. If an infinite loop occurs in an app, the CPU user-mode time of the app increases abnormally. Switch to the ANR info tab page and view the CPU usage information of each process.
It is found that the CPU usage of the app in user mode reaches 94%. Therefore, it is verified that an infinite loop occurs in the main thread, which causes the ANR problem.
4.3.1.3 Viewing App Logs
This step is not necessary as we have already located the cause of ANR.
4.3.2 Solving the Problem
Optimize the code as follows to prevent such ANR problems.
5. Case Summary
The preceding ANR problems are resolved with the help of HUAWEI AppGallery Connect APM, especially based on the ANR analysis reports and ANR records provided by the service.
You can view the ANR trend information in different dimensions such as app version, device model, OS version on the ANR analysis page of AppGallery Connect APM. You can analyze the trend and trigger condition of a certain type of ANR problems.
In addition, you can obtain more detailed device information, OS information, app information, and stack logs when ANR occurs based on the detailed problem occurrence records, and quickly locate the problem.
6. References
APM official documentation
The picture is not very clear. I wish it could be enlarged.
Is there a way to change the global ANR timeout?
Hello there,
Device: Samsung Galaxy Tab S 8.4 WiFi (klimtwifi) SM-T700 with the LineageOS 17.1 & GAPS.
I noticed that there are the ANR (Application Not Responding) prompts on some newer apps and the study app that I will be using offline.
Can the ANR timeout be manually set? For example I found here;
const nsecs_t DEFAULT_INPUT_DISPATCHING_TIMEOUT = 5000 * 1000000LL; // 5 sec
Anyhow, does anyone know the location of the code that I would need to edit, to change the global timeout to 20 seconds for ANR?
I don't need it to do anything else then reading, typing, dictionaries and this ancient txt app; so resolving the issue is not really worth it, I am happy to wait, but don't want these annoying promts; just like with an old windows computer that would take a minute to load; I am in no rush; hopefully this code is accessible on rooted device.
I would greatly appreciate any help on the matter; I just posted here, as it seems to be the main post on ANR.
With many thanks.
Related
[Q] How to ensure that a networking thread gets treated as high priority.
We have a Service with some threads dedicated to network communication. It's heartbeat-type traffic - a quick request-response a couple of times a second with small amounts of data. The problem is that a thread sometimes just stops being run for 20 or more seconds when a network request is made (that's based on calling System.currentTimeMillis() at the start and finish of the network request, and I know from measurements on the server side that the request was completed in a fraction of a second). The advice out there suggests setting thread priorities using the Android-specific API and/or the pure Java API. It also suggests poking the service into the foreground with notifications, because Android favours foreground processes. I've tried the thread priorities, doesn't work. I'm currently trying the foreground notification trick, I don't know yet if it solves the problem. Even if any of those techniques happen to work, it stil seems brittle - the kind of thing that could stop working with a hardware change or operating system upgrade. Is there any way of telling Android that a given thread is important enough to get attention a few times a second, and to have it treated as a requirement and not as a suggestion? This isn't a general release application that needs to be a good citizen and let other apps have their turn: we're running it on a tablet that's dedicated just to running this application, and that we can modify in any way that's required to support the application. Have you come across this problem before? What do you suggest? Thanks.
[Q] Statistical analysis of Android shared memory leads to critical security issues.
This was released today but there does not appear to be much info on whether this is already in the wild. It would be almost undetectable. Apparently it is possible to use statistical analysis of the size of the surfaceflinger off-screen buffer to predict with 90% accuracy what another app is doing. All an attacker needs is an application that runs in the background, and does not require any special permissions. Once it determines that a user is entering his password, for example, it can bring to the foreground an identical looking password dialog and capture the login data. Since the user expects this behavior, they may never notice. So far all I could find is the actual paper: cs.ucr.edu/~zhiyunq/pub/sec14_android_activity_inference.pdf And some videos of a proof of concept have been posted: f2bbs.com/thread/2234 The question is: has this been seen in the wild? Seems like a very serious threat without an obvious fix...
Need best way to monitor Android device for random crash/freeze/other misbehavior
I support users of various Android devices (mostly Samsung tablets, 5.1/6.0) and we frequently encounter reports of app behavior that is not (easily) reproducible-- crashes, freezes, etc-- and because we only get these reports after the fact, or cannot remote in the moment using Mobicontrol due to state of the device or poor connectivity, we can only attempt to surmise what might have happened instead of knowing for certain what did. Logcat files don't really help us because we cannot connect via adb for geographically distant users, and we need to have such a session running to capture the condition but we cannot know in advance when it might occur. Furthermore, these users are not technically proficient in the least so we cannot rely on their information or trust that they can reliably follow directions beyond simple point and click interface elements. Root is not an option for us so whatever tool or method might be recommended must meet the following criteria: -Run unattended, but can be scheduled start/stop -Should impact system resources minimally so it doesn't possibly contribute to performance issues -Should keep a rolling log of CPU. RAM, events, etc that is automatically purged to avoid buildup of unnecessary logs, say 24 hours, or configurable interval -Ideally would email log, with user description/annotation, to preconfigured recipient list upon command from local or remote user when app performance warrants -License would need to permit us to deploy on "suspect" devices via apk push via Mobicontrol package rules, not Play store, for between 5 and 10 devices as needed I am hoping to find what the offending app(s) are, under what specific conditions the users experience the interruptions to their work, and what the actual experiences are-- is slow performance being experienced as a freeze? Is an app trying to connect to a network when none is available? Etc. Again, we can't rely on our users to accurately assess and describe what led up to the crash or freeze and unless we are connected via Mobicontrol when it happens we cannot get the details later without such a tool as described above. Any and all help is greatly appreciated, thank you Matt
No ideas? Bump Hoping that someone may have recommendations, we can really use the help. Thanks.
Unable to add my apl in family design
Eligibility issues: - During testing we experienced stability issues with your app and were unable to successfully evaluate it for the Designed for Families program. Please make sure your app behaves predictably at runtime and does not crash, hang, or display error messages, and resubmit. --------------- - Apps in Designed for Families must include metadata text and images that accurately reflect the app experience. --------------- Please let me.know what is the issue here. Below the link to my app com.pss.mykidwords
Principles Behind HUAWEI Prediction How We Trained Models for the Service
HUAWEI Prediction utilizes machine learning, based on user behavior and attributes reported by HUAWEI Analytics Kit, to predict target audiences with next-level precision. The service can help you with carrying out and optimizing operations. For example, it can work with A/B Testing to evaluate how effective your promotions have been and it can also join hands with Remote Configuration to configure dedicated plans for specific audiences through Remote Configuration. This is likely to result in dramatically improved user retention and conversion. { "lightbox_close": "Close", "lightbox_next": "Next", "lightbox_previous": "Previous", "lightbox_error": "The requested content cannot be loaded. Please try again later.", "lightbox_start_slideshow": "Start slideshow", "lightbox_stop_slideshow": "Stop slideshow", "lightbox_full_screen": "Full screen", "lightbox_thumbnails": "Thumbnails", "lightbox_download": "Download", "lightbox_share": "Share", "lightbox_zoom": "Zoom", "lightbox_new_window": "New window", "lightbox_toggle_sidebar": "Toggle sidebar" } Integrating the Analytics SDK into your app enables the Prediction service to run preset tasks for predicting lost, paying, and returning users. On the details page of a specific prediction task, you'll find audiences with high, medium, and low probabilities of triggering a specific event, with meticulous profiling. For example, an audience with a high churn probability will include users who are very likely to quit using the app over the next 7 days. The characteristics of these users are displayed on cards, which makes it easy for you to pursue targeted operations. The following figures give you a sense of how the prediction task list and details page look in practice. * Data in these figures is for reference only. Ø How we built these prediction models First of all, we made it clear what our goal was to make predictions, so the type of data we collect reflects this. We then cleansed and sampled the collected data based on user characteristics to obtain a data set. This data set was divided into a 20% validation set and an 80% training set; multiple rounds of offline experiments were then conducted to determine the features and most suitable parameters for forming models. The generated models were later trained online to perform prediction tasks. This process is outlined in detail below: Ø Feature and model selection and optimization Feature exploration At the early stage of the project, we made sure to analyze user attributes, behavior, and requirements, in order to determine the business-relevant variables, such as user active days over the last 7 days and app use durations, through which we built a feature table. After the features were identified, we chose a method that best suited our service and optimized parameters by performing multiple rounds of experiments. Common tree boosting methods that can be found across the industry include XGBoost, random forests, and Gradient Boost Decision Tree (GBDT). We trained our data set using these methods, and found that random forests perform best. Then the bagging method was adopted to improve models' fitting and generalization capabilities. In addition to parameter optimization, the sampling ratio was also considered, especially for the payment prediction scenario, in which the proportion between positive samples and negative samples was large (about 1:100). For such cases, the accuracy and recall indicators should both be ensured. Then we adjusted the ratio of positive samples to negative samples to 1.5:1 during model training for payment prediction, in order to boost the recall of the model. Hyperparameter and feature determination Unnecessary features in a model can undermine the efficacy of predictions made by the model, or slow down model training. During experiments at this early stage, features were sorted by weight, and the top features were selected. In the model that would actually come to be, these features and relevant hyperparameters were configured. Even after a model is applied for prediction, the data still needs to be observed and analyzed to supplement necessary features. In later iterations, we added a range of features, including the event and trend features, bringing the feature count over 400. Automatic hyperparameter search Model training involving full features can be quite time-consuming, and fail to produce the optimal output. In addition, the optimal hyperparameters and features may vary depending on the app. Therefore, the training should be performed by app. To address this issue, we applied the automatic hyperparameter search function to search for optimal parameters in the configured parameter space. Matched parameters are stored in a Hive table. The following figures show the modeling procedure and relevant external support. Ø Research emphasis We will continue optimizing our models, by researching the following: l Neural network As the number of features continues to grow (400+ currently), and user behaviors become too complex to mine common rules, our prediction models will need to be enhanced to ensure that predictions remain accurate. This will require that we introduce neural networks with strong expressive power, in addition to decision trees to train models based on behavioral features. l Federated learning Currently, data is isolated between apps and tenants. Horizontal federated learning can be used to train models across apps and tenants on a collaborative basis. l Time series feature A typical app user's device will report hundreds of events (among 1,000+ event types) and access nearly 100 pages within the app on a weekly basis. These times series can be used to build both short- and long-term user behavioral features, with the goal of improving prediction accuracy across a wide range of scenarios. Page access user behavioral data can be valuable for research, as such data bear characteristics of time series data. l Feature mining and processing The feature set is still being expanded. We will explore additional relevant features, such as the average app use interval, device attributes, download sources, and locations. In addition, we will also undertake such measures as discretization, normalization, square and square root operations, Cartesian product calculation, and Cartesian product calculation for multiple data sets, to build subsequent features that are based on existing features. For more on HUAWEI Prediction, visit>> For more details, you can go to: l Our official website l Our Development Documentation page, to find the documents you need l Reddit to join our developer discussion l GitHub to download demos and sample codes l Stack Overflow to solve any integration problems