At Esper, we specialize in building a platform and offering management tools for dedicated device solutions. But in this complex environment, our customers run into complex issues that require “above and beyond” expertise to get to the root causes and ultimately solve the problems in a commercially viable way. As an infrastructure provider for dedicated and fully managed solutions on both Android and iOS, we go much deeper than typical MDM vendors.
This is the first in a series of blog posts that provide case studies of actual issues we’ve encountered and solved for our customers. Names have been removed to protect the innocent (and, in some cases, the not-so-innocent).
Deep Cut #1: Android BSOD — The Black Screen of Death
Remote control of dedicated Android devices is among the most popular capabilities Esper customers use. One of our customers created a solution utilizing hardware sourced from an ODM combined with Esper’s Seamless Provisioning and an implementation for permissionless remote control (corporate-owned and managed, kiosk mode use case). Given their support model for their deployment, flawless remote control was an absolute requirement. Esper worked with the device maker to ensure they enabled both remote control and secure remote ADB (e.g., the ability to securely conduct a remote adb session directly with deployed devices).
However, once the customer prepared to roll out their deployment, they found a problem…
When attempting to view video playback running on the Esper-managed endpoint device via Esper’s remote control feature, the customer encountered a black screen on the Esper Console during the remote control session. It was very strange, as the endpoint device would operate just fine, and any actions taken by the operator on the Esper Console’s remote control session would pass through to the device (e.g., the control part worked to spec). However, all the operator saw was a black screen — thus, the Android BSOD remote edition!
This issue occurred only when the device played a video, rendering end-to-end remote control — a critical capability for our customer's support workflow — useless. Below, we dissect the problem, analyze the root cause, and show the path we took to “one-and-done” solve the problem.
Initial Triage
The first step here was for us to reproduce the issue. Let’s break it down:
- The customer permitted us to access their tenant and work with a specified device exhibiting this issue.
- We were able to confirm a remote control session displayed a black screen only when a video was playing.
- We coordinated with the customer to try an experiment. Restarting the video player locally on the endpoint after establishing the remote control connection resulted in proper playback in the Esper Console for remote control, but when the looping video restarted, the screen went black again!
From there, the next logical step was to see if this was a DRM issue:
- This type of behavior is most commonly associated with playing protected content where the DRM implementation would redact the screen contents. We quickly ruled this out by inspecting the implementation and playing a non-DRM’ed video on a simple media player. The issue remained, confirming it was not a DRM issue.
Then, we had to see if there was some connection to the encoding mode:
- We started diagnosing the issue by switching between the hardware and software encoding modes used by Esper to deliver the screen rendering for remote control to see if the issue was related to our screen delivery/rendering methods. This did not resolve the issue, effectively ruling this part of the delivery chain as the problem.
With DRM and encoding ruled out, we had to determine if this behavior was specific to a video stream (e.g., versus capturing a screenshot):
- Esper also supports taking remote screenshots, which invokes a different code path than rendering an endpoint screen for a remote control session.
- So, we took a screenshot. Since that is executed locally on the device if the screenshot function worked as intended (capturing the entire screen exactly as it is on the device), that would suggest that the issue was somewhere with our remote control implementation. However, screenshots also returned black images during a remote control session with video playback, thus proving that our implementation worked as designed.
Finally, we had to compare our solution to other popular remote control tools:
- To put the icing on the bug-swatting cake, we tried a popular remote control alternative to see if it exhibited the same issue. It did. This pointed to a more systemic problem, likely at the kernel level.
- (BTW, instead of paying separately for a remote control tool, just manage your devices with Esper to get remote control.)
The finger points at something on the device itself — outside of Esper’s code domain. Now it's time to get down and dirty.
Device-Level Investigation
We needed to get a logcat dump. Here’s the deal — we couldn’t obtain an actual sample device to work with in our lab! Given this was an Esper managed device with a straightforward repro case that we could handle remotely using Esper, we could also trigger the generation of the logcat and fetch the file remotely. Since we support both bugreport and remote adb sessions, we fetched the bug report that contained the logcat file.
Upon reviewing the logcat, the following error messages were consistently observed during both remote control sessions and screenshot operations:
XX-XX 15:48:06.457 524 570 E SurfaceFlinger: captureScreen failed to readInt32: -2
XX-XX 15:48:06.457 524 570 W WindowManager: Unable to take screenshot of display 13
XX-XX 15:48:06.460 699 699 W DisplayController: Skipping Display Configuration change on non-added display
XX-XX 15:48:06.463 524 570 I WindowManager: Screen frozen for +19ms due to new-config
XX-XX 15:48:06.464 524 570 E SurfaceFlinger: captureLayers failed to readInt32: -2
These messages indicate a failure in the Android OS SurfaceFlinger service. SurfaceFlinger is a system service in the Android OS that's responsible for determining what appears on the screen. It's a key component of the Android graphics system that accepts data buffers from multiple sources, combines them, and sends them to the display. It combines multiple layers on the screen and produces the final image.
The error suggests that Esper’s remote control agent was unable to capture a layer on the screen due to an I/O error, resulting in a blank image.
Root Cause Theory
Logcats are rich and thick with entries. Our crack engineer on the case sliced and diced the dump and teased out this spammy logline:
I [706051.617336@2] MUA: gpu_realloc: screen cap should not access the uvm buffer.
From this, we posited the issue arose due to a behavior in the OS kernel supplied by the silicon vendor, which blocks the screencap process from accessing the GPU’s Unified Video Memory (UVM) buffer when another process is accessing it. This non-standard behavior prevents multiple processes from accessing the buffer simultaneously, even in read-only mode.
The best way to verify this is by inspecting the kernel source code, which is not readily available from the silicon vendor. However, our engineer scoured GitHub and found the kernel source posted by another not-to-be-named developer. From this, our engineer found this source code entry:
if (!enable_screencap && current->tgid == ud->pid && buffer->commit_display) {
pr_err("screen cap should not access the uvm buffer.\n");
return ERR_PTR(-ENODEV);
}
The above code snippet shows that the kernel is designed to explicitly throw an error when the screencap process tries to access the UVM buffer.
Validation
The kernel module parameter `enable_screencap` can be toggled to allow screen capturing during video playback. Setting this parameter to 1 circumvents the error condition:
bash
echo 1 > /sys/module/aml_media/parameters/enable_screencap
Changing this parameter resolves the black screen issue during remote control and screenshot operations.
We figured it out! However, there is a drawback: Toggling this kernel module parameter does not survive a reboot. So, while it is helpful, it is not a production-level solution in itself.
The Permanent Solution
The best way to solve this issue at its root cause is for the device maker to get the appropriate source code level change to the kernel to address this issue by setting the `enable_screencap` parameter to 1. In the end, as simple as that. But making such a change end-to-end takes time. Our customer has a deployment to scale, business to do!
So, we created a companion app for the customer, which applies this fix upon reboot. This customer can use the Esper Cloud to install and update this app as needed and move ahead with the current device and OS build. Then, in parallel, work with the device maker to make the required kernel change in new device deliveries and potentially address currently deployed devices via FOTA.
Conclusion
The black screen of death issue during video playback in a remote control session on Esper is rooted in the silicon vendor kernel's configuration of the UVM buffer. We came up with a novel solution to address the issue for currently deployed devices via a purpose-built app, and the customer engaged with the device maker for a permanent solution via an update to the Android OS build. This issue is not related to Esper but rather to the underlying device infrastructure.
This is what we do at Esper. We go deep, using our Android OS expertise to solve thorny problems for Enterprise customers running business-critical dedicated device fleets. Check us out.