At Esper, we specialize in building a platform and offering management tools for dedicated device solutions. But in this complex environment, our customers run into complex issues that require “above and beyond” expertise to get to the root causes and ultimately solve the problems in a commercially viable way. As an infrastructure provider for dedicated and fully managed solutions on both Android and iOS, we go much deeper than typical MDM vendors.
This is the second blog in our Deep Cuts series of posts that provide case studies of actual issues we’ve encountered and solved for our customers. Names have been removed to protect the innocent (and, in some cases, the not-so-innocent).
[The name Deep Cut is a play on a song that is considerably less popular and well-known than other songs on the same album or by the same artist.]
Deep Cut #2: Mystery Crash and Burn of Esper Foundation for Android on Flipped x86 Model!
Vertical market by vertical market, there’s a shift from Windows to Android. Some verticals have moved earlier than others, and at the customer level, the hardware refresh cycle is typically a headwind against this important OS change. The vast majority of the Windows installed base is running the x86 processor architecture (sorry Windows RT, not much play for you here).
At Esper, we have many customers who need to go to Android but can’t wait for the hardware refresh cycle. Constrained capital expenditure budgets prevent them from executing the hardware refresh cycle earlier. That’s why we offer what we call the Windows to Android flip — using Esper Foundation for Android x86 combined with our Seamless Provisioning, our customers can efficiently achieve a no-touch conversion from a Windows-based dedicated device system to Android.
These machines are generally not spring chickens, in many cases more than a decade old. All the customer needs is a few more years out of them, and then they are ready to go with modern hardware. With that comes an occasional set of challenges like dealing with BIOS drift (yes, it exists, techs can individually tweak BIOS settings in the field). Yet even spring chickens can cause problems as you shall soon see.
One such customer converted their entire POS fleet from Windows to Android to run a modern POS solution that only supports Android on the endpoints. With our support, the customer and their IT vendors were able to convert the entire fleet, which consisted of various x86 OEM devices, at their proper pace — this type of change does not happen overnight.
The solution had been running fine for months, and then suddenly, the customer started seeing devices that freeze during the bootup process, getting stuck on the Esper Foundation boot screen animation. It is completely reasonable to assume that this is a problem with Esper Foundation for Android — the OS was crashing on boot! We had not seen this type of issue before with Foundation in the wild. Oh no! So we dive in.
Initial Triage
We were initially not able to reproduce the issue in our Esper Lab, despite having the same hardware model and configuration in house. So we started with gathering detailed information from the customer:
- At first, the customer didn’t pay much attention as it was an occasional device that would exhibit this issue. They took care of it by reinstalling the Esper Foundation for Android x86 OS, and most of the time after that, the devices operated to spec.
- The issue continued to show up slowly but steadily, about 2-3 devices a day, leading up to 300 POS devices in total. Now, they are getting stressed as these are front-line, revenue-driving devices! And they were finding that reinstalling Foundation did not always fix the issue. About 10% of the devices where the OS was installed would continue to freeze on boot. Hmm…
- As the customer was mid-stream pipelining out updates to Foundation, the issue was seen on more than one version of Foundation, which suggested a core issue with the Foundation OS. We’ll see.
- Strangely enough, the issue was happening only on one device model, which is typically less than two years old. So, the idea that it was due to aging devices is out.
- Key point: These devices utilize a 128GB flash SSD. The drive model used is industrial grade for 3K P/E (program/erase) cycles, meaning each NAND flash memory cell is rated to be erased and then written to 3,000 times. Beyond that, it is out of specification, and the memory cell may fail.
We then went back to the model exhibiting the problem we had in our lab (we had two of them) and did an initial investigation of the state of the SSD.
- On the first one we did a reinstall of Foundation to duplicate the remediation they performed in the field. The OS installed successfully, and the system operated to specification.
- On the second drive, before installing, we ran a file system consistency check on the SSD, specifically "sudo fsck -f /dev/sda" and from that found there was a dirty bit on the SSD. A dirty bit (also known as a modified bit) is a flag set to indicate that a particular memory address has been modified by the OS yet has not been saved by the storage device. Goes back to the rated P/E cycles.
- A dirty bit can be caused by an improper shutdown, drive failure, or a kernel issue.
This warranted a deep dive into the SSD on the devices exhibiting this issue. We secured several of these drives from the customer to perform a thorough analysis.
SSD Investigation
With a subject SSD in hand, the Linux utility `e2fsck` was used. It reported significant errors in two partitions: /data and /system. The /data partition stores user data, while the /system partition contains essential system files. The corruption in these blocks explains why the device fails to boot when the SSD is in use.
Interestingly, reinstalling Foundation temporarily resolved the issue, which led to the initial impression that the problem might be software-related. However, this is not the case.
Why Does Formatting Temporarily Help?
Upon deeper investigation, it was discovered that the temporary relief provided by reinstalling Foundation is due to how SSD firmware functions. Formatting causes the SSD firmware to remap all the blocks, altering the storage of each sector. For instance, a byte that was previously stored in block 1200 might be reassigned to block 500 after formatting. This process helps to balance the wear levels across all memory blocks, thereby extending the SSD's lifespan.
When the filesystem writes data post-formatting, it might use less deteriorated blocks, making the SSD temporarily functional. However, as soon as the operating system encounters a bad block again, the failure will recur. As the failing blocks vary, this failure could happen at any time, whether during boot-up or while running an application.
SMART Analysis Data
SMART (Self-Monitoring, Analysis, and Reporting Technology) is a monitoring system included in most storage devices. Using `smartctl`, a Linux utility for interfacing with storage device firmware, a SMART analysis of the investigated SSD was conducted. Here is a pruned version of the output from smartctl:
SMART Error Log Version: 1
ATA Error Count: 36646 (device log contains only the most recent five errors)
SMART Attributes Data Structure revision number: 0
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME FLAG VALUE WORST THRESH TYPE UPDATED WHEN_FAILED RAW_VALUE
9 Power_On_Hours 0x0012 100 100 000 Old_age Always - 11801
12 Power_Cycle_Count 0x0012 100 100 000 Old_age Always - 59
167 Unknown_Attribute 0x0022 100 100 000 Old_age Always - 0
168 Unknown_Attribute 0x0012 100 100 000 Old_age Always - 51993
169 Unknown_Attribute 0x0013 099 099 010 Pre-fail Always - 1114168
173 Unknown_Attribute 0x0012 200 200 000 Old_age Always - 27166165507420
175 Program_Fail_Count_Chip 0x0013 100 100 010 Pre-fail Always - 0
180 Unused_Rsvd_Blk_Cnt_Tot 0x0033 100 100 020 Pre-fail Always - 11039
192 Power-Off_Retract_Count 0x0012 100 100 000 Old_age Always - 55
194 Temperature_Celsius 0x0022 084 084 030 Old_age Always - 16 (Min/Max 11/53)
231 Unknown_SSD_Attribute 0x0033 010 010 005 Pre-fail Always - 250
233 Media_Wearout_Indicator 0x0032 100 100 000 Old_age Always - 4217023929984
234 Unknown_Attribute 0x0032 100 100 000 Old_age Always - 281474976710655
241 Total_LBAs_Written 0x0032 100 100 000 Old_age Always - 7902295654
242 Total_LBAs_Read 0x0032 100 100 000 Old_age Always - 207306949
The results included several significant findings:
- ATA Error Count: The SSD had an ATA error count of 36,646, indicating failing hardware.
- Attribute 231 (Wear Level Indicator): This attribute indicates significant wear, with a value of 10 on a scale from 0 to 100 (100 is best). The SSD vendor sets the threshold at 5, meaning if the value drops further, the drive will fail irreparably.
- Power On Hours: The SSD had been used for 11,801 hours (~1.5 years of constant usage), showing heavy usage.
- Total LBAs Written: A logical block address (LBA) is a scheme to identify SSD blocks for read/write operations. The SSD had approximately 7.9 trillion LBAs written, translating to around 4TB of data. Although the SSD is rated for about 40TB of writes, this amount still indicates heavy usage for a 128GB SSD.
The Solution
After thorough testing of the SSDs provided by a customer, it is clear that the issue is not related to Esper Foundation for Android. The data clearly shows that the SSD is near the end of its life. The temporary fix of reinstalling Foundation only delays the inevitable failure, which is expected to occur again soon after each reinstallation.
Regardless of its purchase date, the extensive use and data writes have significantly worn out the drive. This situation suggests that the customer's usage patterns involve high IO operations, accelerating the wear of the SSDs. The recommendation for the customer includes either modifying their application behavior to reduce IO or switching to drives with higher endurance specifications to better handle their workload. Unfortunately the only viable solution for the customer was to replace the SSDs on this particular model.
If the application set's behavior does not change, upon hardware refresh, pay attention to your SSD requirements!