"Solid State" media failures

**Topcat** · 12-26-2018, 09:11 PM

Re: "Solid State" media failures

had my first SSD die on me last week.....it was an OCZ Trion 150, 240gb. I've never had a tablet SSD fail, but I'm not a tablet junkie.....I still prefer a laptop.

**Curious.George** · 12-26-2018, 11:59 PM

Re: "Solid State" media failures

Originally posted by Topcat

had my first SSD die on me last week.....it was an OCZ Trion 150, 240gb. I've never had a tablet SSD fail, but I'm not a tablet junkie.....I still prefer a laptop.

By "die", did it simply stop working altogether? Or, did it's FTL prove incapable of coping with write wear-through? I.e., spinning rust should, theoretically, degrade gracefully as the grown defect table gets bigger (eventually impacting the capacity of the volume). The FTL should provide similar functionality for solid state media with the caveat that the entire medium will, eventually, fail.

**Topcat** · 12-27-2018, 11:10 AM

Re: "Solid State" media failures

I'm not sure what caused its death...I switched it from one interface to another (ICH7 to an ICH10), and it never worked again. BIOS sees it, but it can't be read from or written to. Tried several utilities to gain access to its data, no go. I shut it off and pronounced it dead when DBAN said the time remaining to wipe it was 640 hours....

**stj** · 12-27-2018, 11:24 AM

Re: "Solid State" media failures

obviously you tried it on the original machine?
maybe a damaged sata connector.

**Topcat** · 12-27-2018, 11:43 AM

Re: "Solid State" media failures

Originally posted by stj

obviously you tried it on the original machine?
maybe a damaged sata connector.

Yup, dead in that machine too. Connector/cabling is fine.

**stj** · 12-27-2018, 11:48 AM

Re: "Solid State" media failures

fine on the drive pcb?

reason i say this, sata is a fast serial system - too damned fast.
data is sent in small packets and corrupted ones are re-sent.
the theory is that if the errors arent too high the transfer rates are still impressive.

if you get a bad signal from a poor cable or connector the data still makes it but it's much slower - potentially so slow that more than a handfull of bytes is going to timeout.

**tom66** · 12-27-2018, 12:22 PM

Re: "Solid State" media failures

SSDs do fail, I've had friends with Kingston drives go bad. None myself, had the odd SD card and USB drive fail though. Personally I do not trust any SSD not from a tier-one manufacturer (Samsung, Sandisk, Intel and *maybe* Toshiba), manufacturers like Kingston are frequently swapping the parts used on their SSDs so performance is inconsistent and lifespan is never guaranteed. See, for instance, the Kingston V300 debacle [1].

Newer SSDs are moving to 8-level or 16-level flash, so the per-cell density is getting really high, but the trouble is, all flash memory is damaged by erase operations. And as the cells are written more often, their leakage increases, so they hold data less reliably over time.

This is one reason flash memory is terrible for archival purposes. If it is a high-density SSD, don't expect it to retain data without power for more than 10 years or so. A powered SSD is a happy SSD, because the controller can remap the drive periodically, when under little load.

One other significant factor is ambient temperature. When I was employed at a large set-top box manufacturer I tested a number of flash memory devices in STBs at temperature. The unit that was running at 40C had about half the cycle life of a unit running at 25C. So, if you can position your SSD so it runs cooler, that's better. One reason I really don't like M.2 drives is because they are located so close to the hot CPU, compared to a SATA drive. Higher temperatures at high cycle counts are the worst-case for flash memory.

[1] https://www.anandtech.com/show/7763/...er-micron-nand

**stj** · 12-27-2018, 01:35 PM

Re: "Solid State" media failures

Originally posted by tom66

If it is a high-density SSD, don't expect it to retain data without power for more than 10 years or so.

more like 10 months for some of the stacked-cell chips - i'v seen the datasheets!

**Curious.George** · 12-27-2018, 02:55 PM

Re: "Solid State" media failures

Originally posted by Topcat

I'm not sure what caused its death...I switched it from one interface to another (ICH7 to an ICH10), and it never worked again. BIOS sees it, but it can't be read from or written to. Tried several utilities to gain access to its data, no go. I shut it off and pronounced it dead when DBAN said the time remaining to wipe it was 640 hours....

But the failure was coincident with your actions? ESD? Board flexing breaking some (brittle) ROHS solder joints?

[Coincidences always leave me suspicious...]

**Curious.George** · 12-27-2018, 03:02 PM

Re: "Solid State" media failures

Originally posted by tom66

SSDs do fail, I've had friends with Kingston drives go bad. None myself, had the odd SD card and USB drive fail though. Personally I do not trust any SSD not from a tier-one manufacturer (Samsung, Sandisk, Intel and *maybe* Toshiba), manufacturers like Kingston are frequently swapping the parts used on their SSDs so performance is inconsistent and lifespan is never guaranteed. See, for instance, the Kingston V300 debacle.

What I'm interested in is whether the failures are typically catastrophic (e.g., TC's, upthread) or a gradual degradation in performance (longer erase/program times, more bad blocks to recover/restore, etc.)

Newer SSDs are moving to 8-level or 16-level flash, so the per-cell density is getting really high, but the trouble is, all flash memory is damaged by erase operations. And as the cells are written more often, their leakage increases, so they hold data less reliably over time.

And even read perturb events become more of a problem -- which leads to more erase/program cycles (to salvage the block(s) in which data is degrading but the block is not actually "failed"/unusable).

This is one reason flash memory is terrible for archival purposes. If it is a high-density SSD, don't expect it to retain data without power for more than 10 years or so. A powered SSD is a happy SSD, because the controller can remap the drive periodically, when under little load.

It's worth noting that the 10 years figure has been applied to almost all new media. Yet, time has proven otherwise. E.g., I still have ancient 8" floppies that I can read, 9T tape, and hundreds of off-the-shelf writable CD/DVD media.

OTOH, I have colleagues who complain that they can't read a CD that they wrote a few months earlier (PEBKaC).

An SSD can typically be removed/replaced leaving you with a usable piece of kit (sans SSD). OTOH, tablets, phones and other appliances usually have their solid state memory "soldered down". So, a failure in the media OR a failure in the FTL can result in bricking the device with no hope of salvage.

**tom66** · 12-27-2018, 03:22 PM

Re: "Solid State" media failures

Originally posted by Curious.George

What I'm interested in is whether the failures are typically catastrophic (e.g., TC's, upthread) or a gradual degradation in performance (longer erase/program times, more bad blocks to recover/restore, etc.)

In my experience, it's typically catastrophic because the failures begin at similar times across the drive.

Once a sector fails on an SSD, the drive will spend a long time attempting to recover it. This will lead to read latency climbing significantly and random sector failure will also likely cause issues with filesystems.

With our STBs, when a failure occurred in the onboard eMMC, the Linux kernel spent about 20 minutes spewing out messages on dmesg/serial terminal before I had a usable terminal. And it effectively became unusuable because each sector read would be rejected after a 10sec delay from the drive controller.

Maybe it's possible to configure the kernel to behave more gracefully when this goes bad but AFAIK there is no way for the kernel to know the drive is bad - it just takes forever to read from...

**TechGeek** · 12-27-2018, 04:32 PM

Re: "Solid State" media failures

More reasons to not trust SSDs with critical data.

**tom66** · 12-27-2018, 04:54 PM

Re: "Solid State" media failures

*

Originally posted by TechGeek

More reasons to not trust SSDs with critical data.

Another reason not to have critical data on any single medium *at all*. Backups, folks! Backups! Three backups, two different locations and at least one different type of medium. But if you're too lazy to do that, then at least use an online service e.g. BackBlaze.

**Curious.George** · 12-28-2018, 01:51 AM

Re: "Solid State" media failures

Originally posted by tom66

In my experience, it's typically catastrophic because the failures begin at similar times across the drive.

So, you're effectively saying that the wear leveling is "practically ideal" and everything "wears out" at the same time...?

Once a sector fails on an SSD, the drive will spend a long time attempting to recover it. This will lead to read latency climbing significantly and random sector failure will also likely cause issues with filesystems.

But, now you're conflating the drive's failure with the application's expectations of it. I.e., an application that doesn't hammer on the drive wouldn't suffer as horrendous a fate.

Environments that load apps "once" from persistent store could stumble along with the user only noticing a startup delay when the app is initially loaded.

With our STBs, when a failure occurred in the onboard eMMC, the Linux kernel spent about 20 minutes spewing out messages on dmesg/serial terminal before I had a usable terminal. And it effectively became unusuable because each sector read would be rejected after a 10sec delay from the drive controller.

I'd assume you would tune the driver to not wait as long for a retry, knowing the nature of the drive that it was talking to (i.e., don't use a driver tuned for use with "traditional media")

Maybe it's possible to configure the kernel to behave more gracefully when this goes bad but AFAIK there is no way for the kernel to know the drive is bad - it just takes forever to read from...

I'm assuming (?) tablets and other devices with soldered down memory implement their own FTL and, as such, can (chose to) see more of what's happening inside the medium. By contrast, an SSD has a conventional interface that it tries to maintain that deliberately hides lots of these "medium specific" details.

Of course, the flip side of this (if its indeed how these devices are designed) is that you're at the mercy of N different FTL implementations, each of which embody considerable BFM. :<

**Curious.George** · 12-28-2018, 01:58 AM

Re: "Solid State" media failures

Originally posted by tom66

Backups, folks! Backups! Three backups, two different locations and at least one different type of medium. But if you're too lazy to do that, then at least use an online service e.g. BackBlaze.

A lot of that depends on how long you expect to consider your data as "valuable". My archive goes back more than 40 years. A good bit of that stuff I really wouldn't cry about if it disappeared. But, the effort to sort out what REMAINS important to me, today, far exceeds the cost/effort to preserve it!

On-line services require a high speed connection to move copies of archives around. And/or support for more advanced protocols (e.g., rsync) to verify their integrity against local copies (and vice versa). For example, it takes a fair bit of time to copy a TB image over Gbe; imagine doing that with many TB!

And, being a Cynic, I'm not sure I'd consider any of them "secure", given the number of break-ins/hacks and outright SALES of data that we hear about.

**tom66** · 12-28-2018, 12:36 PM

Re: "Solid State" media failures

Originally posted by Curious.George

So, you're effectively saying that the wear leveling is "practically ideal" and everything "wears out" at the same time...?

No... some areas will wear out sooner, but the failures will be random in nature. So it is reasonably likely that several sectors will fail in a small span of time if the drive is used in a typical fashion (and the wear levelling works well)

Originally posted by Curious.George

But, now you're conflating the drive's failure with the application's expectations of it. I.e., an application that doesn't hammer on the drive wouldn't suffer as horrendous a fate.

Environments that load apps "once" from persistent store could stumble along with the user only noticing a startup delay when the app is initially loaded.

True - to a point - but there is no current way for SATA SSDs to indicate that they are becoming a bit "latent" and that you need to wait to read some sections. So the kernel (Windows, Linux, whatever) will keep hitting sectors and if it gets held up somewhere, the result will be random performance degradation with the user not being aware of the trigger.

With our eMMC flash on our STBs the fault essentially was that the eMMC wouldn't mount correctly, but the application software didn't like this, so attempted to re-mount it frequently. Each mount attempt took far too long as it relied on a timeout, leading to the unit slowing down considerably.

Originally posted by Curious.George

I'm assuming (?) tablets and other devices with soldered down memory implement their own FTL and, as such, can (chose to) see more of what's happening inside the medium. By contrast, an SSD has a conventional interface that it tries to maintain that deliberately hides lots of these "medium specific" details.

You assume so but it is not correct. Many smartphones and smart devices still use eMMC flash wth the same defect we experienced.

**tom66** · 12-28-2018, 12:41 PM

Re: "Solid State" media failures

Originally posted by Curious.George

A lot of that depends on how long you expect to consider your data as "valuable". My archive goes back more than 40 years. A good bit of that stuff I really wouldn't cry about if it disappeared. But, the effort to sort out what REMAINS important to me, today, far exceeds the cost/effort to preserve it!

On-line services require a high speed connection to move copies of archives around. And/or support for more advanced protocols (e.g., rsync) to verify their integrity against local copies (and vice versa). For example, it takes a fair bit of time to copy a TB image over Gbe; imagine doing that with many TB!

So my solution was to back it all up using my 20Mbit (upstream) connection which took a while! About two months all in with it running in the background. But it did work.

I don't keep the stuff on the cloud. It's only used as a backup method, the data is only there in case a failure occurs.

Originally posted by Curious.George

And, being a Cynic, I'm not sure I'd consider any of them "secure", given the number of break-ins/hacks and outright SALES of data that we hear about.

I use BackBlaze myself with an encryption key. The encryption key is partially written down on a piece of paper, which is stored in a fire safe in my detached garage and a second copy is stored in another secret location. The key is in two parts with the first part being a secret that I have remembered (just a memorable word or something like that), and the second part is written on that paper.

Without that key the data is useless, it is encrypted on my PC and if I have a drive failure they will ship me a HDD with the encrypted data on it, which I can then recover using that key.

If you are so concerned about data security you can trust AES256, it *will not* be broken with current technology and is likely to remain secure for at least the next 20 years.

**Curious.George** · 12-28-2018, 03:03 PM

Re: "Solid State" media failures

Originally posted by tom66

True - to a point - but there is no current way for SATA SSDs to indicate that they are becoming a bit "latent" and that you need to wait to read some sections. So the kernel (Windows, Linux, whatever) will keep hitting sectors and if it gets held up somewhere, the result will be random performance degradation with the user not being aware of the trigger.

Well, if the driver was smarter, it would be able to note the time required to service individual requests and compare these to historical norms. I do this in userland when I'm accessing the volumes in my archive. It pays off in spades for optical media where retries are costly (in terms of time) and where "SMART" data isn't really available.

Without hacking the driver, it gives me similar diagnostics that I have available in other devices I've designed (that just use NAND/NOR directly). There, I watch the device's actual performance against it's "specified" worst case performance to detect potential failures (before they become "double failures" and, thus, less detectable -- the second failure masking the first).

While I don't rely on it as a predictor of drive failure, I use it to modify the schedule for "file verification" so that the other files on the physical volume are revisited sooner in case there IS a problem brewing.

With our eMMC flash on our STBs the fault essentially was that the eMMC wouldn't mount correctly, but the application software didn't like this, so attempted to re-mount it frequently. Each mount attempt took far too long as it relied on a timeout, leading to the unit slowing down considerably.

Sounds like a case of "generic" software applied to a very specific technology. An "impedance mismatch", of sorts.

Many smartphones and smart devices still use eMMC flash wth the same defect we experienced.

I'd have assumed economies of scale made it cost effective to deal with their own FLASH management (instead of paying a vendor to do so). OTOH, at large scales, nearly everything becomes free so this may have been an easy bone to toss out.

In my case (10K's), I'd rather the cost savings AND the enhanced insight to the components' operation as I can't just swap out a drive -- nor do I have a network of retail establishments (phone vendors) that can provide replacement devices on my behalf.

**Curious.George** · 12-28-2018, 03:17 PM

Re: "Solid State" media failures

Originally posted by tom66

So my solution was to back it all up using my 20Mbit (upstream) connection which took a while! About two months all in with it running in the background. But it did work.

My archive is in excess of 100T. I can access it at ~100MB drive rates (i.e., Gbe) so I don't think twice about verifying "a copy"s integrity or pulling down a few GB of data in case I might want to use it (discarding it if I opt not to).

I use BackBlaze myself with an encryption key. The encryption key is partially written down on a piece of paper, which is stored in a fire safe in my detached garage and a second copy is stored in another secret location. The key is in two parts with the first part being a secret that I have remembered (just a memorable word or something like that), and the second part is written on that paper.

Without that key the data is useless, it is encrypted on my PC and if I have a drive failure they will ship me a HDD with the encrypted data on it, which I can then recover using that key.

If you are so concerned about data security you can trust AES256, it *will not* be broken with current technology and is likely to remain secure for at least the next 20 years.

So, in case of a disaster on your end, and you want to retrieve your backup, you do so through the same 20Mb pipe, over the course of another few months? Hoping, all the while, that the company maintaining it hasn't changed their terms of service (or gone belly up or been hacked offline)?

I'm old enough that if something MAJOR happened (house explosion), I'd only fret over the loss of RECENT financial and medical records. And, those are periodically updated on portable media to handle the MORE likely scenario of having to evacuate (fire, flood, terror incident, etc.).

Yeah, I'd miss my music archives, book library, technical library, software archive, project history logs, etc.. But, I'd also miss the various bits of equipment that I'd lost -- many of which being irreplaceable and/or essential to make use of the data (apps, source code) that was "lost". Recovering the archive from an offsite store would take months, anyway (getting a machine set up again that could access them -- and make use of them! -- and having the encryption key on my person when I abandoned the office!). Not likely to be the most pressing need I'd have! So, it's just as easy to treat them as disposable, at that point and start over.

[Finances and medical, however, have no convenient "reset" and their "need" can prove to be "immediate"!]

"Solid State" media failures

"Solid State" media failures

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Related Topics