Flash storage topics

By Jake Edge
June 6, 2018

At the 2018 Linux Storage, Filesystem, and Memory-Management Summit (LSFMM), Jaegeuk Kim described some current issues for flash storage, especially with regard to Android. Kim is the F2FS developer and maintainer, and the filesystem-track session was ostensibly about that filesystem. In the end, though, the talk did not focus on F2FS and instead ranged over a number of problem areas for Android flash storage.

He started by noting that Universal Flash Storage (UFS) devices have high read/write speeds, but can also have high latency for some operations. For example, ext4 will issue a discard command but a UFS device might take ten seconds to process it. That leads the user to think that Android is broken, he said.

UFS devices have a "huge garbage-collection overhead". When garbage collection is needed, the performance of even sequential writes drops way down. That needs to be avoided, so UFS must be periodically given some time to do its garbage collection. But power is a more important consideration, so hibernating the device is prioritized, which does not leave much time for the device to do its garbage collection.

Amir Goldstein suggested doing garbage collection when the device is charging; he thought that should provide a reasonable solution. Kim said that Android currently declares a ten-minute idle time at 2am that is used to defragment the filesystem. It could perhaps also be used for garbage collection.

The solution to the discard performance problem should be fairly straightforward, he said. A kernel thread (kthread) could be added to issue discards asynchronously during idle time. Candidate blocks could be added to a list that would be processed by the kthread. There is a race condition if the block gets reallocated, however.

Different UFS devices have different latencies for their cache-flush commands. Some vendors' devices have low latency but others have ten-second latencies for a single cache-flush command. Given that, it makes sense to batch cache-flush commands.

Filesystem encryption is mandatory for Android. It is present in ext4 and has also been added to F2FS. There is some hardware encryption code from Qualcomm that cannot be pushed upstream, however. Ted Ts'o said that it is "horrible code" that only works for ext4 ecryptfs or F2FS; no one has had time to clean it up for the mainline.

Kim would like to see the garbage collection on the device side get optimized. He would like to add a customized interface that can be called when it is time to do garbage collection. If the system can detect idle time, it can then initiate the garbage-collection process.

SQLite performance is another problem area. SQLite uses fsync() to ensure its data has gotten to storage. By default it uses a journal, so writes to the database end up requiring two writes and two fsync() calls (first for the journal and then to the final location). Two fsync() operations can be expensive and are not needed for F2FS because it is a copy-on-write filesystem. A feature has been added to SQLite to avoid one write and one fsync() by using F2FS atomic writes.

In order to reduce the latency of fsync() calls, he is looking at write barriers. He researched them and found that they had been removed long ago. Kent Overstreet said they were removed due to unclear semantics, especially for stacked filesystems. In that case, the stack would have to provide order guarantees for the BIOs all the way down the stack, which would be difficult to do and would defeat the purpose of some of the layers. Beyond that, it is impossible to test to make sure that has been done correctly.

But Kim said that the Android case would not involve device-mapper or other stacking, he is just trying to avoid the cache-flush command. Jan Kara suggested a new storage command, like "issue barrier", that would cause any I/O issued before the barrier to complete before any new I/O.

Index entries for this article
Kernel	Filesystems/Flash
Conference	Storage, Filesystem, and Memory-Management Summit/2018

Flash storage topics

Posted Jun 6, 2018 22:58 UTC (Wed) by Tobu (subscriber, #24111) [Link] (18 responses)

That "issue barrier" command would be a perfect fit for some databases. Ensuring a given order is a lot faster than ensuring everything has been persisted to disk, and can be a sufficient guarantee in distributed systems.

Flash storage topics

Posted Jun 7, 2018 17:55 UTC (Thu) by drh (guest, #65025) [Link] (16 responses)

A "write barrier" does indeed help things to run faster, though if you depend on write barrier rather than fsync(), you loss Durability on a power outage. You still have Atomic, Consistent, and Isolated - the first three letters of ACID. But you lose the D.

On the other hand, many applications don't care so much about losing a little work during a power outages as long as everything comes back up in a sane state.

Flash storage topics

Posted Jun 7, 2018 19:15 UTC (Thu) by andresfreund (subscriber, #69562) [Link]

Not necessarily. You can still get durability with barriers by waiting for the barrier to complete. But IMHO barriers are more useful for things like flushing the write ahead log because you want to write out buffers, rather than WAL flushes due to commits. Usually you have to sync the log before writing out a buffer that's covered by not yet flushed entries. Barriers can make that a lot more efficient.

Flash storage topics

Posted Jun 7, 2018 20:52 UTC (Thu) by zlynx (guest, #2285) [Link] (14 responses)

Yes, agreed. ACI is quite fine for most applications. As long as the files are in one consistent state or another a bit of data loss is acceptable.

It doesn't happen very often anyway. Between laptops with batteries and desktops with UPS most of my data loss comes from kernel bugs these days.

Speaking of UPS, I never understood why someone would balk at spending $100 to protect a $1,000 computer. Bad power can cause nasty issues.

Flash storage topics

Posted Jun 7, 2018 22:10 UTC (Thu) by k8to (guest, #15413) [Link]

I would guess at an individually owned computer because they don't understand the realities and/or because it's another thing to deal with that they don't really want to.

At an organizational level it's another thing to organize and do procurement and asset management for. Which is basically the same type of dynamic just playing out at a different scale, but here sometimes the paperwork and human hours for the 1000$ computer are larger than 1000$ (and relatedly the paperwork for the UPS could be similar to the computer).

It probably still makes sense in the long run, though maybe not if you go thoroughly in the cattle direction.

Flash storage topics

Posted Jun 8, 2018 15:18 UTC (Fri) by nix (subscriber, #2304) [Link] (12 responses)

> Speaking of UPS, I never understood why someone would balk at spending $100 to protect a $1,000 computer. Bad power can cause nasty issues.

But UPSes are *also* a source of unreliability. If you have only a couple of power flickers a decade (as the UK used to until it privatized a lot of its electricity network and started skimping on maintenance), a UPS *worsens* reliability rather than improving it.

Flash storage topics

Posted Jun 8, 2018 15:33 UTC (Fri) by zlynx (guest, #2285) [Link] (11 responses)

I suppose?

In my personal experience, while I have had UPS batteries go bad, which just happens every so many years, I haven't had a UPS actually fail. I have had more computer power supplies fail.

And as for utility power, I don't understand how the UK could possibly be as reliable as you say. Perhaps I'm used to more spread out rural areas of Colorado, but with power lines on poles combined with extremely high winds (tornadoes sometimes) and lightning, and heavy wet snow, well, power is just going to go out every now and then. I really don't see how maintenance could help.

Flash storage topics

Posted Jun 8, 2018 15:50 UTC (Fri) by karkhaz (subscriber, #99844) [Link] (7 responses)

The weather is a lot more temperate in the UK than it is in Colorado. Neither tornadoes nor hurricanes form here, and as for heavy snow---well, people freaked out last winter when London was belaboured with five glorious centimeters of it, most years we get none. Personally, I've not in my entire life seen a single power outage in London.

Flash storage topics

Posted Jun 8, 2018 16:42 UTC (Fri) by excors (subscriber, #95769) [Link] (1 responses)

https://www.ukpowernetworks.co.uk/power-cut/map has a handy map - right now it appears to be reporting 4 unplanned power cuts in London, affecting about 82 customers, so outages are not totally unheard of. (Unfortunately they don't seem to have an obvious way to see historical data. My vague memories are of somewhere between maybe 0.5 and 2 outages per year, living in places outside London.)

Flash storage topics

Posted Jun 14, 2018 17:30 UTC (Thu) by Wol (subscriber, #4433) [Link]

Admittedly I have only lived on the outskirts of London for pretty much my entire life, but there are - over that entire period - only three power outages that I can remember.

After heavy rain, there was a landslip at a chalk pit that took out the local substation.

At work, some robbers tried to blow their way into a bank vault, but in the process took out a major electricity supply cable.

Some thieves tried to steal a copper power line (250KVA, I think) and took out a small town.

Things like brownouts are pretty much unknown.

So yes, in Britain MOST people MOST of the time never experience a problem. The only people who will see any need for a UPS are people who live near an industrial area where their neighbours are dirtying the supply. Outside of that, supply is both good and reliable, and short outages are almost unknown. If there's a problem, it's either with the house supply itself, or in the cases I've mentioned above it's a major but localised problem - at one day, the first problem was the one rectified the quickest of the above three. The second took a week, while the third left many homes without power for days ...

Cheers,
Wol

Flash storage topics

Posted Jun 9, 2018 10:53 UTC (Sat) by anselm (subscriber, #2796) [Link] (4 responses)

Neither tornadoes nor hurricanes form here

The UK Met Office would beg to disagree on tornadoes:

Around 30 tornadoes a year are reported in the UK. These are typically small and short-lived, but can cause structural damage if they pass over built-up areas.

Hurricanes in a literal sense don't occur in the UK a lot, but sometimes the “tail end” of a hurricane can end up in Britain as a destructive storm, like Hurricane Ophelia in October, 2017.

Given that overhead power lines are fairly common at least in rural areas of the UK, it would not be in the least surprising that storms (including hurricane remnants and tornadoes) caused occasional power outages.

Flash storage topics

Posted Jun 9, 2018 12:39 UTC (Sat) by mpr22 (subscriber, #60784) [Link] (3 responses)

The UK has the most tornadoes per year of any country in Europe, and more tornadoes per square kilometre per year than any country in the world except the Netherlands. (I think it might even have more tornadoes per square kilometre per year than the region of the USA known as "Tornado Alley".)

Flash storage topics

Posted Jun 9, 2018 13:44 UTC (Sat) by karkhaz (subscriber, #99844) [Link] (2 responses)

Cheers for these facts, I must admit to not knowing them despite enthusiastically following the hurricane season in the US.

One thing that occurs to me, though, is that the sources for "highest number of tornadoes recorded per area" seem to have very population densities (Bangladesh is mentioned often, and the Netherlands and UK have the highest and third-highest densities of all the non-tiny European nations). This may be the reason that the reported numbers are so high for these countries: people are a lot more likely to see a tornado, even if it is too weak to cause significant damage, than in rural Tornado Alley. This is compounded by the fact that tornadoes are difficult to observe directly using radar, so tornado reports mostly come from people who have seen the tornado first-hand. And higher population densities lead to more frequent infrastructure that could suffer from noticeable damage, e.g. power lines, train tracks, etc.

Also I wonder if people being acclimatized to huge tornadoes in Tornado Alley leads to people reporting less of the smaller ones: a relatively benign tornado that would have somebody in the UK scrambling to phone the Met Office might just be ignored by their cousin over the pond.

Flash storage topics

Posted Jun 12, 2018 6:14 UTC (Tue) by k8to (guest, #15413) [Link] (1 responses)

I wonder what the standards are for what qualifies as a tornado.

In the United States northeast growing up, we had a number of minor twisters that no one thought to label "tornado". If it only uprooted 30 trees or so, it was "just a twister".

Flash storage topics

Posted Jun 12, 2018 6:34 UTC (Tue) by mpr22 (subscriber, #60784) [Link]

Wikipedia tells me that to qualify as a tornado, a weather phenomenon must involve a rotating wind column, reaching from ground level to the base of the overhead clouds, with surface wind speeds in excess of 40 mph (64 km/h).

Flash storage topics

Posted Jun 8, 2018 17:47 UTC (Fri) by nix (subscriber, #2304) [Link]

The power lines are only on poles in the UK in rural areas (villages and smaller): elsewhere they are usually underground and immune to most weather-related failures. Local stepdown transformers are not per-house, but cover a few blocks (so are much larger, much more expensive, and correspondingly much more *reliable*: they very rarely fail, and usually survive even direct lightning strikes). Every significant power station is required to be able to restart under its own steam, too, so we don't have cascading failures where big chunks of the grid go offline and can't come back on because they themselves need grid power.

Long-distance high-voltage cross-country lines and lines leaving rural power stations *are* up in the air, but it takes a hell of a lot of wind to knock down one of *those* monster pylons (again, it more or less never happens unless a tree falls on the lines, and they are usually routed away from trees for exactly that reason). It does happen, but because these are high-voltage lines there is almost always fallback from elsewhere in the grid if one is hit, unless there has been a *major* storm and numerous of them have been taken out at once. (Again, this affects major conurbations essentially never: distant rural Scotland, or single small towns that might have only one or two incoming lines, sure, but nothing larger.)

I am unhappy with the current state of the UK power network. I've had two flickers in the last decade, and checking my supplier's logs I see one five-minute outage! This is awful: in the decades of the 1990s and 2000s I had none at all. It's hard to justify a UPS with reliability like that. (Note: in the same time window we had a half-*day* water outage, when a farmer drove a plough through a 6in supply pipe and de-watered half the town...)

Flash storage topics

Posted Jun 28, 2018 11:32 UTC (Thu) by jospoortvliet (guest, #33164) [Link] (1 responses)

In most of north-west Europe powerlines are underground so power outages happen at most once every few years...

Flash storage topics

Posted Jul 7, 2018 19:53 UTC (Sat) by nix (subscriber, #2304) [Link]

In rural, particularly forested, parts of North Europe, the one-step-up-from-domestic local power lines come in overhead (because they have to, because running a power line under fifty miles of forest to serve a hamlet of twenty houses is ridiculous). When the trees grow and it snows hard enough (or gloopily enough), the power can go down and stay down for days or even weeks while kilometres of line are restrung. However, unlike in Puerto Rico -- or 1998 Auckland! -- it does not go down for months.

Flash storage topics

Posted Jun 14, 2018 15:36 UTC (Thu) by Wol (subscriber, #4433) [Link]

As someone who wants to implement a database, that "issue barrier" would be perfect.

At present, (a) user-space can not reason about the state of the disk, and (b) a "sync" is effectively a Denial-of-Service attack on all the other users of the system (and yourself).

If I can guarantee that certain writes hit the disk in a fixed order, then I can reason about the state of the disk, and write a robust app.

The problem is that POSIX explicitly only applies to a properly functioning system - it explicitly disclaims all liability if the system malfunctions, and things like databases need to be able to reason about a malfunctioning system.

Cheers,
Wol

Flash storage endurance

Posted Jun 8, 2018 5:17 UTC (Fri) by marcH (subscriber, #57642) [Link] (5 responses)

[Maybe slightly off-topic sorry but since the flash experts should hopefully flock here]

There's some empirical evidence that cheap eMMCs found in - you guessed - cheap Android phones wear out quickly, sometimes even sooner than when security issues there stopped receiving fixes.

On the other hand, there's an incredible number of Android sites and communities rooting and reviewing every single Android device under the sun and running many various benchmark on them. Just no... storage endurance test ever? Why?

It wouldn't sound like a major feat to run some storage test designed to "break" flash storage as fast as possible thanks to smartly configured write amplification[*] and what not and to measure how many cycles it takes before the memory dies. Or is it hard and why? OK that would cost a sample device but the number of clicks for the corresponding review should hopefully offset that.

[*] https://lwn.net/Articles/428584/ Optimizing Linux with cheap flash drives Arnd Bergmann
https://www.bunniestudios.com/blog/?p=3554 On Hacking MicroSD Cards

> For example, ext4 will issue a discard command but a UFS device might take ten seconds to process it

LOL. Hey, why would you have half-decent latency requirements for components aiming a market of purely interactive products! I doubt any eMMC was that bad, I mean at least not any brand new eMMC.

Flash storage endurance

Posted Jun 8, 2018 9:04 UTC (Fri) by excors (subscriber, #95769) [Link] (4 responses)

I think that number by itself wouldn't tell you anything useful - you'd need to know how much writing occurs during the expected usage of the device, to determine how many years the flash is likely to survive and whether it's a real problem.

Maybe some device manufacturers measure and optimise their IO, allowing themselves to choose a cheaper chip with lower endurance because they have confidence that it will be sufficient, whereas others don't care and have a higher-endurance chip that wears out quicker because they're constantly spamming it with log files and unnecessary caches and some process is calling sync() every 30ms. Simply comparing the raw endurance would give misleading results as to which device is better, and reviewing devices with misleading benchmarks is harmful since it forces manufacturers to optimise for those benchmarks rather than for users.

I suspect it's also hard to get meaningful measurements from a single device, because of the random nature of the failures. You might need to test a large number to get an accurate MTBF, and it seems impractical and a bit silly to buy a large number of phones just to test a chip that costs a few dollars.

Flash storage endurance

Posted Jun 8, 2018 14:06 UTC (Fri) by marcH (subscriber, #57642) [Link] (3 responses)

> harmful since it forces manufacturers to optimise for those benchmarks rather than for users.

This is already the nature of almost the entire industry except in this case. Yet I don't think anyone would like a benchmark-free world. The answer is rather better and more varied benchmark(s) that are harder to cheat. Considering the relative simplicity of storage interfaces (compared to say... GPUs!) designing such an endurance benchmark that models real-world usage quite reliably doesn't seem crazy. In fact isn't there some endurance benchmark already for less disposable storage products?

> You might need to test a large number to get an accurate MTBF, and it seems impractical and a bit silly to buy a large number of phones just to test a chip that costs a few dollars.

Fair enough. Then maybe the answer should be something like this:
https://ai.google/research/pubs/pub32774 "Failure Trends in a Large Disk Drive Population"
Maybe it's happening somewhere already.

Flash storage endurance

Posted Jun 8, 2018 15:33 UTC (Fri) by excors (subscriber, #95769) [Link] (2 responses)

I agree it would be good to have good benchmarks - I'm just concerned that bad benchmarks are worse than no benchmarks, and it seems hard to design good benchmarks for this. The problem isn't necessarily that people would cheat, it's that the marketing people would tell the engineers to spend effort on legitimately increasing a benchmark number that does not meaningfully improve the customer experience, at the expense of something more useful.

Measuring the endurance of a particular flash chip doesn't sound like it should be too difficult; just do a load of writes until you see IO failures or data loss, and maybe do something to see how effective any wear-levelling is, and compare against the vendor's endurance guarantees to make sure they're not lying. But if you want to know how that affects the lifetime of a phone, you need to know the behaviour of the software on that phone, and you need to know what memory chip it uses (which is non-trivial since a single model of phone might use parts from multiple vendors at once, for supply chain diversification, and change parts over time to reduce cost), and that's not something a typical phone review site could feasibly do. CPU/GPU benchmarks are much easier since the relevant software is provided by the benchmark itself, and the hardware is usually consistent across a phone model (or if some are different then it's probably a whole different SoC and is very obvious), so measurements on a test device are likely to match customer devices.

To get realistic data about large populations, I guess you'd need access to automatically-uploaded error logs or customer support records to see how many users have encountered storage errors. That would be nice, but seems unlikely to happen.

Flash storage endurance

Posted Jun 8, 2018 18:20 UTC (Fri) by marcH (subscriber, #57642) [Link] (1 responses)

> But if you want to know how that affects the lifetime of a phone, you need to know the behaviour of the software on that phone,

Basic benchmark design problem, not specific to storage or endurance.

> and you need to know what memory chip it uses

Not a problem specific to storage or endurance: https://www.google.com/search?q=iphone+intel+modem

> CPU/GPU benchmarks are much easier since the relevant software is provided by the benchmark itself

Interfaces to GPU are orders of magnitude more complex than storage interfaces; one of the reasons cheating GPU benchmarks is universal: https://www.google.com/search?q=game+benchmark+cheating
https://fosdem.org/2018/schedule/event/apitrace/
Yet no one suggests to stop benchmarking GPUs.

> The problem isn't necessarily that people would cheat, it's that the marketing people would tell the engineers to spend effort legitimately...

We know how "legitimately" often ends up with (at least) GPUs and car emissions. You can take for granted that some actors will always go "beyond legitimate"; again nothing specific to flash storage or endurance.

> I guess you'd need access to automatically-uploaded error logs or customer support records to see how many users have encountered storage errors. That would be nice, but seems unlikely to happen.

How do we know it's not happening already? (biggest lie on the Internet: "I agree")

Flash storage endurance

Posted Jun 8, 2018 18:44 UTC (Fri) by excors (subscriber, #95769) [Link]

>> I guess you'd need access to automatically-uploaded error logs or customer support records to see how many users have encountered storage errors. That would be nice, but seems unlikely to happen.
>
> How do we know it's not happening already? (biggest lie on the Internet: "I agree")

Error logs certainly get uploaded already, on some devices - they're very useful for identifying and prioritising common bugs, quickly detecting regressions when rolling out OTAs, etc. What I mean is unlikely is that the companies with that information would ever release it publicly.

[slightly off-topic] 2AM?

Posted Jun 8, 2018 9:55 UTC (Fri) by awilfox (guest, #124923) [Link] (1 responses)

> Kim said that Android currently declares a ten-minute idle time at 2am that is used to defragment the filesystem. It could perhaps also be used for garbage collection.

What about night owls like me, that are regularly using the phone at that time? If the phone isn't idle then, does it just wait for the next idle time? Searched ddg but nobody seems to have heard of Android doing defragmenting every 02:00...

[slightly off-topic] 2AM?

Posted Jun 18, 2018 12:40 UTC (Mon) by nelzas (subscriber, #4427) [Link]

Yes, and what about people like me, who turn off their phone when going to bed?
searching for defrag in phone settings doesn't give me results...