| Did you know...? LWN.net is a subscriber-supported publication; we rely on subscribers to keep the entire operation going. Please help out by buying a subscription and keeping LWN on the net. |
Day two of the Linux Storage, Filesystem, and Memory Management Summit was much like its predecessor, but with fewer combined sessions. It was held in San Francisco on April 2. Below is a look at the combined sessions as well as those in the Memory Management track that is largely based on write-ups from Mel Gorman as well as some additions from my notes. In addition, James Bottomley has written up the Filesystem and Storage track.
Steven Sprouse was invited to the summit to talk about flash media. He is the director of NAND systems architecture at SanDisk, and his group is concerned with consumer flash products - for things like mobile devices, rather than enterprise storage applications, which is handled by a different group. But, he said, most of what he would be talking about is generic to most flash technologies.
The important measure of flash for SanDisk is called "lifetime terabyte writes", which is calculated by the following formula:
physical capacity * write endurance
-----------------------------------
write amplification
Physical capacity is increasing, but write endurance is decreasing (mostly
due to cost issues). Write amplification is a measure of the actual amount
of writing that must be done because the device has to copy data based on
its erase block size.
Write amplification is a function of the usage of the device, its block
size, over-provisioning, and the usage of the trim command (to tell the
device what blocks are no longer being used).
Block sizes (which are the biggest concern for write amplification) are
getting bigger for flash devices, resulting in higher
write amplification.
The write endurance is measured in data retention years. As the cells in the flash get cycled, the amount of time that data will last is reduced. If 10,000 cycles are specified for the device, that doesn't mean they die at that point, just that they may no longer hold data for the required amount of time. There is also a temperature factor and most of the devices he works with have a maximum operating temperature of 45-50°C. Someone asked about read endurance, and Sprouse said that reads do affect endurance but didn't give any more details.
James Bottomley asked if there were reasons that filesystems should start looking at storing long-lived and short-lived data separately and not mixing the two. Sprouse said that may eventually be needed. He said there is a trend toward hybrid architectures that have small amounts of high-endurance (i.e. can handle many more write cycles) flash and much larger amounts of low-endurance flash. Filesystems may want to take advantage of that by storing things like the journal in the high-endurance portion, and more stable OS files in the low-endurance area. Or storing hot data on high-endurance and cold data on low-endurance. How that will be specified is not determined, however.
The specs for a given device are based on the worst-case flash cell, but the average cell will perform much better than that worst case. If you cycle all of the cells in a device the same number of times, one of the pages might well only last 364 days, rather than the one year in the spec. Those values are determined by the device being "cycled, read, and baked", he said. The latter is the temperature testing that is done.
Sprouse likened DRAM and SRAM to paper that has been written on in pencil. If a word is wrong, it can be erased without affecting the surrounding words. Flash is like writing in pen; it can't be erased, so a one-word mistake requires that the entire page be copied. That is the source of write amplification. From the host side, there may be a 512K write, but if that data resides in a 2048K block on the flash, the other three 512K chunks need to be copied which, makes for a write amplification factor of four. In 2004, flash devices were like writing on a small Post-it pad that could only fit four words, but in 2012, it is like writing on a piece of paper the size of a large table. The cost for a one-word change has gone way up.
In order for filesystems to optimize their operation for the geometry of the flash, there needs to be a way to get that geometry information. Christoph Hellwig pointed out that Linux developers have been asking for five years for ways to get that information without success. Sprouse admitted that was a problem and that exposing that information may need to happen. There is also the possibility of filesystems tagging the data they are writing to give the device the information necessary to make the right decision.
Sprouse posed a question about the definition of a "random" write. A 1G write would be considered sequential by most, while 4K writes would be random, but what about sizes in between? Bottomley said that anything beyond 128K is sequential for Linux, while Hellwig said that anything up to 64M is random. But the "right" answer was: "tell me what the erase block size is". For flash, random writes are anything smaller than the erase block size. In the past writing in 128K chunks would have been reasonable, he said, but today each write of that size may make the flash copy several megabytes of data.
One way to minimize write amplification is to group data that is going to become obsolete at roughly the same time. Obsolete can mean that the data is overwritten or that it is thrown away via a trim or discard command. The filesystem should strive to avoid having cold data get copied because it is accidentally mixed in with hot data. As an example, Ted Ts'o mentioned package files (from an RPM or Debian package), which are likely to be obsoleted at the same time (by a package update or removal). Some kind of interface so that user space can communicate that information would be required.
In making those decisions, the focus should be on the hottest files (those changing most frequently) rather than the coldest files, Sprouse said. If the device could somehow know what the logical block addresses associated with each file are, that would help it make better decisions. As an example, if a flash device has four dies, and four files are being written, those files could be written in parallel across the dies. That has the effect of being fast for writing, but is much slower when updating one of the files. Alternatively, each could be written serially, which is slower, but will result in much less copying if one file is updated. Data must be moved around under the hood, Sprouse said, and if the flash knows that a set of rewrites are all associated with a single file, it could reorganize the data appropriately when it does the update.
There are a number of things that a filesystem could communicate to the device that would help it make better decisions. Which blocks relate to the same file, and which are related by function, like files in a /tmp directory that will be invalid after the next boot, or are OS installation files or browser cache files. Filesystems could also mark data that will be read frequently or written frequently. Flash vendors need to provide a way for the host to determine the geometry of a device like its page size, block size, and stripe size.
Those are all areas where OS developers and flash vendors could cooperate, he said. Another that he mentioned was to provide some way for the host to communicate how much time has elapsed since the last power off. Flash devices are still "operating" even when they powered off, because they continue to hold the data that was stored. You could think of flash as DRAM with a refresh rate of one year, for example. If the flash knows that it has been off for six months it could make better decisions for data retention.
Some in the audience advocated an interface to the raw flash, rather than going through the flash translation layer (FTL). Ric Wheeler disagreed, saying that we don't want filesystems to have to know about the low-level details of flash handling. Ts'o agreed and noted that new technologies may come along that invalidate all of the work that would have been put in for an FTL-less filesystem. Chris Mason also pointed out that flash manufacturers want to be able to put a sticker on the devices saying that it will store data for some fixed amount of time. They will not be able (or willing) to do that if it requires the OS to do the right thing to achieve that.
One thing that Mason would like to see is some feedback on hints that filesystems may end up providing to these devices. One of his complaints is that there is no feedback mechanism for the trim command, so that filesystem developers can't see what benefits using trim provides. Sprouse said that trim has huge benefits, but Mason wants to know whether Linux is effective at trimming. He would like to see ways to determine whether particular trim strategies are better or worse and, by extension, how any other hints provided by filesystems are performing.
Bottomley asked if flash vendors could provide a list of the information they are willing to provide about the internals of a given device. With that list, filesystem developers could say which would be useful. Many of these "secrets" about the internals of flash devices are not so secret, as Ts'o pointed out that Arnd Bergmann has done timing attacks to suss out these details, which he has published. Even if there are standards that provide ways for hosts to query these devices for geometry and other information, that won't necessarily solve the problem. As someone in the audience pointed out, getting things like that into a standard does not force the vendors to correctly fill in the data.
Wheeler asked if it would help for the attendees' "corporate overlords" to officially ask for that kind of cooperation from the vendors. There were representatives many large flash-buying companies at the summit, so that might be a way to apply some pressure. Sprouse said that like most companies, there are different factions within SanDisk (and presumably other flash companies). His group sees the benefit of close cooperation with OS developers, but others see the inner workings as "secret sauce".
It is clear there are plenty of ways for the OS and these devices to cooperate, which would result in better usage and endurance. But there is plenty of work to do on both sides before that happens.
Kent Overstreet discussed the Bcache project, which creates an SSD-based cache for other (slower) block devices. He began by pointing out that the device mapper (DM) stores much of the information that Bcache would need in user space. Basically, the level of pain required to extract the necessary information from DM meant that they bypassed it entirely.
It was more or less acknowledged that, because Bcache is sufficiently well established in terms of performance, that may imply that DM should provide an API it can use. Basically, if a flash cache is to be implemented in kernel, basing it upon Bcache would be preferable. It would also be preferred if any such cache was configured via an established interface such as DM; this is the core issue that is often bashed around.
It was pointed out that Bcache also required some block-layer changes to split BIOs in some cases, depending on the contents of the btree, which would have been difficult to communicate via DM. This reinforces the original point that adapting Bcache to DM would require a larger number of changes than expected. There was some confusion on exactly how Bcache was implemented and what the requirements are but the Bcache developers were not against adding DM support as such. They were just indifferent to DM because their needs were already been served.
In different variations, the point was made that the community is behind the curve in terms of caching on flash and that some sort of decision is needed. This did not prevent the discussion being pulled in numerous different directions that brought up a large number of potential issues with any possible approach. The semi-conclusion was the community "has to do something" but what that was reached no real conclusion. There was a vague message that a generic caching storage layer was required that would be based on SSD initially but exactly at which layer this should exist as was unclear.
Hiroyuki Kamezawa discussed the problem of hot unplugging full NUMA nodes on Intel "Ivy Bridge"-based platforms. There are certain structures that are allocated on a node that cannot be reclaimed before unplug such as pgdat. The basic approach is to declare these nodes as fully ZONE_MOVABLE and allocate needed support structures off-node. The nodes this policy affects can be set via kernel parameters.
An alternative is to boot only one node and, later, hotplug the remaining nodes, marking them ZONE_MOVABLE as they are brought up. Unfortunately, there is an enumeration problem with this. The mapping of physical CPUs to NUMA nodes is not constant because altering a BIOS setting such as HT may change that mapping. For similar reasons, the NUMA node ID may change if DIMMs change slots. Hence, the problem is that the physical node IDs and node IDs as reported by the kernel are not the same between boots. If, on a four-node machine they boot nodes zero and one and hotplug node two, the physical addresses might vary and this is problematic when deciding which node to remove or even when deciding where to place containers.
To overcome this, they need some sort of translation layer that virtualizes the CPU and node ID numbers to keep the mappings consistent between boots. There is more than one use case for this, but the problem mentioned regarding companies that have very restrictive licensing based on CPU IDs was not a very popular one. To avoid going down a political rathole, that use case was acknowledged, but the conversation moved on as there are enough other reasons to provide the translation layer.
It was suggested that perhaps only one CPU and node be activated a boot and to bring up the remaining nodes after udev is active. udev could be used to create symbolic links mapping virtual CPU IDs to physical CPU IDs and similarly symbolic link virtual node IDs to the underlying physical IDs in sysfs. A further step might be to rename CPU IDs and node IDs at runtime to match what udev discovers similar to the way network devices can be renamed, but that may be unnecessary.
Conceivably, an alternative would be that the kernel could be informed what the mapping from virtual IDs to physical IDs should be (based on what's used by management software) and rename the sysfs directories accordingly, but that would be functionally equivalent. It was also suggested that this should be managed by the hardware but that is probably optimistic and would not work for older hardware.
Unfortunately, there was no real conclusion on whether such a scheme could be made work or if it would suit Kamezawa's requirements.
Dan Magenheimer started by discussing whether frontswap should be merged. It got stalled, he said, due to bad timing as he passed a line where there was an increased emphasis on review and testing. To address this he gave an overview of transcendent memory and its components such as the cleancache and frontswap front-ends and the zcache, RAMster, Xen, and KVM backends. Many of these components have been merged, with RAMster being the most recent addition, but frontswap is noticeable by its absence despite the fact that some products ship with it.
He presented the results of a benchmark run based on the old reliable parallel kernel build with increasing numbers of parallel compiles until it started hitting swap. He showed the performance difference when zcache was enabled. The figures seemed to imply that the overhead of the schemes was minimal until there was memory pressure but when zcache was enabled, performance could in fact improve due to more efficient use of RAM and reduced file and swap I/O. He referred people to the list where more figures are available.
He followed up by presenting the figures when the RAMster backend was used. The point was made that using RAMster might show an improvement on the target workload while regressing the performance of the machine that RAMster was taking resources from. Magenheimer acknowledged this but felt that was sufficient evidence justifying frontswap's existence to have it merged.
Andrew Morton suggested posting it again with notes on what products are shipping with it already. He asked how many people had done a detailed review and was discouraged that apparently no one had. On further pushing it turned out that Andrea Arcangeli had looked at it and while he saw some problems he also thought it was been significantly improved in recent times. Rik van Riel's problem was that frontswap's API was synchronous but Magenheimer believes that some of these concerns have been alleviated in recent updates. Morton said that if this gets merged, it will affect everyone and insisted that people review it. It seems probable that more review will be forthcoming this time around as people in the room did feel that the frontswap+zcache combination, in particular, would be usable by KVM.
Kyungmin Park than talked about the contiguous memory allocator (CMA) and how it has gone through several versions with review but without being merged. Morton said that he had almost merged it a few times but then a new version would come out. He said to post it again and he'll merge that.
Mel Gorman then brought up swap over NFS, which has also stalled. He acknowledged that the patches are complex, and the feedback has been that the feature isn't really needed. But, he maintained, that's not true, it is used by some and, in fact, ships with SUSE Linux. Red Hat does not, but he has had queries from at least one engineer there about the status of the patches.
Gorman's basic question was whether the MM developers were willing to deal with the complexity of swap over NFS. The network people have "stopped screaming" at him, which is not quite the same thing as being happy with the patches, but Gorman thinks progress has been made there. In addition, there are several other "swap over network filesystem" patches floating around, all of which will require much of the same infrastructure that swap over NFS requires.
Morton said that the code needs to be posted again and "we need to promise to look at it". Hopefully that will result in comments on whether it is suitable in its current state or, if not, what has to be done to make it acceptable.
While implementing a page table walker for estimating work set size, Michel Lespinasse found a number of cases where mmap_sem hold time for writes caused significant problems. Despite the topic title ("Working Set Estimation"), he focused on enumerating the worst mmap_sem hold times, such as when a mapped file is accessed and the atime must be updated or when a threaded application is scanning files and hammering mmap_sem. The user visible effects of this can be embarrassing. For example, ps can stall for long periods of time if a process is stalled on mmap_sem which makes it difficult to debug a machine that is responding poorly.
There was some discussion on how mmap_sem could be changed to alleviate some of these problems. The proposed option was to tag a task_struct before entering a filesystem to access a page. If the filesystem needs to block and the task_struct was tagged, it would release the mmap_sem and retry the full fault from start after the filesystem returns control. Currently the only fault handler that does this properly is x86. The implementation was described as being ugly so he would like people to look at it and see how it could be improved. Hugh Dickins agreed that it was ugly and wants an alternative. He suggested that maybe we want an extension of pte_same to cover pte_same_vma_same() but it was not deeply considered. One possibility would be to have a sequence counter on the mm_struct and observing if it changed.
Andrea Arcangeli pointed out that just dropping the mmap_sem may not help as it still gets hammered by multiple threads and instead the focus should be on avoiding blocking when holding mmap_sem for writing because it is an exclusive lock. Lespinasse felt that this was only a particular problem for mlockall() so there may be some promise for dropping mmap_sem for any blocking and recovering afterward.
Dickins felt that at some point in the past that there was a time when mmap_sem was dropped for writes and just a read semaphore held under some circumstances. He suggested doing some archeology of the commits to confirm if the kernel ever did that and, if so, what were the reasons it was dropped.
The final decision for Lespinasse was to post the patch that expands task_struct with information that would allow the mmap_sem to be dropped before doing a blocking operation. Peter Zijlstra has concerns that this might have some scheduler impact and Andi Kleen was concerned that it did nothing for hold times in other cases. It was suggested that the patch be posted with a micro-benchmark that demonstrates the problem and what impact the patch has on it. People that feel that there are better alternatives can then evaluate different patches with the same metrics.
Hugh Dickins credited Johannes Weiner's work on reducing the size of mem_cgroup and highlighted Hiroyuki Kamezawa's further work. He asserted that mem_cgroup is now sufficiently small that it should be merged with page_cgroup. He then moved on to page flag availability and pointed out that there currently should be plenty of flags available on 64-bit systems. Andrew Morton pointed out that some architectures have stolen some of those flags already and that should be verified. Regardless of that potential problem it was noted that, due to some slab alignment patches, there is a hole in struct page and there is a race to make use of that space by expanding page flags.
The discussion was side-tracked by bringing up the problem of virtual memory area (VMA) flag availability. There were some hiccups with making VMA flags 64-bit in the past but thanks to work by Konstantin Khlebnikov, this is likely to be resolved in the near future.
Dickins covered a number of different uses of flags in the memory cgroup (memcg) and where they might be stored but pointed out that memcg was not the primary target. His primary concern was that some patches are contorting themselves to avoid using a page flag. He asserted that the overhead of this complexity is now higher than the memory savings from having a smaller struct page. As keeping struct page very small was originally for 32-bit server class systems (which are now becoming rare) he felt that we should just expand page flags. Morton pointed out that we are going to have to expand page flags eventually and now is as good as time as any.
Unfortunately numerous issues were raised about 32-bit systems that would be impacted by such a change and it was impossible to get consensus on whether struct page should be expanded or not. For example, it was pointed out that embedded CPUs with cache lines of 32 bytes benefit from the current arrangement. Instead it looks like further tricks may be investigated for prolonging the current situation such as reducing the number of NUMA nodes that can be supported on 32-bit systems.
Johannes Weiner wanted to discuss the memcg statistics and what should be gathered. His problem is that he had very little traction on the list and felt maybe it would be better if he explained the situation in person.
The most important statistics he requires are related to memcg hierarchical reclaim. The simple case is just the root group and the basic case is one child that is reclaimed by either hitting its hard limit or due to global reclaim. It gets further complicated when there is an additional child and this is the minimum case of interest. In the hierarchy, cgroups might be arranged as follows:
root
cgroup A
cgroup B
The problem is that if cgroup B is being reclaimed then it should be possible to identify whether the reclaim is due to internal or external pressure. Internal pressure would be due to cgroup B hitting its hard limit. External pressure would be due to either cgroup A hitting its hard limit or global reclaim.
He wants to report pairs of counters for internal and external reclaims. By walking cgroup tree, the statistics for external pressure can be calculated. By looking at the external figures for each cgroup in user space it can be determined exactly where external pressure originated from for any cgroup. The alternative is needing one group of counters per parent which is unwieldy. Just tracking counters about the parent would be complicated if the group were migrated.
The storage requirements are just for the current cgroup. When reporting to user space a tree walk is necessary so it costs computationally but the information will always be coherent even if memcg changes location in the tree. There was some dispute on what file exactly should expose this information but that was a relatively minor problem.
The point of the session was for people to understand how he wants to report statistics and why it is a sensible choice. It seemed that people in the room had a clearer view of his approach and future review might be more straightforward.
Michal Hocko stood up to discuss the current state of the memcg devel tree. After the introduction of the topic, Andrew Morton asked why it was not based on linux-next which Hocko said was a moving target. This potentially leads to a rebases. Morton did not really get why the tree was needed but the memcg maintainers said the motivation was develop against a stable point in time without having to wrestle with craziness in linux-next.
Morton wanted the memcg stuff to be a client of the -mm tree. That is a client of linux-next but Andrew feels he could manage the issues as long as the memcg developers were willing to deal with rebases which they were. Morton is confident he can find a way to compromise without the creation of a new tree. In the event of conflicts, he said that those conflicts should be resolved sooner rather than later.
Morton made a separate point of how long is it going to take to finish memcg. It's one file, how much more can there be to do? Peter Zijlstra pointed out that much of the complexity is due to changing semantics and continual churn. The rate of change is slowing but it still happens.
The conclusion is that Morton will work on extracting the memcg stuff from his view of the linux-mm tree into the memcg devel tree on a regular basis to give them a known base to work against for new features. Some people in the room commented that they missed the mmotm tree as it used to form a relatively stable tree to develop against. There might be some effort in the future to revive something mmotm-like while still basing it on linux-next.
Andi Kleen talked a bit about some of the scalability issues he has run into. These are issues that have showed up in both micro and macro benchmarks. He gave the example of the VMA links for very large processes that fork causing chains that are thousands of VMAs long. TLB flushing is another problem where pages being reclaimed are resulting in an IPI for each page; he feels these operations need to be batched. Andrea Arcangeli pointed out that batching may be awkward because pages are being reclaimed in LRU, not MM, order and batching may be problematic. It could just send an IPI when a bunch of pages are gathered or be able to build lists of pages for multiple MMs.
Another issue on whether clearing the access bit should result in a TLB flush or not. There were disagreements in the room as to whether this would be safe. It potentially affects reclaim but the length of time a page lives on the inactive LRU list should be enough to ensure that the process gets scheduled and flushes the TLB. Relying on that was considered problematic but alternative solutions such as deferring the flush and then sending a global broadcast would interfere with other efforts to reduce IPI traffic. Just avoiding the flush for clearing the access should be fine in the vast majority of cases so chances are a patch will appear on the list for discussion.
Kleen next raised an issue with drain_pages(), which has severe lock contention problem when releasing the pages back to the zone list as well as causing a large number of IPIs to be sent.
His final issue was that swap clustering in general seems to be broken and that the expected clustering of virtual address to contiguous areas in swap is not happening. This was something 2.4 was easily able to do because of how it scanned page tables but it's less effective now. However, there have been recent patches related to swap performance so that particular issue needs to be re-evaluated.
The clear point that shone through is that there are new scalability issues that are going to be higher priority as large machines become cheaper and that the community should be pro-active dealing with them.
Pavel Emelyanov briefly introduced how Parallels systems potentially create hundreds of containers on a system that are all effectively clones of a template. In this case, it is preferred that the file cache be shared between containers to limit the memory usage so as to maximize the number of containers that can be supported. In the past, they used a unionfs approach but as the number of containers increased so did the response time. This was not a linear increase and could be severe on loaded machines. If reclaim kicked in, then performance would collapse.
Their proposal is to extend cleancache to store the template files and share them between containers. Functionally this is de-duplication and, superficially, kernel samepage merging (KSM) would suit their requirements. However, there were a large number of reasons why KSM was not suitable, primarily because it would not be of reliable benefit but also because it would not work for file pages.
Dan Magenheimer pointed out that Xen de-duplicates data through use of a backend to cleancache and that they should create a new backend instead of extending cleancache which would be cleaner. It was suggested that when they submit the patches that they be very clear why KSM is not suitable to avoid the patches being dismissed by the casual observer.
Pavel Emelyanov talked about a project he started about six months ago to address some of the issues encountered by previous checkpoint implementations, mostly by trying to move it into user space. This was not without issue because there is still some assistance needed from the kernel. For example, kernel assistance was required to figure out if a page is really shared or not. A second issue mentioned was that given a UNIX socket, it cannot be discovered from userspace what its peer is.
They currently have two major issues. The first is with "stable memory management". Applications create big mappings but they do not access every single page in it and writing the full VMA to a disk file is a waste of time and space. They need to discover which pages have been touched. There is a system call for memory residency but it cannot identify that an address is valid but swapped out for example. For private mappings, it cannot distinguish between a COW page and one that is based on what is on disk. kpagemap also gives insufficient information because information such as virtual address to page frame number (PFN) is missing.
The second major problem is that, if an inode is being watched with inotify, extracting exact information about the watched inode is difficult. James Bottomley suggested using a debugfs interface. A second proposal was to extend the /proc interface in some manner. The audience in the room was insufficiently familiar with the issue to give full feedback so the suggestion was just to extend /proc in some manner, post the patch and see what falls out as people analyze the problem more closely. There was some surprise from Bottomley that people would suggest extending /proc but for the purpose of discussion it would not cause any harm.
Roland Dreier began by noting that people writing block drivers have only two choices: A full request-based driver, or using make_request(). The former is far too heavyweight with a single very hot lock (the queue lock) and a full-fledged elevator. The latter is way too low down in the stack and bypasses many of the useful block functions, so Dreier wanted a third way that takes the best of both. Jens Axboe proposed using his multi-queue work which essentially makes the block queue per-CPU (and thus lockless) coupled with a lightweight elevator. Axboe has been sitting on these patches for a while but promised to dust them off and submit them. Dreier agreed this would probably be fine for his purposes.
Shyam Iyer previewed Dell's vision for where NVMe (Non-Volatile Memory express - basically PCIe cards with fast flash on them) were going. Currently the interface is disk-like, with all the semantics and overhead that implies, but ultimately Dell sees the device as having a pure memory interface using apertures over the PCIe bus. Many people in the room pointed out that while a memory-mapped interface may be appealing from the speed point of view, it wouldn't work if the device still had the error characteristics of a disk, because error handling in the memory space is much less forgiving. Memory doesn't do any software error recovery and every failure to deliver data instantly is a hard failure resulting in a machine check, so the device would have to do all recovery itself and only signal a failure to deliver data as a last resort.
Frederick Knight began by previewing the current T10 thoughts on handling shingle drives: devices which vastly increase storage density by overlapping disk tracks. They can increase storage radically but at the expense of having to write a band at a time (a band is a set of overlapping [shingled] tracks). The committee has three thoughts on handling them:
The room quickly decided that only the first and last were viable options, so the slides on the new banding commands were skipped.
In the possible hint-based architecture, there would be static and dynamic hints. Static would be from device to host signalling which indicated geometry preferences by LBA range, while dynamic would be from the host to device indicating the data characteristics on a write which would allow the device to do more intelligent placement.
It was also pointed out that shingled drives have very similar characteristics to SSDs if you consider a band to be equivalent to an erase block.
The problem with the dynamic hinting architecture is that the proposal would repurpose the current group field in the WRITE command to contain the hint, but there would only be six bits available. Unfortunately, virtually every member of the SCSI committee has their own idea about what should be hinted (all the way from sequential vs random in a 32-level sliding scale, write and read frequency and latency, boot time preload, ...) and this lead to orders of magnitude more hints than fit into six bits, so the hint would be an index into a mode page which described what it means in detail. The room pointed out unanimously that the massive complexity in the description of the hints meant that we would never have any real hope of using them correctly since not even device manufacturers would agree exactly what they wanted. Martin Petersen proposed identifying a simple set of five or so hints and forcing at least array vendors to adhere to them when the LUN was in Linux mode.
Lukáš Czerner gave a description of the current state of his storage manager command-line tool, which, apart from having some difficulty creating XFS volumes was working nicely and should take a lot of the annoying administrative complexity out of creating LVM volumes for mounted devices.
Martin Petersen began by lamenting that in the ATA TRIM command, T13 only left two bytes for the trim range, meaning that, with one sector of ranges, we could trim at most 32MB of disk in one operation. The other problem is that the current architecture of the block layer only allows us to trim contiguous ranges. Since TRIM is unqueued and filesystems can only send single ranges inline, trimming is currently a huge performance hit. Christoph Hellwig had constructed a prototype with XFS which showed that if we could do multi-range trims inline, performance could come back to within 1% of what it was without sending trim.
Discussion then focused on what had to happen to the block layer to send multi-range commands (it was pointed out that it isn't just trim: scatter/gather SCSI commands with multiple ranges are also on the horizon). Jens Axboe initially favored the idea of allowing a single BIO to carry multiple ranges, whereas Petersen had a prototype using linked BIOs for the range. After discussion it was decided that linked BIOs was a better way forward for the initial prototype.
SR-IOV (Single Root I/O virtualization) is designed to take the hypervisor out of storage virtualization by allowing a guest to have a physical presence on the storage fabric. The specific problem is that each guest needs a world wide name (WWN) as their unique address on the fabric. It was agreed that we could use some extended host interface for setting WWNs but that we shouldn't expose this to the guest. The other thought was around naming of virtual functions when they attach to hosts. In the network world, virtual function (vf) network devices appear as eth<phys>-<virt> so should we do the same for SCSI? The answer was categorically that without any good justification for this naming scheme: "hell no."
The final problem discussed was that when the vf is created in the host, the driver automatically binds to it, so it has to be unbound before passing the virtual function to the guest. Hannes Reinecke pointed out that binding could simply be prevented using the standard sysfs interfaces. James Bottomley would prefer that the driver simply refuse to bind to vf devices in the host.
Robert Love noted that the first iteration of Fibre Channel attributes was out for review. All feedback from Greg Kroah-Hartman has been incorporated so he asked for others to look at it (Bottomley said he'd get round to it now that Kroah-Hartman is happy).
How should we report "unit attentions" (UAs - basically SCSI errors reported by storage devices) to userspace? Three choices were proposed:
There was a lot of discussion, but it was agreed that the in-kernel handling should be done by a notifier chain to which drivers and other entities could subscribe, and the push to user space would happen at the other end, probably via either netlink or structured logging.
[ I would like to thank LWN subscribers for funding that allowed me to attend the summit. Big thanks are also due to Mel Gorman and James Bottomley for their major contributions to the summit coverage. ]
Thanks to Alasdair Kergon for making his photograph of the 2012 Linux
Storage, Filesystem, and Memory Management summit available.
Flash and small modifications
Posted Apr 5, 2012 9:12 UTC (Thu) by dlang (subscriber, #313) [Link]
There are many cases where filesystem modifications could take advantage of this capability to modify blocks in place instead of copying the blocks to write the modified versions (which could significantly reduce the write amplification issue)
Two examples that jump to mine are:
sequential writes (i.e. log files), if a append-only file is written when it ends in the middle of a block, that block should be able to be modified rather than copied when additional data gets written
journal files, the journal wants to record that a transaction is pending, and then later indicate that the transaction has been completed (with durable writes at each stage), This is either a sequential write problem (if the journal has separate entries for transaction start and end), or it's a case of wanting to flip a bit in the existing entry to indicate that it has been completed. In either case, some ability to modify an existing block in specific ways would allow you to avoid having to copy the entire eraseblock because a small change took place.
Checksums (including RAID) make these sorts of changed impossible to do unless you can also modify the checksum, but this could be addressed by either checksumming smaller chunks (and leaving some chunks blank so that the chunk and it's checksum can be written later), or by allowing there to be several possible checksums for a chunk and you use the most recent one.
In any case, organizing flash with checksums to work this way will be less efficient use of of the flash cells than the current mode, but it would be similar to the idea being floated of there being different classes of storage ('high durability' vs 'low durability'), except this doesn't actually require different types of flash, it could be implemented via different allocation of the bits of the existing flash.
For example, instead of storing 2M of user data per flash page, it could store 1.5M of user data, but have space for many different versions of the checksums so that the user data could be modified many times without having to copy the pages.
The filesystem would need to be tweaked to make use of this (except possibly in the append-only case), because it would need to modify the bits the 'right' way in the 'right' alignment for this to work, and given the space efficiency vs modification efficiency, it would probably be wise to have the filesystem tell the block device that it wanted to have data stored in the modification efficient mode.
Flash and small modifications
Posted Apr 6, 2012 10:52 UTC (Fri) by valyala (guest, #41196) [Link]
I believe sequential writes are already handled optimally by flash firmware. All writes to flash can be easily buffered in on-board RAM and merged into block-sized writes before hitting the flash. Of course, such RAM must be backed by a power supply (a capacitor), which will allow flushing RAM contents to flash in the event of power loss.
Flash and small modifications
Posted Apr 6, 2012 18:07 UTC (Fri) by dlang (subscriber, #313) [Link]
unless the device includes an on-board battery with enough power to write the contents of the ram to flash when it losses power, and I am not aware of any SSD drives that do this.
Flash and small modifications
Posted Apr 6, 2012 18:09 UTC (Fri) by dlang (subscriber, #313) [Link]
Think of a log file that gets a new line written to it every minute or two. If the average log line is ~1/4K (which is what I measured my logs to be), then you have about 2K log entries per block, and so will re-write the block 2K times before filling it up.
Flash and small modifications
Posted Apr 6, 2012 21:08 UTC (Fri) by jzbiciak (subscriber, #5246) [Link]
I'll use "set" and "erase" to describe the operations on bits below. In typical flash, "erase" clears the bit to 1, and "set" deposits a charge and causes it to read as 0. Or, you can flip those around with "erase" leaving a 0 and "set" leaving a 1. Either way, the principle holds, and is the same principle you were highlighting.
The flash I've worked with puts an upper bound on the number of times you can set the same bit without an intervening erase cycle. This makes a certain amount of sense to me. The process of flipping a bit from erased to set deposits electrons onto an otherwise-floating transistor gate. Setting the same bit more than once could threaten to break down that gate prematurely by depositing too much charge. At least, I guess that's the reason behind the admonition to not re-program a given word more than once.
That said, write blocks tend to be much smaller than erase blocks, so that does give you some flexibility. So, while a given bit can only be "set" a certain number of times, the hardware generally gives you some granularity to avoid setting it too often.
Anyway, that constraint can limit the number of times you can "modify in place" a given block, and limit the ways in which that modification could be carried out. To me, the most interesting scenario is the "append to log" scenario. This one seems most compatible with the underlying physics. It does require the filesystem to pad sectors beyond EOF with an appropriate value for rewriting later. For example, if flash needs 1s, then the portion of a sector beyond EOF written by the filesystem needs to be 0xFF.
As far as checksums go, if you go with CRCs, they have a really nice property: They're fully linear codes. The CRC for a given block of data is equal to the XOR of the CRCs for each of the 1 bits in that block if you were to assume the other bits were zero. (Assuming no pre/post inversions.) That is, suppose you wanted to CRC this string: 10010001. Then: CRC(10010001) = CRC(10000000) ^ CRC(00010000) ^ CRC(00000001).
So, you could protect a block of data that's updated in this way by just appending the CRC of the delta to the "CRC list". The reader would then need to XOR all of the provided CRCs. Now, if your flash erases to 1 and sets to 0, then you could store CRCs inverted, with the unused CRCs reading as 0xFFFFFFFF. The reader wouldn't even have to know how many CRCs are valid at that point -- it could just read them all in and XOR them. If there's an even number of potential CRCs (as seems likely), the reader wouldn't even need to apply any inversions.
Flash and small modifications
Posted Apr 6, 2012 22:08 UTC (Fri) by dlang (subscriber, #313) [Link]
But overall I think that we are talking basically the same thing.
I first ran into this sort of concept many years ago when the digital recording first hit the consumer market (if you can remember when the first music Christmas cards and similar appeared, that timeframe). The way those recorders worked was that they used a EEPROM chip and recorded an analog value (~8 bits worth) in each bit of the EEPROM by hitting it with a smaller than spec programming charge repeatedly until the stored value matched the desired analog value. When doing this program you would 'program' the chip with 00000001 until the first bit got to the desired value, then program it with 00000010 until the second bit got to the desired value, etc. you would never be sending a programming signal to a bit that you already had set.
multi-level flash uses a similar trick to store 4 levels of signal per cell, it calls these '00' '01' '10' '11' in the digital world, so the 'partial modification' that I am talking about would require support from the flash chipset to allow it to not try to reprogram a cell that's already been set, but that is a pretty trivial thing to do.
Flash and small modifications
Posted Apr 6, 2012 22:46 UTC (Fri) by jzbiciak (subscriber, #5246) [Link]
Flash and small modifications
Posted Apr 6, 2012 23:02 UTC (Fri) by dlang (subscriber, #313) [Link]
Flash and small modifications
Posted Apr 12, 2012 16:31 UTC (Thu) by wookey (subscriber, #5501) [Link]
So whilst your theory is fine, in practice this hasn't been any use for years.
If I'd been in the audience I'd have been in the 'give us raw access' group. Or at least very direct control over what 'optimisations' the device will do itself. My experience is that linux filesystem writers can do a much better job of getting this right than flash vendors, who really don't have the same optimisation parameters as us at all. Mostly their efforts to do things for us have produced a lot of shitty (slow, unreliable) flash in SD cards.
On the other hand it is clear that some things are better done on the device (checksumming/ECC for a start), and potentially some other stuff (the way modern disk drives by-and-large do a reasonable job internally without messing things up).
But if they just told us the block sizes that would be a good start. The 5 years this hasn't been happening for has been a terrible waste. I bet we have legal dept and patents to thank for that as well as generally not caring about anthing other than 'FAT in cameras'.
Flash and small modifications
Posted Apr 12, 2012 19:31 UTC (Thu) by dlang (subscriber, #313) [Link]
The question is if this would help enough to matter.
There are two areas I see it potentially helping:
1. longer lifetime due to reduced wear by not having to copy the blocks as much
2. better performance by being able to update the blocks and therefor avoid the need to do as many slow erase cycles (the device tries to do these in the background, but if there are enough writes on a full enough device, it has to wait for them)
LBA Hinting
Posted Apr 5, 2012 9:29 UTC (Thu) by dlang (subscriber, #313) [Link]
Copyright © 2012, Eklektix, Inc.
This article may be redistributed under the terms of the
Creative
Commons CC BY-SA 4.0 license
Comments and public postings are copyrighted by their creators.
Linux is a registered trademark of Linus Torvalds