LinuxCon: Kernel roundtable covers more than just bloat

By Jake Edge
September 30, 2009

If you have already heard about the kernel roundtable at LinuxCon, it is likely due to Linus Torvalds's statement that the kernel is "huge and bloated". While much of the media focused on that soundbite, there was quite a bit more to the panel session. For one thing, Torvalds definitely validated the impression that the development process is working better than it ever has, which has made his job an "absolute pleasure" over the last few months. In addition, many other topics were discussed, from Torvalds's motivations to the lessons learned in the 2.6 development series—as well as a bit about bloat.

The panel consisted of five kernel developers: Torvalds, Greg Kroah-Hartman of Novell, Chris Wright of Red Hat, Jonathan Corbet of LWN, and Ted Ts'o of IBM (and CTO of the Linux Foundation) sitting in for Arjan van de Ven who got held up in the Netherlands due to visa problems. James Bottomley of Novell moderated the panel and set out to establish the ground rules by noting that he wanted to "do as little work as possible", so he wanted questions from the audience, in particular those that would require answers from Torvalds as "he is sitting up here hoping to answer as little as possible". Bottomley was reasonably successful in getting audience questions, but moderating the panel probably took a bit more effort than he claimed to be looking for.

Innovative features

Bottomley began with a question about the "most innovative feature" that went into the kernel in the last year. Wright noted that he had a "virtualization slant", so he pointed to the work done to improve "Linux as a hypervisor", including memory management improvements that will allow running more virtual machines more efficiently under Linux. Corbet and Ts'o both pointed to the ftrace and performance counters facilities that have been recently added. Tracing and performance monitoring have both been attacked in various ways over the years, without getting into the mainline, but it is interesting to see someone approach "the problem from a different direction, and then things take off", Corbet said.

Bottomley altered the question somewhat for Kroah-Hartman, enquiring about the best thing that had come out of the staging tree that Kroah-Hartman maintains. That seemed to stump him momentarily, so he mentioned the USB 3.0 drivers as an innovative feature added to the kernel recently, noting that Linux is the first OS to have a driver for that bus, when hardware using it is still not available to buy: "It's pretty impressive". After a moment's thought, though, Kroah-Hartman pointed out that he had gotten Torvalds's laptop to work by using a wireless driver from the staging tree, which completely justified that tree's existence.

Ts'o also noted the kernel mode switching support for graphics devices as another innovative feature, pointing out that "it means that the X server no longer has to run as root—what a concept". He also suggested that it made things easier for users who could potentially get kernel error messages in the event of a system hang, without having to hook up a serial console.

Making it easy for Linus

Torvalds took a "different tack" on the question, noting that he was quite pleased with "how much easier my job has been getting in the last few months". He said that it is a feature that is not visible to users but it is the feature that is most important to him, and that, in the end, "it improves, hopefully, the kernel in every area".

Because subsystem maintainers have focused on making it "easy for Linus" by keeping their trees in a more mergeable state, Torvalds has had more time to get involved in other areas. He can participate in more threads on linux-kernel and "sometimes fix bugs too". He clearly is enjoying that, especially because "I don't spend all my time just hating people that are sending merge requests that are hard to merge".

Over the last two merge windows (including the just completed 2.6.32 window), things have been going much more smoothly. Smooth merges mean that Torvalds gets a "happy feeling inside that I know what I am merging — whether it works or not [is a] different issue". In order to know what he is merging, Torvalds depends on documentation and commit messages in the trees that outline what the feature is, as well as why people want it. In order to feel comfortable that the code will actually work, he bases that on his trust of the person whose tree he is merging to "fix up his problems afterwards".

Motivation

The first question from the audience was directed at Torvalds's motivation, both in the past and in the future. According to Torvalds, his motivation for working on the kernel has changed a lot over the years. It started with an interest in low-level programming that interacted directly with the hardware, but has slowly morphed into working with the community, though "I shouldn't say 'the community', because when anyone else says 'the community', my hackles rise [...] there's no one community". It is the social aspect of working with other people on the kernel project that is his main motivation today, part of which is that "I really enjoy arguing".

Torvalds's technical itch has already been scratched, so other things keep him going now: "All of my technical problems were solved so long ago that I don't even care [...] I do it because it's interesting and I feel like I am doing something worthwhile". He doesn't see that changing over the next 5-10 years, so, while he wouldn't predict the future, there is a clear sense that things will continue as they are—at least in that time frame.

Malicious code

Another question from the audience was about the increasing rate of kernel contributions and whether that made it harder to keep out malicious code from people with bad intentions. Kroah-Hartman said that it is hard to say what is malicious code versus just a bug, because "bugs are bugs". He said he doesn't remember any recent attempts to intentionally introduce malicious code.

Torvalds pointed out that the problem has never been people intentionally doing something bad, but, instead, trying to do something good and unintentionally ending up causing a security hole or other bug. He did note an attempt to introduce a back door into the kernel via the BitKeeper repository 7-8 years ago which "was caught by BitKeeper with checksums, because they [the attackers] weren't very good at it". While that is the only case he is aware of, "the really successful ones we wouldn't know about". One of Git's design goals was to keep things completely decentralized and to cryptographically sign all of the objects so that a compromise of a public git server would be immediately recognized, because it didn't match others' private trees, he said.

Performance regressions

Bottomley then turned to performance regressions, stating that Intel had been running a "database benchmark that we can't name" on every kernel release. They have found that the performance drops a couple of percentage points each release, with a cumulative effect over the last ten releases of about 12%. Torvalds responded that the kernel is "getting bloated and huge, yes, it's a problem".

"I'd love to say we have a plan" for fixing that, Torvalds said but it's not the case. Linux is "definitely not the streamlined, small, hyper-efficient kernel that I envisioned 15 years ago"; the kernel has gotten large and "our icache [instruction cache] footprint is scary". The performance regression is "unacceptable, but it's probably also unavoidable" due to the new features that get added with each release.

Audio and storage

In response to a question about professional audio, Torvalds said that the sound subsystem in the kernel was much better than it is given credit for, especially by "crazy" Slashdot commenters who pine for the days of the Open Sound System (OSS). Corbet also noted that audio issues have gotten a lot better, though, due to somewhat conflicting stories from the kernel developers over the years, audio developers "have had a bit of a rough ride".

A question about the need for handling memory failures, both in RAM and flash devices, led Ts'o to note that, based on his experience at a recent storage conference, there is "growing acceptance of the fact that hard disks aren't going away". Hard disks will always be cheaper, so flash will be just be another element in the storage hierarchy. The flash hardware itself is better placed to know about and handle failures of its cells, so that is likely to be the place where it is done, he said.

Lessons learned

The lessons learned during the six years of the 2.6 development model was the subject of another question from Bottomley. Kroah-Hartman pointed to the linux-next tree as part of a better kernel development infrastructure that has led to more effective collaboration: "We know now how to work better together". Corbet noted that early 2.6 releases didn't have a merge window, which made stability of those releases suffer. "What we've learned is some discipline", he said.

In comparing notes with the NTFS architect from Microsoft, Ts'o related that the core Windows OS team has a similar development model. "Redmond has independently come up with something almost identical to what we're doing", he said. They do quarterly releases, with a merge period followed by a stabilization period. Microsoft didn't copy the Linux development model, according to the NTFS architect, leading he and Ts'o to theorize that when doing development "on that scale, it's one of the few things that actually works well". That led Bottomley to jokingly suggest a headline: "Microsoft validates Linux development model".

Torvalds also noted that the development model is spreading: "The kernel way of doing things has clearly entered the 'hive mind' when it comes to open source". Other projects have adopted many of the processes and tools that the kernel developers use, but also things like the sign-off process that was added in response to the SCO mess. Sign-offs provide a nice mechanism to see how a particular chunk of code reached the mainline, and other projects are finding value in that as well.

Overall, the roundtable gave an interesting view into the thinking of the kernel developers. It was much more candid than a typical marketing-centric view that comes from proprietary OS vendors. Of course, that led to the "bloated" headlines that dominated the coverage of the event, but it also gave the audience an unvarnished look at the kernel. The Linux Foundation and Linux Pro magazine have made a video of the roundtable available—unfortunately only in Flash format—which may be of interest; it certainly was useful in augmenting the author's notes.

Index entries for this article
Conference	LinuxCon North America/2009

LinuxCon: Kernel roundtable covers more than just bloat

Posted Oct 1, 2009 1:02 UTC (Thu) by flewellyn (subscriber, #5047) [Link] (23 responses)

No plan yet on how to deal with the bloat problem? Well, I hope that doesn't remain the case for long.

LinuxCon: Kernel roundtable covers more than just bloat

Posted Oct 1, 2009 4:31 UTC (Thu) by dowdle (subscriber, #659) [Link] (17 responses)

Over the last 10 releases (3 months x 10 = 30 months) the kernel has gotten 12% slower... but it has a ton more features and supports a lot more hardware. Doesn't sound like much of a loss to me. I mean how much better have processors gotten in the same 30 months? I'm guessing they have gotten more than 12% faster so it is obviously a net gain.

That isn't to say I want the kernel to keep getting slower.

If you want the video in a different format, I'd recommend visiting the page, pausing the video and waiting until it is completely buffered. Then you can copy /tmp/Flash{random-characters} to ~ and convert it to whatever format you want. Of course it would be nice to have a higher quality source to convert from but it isn't too bad.

LinuxCon: Kernel roundtable covers more than just bloat

Posted Oct 1, 2009 9:32 UTC (Thu) by gevaerts (subscriber, #21521) [Link] (2 responses)

If you want to download the video, just visit the page and look at the source. The flv file is easily downloadable.

LinuxCon: Kernel roundtable covers more than just bloat

Posted Oct 1, 2009 15:28 UTC (Thu) by Velmont (guest, #46433) [Link] (1 responses)

It's surprising that it doesn't work in Gnash. Many video players do, I should think Linux Magazine thought about that!

LinuxCon: Kernel roundtable covers more than just bloat

Posted Oct 2, 2009 2:16 UTC (Fri) by DOT (subscriber, #58786) [Link]

You'd think Linux-related sites would already be providing Theora videos (self-hosted, or through dailymotion or blip.tv). It's not like their target demographic is likely to be running Internet Explorer. ;)

The causes of bloat?

Posted Oct 1, 2009 9:46 UTC (Thu) by alex (subscriber, #1355) [Link] (3 responses)

Listening to this weeks FLOSS weekly which interviewed Linus I noted a lot of the "bloat" comes from features like auditing and security checking. I don't know if it's possible to build a stripped down kernel without these things in them and see if the performance comes back. Not that I'd want to run such a kernel on a production site though...

The causes of bloat?

Posted Oct 4, 2009 17:26 UTC (Sun) by nevets (subscriber, #11875) [Link] (1 responses)

I've seen this with ftrace traces. Running the function graph tracer, a good amount of time is spent in the selinux code. The price you pay for security.

One might argue that we've become 12% slower, but > 12% more secure.

How much checking do you need to do?

Posted Oct 5, 2009 10:43 UTC (Mon) by alex (subscriber, #1355) [Link]

I'm all for increasing the security of the kernel. However I feel the ideal* case the kernel should be striving for is a compare/branch for the check. Does SELinux do any caching of it's authentication results?

For example once you have validated a process can read a given file descriptor do you need to re-run the whole capability checking logic for every sys_read()?

Of course any such caching probably introduces another attack vector so care would have to be taken with the implementation?

*ideal being a target even if you may never actually reach that goal.

The causes of bloat?

Posted Oct 8, 2009 7:01 UTC (Thu) by kragil (guest, #34373) [Link]

Yeah, it was clear that Linus wasn't happy about how the "huge and bloated" thing was the only focus nearly all media concentrated on. (leaving out the "unacceptable but unavoidable" part.)

In the interview he said that "at least Linux isn't this fat ugly pig that should have been shot 15 years ago"

I'd like to think that Linus is so bright that the bloat statement was intentional to get the kernel community working on a solution (don't tell me there isn't one that is way too easy), but he probably does not have these mad Sun Tzu communication skillz.

Maybe next time add the the pig comment to put things into perspective for the media?

LinuxCon: Kernel roundtable covers more than just bloat

Posted Oct 1, 2009 10:26 UTC (Thu) by job (guest, #670) [Link]

Considering what 5% extra performance costs in hardware in the mainstream segment, running a year old kernel would give you the same benefit (if you don't care about the newest features). That's not a good situation!

Regarding the video, you can wget this link and feed it in mplayer/xine if you have a recent ffmpeg with VP6 installed.

LinuxCon: Kernel roundtable covers more than just bloat

Posted Oct 2, 2009 18:09 UTC (Fri) by simonl (guest, #13603) [Link] (8 responses)

The Moore's law reference stinks. It is a typical commercial vendor apologist view. Sorry, our software sucks, we know, but just buy faster hw and make up for our crap.

When you argue that way, you take no pride in your code. And pride is what built this kernel.

Maybe the kernel devs have become too employed. Too focused on solving customers' immediate needs, and having too little time to go hunt down whatever catches their attention, big or small, with no prospects to please a project manager.

But look what Apple just did in their latest release: Nothing spectacular, except cleaning up. Someone has had the guts to nack new features and focus on removal.

LinuxCon: Kernel roundtable covers more than just bloat

Posted Oct 4, 2009 9:41 UTC (Sun) by Los__D (guest, #15263) [Link] (4 responses)

Features costs performance. Drops for no reason is, of course, a problem but tradeoffs will need to be made sometimes.

You could have a LIGHTNING FAST DOS box today. What good would it do you?

Features no excuse

Posted Oct 5, 2009 10:34 UTC (Mon) by eru (subscriber, #2753) [Link] (3 responses)

Features costs performance.

But if you don't use the features, you should not need to pay the price! If I understand correctly, the slowdown has been seen in repeatable benchmarks that can be run on both old and new kernel versions. Therefore the benchmarked code certainly isn't using any new features, but it still gets slowed down. Not justifiable.

You could have a LIGHTNING FAST DOS box today. What good would it do you?

Bad comparison. MS-DOS always had severe problems that really did not have much to do with its small footprint. It was brain-damaged already on day one. An OS that does more or less what MS-DOS did, but in a sensible and stable way might still be useful.

Features no excuse

Posted Oct 8, 2009 8:20 UTC (Thu) by renox (guest, #23785) [Link] (1 responses)

[[But if you don't use the features, you should not need to pay the price!
If I understand correctly, the slowdown has been seen in repeatable benchmarks that can be run on both old and new kernel versions. Therefore the benchmarked code certainly isn't using any new features, but it still gets slowed down. Not justifiable.]]

Linus referred to the icache footprint(size) of the kernel, if you add features, even when not used they increased the size of the generated code so they reduce the performance.
Sure if you have an option to remove the code from the kernel at compilation time, then this issue shouldn't happen.. So which configuration did Intel benchmark?

Without specific figures, it's difficult to know where the issue is, I wouldn't be surprised that SELinux or virtualisation are the culprit: these features seems quite invasive..

Features no excuse

Posted Oct 8, 2009 9:03 UTC (Thu) by dlang (guest, #313) [Link]

adding SELinux defiantly slows things down, and if you run a benchmark on a system running a kernel compiled with SELinux you will get lower results than if you run the same benchmark on the same kernel without SELinux

so to not use the feature of SELinux you would compile a kernel without it.

the same thing goes for many fetures, turning them on at compile time increases the cache footprint and therefor slows the system, even if you don't use the feature. but you (usually) do have the option to not compile the code into the kernel to really avoid the runtime cost of them.

Features no excuse

Posted Oct 8, 2009 9:10 UTC (Thu) by bersl2 (guest, #34928) [Link]

But if you don't use the features, you should not need to pay the price! If I understand correctly, the slowdown has been seen in repeatable benchmarks that can be run on both old and new kernel versions. Therefore the benchmarked code certainly isn't using any new features, but it still gets slowed down. Not justifiable.

Then configure out what you don't want already.

Really, you think going with your distro's generic kernel is efficient? It doesn't take very long to find /proc/config* and take out some of the above-mentioned features that can't be modular.

That, or yell at your distro, for the little good that will do.

LinuxCon: Kernel roundtable covers more than just bloat

Posted Oct 4, 2009 12:04 UTC (Sun) by fuhchee (guest, #40059) [Link]

"... Too focused on solving customers' immediate needs, and having too little time to go hunt down whatever catches their attention ..."

I suspect it's the other way around. Many customers care deeply about performance, and it is their vendors who must perform code-karate against this kind of "bloat" (slowdown). To justify each new thing, LKML rarely carries data beyond microbenchmarks.

LinuxCon: Kernel roundtable covers more than just bloat

Posted Oct 5, 2009 8:11 UTC (Mon) by cmccabe (guest, #60281) [Link]

> The Moore's law reference stinks. It is a typical commercial vendor
> apologist view. Sorry, our software sucks, we know, but just buy faster hw
> and make up for our crap.
>
> When you argue that way, you take no pride in your code. And pride is what
> built this kernel.

Well, maybe, you can have clean code that runs 12% slower, or you can have code that's #ifdef'ed to hell that runs at the old speed. In that case, which would you rather have?

Keep in mind, if you choose route #2, people in 2015 might use your name as a curse...

Obviously this is an oversimplification. But still, the point remains: don't criticize the code until you've seen it and understand the tradeoffs.

pride comes before a fall

Posted Oct 14, 2009 9:52 UTC (Wed) by gvy (guest, #11981) [Link]

> And pride is what built this kernel.
I hope *not*.

LinuxCon: Kernel roundtable covers more than just bloat

Posted Oct 1, 2009 16:20 UTC (Thu) by smoogen (subscriber, #97) [Link] (2 responses)

I think the issue is defining 'bloat'. Most of the features that the unnamed priestess of Delphi has are also features that the various people running it have wanted if for no other reason than all the business case rules that require them to be audited. Now for ye-old hacker in the basement.. those are things that aren't needed ro wanted. Its going to be in the end up to the ye-old hackers to dive in and fix/remove the bloat to see if it can be regained. My guess though is that for mainstream Linux it will mostly be there for some time.

LinuxCon: Kernel roundtable covers more than just bloat

Posted Oct 1, 2009 16:43 UTC (Thu) by flewellyn (subscriber, #5047) [Link] (1 responses)

That's a good point: there's no rule that says you have to run a distro-built kernel, even in a production environment. A customized kernel with unneeded features trimmed may help matters. (For instance, I have no need for most of the drivers or filesystems.)

LinuxCon: Kernel roundtable covers more than just bloat

Posted Oct 2, 2009 0:25 UTC (Fri) by giraffedata (guest, #1954) [Link]

Linus was talking about some kind of bloat that makes the kernel slower. I don't think removing device driver or filesystem driver modules from the disk or configuring out most of the configurable features speeds things up.

LinuxCon: Kernel roundtable covers more than just bloat

Posted Oct 2, 2009 4:48 UTC (Fri) by karthik_s1 (guest, #60525) [Link] (1 responses)

make linux a micro kernel :-)

macro overhead

Posted Oct 14, 2009 10:03 UTC (Wed) by gvy (guest, #11981) [Link]

> make linux a micro kernel :-)
Oh yeah, that would make those pesky 12% a non-issue. :]

LinuxCon: Kernel roundtable covers more than just bloat

Posted Oct 1, 2009 16:39 UTC (Thu) by josh (subscriber, #17465) [Link]

Direct link to the Linux kernel roundtable video.

SSD

Posted Oct 2, 2009 9:31 UTC (Fri) by dwmw2 (subscriber, #2063) [Link] (4 responses)

"The flash hardware itself is better placed to know about and handle failures of its cells, so that is likely to be the place where it is done, he said."

I was biting my tongue when he said that, so I didn't get up and heckle.

I think it's the wrong approach. It was all very well letting "intelligent" drives remap individual sectors underneath us so that we didn't have to worry about bad sectors or C-H-S and interleaving. But what the flash drives have to do to present a "disk" interface is much more than that; it's wrong to think that the same lessons apply here.

What the SSD does internally is a file system all of its own, commonly called a "translation layer". We then end up putting our own file system (ext4, btrfs, etc.) on top of that underlying file system.

Do you want to trust your data to a closed source file system implementation which you can't debug, can't improve and — most scarily — can't even fsck when it goes wrong, because you don't have direct access to the underlying medium?

I don't, certainly. The last two times I tried to install Linux to a SATA SSD, the disk was corrupted by the time I booted into the new system for the first time. The 'black box' model meant that there was no chance to recover — all I could do with the dead devices was throw them away, along with their entire contents.

File systems take a long time to get to maturity. And these translation layers aren't any different. We've been seeing for a long time that they are completely unreliable, although newer models are supposed to be somewhat better. But still, shipping them in a black box with no way for users to fix them or recover lost data is a bad idea.

That's just the reliability angle; there are also efficiency concerns with the filesystem-on-filesystem model. Flash is divided into "eraseblocks" of typically 128KiB or so. And getting larger as devices get larger. You can write in smaller chunks (typically 512 bytes or 2KiB, but also getting larger), but you can't just overwrite things as you desire. Each eraseblock is a bit like an Etch-A-Sketch. Once you've done your drawing, you can't just change bits of it; you have to wipe the whole block.

Our flash will fill up as we use it, and some of the data on the flash will be still relevant. Other parts will have been rendered obsolete; replaced by other data or just deleted files that aren't relevant any more. Before our flash fills up completely, we need to recover some of the space taken by obsolete data. We pick an eraseblock, write out new copies of the data which are still valid, and then we can erase the selected and re-use it. This process is called garbage collection.

One of the biggest disadvantages of the "pretend to be disk" approach is addressed by the recent TRIM work. The problem was that the disk didn't even know that certain data blocks were obsolete and could just be discarded. So it was faithfully copying those sectors around from eraseblock to eraseblock during its garbage collection, even though the contents of those sectors were not at all relevant — according to the file system, they were free space!

Once TRIM gets deployed for real, that'll help a lot. But there are other ways in which the model is suboptimal.

The ideal case for garbage collection is that we'll find an eraseblock which contains only obsolete data, and in that case we can just erase it without having to copy anything at all. Rather than mixing volatile, short-term data in with the stable, long-term data we actually want to keep them apart, in separate eraseblocks. But in the SSD model, the underlying "disk" can't easily tell which data is which — the real OS file system code can do a much better job.

And when we're doing this garbage collection, it's an ideal time for the OS file system to optimise its storage — to defragment or do whatever else it wants (combining data extents, recompressing, data de-duplication, etc.). It can even play tricks like writing new data out in a suboptimal but fast fashion, and then only optimising it later when it gets garbage collected. But when the "disk" is doing this for us behind our back in its own internal file system, we doesn't get the opportunity to do so.

I don't think Ted is right that the flash hardware is in the best place to handle "failures of its cells". In the SSD model, the flash hardware doesn't do that anyway — it's done by the file system on the embedded microcontroller sitting next next to the flash.

I am certain that we can do better than that in our own file system code. All we need is a small amount of information from the flash. Telling us about ECC corrections is a first step, of course — when we had to correct a bunch of flipped bits using ECC, it's getting on for time to GC the eraseblock in question, writing out a clean copy of the data elsewhere. And there are technical reasons why we'll also want the flash to be able to say "please can you GC eraseblock #XX soon".

But I see absolutely no reason why we should put up with the "hardware" actually doing that kind of thing for us, behind our back. And badly.

Admittedly, the need to support legacy environments like DOS and to provide INT 13h "DISK BIOS" calls or at least a "block device" driver will never really go away. But that's not a problem. There are plenty of examples of translation layers done in software, where the OS really does have access to the real flash but still presents a block device driver to the OS. Linux has about 5 of them already. The corresponding "dumb" devices (like the M-Systems DiskOnChip which used to be extremely popular) are great for Linux, because we can use real file systems on them directly.

At the very least, we want the "intelligent" SSD devices to have a pass-through mode, so that we can talk directly to the underlying flash medium. That would also allow us to try to recover our data when the internal "file system" screws up, as well as allowing us to do things properly from our own OS file system code.

SSD

Posted Oct 3, 2009 7:45 UTC (Sat) by job (guest, #670) [Link]

Did Ted not read Val's articles on LWN, or does he just not agree?

SSD

Posted Oct 5, 2009 13:09 UTC (Mon) by i3839 (guest, #31386) [Link] (1 responses)

I agree with you, but I understand the other side too.

> But I see absolutely no reason why we should put up with the "hardware"
> actually doing that kind of thing for us, behind our back. And badly.

Three reasons:

- Interoperability without losing flexibility.
For a "dumb" block driver to work the translation table must be fixed.
And when running in legacy mode the drive will be very slow. There are
a heap of problems waiting when it gives write access too.
It's very hard to let people switch filesystems. They mostly use the
default one from their OS, or FAT. As long as disks are sold as units
and not as part of a system this will be true.

- Performance.
Having dedicated hardware doing everything is faster and uses less power.
Not talking about an ARM mc, but a custom asic. (Just compare Intel's
SSD idle/active power usage to others.)

- Fast development.
Currently the flash chip interface is standardized with ONFI and the
other end with SATA. All the interesting development happens in-between.
Because it's wedged between stable interfaces it can change a lot without
impacting anything else (except when it's done badly ;-).

So short term the situation is quite hopeless for direct hardware access.
The best hope is probably SSDs with free specifications and open source
firmware.

Long term I think we should get rid of the notion of a disk and go more
to a model resembling RAM. You could buy arrays of flash and plug them in,
instead of whole disks (they could look like "disks" to the user, but
that's another matter). This decouples the flash from the controller
and makes data recovery easier in case the controller dies.

What is needed is a flash specific interface which replaces SATA and
implements the features that all good flash file systems need: Things
like multi-chip support, ECC handling, background erases and scatter
gather DMA. Perhaps throw in enough support to implement RAID fast
in software, if it makes enough sense. Basically a standardized small,
simple and fast flash controller, preferably accessible via some kind of
host memory interface (PCIe on x86). Make sure to make it flexible enough
to handle things like MRAM/PRAM/FeRAM too.

Maybe AHCI is good enough after a few adaptations, but it can be probably
a lot better. Or perhaps some other interface from the embedded world fits
the description.

This will make embedding the controller on a SoC easier too, without the
need for special support in the OS. SATA is really redundant in such cases.

I'm pretty sure most flash controllers already have such chip, but don't
expose it. Instead they hide it behind an embedded microcontroller which
handles SATA and implements the FTL. They should standardize it and get
rid of the power hungry, complexity adding bloat.

I also think it's crucial to get optimal performance, flash gets faster
and faster. It seems pointless to get faster and faster SATA when PCIe
is already fast enough. It's also silly to fake SATA by just implementing
a AHCI controller with direct flash access. Keep SATA around for real
external storage, not something that doesn't take much space anyway.

People that look at SSDs and see them just as disks and don't think about
the future will think it's best if the hardware does as much as possible.
But if you forget the classic disk model and look at what's really going
on it seems obvious that the classic disk model isn't that simple anyway
and doesn't fit flash or how the hardware looks like and could be used.

SSD

Posted Oct 6, 2009 6:23 UTC (Tue) by dwmw2 (subscriber, #2063) [Link]

We should probably take this discussion elsewhere. Your input would be welcome on the MTD list, where I've started a thread about what we want the hardware to look like, if we could have it our way.

But just a brief response...

"- Interoperability without losing flexibility."

This is still possible with a more flexible hardware design — you just implement the translation layer inside your driver, for legacy systems. M-Systems were doing this years ago with the DiskOnChip. More recently, take a look at the Moorestown NAND flash driver. You can happily use FAT on top of those. But of course you do have the opportunity to do a whole lot better, too. And also you have the opportunity to fix the translation layer if/when it goes wrong. And to recover your data.

"- Performance"

But this isn't being done in hardware. It's being done in software, on an extra microcontroller.

Yes, we do need to look carefully at the interface we ask for, and make sure it can perform well. But there's no performance-based reason for the SSD model.

"- Fast development"

You jest, surely? We had TRIM support for FTL in Linux last year, developed in order to test the core TRIM code. When do we get it on "real hardware"? This year? Next?

Being "wedged between stable interfaces" isn't a boon, in this case. Because it's wedged under an inappropriate stable interface, we are severely hampered in what we can do with it.

"People that look at SSDs and see them just as disks and don't think about the future will think it's best if the hardware does as much as possible. But if you forget the classic disk model and look at what's really going on it seems obvious that the classic disk model isn't that simple anyway and doesn't fit flash or how the hardware looks like and could be used."

Agreed. I think it's OK for the hardware to do the same kind of thing that disk hardware does for us — ECC, and some block remapping to hide bad blocks. But that's all; we don't want it implementing a whole file system of its own just so it can pretend to be spinning rust. In particular, perpetuating the myth of 512-byte sectors is just silly.

SSD

Posted Oct 29, 2009 18:11 UTC (Thu) by wookey (guest, #5501) [Link]

Well said David. I thought Linus was talking rot there too. It's bad enough all the manufacturers thinking the black-box approach is a sensible one - we really don't want kernel people who don't know any better getting that idea as well.