LWN.net Logo

Nastiness explained (sort of)

Nastiness explained (sort of)

Posted May 7, 2008 16:37 UTC (Wed) by MisterIO (subscriber, #36192)
In reply to: Nastiness explained (sort of) by MisterIO
Parent article: Stable kernel updates (security fix)

Ok, with kernel preemption, that's not true.


(Log in to post comments)

Nastiness explained (sort of)

Posted May 7, 2008 19:32 UTC (Wed) by viro (subscriber, #7872) [Link]

Actually, on UP we see everything in order, regardless of preemption.
What happens is that in setlk we update the per-inode list of file_lock 
(making it non-empty), then check that descriptor still points to the
same file.  In close we remove file from descriptor table, then check
per-inode list.  The trouble is, we have no warranty that these will be
seen in order on another CPU.  IOW, check of per-inode list might see
the value prior to update done by setlk and setlk might see descriptor
table state prior to update done by close.  With both sides deciding
that there's no need to evict file_lock - close() doesn't see it, setlk
thinks that when close() comes it'll evict the sucker.

And file_lock holds a reference to struct file, but doesn't contribute
to refcount, with all joy that comes from _that_.  It's not supposed
to outlive removal of struct file from descriptor table, let alone
struct file itself...

lock/unlock around accesses to descriptor table forces SMP ordering
and solves the problem; modification of file_lock list is guaranteed
to be seen before unlock done after the check in setlk and check in
close is guranteed to come after lock done before removal from descriptor
table.  So that spinlock acts as a barrier, preventing the ordering
problems.

See Documentation/memory-barriers.txt for background in that area.
And yes, that's part of the price to be paid for lockless access
schemes (such as RCU for descriptor tables and lazy taking of BKL
for POSIX locks eviction in filp_close()) - one needs to think of
things that normally would be provided automatically by exclusion
primitives.

Real SMP-only races are not too frequent thing around VFS, but yes,
sometimes they do happen...

Much clearer explanation

Posted May 7, 2008 19:51 UTC (Wed) by pr1268 (subscriber, #24648) [Link]

Thank you, sir! Your explanation makes so much more sense than my silly melodramatic version above.

This issue highlights one of the critical issues of multi-processor kernel programming--it's easier to write multi-threaded userspace code, but the buck stops at the kernel with respect to maintaining absolute thread safety.

Nastiness explained (sort of)

Posted May 7, 2008 23:29 UTC (Wed) by csamuel (subscriber, #2624) [Link]

Al, thanks for that description!

Could this be related to a problem we're seeing on Barcelona (B3) based 
systems where open(2) occasionally fails with ENOENT when creating a file 
in a directory that most certainly exists ?

We're seeing it under 2.6.24 and 2.6.25, on both NFS and local (ext3) 
filesystems.

We did think it was the queueing system occasionally corrupting the mode 
variable and dropping O_CREAT until a user forwarded us a message 
showing /bin/cp having an identical problem. :-(

Nastiness explained (sort of)

Posted May 8, 2008 7:51 UTC (Thu) by MisterIO (subscriber, #36192) [Link]

Thanks for the explanation.

Copyright © 2008, Eklektix, Inc.
Comments and public postings are copyrighted by their creators.
Linux is a registered trademark of Linus Torvalds
Powered by Rackspace Managed Hosting.