Samba 4 has been a long time in coming—and it seems to still be a
ways off. Samba is a free implementation of Microsoft's SMB/CIFS protocol
that is used in the vast majority of consumer-targeted network storage
devices, but the current 3.x versions lack many of the features that
enterprise users require (Active Directory support in particular) and Samba 4 is meant to address that shortcoming while also reworking and rewriting much of
the code base.
The biggest hurdle to a Samba 4 release seems to be integrating the new code
with the existing pieces that currently reside in Samba
recent proposal to release the current code "as is"—rather than complete
the integration work before a 4.0 release—is under heavy discussion on the
samba-technical mailing list.
There is a fair amount of impatience, it seems, for a Samba 4.0 release.
In the proposal, Andrew Bartlett notes that
vendors would like to build their products atop a stable release, rather
than alphas, and that finishing the integration work in 4.1 might allow a
final 4.0 release "in about three
months time". Samba 4 has been in the works since 2003, with
several "technology preview" releases starting in 2006 and the first alpha
release in 2007, but a 4.0 final release has so far proved elusive.
In his proposal, Bartlett is seeking to route around the hurdles and get
a release out there.
Part of the problem with integrating the Active Directory (AD) Domain
Controller (DC) work with the existing production Samba 3 code is that
there needs to be a clear migration path for users who upgrade. If the
existing Samba 3 file server code (often referred to as "smbd") were still
shipped as an option, existing users would not need to change anything.
Only users that were interested in moving to Samba-based DCs would need to
start using bin/samba (which is the name of the Samba 4 server
AD and DC functionality).
But, some have always envisioned Samba 4 as a single server process that
can handle all of the different roles (file server
and AD DC), which necessitates the integration. Others are not so sure that
it doesn't make sense to release a Samba 4 that incorporates the new
for those who need AD DC support, while leaving other users to continue
using the older, working and well-tested code. Part of the problem seems
to be that various different sub-groups within the project are taking their
own approach, and no one has done a lot of development—or
testing—of an integrated server solution.
Those who are currently testing DC setups with the Samba 4 code are using
the simpler, single-process "ntvfs"
file server code, rather than smbd. For that reason and others, Andrew
Tridgell would like to
see ntvfs still be available as an option:
For embedded devices the single process mode and
ntvfs/ file server code is what we are likely to use for some time, as
it uses far fewer resources. For intensive file serving using smbd
makes sense but I don't want to lose the ability to run in small
But that doesn't sit well with Jeremy
Allison, partly because of the maintenance burden:
Having a second file server embedded - used only in certain
cases and having differing semantics is a [recipe] for disaster,
and not a good idea for long term stability of this release.
That's a big sticking point for me. I though we'd decided
s4 fileserver was smbd code - end of story. The details
were how to do the integration.
Beyond just the embedded case, though, Tridgell sees value in keeping ntvfs alive, rather than
forcing those users to switch to the smbd-based file server code. It's a
matter of what testing has been done, as well as causing fewer disruptions
to existing setups:
Apart from the embedded case, it is
also the file server we have done all testing of Samba as a DC against
so far. Being able to run it by setting a config option is a good thing
I think, at the very least for debugging issues with the changeover to
the smbd based file server. It also means that existing sites running s4
as a DC are able to continue to run the file server they have been using
up to now.
Tridgell also thinks that ntvfs has a design and structure that should
eventually migrate into smbd. But, as he notes, his arguments haven't convinced
Allison, so he's started working on a branch that uses the smbd server.
That's only one of the problem areas, though. AD DCs handle far more than just
file serving, they also perform authentication, DNS lookups, act as printer
But, once again, there are two versions of some of the services floating
around. For example, winbind, which handles some authentication and
UID/GID lookups has two separate flavors, neither of which currently
handles everything needed for an AD DC. Tridgell and Bartlett have been
looking into whether it makes sense to have
a single winbind server for both worlds, seemingly coming to the conclusion
that it doesn't. But Allison, Simo Sorce, and Matthieu Patou see that as
another unnecessary split between existing and future functionality. Sorce
is particularly unhappy with that direction, saying:
It almost [seems] like we should
just fork the 2 projects, after all we reimplement everything
differently between the file server components and the DC components
with no intention of sharing common code [...]
Part of the complaints about Bartlett's proposal is about positioning.
Samba 4 has been envisioned as a complete drop-in replacement for Samba 3.
To some, that means integrating the Samba 3 functionality into Samba 4, but for others, it could mean making the Samba 3 pieces available as
part of Samba 4. Tridgell and others are in the latter camp, but Allison looks at it this
way: "It isn't an integrated product yet,
it's just a grab bag of non-integrated features." He goes on to say
that it will put OEMs and Linux distributions "in an incredibly
difficult position w.r.t. marketing and communications".
Everyone seems to agree that the ultimate goal is to integrate the two code
bases, but there is enough clamor for a real release of the AD DC feature
that some would like to see an interim release. One idea that seems to be
gaining some traction is to do a "Samba AD" release using the Samba 4 code
(and possibly including some of Samba 3 server programs). That release
would be targeted at folks that want to run a Samba AD DC—as many
are already doing using the alpha code—encouraging those who don't need
that functionality to stay with Samba 3. As Géza Gémes put it:
Call the Samba 4.0 release Samba-AD (the idea behind the name belongs to
Sernet people), and continue to release Samba3 as Samba-FS. This way
people would have a suggestion where those are going to be deployable.
Of course I DON'T propose the end of the integration efforts. But if the
plan is to do a release in the near future that seems a good (certainly
not perfect) compromise. Having a Samba release with ability to act as
an AD DC is becoming more and more important to many people who have to
upgrade their network infrastructure.
Doing something like that would remove much of the pressure that the Samba
team is feeling regarding a Samba 4 release. That would allow time to work
out the various technical issues with integrating the two Sambas for an
eventual Samba 4 release that fulfills the goal of having a single server
for handling both the AD DC world and the simpler file-and-print-server
world. As Gémes and others said, it's not a perfect solution, but
it may well be one that solves most of the current problems.
The underlying issue seems to be that Samba 3 and Samba 4 have been on
divergent development paths for some time now. While it was widely
recognized that those paths would need to converge at some point, no real
plan to do so has come about until now. Meanwhile, users and OEMs have been
patiently—or not so patiently—waiting for the new features. It is probably still a ways off
before the "real" Samba 4 makes an appearance, but plans are coming
together, which is certainly a step in the right direction. Given that
some have been using the AD DC feature for some time now, it probably makes
sense to find a way to get it into people's hands. Not everyone is
convinced of that last part, however, so it remains to be seen which way
the project will go.
Comments (33 posted)
Developers in the "free network services" movement often highlight Google services like GMail, Google Maps, and Google Docs when they speak about the shortcomings of centralized software-as-a-service products, but they rarely address the software behemoth's core product: web search. That may be changing now that the decentralized search engine YaCy has made its 1.0 release. But while the code may be 1.0, the search results may not scream "release quality."
The rationale given for YaCy is that a decentralized, peer-to-peer search service prohibits a central point-of-control and the problems that come with it: censorship, user tracking, commercial players skewing search results, and so forth. Elsewhere on the project site, the fact that a decentralized service also eliminates a central point of failure comes up, as does the potential for faster responses through load-balancing. But user control is the core issue: YaCy users determine which sites and pages get indexed, and it is possible to thoroughly explore the search index from a YaCy client.
In addition to "basic" search functionality indexing the entire web,
individual administrators can point YaCy's crawler towards specific
content, using it to create a search portal limited in scope to a
particular topic, a specific domain, or a company intranet. On one hand,
this design decision allows both custom "search appliances" using free
software, and, with the same code base, puts the indexing choices directly
into the hands of the search engine's users. On the other hand, the
portion of the web indexed by the federated network of YaCy installations
is much smaller than the existing indexes of Google and other commercial
services — and perhaps more importantly, it grows more slowly as well.
The 1.0 release of YaCy is available for download in pre-built packages
for Windows, Mac OS X, and Linux, as well as an Apt repository for Debian and Ubuntu. The package requires OpenJDK version 6 or later; once installed it provides a web interface at http://localhost:8090/ that includes searching, administration and configuration, and control over your local index creation. You can also register the local YaCy instance as a toolbar search engine in Firefox, although it must be running in order to function.
In a relatively new move, the project is also running a public web search portal at search.yacy.net. This portal accesses the primary "freeworld" network of public YaCy peers. There are other, independent YaCy networks that do not attempt to index the entire web, such as the scientific research index sciencenet.kit.edu. These portals make it possible to test out YaCy's coverage and results without installing the client software locally.
The architecture among peers
Broadly speaking, a network of YaCy peers (such as freeworld) maintains
a single, shared reverse-word-index for all of the crawled pages (in other
words, a database of matching URLs, ordered on the words that would make up
likely search terms). The difference is that the index is sharded among
the peers in a distributed hash table (DHT). Whenever a peer indexes a new set of pages, it writes updates to the DHT. Shards are replicated between multiple peers to bolster lookup speed and availability.
In practice, YaCy's DHT is a bit more complicated. The DHT does not
store the full URL reference for every matching page — the complete
entry includes not just the source URL, but metadata about it such as the
last crawl time, the language, and the filetype (all of which might be
important to the user performing the search). That information is stored
in a local database on the peer that crawled the page, and replicated to
several other peers. The peers' data is kept in a custom-written NoSQL database using AVL trees, a self-balancing search tree that gives logarithmic lookup, insert, and delete operations.
Each record in the DHT's index for a particular word contains a hash of the URL and a hash of the peer where the full reference is stored. Those two hashes are computed and stored separately, so that it is simple to determine that two matching hashes are entries for the same URL. That saves time because a URL that matches multiple search terms only has to be fetched from a peer once for a particular search.
Finally, for common terms, the number of URL references for a given word can become unwieldy, so YaCy partitions the DHT not just on full word-entry boundaries, but splits each word entry across multiple peers. That complicates the search process because matches for each word need to be retrieved from multiple peers, but it balances the load.
YaCy lead developer Michael Christen said that the freeworld network
currently partitions word entries into 16 parts (although this is configurable, and he
said the freeworld network may soon scale-up). The peer network is visualized as a circle, and the 16 shards evenly "spaced" around the circle. Two mirrors are created for each shard, and are placed in adjacent positions to the "left" and "right" of the primary location. The freeworld network claims around 1000 active peers with about 1500 passive peers (i.e., those not contributing to the public index); together they are currently indexing just under 888 million pages. You can see a live visualization of the network on the YaCy home page, and the YaCy client allows you to explore it more thoroughly (due to the DHT's circular design, the live visualization resembles a borderline-creepy pulsating human eye; it is still not clear to me whether or not this is intentional...).
Because a search involves sending several stages of request to multiple peers (i.e., getting the matching URL/peer hashes from the DHT, then querying peers for the URL records), the lag time for a YaCy search is potentially much higher than it is for sending a query to a single search engine datacenter. However, because each YaCy peer is also storing its own portion of the full index in a local database, the system makes use of that existing data structure to speed things up.
First, for every search, a query to the local database is started
concurrently with the remote search query. Secondly, each search's results
are cached locally, so that on subsequent searches the local query will
return more hits and not tax the network. This works best for the scenario
when two searches performed in a row are part of a single search session
— such as a search for "Linux" followed quickly by a refinement for
"Linux" and "kernel." It is also presumed that a YaCy user is likely to
have crawled pages that are of particular interest to them, so there is a better-than-average chance that relevant results will already be stored in the local database.
The search process in YaCy is complex not only because of the distributed storage system, but because a central server does not assemble and sort the results. Instead, the local YaCy web application must do so. It collects results from the local database (including cached results from previous queries) and from remote peers and sorts them together in a "Reverse Word Index Queue." The ranking algorithm used is similar to the PageRank used by Google, though Christen describes it as simpler. As you continue to use YaCy, however, it observes your actions and uses those to refine future search results.
YaCy next fetches the contents of each matching page. Christen admits that this is time-consuming, but it is done to prevent spamming and gaming the system — by fetching the page, the local web application can verify that the search terms actually appear in the page. The results are loaded as HTTP chunks so that they appear on screen as fast as possible.
From the user's standpoint, the YaCy search interface is a good deal
more complicated than the minimalist design sported by Google and its
big-name competitors. The search box is clean, but the results page
displays more detail for every entry than a production search engine might:
there is a link to the metadata database entry for each hit, another link
to details on the indexing and parsing history of the entry, and floating
buttons to promote, demote, and bookmark each link. There is also a
"filetype navigator," "domain navigator," and "author navigator" for the
results page as a whole, along with a tag cloud.
As interesting as the query and replication design may be, the entire system would be useless without a sufficiently large set of indexed pages. As mentioned earlier, index creation is a task left up to the peers as well. The YaCy client software includes several tools for producing and managing the local peer's portion of the global index.
The current release includes seven methods for adding to the index: a site crawler that limits itself to a single domain, an "expert" crawler that will follow links to any depth requested by the user, a network scanner that looks for locally-accessible servers, an RSS importer that indexes individual entries in a feed, an OpenArchives Protocol for Metadata Harvesting (PMH) importer, and two specialized crawlers: one for MediaWiki sites, and one for phpBB3 forums.
YaCy can also be configured to index all URLs visited by the local
browser. In this mode, it places several safeguards to protect against
indexing personal and protected pages. It skips all pages fetched using
POST or GET parameters, those requiring HTTP password authorization, and so
on. You can also investigate the local database, pulling up records by word entry or hash (if you happen to know the hash of the word), and edit or delete the metadata stored. You can also blacklist problematic sites with regular expression matching; there is a shared YaCy Blacklist Engine available to all clients, although there is no documentation on its contents.
How ever you add to the global index, the actions you take on the search
results also contribute to the search experience: you can promote or demote
any entry using the +/- buttons on the results page. There is also an advanced-configuration tool with which you can
tweak the weight of about two dozen separate factors in how your search
results are sorted locally, including word frequency, placement on the
page, appearance in named anchors, and so on. These customizations are
local, though; they do not affect which pages the peers send in response to
Options and the future
The description above outlines the default, "basic" mode of YaCy usage. The administrative interface allows you to configure the YaCy client in several other modes, including intranet indexing and serving as a single-site search portal. You can also set up user accounts for use with a password-protected search engine (as might be the case for a company intranet), load "suggestion dictionaries," tweak the filetypes that YaCy will index, and adjust the memory usage and process priority.
Another wrinkle in YaCy administration is that the peer can be configured to also load search results culled from the Scroogle and Blekko search engines. Scroogle is an engine that scrapes Google, but removes user-tracking data from the results, while Blekko is a search engine dedicated to publishing its search algorithms and optimizations for public consumption.
Both speak to the need to "bootstrap" YaCy's global index. A glance at
reader comments to other YaCy stories (such as LWN's announcement and Slashdot's coverage)
indicates that many people have tried YaCy and found the ordering of the
results to be lacking. The topic comes up repeatedly on the YaCy
discussion boards, although Christen noted in an FSCONS 2010 talk that YaCy already has more pages in its index than Google did at the time that it launched.
Nevertheless, the YaCy team has recently been promoting a new idea to boost the size of the index: interoperability with the Apache Solr search platform.
From a practical standpoint, this is probably a good move. YaCy alone is not yet indexing enough of the web to be competitive with commercial search engines. Some modest tests of my own roughly match the experiences of the LWN and Slashdot commenters: YaCy can find big and obvious pages for popular topics, but the real meat of web search is the ability to find the difficult-to-discover content. From one point of view, YaCy is like any other crowd-sourced data initiative: the more people who participate, the better it gets. However, it is drastically different from Wikipedia or OpenStreetMap in one key regard: the partial coverage available during the ramp-up phase of the project makes the system unusable for real work. You can map your own home town and the map will be useful to you on a daily basis — but if you index your own web site, that does not help you find most of your search targets. Better interoperability with Solr and other open source search engines could help, as would a concerted effort to index important un-covered areas of the web (a replacement for Google Code Search comes to mind).
Still, the developers are quick to admit that YaCy is not a production service. At this point, the team is concerned with tackling the tricky problem of distributing indexing, searching, and page ranking over a peer-to-peer network. Which is an original problem, even if the current state of the index is not a major challenge to Google.
Comments (8 posted)
Here is LWN's fourteenth annual timeline of significant events in the Linux
and free software world for the year.
In many ways, 2011 is just like all the previous years we have
covered—only the details have changed. Releases of new software and
distributions continues at its normal ferocious rate, and Linux adoption
(though perhaps not on the desktop) continues unabated. That said, the
usual threats to our communities keep rearing their heads; in particular,
the patent attacks against free software continue to increase. But,
overall, it was a great year for Linux and free software, just as we expect
(and beyond) to be.
We will be breaking the timeline up into quarters, and this is our report
on January-March 2011. Over the next month, we will be putting out
timelines of the other three quarters of the year.
This is version 0.8 of the 2011 timeline. There are almost certainly some
errors or omissions; if you find any, please send them to firstname.lastname@example.org.
LWN subscribers have paid for the development of this timeline, along with
previous timelines and the weekly editions. If you like what you see here,
or elsewhere on the site, please consider subscribing to LWN.
For those with a nostalgic bent, our timeline index page has links
to the previous 13 timelines and some other retrospective articles
going all the way back to 1998.
Linux 2.6.37 is released (announcement, KernelNewbies summary, Who wrote 2.6.37).
It is no longer vital to work to keep Emacs small. Eight Megabytes Ain't
Constantly Swapping any more.
-- Richard Stallman
No more H.264 video codec support for the Chrome/Chromium browser as
Google focuses on WebM support (announcement,
The Hudson continuous integration server project forks due to
fallout from Oracle's acquisition of Sun. The new project is called
Free software's awfully like sausages - wonderfully tasty, but sometimes
you suddenly discover that you've been eating sheep nostrils for the past
15 years of your life.
-- Matthew Garrett
LibreOffice makes its first stable release, 3.3 (announcement,
OpenOffice.org also makes a 3.3 release (new features,
The FFmpeg project has a leadership coup, though it eventually
resolves into a fork in March, which results in the Libav project (LWN blurb).
Amarok 2.4 is released (announcement).
Nice to see it gone - it seemed such a good idea in Linux 1.3
-- Alan Cox won't miss the
Mark Shuttleworth announces plans to include Qt and Qt-based
applications on the default Ubuntu install (blog post).
Xfce 4.8 is released (announcement,
linux.conf.au is held in Brisbane, Australia despite the efforts of
Mother Nature to inundate it. Organizers were quick to move to a new venue after catastrophic
the conference came off without a hitch. (LWN coverage: Re-engineering the internet, IP
address exhaustion, Server power management,
The controversial Mark Pesce keynote, 30 years of sendmail, Rationalizing the wacom driver,
and a Wrap-up).
KDE Software Compilation 4.6 is released (announcement).
Bufferbloat.net launches as a site to work on solving networking performance
problems caused by bufferbloat. (LWN blurb, web site).
The last IPv4 address blocks are allocated by the Internet Assigned
Numbers Authority (IANA) to the Asia-Pacific Network Information Center
(APNIC), which would (seemingly) make the IPv6 transition even more urgent (announcement).
If you're wondering why people don't follow your instructions to help you
with your project, go hit your local library and check out a cookbook. Bake
something you've never baked before. Then, while eating it, open your
documentation again and take a look at it with this in mind.
FOSDEM is held February 5-6 in Brussels, Belgium (LWN coverage: Freedom Box, Distribution collaboration, and Configuration management).
Eben Moglen announces the FreedomBox Foundation as part of his
FOSDEM talk. A fundraising campaign on Kickstarter garners well over the
$60,000 goal. (LWN article).
Debian 6.0 ("Squeeze") is released (announcement, LWN pre-review).
The Ada Initiative launches to promote women in open technology and
culture (announcement, LWN coverage).
Nokia, our platform is burning.
-- Nokia CEO Stephen
Elop foreshadows the switch to Windows
Nokia drops MeeGo in favor of Windows Phone 7 (LWN blurb, Reuters
GNU Guile 2.0.0 released. Guile is an implementation of the
Lisp-like Scheme language (announcement).
The MPEG Licensing Authority (MPEG-LA) calls for patents essential to
VP8, as it is looking to form a patent pool to potentially shake down
implementers of the video codec used by WebM (announcement).
A Linux-based supercomputer is a contestant on Jeopardy. IBM's
"Watson" trounces two former champions (New
York Times article).
Realize that 50% of today's professional programmers have never written a
line of code that had to be compiled.
-- Casey Schaufler
Python 3.2 released (announcement).
FreeBSD 8.2 released (announcement,
Southern California Linux Expo (SCALE) 9x is held in Los Angeles,
February 25-27 (LWN coverage: Unity, Hackerspaces, Distribution unfriendly projects, and Phoronix launches OpenBenchmarking).
Canonical unilaterally switches the Banshee default music store to
Ubuntu One (original
Mark Shuttleworth's view)
Red Hat stops shipping broken-out kernel patches for RHEL 6 which
causes an uproar in the community and charges of GPL violations. It
actually happened earlier, but came to light in February. (LWN coverage: Enterprise distributions and free software and
Red Hat and the GPL; Red Hat statement).
The vendor-sec mailing list and its host are compromised (announcement, LWN coverage).
Golden rule #12: When the comments do not match the code, they probably are
-- Steven Rostedt
Scientific Linux 6.0 is released. (announcement).
The Yocto project and OpenEmbedded "align" both in terms of
governance and technology, which should result in less fragmentation in the
building of embedded Linux systems (announcement).
Linux 2.6.38 is released (announcement, KernelNewbies summary, and
Who wrote 2.6.38).
openSUSE 11.4 is released (announcement,
Linus Torvalds starts loudly complaining about the ARM kernel tree,
which leads to a large effort to
clean it all up (linux-kernel post, LWN article).
If it's some desperate cry for attention by somebody, I just wish those
people would release their own sex tapes or something, rather than drag the
Linux kernel into their sordid world.
Torvalds is unimpressed by the Bionic GPL violation claims
Fraudulent SSL certificates issued by UserTrust (part of Comodo) are
found in the wild (LWN blurb, article and follow-up).
Android's Bionic C library comes under fire for alleged GPL
violations, though it appears to be a concerted fear, uncertainty, and
doubt (FUD) campaign (LWN article).
Microsoft sues Barnes & Noble over alleged patent infringement in
the Android-based Nook ebook
reader (LWN blurb and article).
The worst part about Comodo's letter to the public was how they claimed
that they never thought a nation state would attack them. If that's not
part of your threat model, what business do you have being part of Internet
Firefox 4 is released, marking the beginning of Mozilla's new
quarterly release schedule (announcement).
Google chooses not to release its tablet-oriented Android 3.0
code, because it isn't ready for both tablets and handsets (LWN article).
The Monotone distributed version control system releases its 1.0
GCC 4.6.0 is released (LWN blurb, release notes).
Comments (none posted)
Page editor: Jonathan Corbet
Next page: Security>>