User: Password:
|
|
Subscribe / Log in / New account

Leading items

Releasing Samba 4

By Jake Edge
November 30, 2011

Samba 4 has been a long time in coming—and it seems to still be a ways off. Samba is a free implementation of Microsoft's SMB/CIFS protocol that is used in the vast majority of consumer-targeted network storage devices, but the current 3.x versions lack many of the features that enterprise users require (Active Directory support in particular) and Samba 4 is meant to address that shortcoming while also reworking and rewriting much of the code base. The biggest hurdle to a Samba 4 release seems to be integrating the new code with the existing pieces that currently reside in Samba 3. A recent proposal to release the current code "as is"—rather than complete the integration work before a 4.0 release—is under heavy discussion on the samba-technical mailing list.

There is a fair amount of impatience, it seems, for a Samba 4.0 release. In the proposal, Andrew Bartlett notes that vendors would like to build their products atop a stable release, rather than alphas, and that finishing the integration work in 4.1 might allow a final 4.0 release "in about three months time". Samba 4 has been in the works since 2003, with several "technology preview" releases starting in 2006 and the first alpha release in 2007, but a 4.0 final release has so far proved elusive. In his proposal, Bartlett is seeking to route around the hurdles and get a release out there.

Part of the problem with integrating the Active Directory (AD) Domain Controller (DC) work with the existing production Samba 3 code is that there needs to be a clear migration path for users who upgrade. If the existing Samba 3 file server code (often referred to as "smbd") were still shipped as an option, existing users would not need to change anything. Only users that were interested in moving to Samba-based DCs would need to start using bin/samba (which is the name of the Samba 4 server that includes AD and DC functionality).

But, some have always envisioned Samba 4 as a single server process that can handle all of the different roles (file server and AD DC), which necessitates the integration. Others are not so sure that it doesn't make sense to release a Samba 4 that incorporates the new functionality for those who need AD DC support, while leaving other users to continue using the older, working and well-tested code. Part of the problem seems to be that various different sub-groups within the project are taking their own approach, and no one has done a lot of development—or testing—of an integrated server solution.

Those who are currently testing DC setups with the Samba 4 code are using the simpler, single-process "ntvfs" file server code, rather than smbd. For that reason and others, Andrew Tridgell would like to see ntvfs still be available as an option:

For embedded devices the single process mode and ntvfs/ file server code is what we are likely to use for some time, as it uses far fewer resources. For intensive file serving using smbd makes sense but I don't want to lose the ability to run in small environments.

But that doesn't sit well with Jeremy Allison, partly because of the maintenance burden:

Having a second file server embedded - used only in certain cases and having differing semantics is a [recipe] for disaster, and not a good idea for long term stability of this release.

That's a big sticking point for me. I though we'd decided s4 fileserver was smbd code - end of story. The details were how to do the integration.

Beyond just the embedded case, though, Tridgell sees value in keeping ntvfs alive, rather than forcing those users to switch to the smbd-based file server code. It's a matter of what testing has been done, as well as causing fewer disruptions to existing setups:

Apart from the embedded case, it is also the file server we have done all testing of Samba as a DC against so far. Being able to run it by setting a config option is a good thing I think, at the very least for debugging issues with the changeover to the smbd based file server. It also means that existing sites running s4 as a DC are able to continue to run the file server they have been using up to now.

Tridgell also thinks that ntvfs has a design and structure that should eventually migrate into smbd. But, as he notes, his arguments haven't convinced Allison, so he's started working on a branch that uses the smbd server. That's only one of the problem areas, though. AD DCs handle far more than just file serving, they also perform authentication, DNS lookups, act as printer servers, and more.

But, once again, there are two versions of some of the services floating around. For example, winbind, which handles some authentication and UID/GID lookups has two separate flavors, neither of which currently handles everything needed for an AD DC. Tridgell and Bartlett have been looking into whether it makes sense to have a single winbind server for both worlds, seemingly coming to the conclusion that it doesn't. But Allison, Simo Sorce, and Matthieu Patou see that as another unnecessary split between existing and future functionality. Sorce is particularly unhappy with that direction, saying:

It almost [seems] like we should just fork the 2 projects, after all we reimplement everything differently between the file server components and the DC components with no intention of sharing common code [...]

Part of the complaints about Bartlett's proposal is about positioning. Samba 4 has been envisioned as a complete drop-in replacement for Samba 3. To some, that means integrating the Samba 3 functionality into Samba 4, but for others, it could mean making the Samba 3 pieces available as part of Samba 4. Tridgell and others are in the latter camp, but Allison looks at it this way: "It isn't an integrated product yet, it's just a grab bag of non-integrated features." He goes on to say that it will put OEMs and Linux distributions "in an incredibly difficult position w.r.t. marketing and communications".

Everyone seems to agree that the ultimate goal is to integrate the two code bases, but there is enough clamor for a real release of the AD DC feature that some would like to see an interim release. One idea that seems to be gaining some traction is to do a "Samba AD" release using the Samba 4 code (and possibly including some of Samba 3 server programs). That release would be targeted at folks that want to run a Samba AD DC—as many are already doing using the alpha code—encouraging those who don't need that functionality to stay with Samba 3. As Géza Gémes put it:

Call the Samba 4.0 release Samba-AD (the idea behind the name belongs to Sernet people), and continue to release Samba3 as Samba-FS. This way people would have a suggestion where those are going to be deployable. Of course I DON'T propose the end of the integration efforts. But if the plan is to do a release in the near future that seems a good (certainly not perfect) compromise. Having a Samba release with ability to act as an AD DC is becoming more and more important to many people who have to upgrade their network infrastructure.

Doing something like that would remove much of the pressure that the Samba team is feeling regarding a Samba 4 release. That would allow time to work out the various technical issues with integrating the two Sambas for an eventual Samba 4 release that fulfills the goal of having a single server for handling both the AD DC world and the simpler file-and-print-server world. As Gémes and others said, it's not a perfect solution, but it may well be one that solves most of the current problems.

The underlying issue seems to be that Samba 3 and Samba 4 have been on divergent development paths for some time now. While it was widely recognized that those paths would need to converge at some point, no real plan to do so has come about until now. Meanwhile, users and OEMs have been patiently—or not so patiently—waiting for the new features. It is probably still a ways off before the "real" Samba 4 makes an appearance, but plans are coming together, which is certainly a step in the right direction. Given that some have been using the AD DC feature for some time now, it probably makes sense to find a way to get it into people's hands. Not everyone is convinced of that last part, however, so it remains to be seen which way the project will go.

Comments (33 posted)

YaCy: A peer-to-peer search engine

November 30, 2011

This article was contributed by Nathan Willis

Developers in the "free network services" movement often highlight Google services like GMail, Google Maps, and Google Docs when they speak about the shortcomings of centralized software-as-a-service products, but they rarely address the software behemoth's core product: web search. That may be changing now that the decentralized search engine YaCy has made its 1.0 release. But while the code may be 1.0, the search results may not scream "release quality."

The rationale given for YaCy is that a decentralized, peer-to-peer search service prohibits a central point-of-control and the problems that come with it: censorship, user tracking, commercial players skewing search results, and so forth. Elsewhere on the project site, the fact that a decentralized service also eliminates a central point of failure comes up, as does the potential for faster responses through load-balancing. But user control is the core issue: YaCy users determine which sites and pages get indexed, and it is possible to thoroughly explore the search index from a YaCy client.

In addition to "basic" search functionality indexing the entire web, individual administrators can point YaCy's crawler towards specific content, using it to create a search portal limited in scope to a particular topic, a specific domain, or a company intranet. On one hand, this design decision allows both custom "search appliances" using free software, and, with the same code base, puts the indexing choices directly into the hands of the search engine's users. On the other hand, the portion of the web indexed by the federated network of YaCy installations is much smaller than the existing indexes of Google and other commercial services — and perhaps more importantly, it grows more slowly as well.

The 1.0 release of YaCy is available for download in pre-built packages for Windows, Mac OS X, and Linux, as well as an Apt repository for Debian and Ubuntu. The package requires OpenJDK version 6 or later; once installed it provides a web interface at http://localhost:8090/ that includes searching, administration and configuration, and control over your local index creation. You can also register the local YaCy instance as a toolbar search engine in Firefox, although it must be running in order to function.

In a relatively new move, the project is also running a public web search portal at search.yacy.net. This portal accesses the primary "freeworld" network of public YaCy peers. There are other, independent YaCy networks that do not attempt to index the entire web, such as the scientific research index sciencenet.kit.edu. These portals make it possible to test out YaCy's coverage and results without installing the client software locally.

The architecture among peers

Broadly speaking, a network of YaCy peers (such as freeworld) maintains a single, shared reverse-word-index for all of the crawled pages (in other words, a database of matching URLs, ordered on the words that would make up likely search terms). The difference is that the index is sharded among the peers in a distributed hash table (DHT). Whenever a peer indexes a new set of pages, it writes updates to the DHT. Shards are replicated between multiple peers to bolster lookup speed and availability.

In practice, YaCy's DHT is a bit more complicated. The DHT does not store the full URL reference for every matching page — the complete entry includes not just the source URL, but metadata about it such as the last crawl time, the language, and the filetype (all of which might be important to the user performing the search). That information is stored in a local database on the peer that crawled the page, and replicated to several other peers. The peers' data is kept in a custom-written NoSQL database using AVL trees, a self-balancing search tree that gives logarithmic lookup, insert, and delete operations.

Each record in the DHT's index for a particular word contains a hash of the URL and a hash of the peer where the full reference is stored. Those two hashes are computed and stored separately, so that it is simple to determine that two matching hashes are entries for the same URL. That saves time because a URL that matches multiple search terms only has to be fetched from a peer once for a particular search.

Finally, for common terms, the number of URL references for a given word can become unwieldy, so YaCy partitions the DHT not just on full word-entry boundaries, but splits each word entry across multiple peers. That complicates the search process because matches for each word need to be retrieved from multiple peers, but it balances the load.

[Network visualization]

YaCy lead developer Michael Christen said that the freeworld network currently partitions word entries into 16 parts (although this is configurable, and he said the freeworld network may soon scale-up). The peer network is visualized as a circle, and the 16 shards evenly "spaced" around the circle. Two mirrors are created for each shard, and are placed in adjacent positions to the "left" and "right" of the primary location. The freeworld network claims around 1000 active peers with about 1500 passive peers (i.e., those not contributing to the public index); together they are currently indexing just under 888 million pages. You can see a live visualization of the network on the YaCy home page, and the YaCy client allows you to explore it more thoroughly (due to the DHT's circular design, the live visualization resembles a borderline-creepy pulsating human eye; it is still not clear to me whether or not this is intentional...).

Search me...

Because a search involves sending several stages of request to multiple peers (i.e., getting the matching URL/peer hashes from the DHT, then querying peers for the URL records), the lag time for a YaCy search is potentially much higher than it is for sending a query to a single search engine datacenter. However, because each YaCy peer is also storing its own portion of the full index in a local database, the system makes use of that existing data structure to speed things up.

First, for every search, a query to the local database is started concurrently with the remote search query. Secondly, each search's results are cached locally, so that on subsequent searches the local query will return more hits and not tax the network. This works best for the scenario when two searches performed in a row are part of a single search session — such as a search for "Linux" followed quickly by a refinement for "Linux" and "kernel." It is also presumed that a YaCy user is likely to have crawled pages that are of particular interest to them, so there is a better-than-average chance that relevant results will already be stored in the local database.

The search process in YaCy is complex not only because of the distributed storage system, but because a central server does not assemble and sort the results. Instead, the local YaCy web application must do so. It collects results from the local database (including cached results from previous queries) and from remote peers and sorts them together in a "Reverse Word Index Queue." The ranking algorithm used is similar to the PageRank used by Google, though Christen describes it as simpler. As you continue to use YaCy, however, it observes your actions and uses those to refine future search results.

YaCy next fetches the contents of each matching page. Christen admits that this is time-consuming, but it is done to prevent spamming and gaming the system — by fetching the page, the local web application can verify that the search terms actually appear in the page. The results are loaded as HTTP chunks so that they appear on screen as fast as possible.

[Search results]

From the user's standpoint, the YaCy search interface is a good deal more complicated than the minimalist design sported by Google and its big-name competitors. The search box is clean, but the results page displays more detail for every entry than a production search engine might: there is a link to the metadata database entry for each hit, another link to details on the indexing and parsing history of the entry, and floating buttons to promote, demote, and bookmark each link. There is also a "filetype navigator," "domain navigator," and "author navigator" for the results page as a whole, along with a tag cloud.

Harvest season

As interesting as the query and replication design may be, the entire system would be useless without a sufficiently large set of indexed pages. As mentioned earlier, index creation is a task left up to the peers as well. The YaCy client software includes several tools for producing and managing the local peer's portion of the global index.

The current release includes seven methods for adding to the index: a site crawler that limits itself to a single domain, an "expert" crawler that will follow links to any depth requested by the user, a network scanner that looks for locally-accessible servers, an RSS importer that indexes individual entries in a feed, an OpenArchives Protocol for Metadata Harvesting (PMH) importer, and two specialized crawlers: one for MediaWiki sites, and one for phpBB3 forums.

YaCy can also be configured to index all URLs visited by the local browser. In this mode, it places several safeguards to protect against indexing personal and protected pages. It skips all pages fetched using POST or GET parameters, those requiring HTTP password authorization, and so on. You can also investigate the local database, pulling up records by word entry or hash (if you happen to know the hash of the word), and edit or delete the metadata stored. You can also blacklist problematic sites with regular expression matching; there is a shared YaCy Blacklist Engine available to all clients, although there is no documentation on its contents.

How ever you add to the global index, the actions you take on the search results also contribute to the search experience: you can promote or demote any entry using the +/- buttons on the results page. There is also an advanced-configuration tool with which you can tweak the weight of about two dozen separate factors in how your search results are sorted locally, including word frequency, placement on the page, appearance in named anchors, and so on. These customizations are local, though; they do not affect which pages the peers send in response to your queries.

Options and the future

The description above outlines the default, "basic" mode of YaCy usage. The administrative interface allows you to configure the YaCy client in several other modes, including intranet indexing and serving as a single-site search portal. You can also set up user accounts for use with a password-protected search engine (as might be the case for a company intranet), load "suggestion dictionaries," tweak the filetypes that YaCy will index, and adjust the memory usage and process priority.

Another wrinkle in YaCy administration is that the peer can be configured to also load search results culled from the Scroogle and Blekko search engines. Scroogle is an engine that scrapes Google, but removes user-tracking data from the results, while Blekko is a search engine dedicated to publishing its search algorithms and optimizations for public consumption.

Both speak to the need to "bootstrap" YaCy's global index. A glance at reader comments to other YaCy stories (such as LWN's announcement and Slashdot's coverage) indicates that many people have tried YaCy and found the ordering of the results to be lacking. The topic comes up repeatedly on the YaCy discussion boards, although Christen noted in an FSCONS 2010 talk that YaCy already has more pages in its index than Google did at the time that it launched.

Nevertheless, the YaCy team has recently been promoting a new idea to boost the size of the index: interoperability with the Apache Solr search platform.

From a practical standpoint, this is probably a good move. YaCy alone is not yet indexing enough of the web to be competitive with commercial search engines. Some modest tests of my own roughly match the experiences of the LWN and Slashdot commenters: YaCy can find big and obvious pages for popular topics, but the real meat of web search is the ability to find the difficult-to-discover content. From one point of view, YaCy is like any other crowd-sourced data initiative: the more people who participate, the better it gets. However, it is drastically different from Wikipedia or OpenStreetMap in one key regard: the partial coverage available during the ramp-up phase of the project makes the system unusable for real work. You can map your own home town and the map will be useful to you on a daily basis — but if you index your own web site, that does not help you find most of your search targets. Better interoperability with Solr and other open source search engines could help, as would a concerted effort to index important un-covered areas of the web (a replacement for Google Code Search comes to mind).

Still, the developers are quick to admit that YaCy is not a production service. At this point, the team is concerned with tackling the tricky problem of distributing indexing, searching, and page ranking over a peer-to-peer network. Which is an original problem, even if the current state of the index is not a major challenge to Google.

Comments (8 posted)

2011 Linux and free software timeline - Q1

Here is LWN's fourteenth annual timeline of significant events in the Linux and free software world for the year.

In many ways, 2011 is just like all the previous years we have covered—only the details have changed. Releases of new software and distributions continues at its normal ferocious rate, and Linux adoption (though perhaps not on the desktop) continues unabated. That said, the usual threats to our communities keep rearing their heads; in particular, the patent attacks against free software continue to increase. But, overall, it was a great year for Linux and free software, just as we expect 2012 (and beyond) to be.

We will be breaking the timeline up into quarters, and this is our report on January-March 2011. Over the next month, we will be putting out timelines of the other three quarters of the year.


This is version 0.8 of the 2011 timeline. There are almost certainly some errors or omissions; if you find any, please send them to timeline@lwn.net.

LWN subscribers have paid for the development of this timeline, along with previous timelines and the weekly editions. If you like what you see here, or elsewhere on the site, please consider subscribing to LWN.

For those with a nostalgic bent, our timeline index page has links to the previous 13 timelines and some other retrospective articles going all the way back to 1998.

January

Linux 2.6.37 is released (announcement, KernelNewbies summary, Who wrote 2.6.37).

It is no longer vital to work to keep Emacs small. Eight Megabytes Ain't Constantly Swapping any more.

-- Richard Stallman

No more H.264 video codec support for the Chrome/Chromium browser as Google focuses on WebM support (announcement, update). [Jenkins logo]

The Hudson continuous integration server project forks due to fallout from Oracle's acquisition of Sun. The new project is called Jenkins (announcement).

Free software's awfully like sausages - wonderfully tasty, but sometimes you suddenly discover that you've been eating sheep nostrils for the past 15 years of your life.

-- Matthew Garrett

[LibreOffice logo]

LibreOffice makes its first stable release, 3.3 (announcement, LWN coverage).

OpenOffice.org also makes a 3.3 release (new features, release notes).

The FFmpeg project has a leadership coup, though it eventually resolves into a fork in March, which results in the Libav project (LWN blurb).

Amarok 2.4 is released (announcement).

Nice to see it gone - it seemed such a good idea in Linux 1.3

-- Alan Cox won't miss the BKL

Mark Shuttleworth announces plans to include Qt and Qt-based applications on the default Ubuntu install (blog post).

Xfce 4.8 is released (announcement, LWN preview).

linux.conf.au is held in Brisbane, Australia despite the efforts of Mother Nature to inundate it. Organizers were quick to move to a new venue after catastrophic flooding, and the conference came off without a hitch. (LWN coverage: Re-engineering the internet, IP address exhaustion, Server power management, The controversial Mark Pesce keynote, 30 years of sendmail, Rationalizing the wacom driver, and a Wrap-up). [KDE logo]

KDE Software Compilation 4.6 is released (announcement).

Bufferbloat.net launches as a site to work on solving networking performance problems caused by bufferbloat. (LWN blurb, web site).

February

The last IPv4 address blocks are allocated by the Internet Assigned Numbers Authority (IANA) to the Asia-Pacific Network Information Center (APNIC), which would (seemingly) make the IPv6 transition even more urgent (announcement).

If you're wondering why people don't follow your instructions to help you with your project, go hit your local library and check out a cookbook. Bake something you've never baked before. Then, while eating it, open your documentation again and take a look at it with this in mind.

-- Mel Chua

FOSDEM is held February 5-6 in Brussels, Belgium (LWN coverage: Freedom Box, Distribution collaboration, and Configuration management). [FreedomBox logo]

Eben Moglen announces the FreedomBox Foundation as part of his FOSDEM talk. A fundraising campaign on Kickstarter garners well over the $60,000 goal. (LWN article). [Debian logo]

Debian 6.0 ("Squeeze") is released (announcement, LWN pre-review).

The Ada Initiative launches to promote women in open technology and culture (announcement, LWN coverage).

Nokia, our platform is burning.

-- Nokia CEO Stephen Elop foreshadows the switch to Windows

Nokia drops MeeGo in favor of Windows Phone 7 (LWN blurb, Reuters report). [Guile logo]

GNU Guile 2.0.0 released. Guile is an implementation of the Lisp-like Scheme language (announcement).

The MPEG Licensing Authority (MPEG-LA) calls for patents essential to VP8, as it is looking to form a patent pool to potentially shake down implementers of the video codec used by WebM (announcement).

A Linux-based supercomputer is a contestant on Jeopardy. IBM's "Watson" trounces two former champions (New York Times article).

Realize that 50% of today's professional programmers have never written a line of code that had to be compiled.

-- Casey Schaufler

[Python logo]

Python 3.2 released (announcement).

FreeBSD 8.2 released (announcement, release notes).

Southern California Linux Expo (SCALE) 9x is held in Los Angeles, February 25-27 (LWN coverage: Unity, Hackerspaces, Distribution unfriendly projects, and Phoronix launches OpenBenchmarking).

Canonical unilaterally switches the Banshee default music store to Ubuntu One (original blog post, update, and Mark Shuttleworth's view)

Red Hat stops shipping broken-out kernel patches for RHEL 6 which causes an uproar in the community and charges of GPL violations. It actually happened earlier, but came to light in February. (LWN coverage: Enterprise distributions and free software and Red Hat and the GPL; Red Hat statement).

March

The vendor-sec mailing list and its host are compromised (announcement, LWN coverage).

Golden rule #12: When the comments do not match the code, they probably are both wrong.

-- Steven Rostedt

[Scientific Linux logo]

Scientific Linux 6.0 is released. (announcement).

The Yocto project and OpenEmbedded "align" both in terms of governance and technology, which should result in less fragmentation in the building of embedded Linux systems (announcement).

Linux 2.6.38 is released (announcement, KernelNewbies summary, and Who wrote 2.6.38). [openSUSE logo]

openSUSE 11.4 is released (announcement, LWN review).

Linus Torvalds starts loudly complaining about the ARM kernel tree, which leads to a large effort to clean it all up (linux-kernel post, LWN article).

If it's some desperate cry for attention by somebody, I just wish those people would release their own sex tapes or something, rather than drag the Linux kernel into their sordid world.

-- Linus Torvalds is unimpressed by the Bionic GPL violation claims

Fraudulent SSL certificates issued by UserTrust (part of Comodo) are found in the wild (LWN blurb, article and follow-up).

Android's Bionic C library comes under fire for alleged GPL violations, though it appears to be a concerted fear, uncertainty, and doubt (FUD) campaign (LWN article).

Microsoft sues Barnes & Noble over alleged patent infringement in the Android-based Nook ebook reader (LWN blurb and article).

The worst part about Comodo's letter to the public was how they claimed that they never thought a nation state would attack them. If that's not part of your threat model, what business do you have being part of Internet infrastructure?

-- Dave Aitel

Firefox 4 is released, marking the beginning of Mozilla's new quarterly release schedule (announcement).

Google chooses not to release its tablet-oriented Android 3.0 ("Honeycomb") source code, because it isn't ready for both tablets and handsets (LWN article). [Monotone logo]

The Monotone distributed version control system releases its 1.0 version (announcement).

GCC 4.6.0 is released (LWN blurb, release notes).

Comments (none posted)

Page editor: Jonathan Corbet
Next page: Security>>


Copyright © 2011, Eklektix, Inc.
Comments and public postings are copyrighted by their creators.
Linux is a registered trademark of Linus Torvalds