|
|
Subscribe / Log in / New account

Leading items

FLOSS speech recognition

By Jake Edge
July 24, 2013

Akademy 2013

Peter Grasch spoke on the second day of Akademy about the status of FLOSS speech recognition. The subtitle of his talk asked "why aren't we there yet?" and Grasch provided some reasons for that, while also showing that we aren't all that far behind the proprietary alternatives. Some focused work from more contributors could make substantial progress on the problem.

[Peter Grasch]

"Speech recognition" is not really a useful term in isolation, he said. It is like saying you want to develop an "interaction method" without specifying what the user would be interacting with. In order to make a "speech recognition" system, you need to know what it will be used for.

Grasch presented three separate scenarios for speech recognition, which were the subject of a poll that he ran on his blog. The poll was to choose which of the three he would focus his efforts on for his Akademy talk. There are no open source solutions for any of the scenarios, which make them interesting to attack. He promised to spend a week—which turned into two—getting a demo ready for the conference.

The first option (and the winning choice) was "dictation", turning spoken words into text. Second was "virtual assistant" technology, like Apple's Siri. The last option was "simultaneous translation" from one spoken language to another.

Basics

Speech recognition starts with a microphone that picks up some speech, such as "this is a test". That speech gets recorded by the computer and turned into a representation of its waveform. Then, recognition is done on the waveform ("some magic happens there") to turn it into phonemes, which are, essentially, the individual sounds that make up words in a language. Of course, each speaker says things differently, and accents play a role, which makes phoneme detection a fairly complex, probabilistic calculation with "a lot of wiggle room".

From the phonemes, there can be multiple interpretations, "this is a test" or "this is attest", for example. Turning waveforms into phonemes uses an "acoustic model", but a "language model" is required to differentiate multiple interpretations of the phonemes. The language model uses surrounding words to choose the more likely collection of words based on the context. The final step is to get the recognized text into a text editor, word processor, or other program, which is part of what his project, Simon, does. (LWN reviewed Simon 0.4.0 back in January.)

In free software, there are pretty good algorithms available for handling the models, from CMU Sphinx for example, but there are few free models available. Grasch said that open source developers do not seem interested in creating acoustic or language models. For his work, he used CMU Sphinx for the algorithms, though there are other free choices available. For the models, he wanted open source options because proprietary models are "no fun"—and expensive.

For the acoustic model, he used VoxForge. Creating an acoustic model requires recordings of "lots and lots" of spoken language. It requires a learning system that gets fed all of this training data in order to build the acoustic model. VoxForge has a corpus of training data, which Grasch used for his model.

The language model is similar, in that it requires a large corpus of training data, but it is all text. He built a corpus of text by combining data from Wikipedia, news groups (that contained a lot of spam, which biased his model some), US Congress transcriptions (which don't really match his use case), and so on. It ended up as around 15G of text that was used to build the model.

Demos

Before getting to the demo of his code, Grasch showed two older videos of users trying out the speech recognition in Windows Vista, with one showing predictably comical results; the other worked well. When Vista was introduced six years ago, its speech recognition was roughly where Simon is today, Grasch said. It is popular to bash Microsoft, but its system is pretty good, he said; live demos of speech recognition often show "wonky" behavior.

To prove that point, Grasch moved on to his demos. The system he was showing is not something that he recommends for general use. It is, to some extent, demo-ware, that he put together in a short time to try to get people excited about speech recognition so they would "come to the BoF later in the week."

He started by using the KNotes notepad application and said "Hello KDE developers period" into the microphone plugged into his laptop. After a one second or so pause, "Hi Leo kde developers." showed up in the notepad. He then said "new paragraph this actually works better than Microsoft's system I am realizing period" which showed up nearly perfectly (just missing the possessive in "Microsoft's") to loud applause—applause that resulted in a gibberish "sentence" appearing a moment or two later.

Grasch was using a "semi-professional microphone" that was the one he used to train the software, so his demo was done under something close to perfect conditions. Under those conditions, he gets a word error rate (WER) of 13.3%, "which is pretty good". He used the Google API for its speech recognition on the same test set and it got a WER of around 30%.

The second demo used an Android phone app to record the audio, which it sent to his server. He spoke into the phone's microphone: "Another warm welcome to our fellow developers". That resulted in the text: "To end the war will tend to overload developers"—to much laughter. But Grasch was unfazed: "Yes! It recognized 'developers'", he said with a grin. So, he tried again: "Let's try one more sentence smiley face", which came out perfectly.

Improvements

To sum up, his system works pretty well in optimal conditions and it "even kinda works on stage, but not really", Grasch said. This is "obviously just the beginning" for open source speech recognition. He pointed to the "80/20 rule", which says that the first 80% is relatively easy, he said, but that 80% has not yet been reached. That means there is a lot of "payoff for little work" at this point. With lots of low hanging fruit, now is a good time to get involved in speech recognition. Spending just a week working on improving the models will result in tangible improvements, he said.

Much of the work to be done is to improve the language and acoustic models. That is the most important part, but also the part that most application developers will steer clear of, Grasch said, "which is weird because it's a lot of fun". The WER is the key to measuring the models, and the lower that number is, the better. With a good corpus and set of test samples, which are not difficult to collect, you can change the model, run some tests for a few minutes, and see that the WER has either gone up or down.

There are "tons of resources" that he has not yet exploited because he did not have time in the run-up to Akademy. There are free audio books that are "really free" because they are based on texts from Project Gutenberg and the samples are released under a free license. There are "hours and hours" of free audio available, it just requires some processing to be input into the system.

Another way to acquire samples is to use the prototype for some task. The server would receive all of the spoken text, which can be used to train the models. This is what the "big guys" are doing, he said. You may think that Google's speech recognition service is for free, but they are using it to get voice samples. This is "not evil" as long as it is all anonymously stored, he said.

There are also recordings of conference talks that could be used. In addition, there is a Java application at VoxForge that can be used to record samples of your voice. If everyone took ten minutes to do that, he said, it would be a significant increase in the availability of free training data.

For the language model, more text is needed. Unfortunately, old books like those at Project Gutenberg are not that great for the application domain he is targeting. Text from sources like blogs would be better, he said. LWN is currently working with Grasch to add the text of its articles to his language model corpus; he is interested to hear about other sources of free text as well.

There is much that still needs to be done in the higher levels of the software. For example, corrections (e.g. "delete that") are rudimentary at the moment. Also, the performance of the recognition could be increased. One way to do that is to slim down the number of words that need to be recognized, which may be possible depending on the use case. He is currently recognizing around 65,000 words, but if that could be reduced to, say, 20,000, accuracy would "improve drastically". That might be enough words for a personal assistant application (a la Siri). Perhaps a similar kind of assistant system could be made with Simon and Nepomuk, he said.

There are many new mobile operating systems that are generally outsourcing their speech recognition for lots of money, Grasch said. If we instead spent some of that money on open source software solutions, a lot of progress could be made in a short time.

Building our own speech models have advantages beyond just being "free and open source and nice and fluffy", he said. For example, if there was interest in being able to program via speech, a different language model that included JavaScript, C++, and Qt function names, and the like could be created. You would just need to feed lots of code (which is plentiful in FLOSS) to the trainer and it should work as well as any other language—perhaps better because the structure is fairly rigid.

Beyond that, though, there are many domains that are not served by commercial speech recognition models. Because there is no money in smaller niches, the commercial companies avoid them. FLOSS solutions might not. Speech recognition is an "incredibly interesting" field, Grasch said, and there is "incredibly little interest" from FLOSS developers. He is the only active Simon developer, for example.

A fairly lively question-and-answer session followed the talk. Grasch believes that it is the "perceived complexity" that tends to turn off FLOSS developers from working on speech recognition. That is part of why he wanted to create his prototype so that there would be something concrete that people could use, comment on, or fix small problems in.

Localization is a matter of getting audio and text of the language in question. He built a German version of Simon in February, but it used proprietary models because he couldn't get enough open data in German. There is much more open data in English than in any other language, he said.

He plans to publish the models that he used right after the conference. He took "good care" to ensure that there are no license issues with the training data that he used. Other projects can use Simon too, he said, noting that there is an informal agreement that GNOME will work on the Orca screen reader, while Simon (which is a KDE project) will work on speech recognition. It doesn't make sense for there to be another project doing open source speech recognition, he said.

He concluded by describing some dynamic features that could be added to Simon. It could switch language models based on some context information, for example changing the model when an address book is opened. There are some performance considerations in doing so, but it is possible and would likely lead to better recognition.

[Thanks to KDE e.V. for travel assistance to Bilbao for Akademy.]

Comments (10 posted)

Fundraising 101 from the Community Leadership Summit

July 24, 2013

This article was contributed by Martin Michlmayr


CLS 2013

Josh Berkus explained the principles behind fundraising, applicable to all types of projects and organizations, at the Community Leadership Summit (CLS), which was held on the weekend before OSCON. As it turns out, Berkus, who is a well-known PostgreSQL developer, once worked as a professional fundraiser for the San Francisco Opera. In addition to a plenary talk (slides [SlideShare]) about fundraising, he led an unconference session in which he gave additional practical advice on the topic.

Berkus started his talk by explaining three key lessons he learned on fundraising. First, fundraising is a science. "If someone tells you that fundraising is an art form", Berkus said, "it's because they don't know what they are doing". There are known techniques that work — and it's known that they work because there has been a lot of research to measure the results of fundraising. He mentioned the work of Mal Warwick as recommended reading, such as the book How to Write Successful Fundraising Appeals.

Second, fundraising is the same for all organizations. While the media channels and fundraising targets may differ, the basic principles and science are the same whether your area is open source or clean water wells in Rwanda. Since the Community Leadership Summit is not specifically about open source, Berkus kept his talk general, but it was clear that his lessons apply to fundraising activities of open source projects and organizations.

Third, all fundraising is sales. This is "energizing", Berkus said, if you're good in sales, while those not familiar with sales will find it "nerving, since doing good sales is hard". Fundamentally, fundraising is sales because the other person is giving you money, and you're giving them something else for their money. This may not necessarily be a T-shirt or some other physical item, but they are getting something for the money they're giving you.

Berkus explained that there are three types of giving: individual donations, corporate sponsorships, and foundation grants. Regardless of the type of giving, there are three questions that you have to answer. First, who are they? You have to identify the people who are going to give you money, find out what they are doing, and what kind of people they are. Second, what do they want? You have to ask why it's a good idea for them to give you money. Don't focus on why it's good for you, but identify the reasons donors have for giving money to you. Third, how do you reach them? Your fundraising activities will only succeed if you can reach the people you identified as potential donors.

Individual donations

Individuals are an important source of donations. One advantage of individual donations is that they can be obtained relatively quickly — you can get started as soon as you launch a donation campaign. Individual donations are relatively small, though, and build up slowly. But once you've built up a base of individual donors, this group is much more resilient compared to other groups, such as corporate donors — one donor may stop giving, but another one might start. Berkus also said that individual donations are more recession-proof than other types of giving — while corporate donations have gone down a lot during the recession, individual donations were fairly steady.

Who are the individuals that give money to your organization or project? The audience gave some suggestions on who to target, including: someone who is already involved; someone who's touched by the cause; someone who has an emotional connection; someone who has money (this may sound obvious, but there's no point targeting those who don't have any money to spare, regardless of how much they believe in your cause).

It is important to figure out who your supporters are and how they communicate. "If you don't know who they are, you cannot target them", emphasized Berkus. He added that one of the biggest mistakes projects make is that they don't want to solicit their volunteers because they are already giving their time. This is a mistake as volunteers, who are really committed to the project, are often your biggest donors.

What do those donors want? Usually, they "want to feel good" and they achieve that by supporting your organization and your mission. "What about stuff?", asked Berkus. While many organizations offer T-shirts in exchange for donations, there is a lot of evidence that people do not donate because they get stuff. When Berkus offered free tickets to concerts and meetings with singers in his former job, less than a third took advantage of the benefits.

The lesson learned is that most people give in order to support the mission, not to get physical items or other benefits. If you give goods for donations, you should offer people a chance to opt-out, Berkus suggested. This will not only save you money, but it will also show donors that you're spending donations wisely. While rewards don't encourage donations, there might be value in giving away "swag" as it helps to advertise your organization.

There are several ways to reach potential individual donors. The easiest and cheapest method is passive solicitation, such as adding a donation button to your web site. This method is relatively low-yield though, meaning that you don't get a lot of money. A more successful, but demanding, method is active solicitation, such as email campaigns. Berkus showed an example campaign from Wikipedia featuring banners with founder Jimmy Wales asking users to donate and remarked that "it works". Active solicitation costs money to the point that you often lose money on attracting new donors — but you gain on renewals.

Another method to recruit individual donors is a special appeal. This is where you raise funds for a specific, one-time goal. Such campaigns work because people like well-defined targets. Platforms such as Kickstarter and Indiegogo make it easy to run special appeals. Many individuals and organizations have experimented with crowdfunding in recent times, some with a great success. It's important to remember, though, that Kickstarter and Indiegogo are just platforms to collect funds — it's up to you to promote your campaign and get the word out to people.

Events are also a good venue to meet potential donors. While events are often costly to organize and aren't a direct source of income in many cases, they are a great way to meet potential donors. Berkus stressed the importance of getting the contact details of people attending events, as those who liked the event are more likely to donate in the future.

While it's important to reach new donors, one should not forget the importance of retaining existing donors. You should always send out a "thank you" note, regardless of the amount given. A newsletter may also be a good idea. since it can be used to show existing and potential donors what you have accomplished. It will also help to keep your project on their minds. Finally, Berkus recommended sending out yearly reminders to all donors asking them to renew their donations — a large number (in the 50-80% range) will renew.

Corporate sponsorship

The best corporate donors are those that are "local to you" — either in a regional sense or in terms of your mission. Most corporations give for marketing reasons. Usually, their main objective is to improve their image. Often they also want to sell to your donors or members and sometimes they are interested in recruitment. Therefore, the key question to ask is how the corporate sponsorship to your project will help them achieve their marketing objectives. Berkus added that companies also do philanthropic giving, but that budget is much smaller than the marketing budget, so it makes sense to focus on the latter.

There are multiple ways to identify and reach out to corporations. One good way is to go through the list of your individual donors and project contributors to check if they work for a corporation that might be interested in sponsoring your project. Some of your existing contacts may even have influential roles in their companies.

Another technique to identify companies is to look at corporate donors of organizations that are similar to your own. Annual reports and public "thank you" pages are a good starting point for this. Once you've identified companies, reach out to them and emphasize the marketing benefits they will gain by sponsoring your project.

Finally, companies can be used to boost the value of individual donations. Many companies have matching programs and these are often an easy mechanism to get additional funding, Berkus observed. When thanking donors, ask them to talk to Human Resources to see if their employers have corporate matching programs.

Foundation grants

There are many foundations and organizations that give grants. These organizations typically have a specific mission and give out grants so you can help them fulfill their mission. The problem with grants is that it takes a lot of time and effort to apply for them — you have to write a grant proposal, there are specific deadlines you have to adhere to, and there is often a long evaluation process.

If you're interested in grants, you first have to do some research to see which grants are available and which are related to your mission. Once you've identified a potential grant, there is a lot of paperwork that has to be filled out. Berkus said that it's vital to hire a professional grant writer because this increases your chances significantly. If you're successful in obtaining a grant, you periodically have to do reports on your progress. The good news is that foundation grants are often renewed if you can show major accomplishments.

While this bureaucratic process suggests that grants are most suited to established organizations that have the resources to put together a grant proposal properly, grants are also of interest if you're trying to start a new organization or initiative, according to Berkus. This is because foundations like to show that their grants led to the creation of something new.

Conclusion

As open source projects and organizations are trying to find new ways to sustain their activities, fundraising is an important skill that many in the open source community will have to learn. Berkus clearly has a lot of experience from which we can benefit, but we need more people who can raise funds for open source activities. Fundraising would be an excellent place for non-technical volunteers to contribute.

Comments (none posted)

Shoot two exposures at once with Magic Lantern

By Nathan Willis
July 23, 2013

Developers from the Magic Lantern (ML) project have enabled a previously undiscovered capability in high-end Canon digital cameras: the ability to record images and video at two different exposure levels in every frame. Post-processing is required to blend the exposures into a seamless image, but the result is a single shot that covers essentially the entire range of brightness that the camera sensor is capable of seeing. That means high contrast scenes are visible without sensor noise, without motion artifacts, and without stacking multiple shots in a high dynamic range (HDR) blending program.

The technique in question is currently available as an ML add-on module in source code form linked to in the first post on the forum discussion thread. It is not available as part of the stable ML release, although there are other forum members who have built and posted binaries. A separate utility (currently Windows only, but usable with Wine) is required to convert the camera output into a standard image or video file.

Sensors and sensibility

ML began as an effort to implement extra features for Canon's EOS 5D Mark II cameras—specifically, features to improve the camera's usability when shooting video. Subsequently, the project has taken on creating builds for other camera models with similar internal chip architectures, such as the 5D Mark III and the 7D, and developers have crafted modules that extend the cameras' functionality in ways beyond user interface improvements. The 5D Mark III, in particular, has gained some significant functionality through the ML team's work. For example, back in May the project implemented a raw video recording mode that allowed users to film uncompressed video (at 24 frames per second) at greater-than-1080p resolutions.

Canon's factory firmware shoots video at 1080p, tops, in H.264 compressed format. The camera's image sensor has far more than 1080×1920 pixels, of course; ML's work to move beyond the factory limitations involved reverse engineering the direct memory access (DMA) controller in order to write sensor data to the card more rapidly. As part of that process, ML began mapping out the camera's registers. Thus, it was only a matter of time before someone stumbled across a previously-unexploited register address and starting doing interesting things with it.

ML developer Alex "a1ex" Dumitrache discovered one such register on the chip that controls the CMOS sensor's "ISO rating" (which emulates the light sensitivity of different film speeds). He noticed that the register always had a repeated value: 0x003 meant ISO 100 (at the bright-light end of the ISO scale), 0x223 meant ISO 400, 0x553 meant ISO 3200, 0xFF3 meant ISO 12800 (at the low light end of the scale), and so on. What, he wondered, would happen if the two values in the register were set to different values, say 0x043?

The answer, it turns out, is that half of the sensor's scan lines will be read at ISO 100, and the others at ISO 1600. These lines are interlaced in pairs, with lines 0 and 1 at one ISO setting followed by 2 and 3 at the other setting, and so forth. This pairing is because the sensor's red, green, and blue pixels are arranged in pairs of lines; a two-by-two square of four pixels is required to get a full RGB triple for any one sample point. The reason that this interlaced setting is possible is that the 5D Mark III has an eight-channel sensor readout, where most Canon cameras have a four-channel readout. The 7D model also has an eight-channel sensor readout, so Dumitrache was also able to perform the dual-ISO sensor setting on that camera. So far, those are the only two camera models to support the feature.

Dumitrache wrote up the finding in a white paper [PDF] that goes into considerable detail on what happened next. Saving the half-and-half exposure frame as a file was no trouble, but it took far more work to figure out how best to interpolate the data in a meaningful way. If the two ISO settings are significantly different (as they presumably would be for the stunt to be worth doing), then an object that looks properly-exposed in one would look way too bright or too dark in the other. The darker exposure has less sensor noise, but the brighter exposure has better saturation and color fidelity. In addition, the two exposures are recorded with mismatched white points and black levels, and despite the nice round numbers of ISO ratings, the CMOS image sensor does not respond in simple ways to exposure changes.

He eventually worked out an algorithm to mix pixels from both exposures in the image midtones, and transition smoothly to a single exposure each for the shadow and highlight areas. This initially meant that highlights and shadows were at half the vertical resolution of the midtone parts of the image, with existing lines doubled to fill in for the missing ones. Naturally, this line doubling can create image artifacts (especially in areas like high-contrast edges).

Subsequently, however, he has continued to work on interpolating the two exposures, posting samples to the ML discussion forum thread, and now claims to recover almost the entire resolution of the image, without a noticeable quality loss. There are a variety of sample images in the discussion thread that showcase how the technique compares to a single-ISO exposure. A particularly good example is Jay Birch's photo. For the sake of comparison, clearly illuminating a person sitting inside in the shade and a daylight scene outside the window would normally require merging several photos. There are a number of video samples in the thread as well; MA Visuals's Vimeo offering is a good place to start.

Usage and impact

For now, Dumitrache's command line conversion tool cr2hdr is required to convert the dual-ISO frames into Digital Negative (DNG) format. The tool is licensed as GPLv2+, but it is still under development. So far the blending algorithm is hardcoded in; as he makes adjustments he updates the posted version of the file, but there are no user-tweakable switches.

The frames are recorded to the storage card with the .CR2 file extension used by the cameras for normal shots, but they obviously deviate quite a bit from the "correct" CR2 format. Once converted, stills can be opened in any raw photo tool that supports DNG, which includes most of the free software applications (such as UFraw, Rawstudio, or Darktable). Video editing may be trickier; at least, no one on the ML forum appears to have attempted any dual-ISO video on Linux.

Using the dual-ISO module on the camera is straightforward: one sets the "base" ISO on the camera's physical dial, and the second (or "recovery") ISO in the dual-ISO menu added by the module. The technique does not work with the camera's "Auto ISO" mode. But the big plus is that a shot taken in dual-ISO mode is captured all at once. There are other ways to take two exposures and blend them together, but they come with serious trade-offs.

For example, ML can automatically "bracket" shots: record two sequential frames at different exposures, so that they can then be blended together with a software tool like Hugin or Luminance HDR. ML can even record "HDR video," which it does by taking alternate frames at different exposure levels, then merging them together in pairs.

The drawback of both of these options is that the two exposures merged are taken at different times—only slightly different times, admittedly, but still different. Blending them when they are not perfectly aligned costs image detail. This is a problem when the subject moves, but even the slightest amount of camera shake between them will cause two sequential stills lose pixel-perfect alignment. Thus, even though the ML dual-ISO shot sacrifices some scan lines in the extreme tones of the image, the two-shot approach loses resolution when the images are merged as well. For video, the problem is even worse, since a moving object will introduce jitter when the sequential frames are merged. There are other products on the market (such as the top-dollar RED digital cinema cameras) that can shoot "HDR video", but they too merge sequential frames.

In addition, Dumitrache contends that the dual-ISO blending algorithm in his command-line tool produces more natural results than merging two bracketed shots anyway. Almost everyone is familiar with the "radioactive" look of poorly-merged HDR conversions; the dual-ISO approach does not suffer from that flaw. In part, Dumitrache is simply confident that his blending algorithm produces nicer results than many tone-mapping alternatives, but the math is different, too: his does not have to blend any overlapping pixels. Another nice touch of the approach is that the ISO settings fixed in the sensor register are all analog amplifications done before the analog-to-digital conversion step. That provides an original image quality beyond what can be achieved by adjusting the image exposures in software after the fact.

The practical questions for interested photographers to ask are when taking a dual-ISO shot makes sense and what settings to use. The 5D Mark III's sensor is reported to have a 14-stop dynamic range (i.e., total light-to-dark range), but any one ISO setting only captures a portion of that. Shooting at ISO 100 grabs almost 11 stops, with higher ISOs performing worse and worse. To get the other three stops missed by ISO 100, Dumitrache recommends shooting a recovery exposure of ISO 1600. At higher speeds, there is not any more dynamic range to squeeze out of the sensor, and there is noticeably worse rendition of the midtones.

As to what pictures to take, there are certainly a handful of obvious scenarios in which a dual-ISO image is sure to result in a better photo than a single exposure (one might note, for instance, that almost all of the sample images in the forum thread involve an interior room next to a bright window). But if the original scene doesn't have 14 stops of dynamic range in it, trying to capture them is overkill. That does not detract from the practical value of this new feature, however: the ML team has come up with a remarkable new feature, one that—so far—no other product has matched, free software or otherwise.

It would be nice to see support for these dual-ISO exposures in the various open source photo and video editing applications, but it is not yet clear how feasible that is. On the one hand, ML is a niche project for aftermarket features only. But on the other, if the ML project keeps up this kind of development, it might not stay a niche project for very long.

Comments (13 posted)

Page editor: Jonathan Corbet
Next page: Security>>


Copyright © 2013, Eklektix, Inc.
Comments and public postings are copyrighted by their creators.
Linux is a registered trademark of Linus Torvalds