Experimental applications of GStreamer

By Nathan Willis
October 28, 2015

Most free-software conferences do an excellent job when it comes to providing a program of informative talks from project representatives and developers. Far fewer succeed at attracting many sessions from end users exploring novel or otherwise interesting uses of free software, but GStreamer Conference is among the events that do. The 2015 edition of the conference was no exception. Among the talks about applications of GStreamer, the stand outs included a session on acoustic location triangulation and a talk about streaming zoom-able ultra-high-definition video through a clever use of tiling.

Audio triangulation

Jan Schmidt spoke about location determination with GStreamer. It should be noted, of course, that Schmidt is a GStreamer maintainer, but the project he covered in this session was entirely an extra-curricular exercise. In fact, he started the talk by explaining that the idea struck him just a few months ago, and he only began working on it after his talk proposal was accepted.

The idea is straightforward. If multiple microphones are placed in known positions in a room, a program can calculate the position that a sound originates from by measuring the relative time that each microphone records the sound—after adjusting for any processing network-transmission delays, that is. The GStreamer 1.6 release added high-precision network clock synchronization and the ability to report network statistics, Schmidt said, so it occurred to him that a network of GStreamer client applications might now be usable as an acoustic triangulation system.

The specifics are important, of course. The speed of sound is 340.29 m/s, or about 34 cm per millisecond. GStreamer 1.6's transmission overhead on a WiFi signal is about 2 ms, which should result in an accuracy of under one meter. At the very least, he said, that would be good enough for Internet of Things (IoT) usage, such as allowing the user to speak voice commands and have appliances decide by proximity what device (say, a lamp) the user is speaking to.

To test the idea, Schmidt adapted an earlier personal project: Aurena, his GStreamer-based whole-house audio distribution system. Aurena used the Real Time Streaming Protocol (RTSP) to play audio from a server simultaneously on multiple client devices. But to support sending microphone audio from the clients back to the server, he had to write an RTSP recording element. The server handles clock synchronization with the clients and sets up the audio-processing pipeline.

Since the devices he had on hand to test with (mostly Android phones) varied as to whether they recorded mono or stereo sound, some processing was required on the server to normalize the input. But the "magic correlation step," he said, had already been solved by other people. To perform the triangulation, he used a package called ManyEars that was developed by robotics researchers.

It was at that stage, however, that he began to run into difficulty. ManyEars itself was not an issue, although it is designed to work with eight microphones precisely placed at the vertices of a cube. GStreamer had no problem combining the client audio streams into the eight-channel signal expected by ManyEars. And it is certainly possible to transform the geometry of a different microphone arrangement (at least in non-pathological cases) and map the results produced by ManyEars into another room shape, if one is willing to do the math. But, as it turns out, Android's audio layer thwarts the plan by introducing random delays and latency that GStreamer, at present, cannot adjust for. In his tests, the Android-introduced delays varied between 30 to 100ms, and were neither predictable nor controllable. Furthermore, some Android devices appear to randomly drop audio packets before they are delivered to the GStreamer client application.

Schmidt decided to introduce a calibration step in an attempt to work around the random-delays problem. The tool, which he demonstrated with multiple Android phones set up around the session room, plays an audio tone from each device, in turn, and records the output on all microphones in order to measure the delay. For now, he is not sure if this approach will pan out, since the Android audio stack's delay factor is so unpredictable that it may not be possible to know for certain that the test sound was played on time. Even a few milliseconds of uncertainty would be enough to destroy the accuracy of the positions calculated.

That said, the general approach may still be useful for non-Android devices, and there was considerable interest from the audience in seeing where the project heads next. In an era when more and more "smart" household devices start listening to us, perhaps GStreamer will allow developers to do something useful with all the microphones—apart from relaying information through the cloud to advertisers and service providers.

Tiled streaming

Arjen Veenhuizen from the Dutch research institute TNO presented the session about tiled video streaming. The root problem that his development team is out to solve is how to cope with the disparity between the ever-higher resolutions offered by video content and the limited capabilities of mobile devices—which make up a sizable percentage of the screens to which video is delivered.

TNO has been working on a solution that splits a source video stream into a set of tiled sub streams, any one of which can be delivered separately to a client device. The example that Veenhuizen gave was of a live sporting event like a track meet; viewers are likely not to want to see the stadium-wide feed, but would prefer instead to watch a high-quality feed of just one portion of the field. That way, each user can get an HD-quality video, but have the freedom to zoom out or in on a different portion of the source stream at will.

The solution that TNO has developed (which it is testing with an arena in Amsterdam) uses GStreamer to stitch together several camera images into a seamless, 6K video stream, then divide the total camera area into multiple "region of interest" (ROI) streams. As it is currently deployed, each camera at the arena is attached to an H.264 encoder; those streams produce 600-800 Mbps (as compared to 3Gbps for the raw camera video).

The camera streams are sent over a dedicated Real Time Messaging Protocol (RTMP) channel on a fiber link to the distribution setup running on a public cloud provider. There, streams are stitched into the "overview" stream that shows the entire arena and each ROI stream is stitched together from the relevant camera streams. This step is performed in parallel by a pool of tiling processes (managed by a master process). In addition to tiling the video stream spatially, he noted, the streams are also split up temporally into three-second chunks. Client machines, such as phone as tablets, first tune in to the overview stream, then the user can click to select a sub-stream.

The number of tiles varies; Veenhuizen said the team has worked with anywhere from 21 to 90. At the moment, the system is dominated by CPU-bound processes; the team uses 32-processor cloud instances—which is the maximum available. It would be better to use GPUs for the stitching and tiling, he said, but so far no cloud provider offers such a service.

He reported that the project has shown GStreamer to be "extremely stable" on such a large-scale project—and seems to suffer no performance hit when being run inside Docker containers. Streams will run for days at a time uninterrupted, producing multiple terabytes of video. In addition, GStreamer's scaling and cropping operations have proven to be high-performance; on average the entire processing pipeline only introduces four to five seconds of latency.

At the moment, though, the team is working on implementing a GStreamer-based client application for the mobile devices, which is proving tricky. The requirements are steep even on the client side: users want fast switching between different ROI streams, interactivity, and frame-accurate synchronization between the available streams.

Looking forward, he said the team hopes to get the video format and streaming protocol standardized in MPEG DASH (Dynamic Adaptive Streaming over HTTP). Eventually, they also want to get the process working on the next generation of video codecs. The primary target is H.265, although that codec is renowned for being substantially slower to encode than H.264, which presents a practical problem for a project already maxing out the machines available from a cloud provider.

At first glance, it might seem like Schmidt's audio-triangulation project and TNO's ultra-high-definition video streaming project have little in common. One is a single-handed, hobbyist effort, while the other is a large-scale cloud-computing–based service. But it is interesting to note that both have to deal with the realities of realtime network streaming, and both are running into problems with GStreamer application development on mobile device platforms. No doubt the GStreamer developers picked up on the issues of importance as well—mobile support was, in fact, one of the "hot topics" raised by Tim-Philipp Müller in his opening state-of-the-project session. And gathering users with diverse use cases in the same room as the developers is always a wise first step toward solving problems.

[The author would like the thank the Linux Foundation for travel assistance to attend GStreamer Conference.]

Index entries for this article
Conference	GStreamer Conference/2015

Experimental applications of GStreamer

Posted Oct 29, 2015 9:52 UTC (Thu) by paulj (subscriber, #341) [Link]

Any old arena in Amsterdam, or Amsterdam Arena?

Experimental applications of GStreamer - tiled streaming

Posted Oct 29, 2015 17:04 UTC (Thu) by jnareb (subscriber, #46500) [Link] (1 responses)

| It would be better to use GPUs for the stitching and tiling, he said, but so far no cloud provider offers such a service.

Actually they are cloud providers that offer GPGPU computing access. For example QwikLab, which offers paid CUDA and OpenACC courses (first lab in series is free) runs on Amazon cloud.

Experimental applications of GStreamer - tiled streaming

Posted Oct 31, 2015 19:29 UTC (Sat) by sciurus (guest, #58832) [Link]

Yes, AWS offers GPU instances with GRID K520 and Tesla M2050 cards from NVIDIA.

Audio location

Posted Oct 29, 2015 18:50 UTC (Thu) by smurf (subscriber, #17840) [Link] (4 responses)

Why would the absolute timing of audio at the receiver matter?
Send a ping to each speaker, at the exact same time but at different frequencies, then measure relative timing at the receiving end.
With known (fixed) speaker locations, calculating the receiver's position is a non-problem these days.

Audio location

Posted Nov 5, 2015 11:24 UTC (Thu) by oldtomas (guest, #72579) [Link] (3 responses)

> Why would the absolute timing of audio at the receiver matter?

It wouldn't if it were constant. But (from the article):

"the Android-introduced delays varied between 30 to 100ms"

I read that as having an uncertainty of roughly 70ms. Of course, if you could develop a solid statistical model of this uncertainty you could try to pick up the signal from the noise, but that sounds like a tall order compared to just cancelling out a (per-device) constant...

Audio location

Posted Nov 8, 2015 15:16 UTC (Sun) by smurf (subscriber, #17840) [Link] (2 responses)

You didn't understand my idea; let me rephrase.

As I understand it, the senders don't have delay issues, they're all on the receiving end.

Thus, instead of pinging one speaker at a time and measuring the absolute time the pulse takes to arrive at the listener, you ping all of them at the same time, record that sound, and use digital processing to figure out the delays between pings. (Thus different frequencies, otherwise you can't tell the pings apart.) This way, delays on the receiver side won't matter, it's all a single sound recorded once by a single microphone.

Audio location

Posted Nov 12, 2015 15:31 UTC (Thu) by nye (subscriber, #51576) [Link] (1 responses)

I don't see how that help in any way with the problem that the latency is wildly variable?

Audio location

Posted Nov 13, 2015 19:53 UTC (Fri) by smurf (subscriber, #17840) [Link]

Latency may be variable, but once you start a recording it, well, records like it's supposed to. Thus, if the sound of all speakers are on a single recording and you evaluate the differences in timing between speakers instead of the sounds' absolute arrival times, there's no problem.