That's how my audio pipeline works. I have a system written from scratch--i.e. no FFmpeg--that pulls Internet radio (Windows Media, Shoutcast, Flash, or 3GPP; over HTTP or RTSP; in MP3 or AAC) and transcodes the codec and format to a requested format (same combinations as before), resampling and possibly splicing in a third stream.
If the decoder produces audio which isn't in a raw format that the encoder can handle (wrong number of channels, sample rate, etc) than the controller transforms it before passing to the encoder. Of course, ideally both the encoder and decoder can handle the widest possible formats, because interleaving and resampling is incredibly slow, mostly because it takes up memory bandwidth, not because the CPU is overloaded in doing the conversions. But sometimes you have to resample or change the channels because that's how its wanted downstream, no matter that the encoder can handle it.
The server design can handle close to 20 unique transcoded streams per CPU on something like a Core2 (averaging 3-4% CPU time per stream)--the server doesn't use threads at all, each process is fully non-blocking with an event loop. (It can also reflect internally, which significantly increases the number of output streams possible.)
Systems which spin on gettimeofday--or rely on some other tight loop with fine grained timeouts--are retarded, too. There are various way to optimize clocking by being smart about how you poll and buffer I/O; you can usually readily gauge the relationship between I/O and samples. For example, a single AAC frame will always produce 1024 samples*. So even if the size of a particular frame isn't fixed, you can at least queue up so many frames in a big gulp, knowing how many seconds of audio you have, sleep longer, and then do a spurt of activity, letting the output encoder buffer on its end if necessary. If you need a tight timed loop to feed to a device, it should be in its own process or thread, separate from the other components, so it isn't hindering optimal buffering.
[*AAC can also produce 960 samples per frame, but I've never seen it in practice, but in any event its in the meta-data; MP3 encodes 384 or 1152 samples per frame; so If you know the sample rate and number of samples you know exactly how many seconds of compressed audio you have.]
My pipeline can do double or triple the work that FFmpeg, Vorbis, and others can handle, even though it's passing frames over a socket pair (the backend process decodes protocols, formats, and codecs; but encodes only to a specific codec; the front-end encodes to a particular format and protocol; I did this for simplicity and security). It's a shame because I'm no audiophile, and many of the engineers on those teams are much more knowledgeable about the underlying coding algorithms.
Adding video into the mix does add complexity, but you can be smart about it. All the same optimization possibilities apply; and synchronization between the audio and video streams isn't computationally complex by itself; it's all about being smart about managing I/O. Like I said earlier, pipelines should be separated completely from the player (which might need to drop or add filler to synchronize playback). It wouldn't be a bad idea at all to write a player which only knows how to playback RTSP, and then write a back-end pipeline which produces RTSP channels. That's a useful type of abstraction missing entirely from all the players I've seen. RTSP gives you only rough synchronization, so the back-end can be highly optimized. The client can then handle the find-grain synchronization. Overall you're optimizing your resources far better than trying to hack everything into one large callback chain.