LWN: Comments on "Transcribing audio with AI using Speech Note"

Providing context

mwood — Wed, 25 Sep 2024 07:53:30 +0000

> I'd love to be able to seed one with a context.

You actually can! Whisper has an `--initial_prompt` option.

e.g. I tried the example from the article and gave it some context, which allowed it to correctly transcribe "shan't" and "bosh" and the repetition, but for some reason it got horribly confused in the middle.

I gave it the following context:

'This is a reading of a two stanza poem. It contains some old/unusual exclamations and the last line of each stanza contains some repetition like "and WORD, and WORD, and WORD"'

[00:00.000 --> 00:13.720] the cat this is a LibriVox recording all LibriVox recordings are in the public domain for more information or to volunteer please visit LibriVox.org
[00:13.720 --> 00:26.560] the cat advice to the young by harry graham from ruthless rhymes for heartless homes LibriVox coffee break collection number eight
[00:26.560 --> 00:41.280] my children you should imitate the harmless necessary cat who eats whatever's on his plate and doesn't even leave the fat who never stays in bed too late or does immoral things like that
[00:41.280 --> 00:55.080] instead of saying shan't or bosh he'll sit and wash and wash and wash when shadows fall and lights grow dim he sits beneath the kitchen stair
[00:55.080 --> 00:55.880] basta
[00:55.880 --> 00:56.460] ba
[00:56.460 --> 01:03.260] and limb a simple couch he chooses there and if you tumble over him he simply loves to hear you
[01:03.260 --> 01:19.500] swear and while bad language you prefer he'll sit and purr and purr and purr end of the cat by harry graham read by patrick wallace

Real-time Speech-to-text?

jch — Fri, 13 Sep 2024 22:26:39 +0000

I've been experimenting with real-time STT for the Galene videoconferencing system¹ using the whisper-cpp library², and I've found it challenging. I've found that the API is not naturally adapted to real-time transcription; I've worked around that by splitting the audio into chunks that I then pipe into the model. More seriously, I've found that the only model I can run in real time on the CPU is the "base" model, which is not very useful in practice. I'll be looking at running on the GPU next time I have some time to work on it.

Does anyone have experience with real-time STT? If so, I'd appreciate a note at <galene@lists.galene.org>.

¹ https://github.com/jech/galene-stt
² https://github.com/ggerganov/whisper.cpp

… until it's not

Wol — Sun, 08 Sep 2024 07:09:38 +0000

Going the other way (satnavs) this is a hard problem.

Bear in mind, iirc, the rule is "an i or e after g makes it soft". We have a town locally called Gillingham, which follows the rule and has a soft G. Most satnavs get it wrong and use a hard G, which is a town down Somerset way called Gillingham, where we were on holiday a year or so ago.

If humans don't follow the rules and "just know" how something works, how on earth are "speech to text" and "text to speech" engines going to get it right! :-) Most of the roads down our way they simply can't pronounce.

Cheers.
Wol

Zoom

ringerc — Sat, 07 Sep 2024 22:50:44 +0000

I'm not a big fan of Zoom but I have to give them some serious credit for their speech to text and automatic transcripts.

My work often has calls with mixes of strong accented fast talking Americans, accented fast taking Chinese, accented fast talking Indians and Pakistanis, and all sorts of others. Plus a lot of jargon. Zoom can sometimes understand better than I can.

I'd love to see openly available models reach this standard. It's troubling how AI model training is becoming another barrier to control of your own things. As if the relentless drive toward SaaS and IoT subscription models and forced cloud-tied accounts isn't already enough to remove control of what you own and use.

… until it's not

ringerc — Sat, 07 Sep 2024 22:46:57 +0000

I'd love to be able to seed one with a context.

An invite list and a list of common terms for a meeting recording. Even better if there's a way to tell it how some jargon is pronounced ("Ceph" spoken by Americans seems to sound like "Seth" to transcription software; "jit" for JIT, git has a hard "g", etc).

Or a seed like an episode synopsis for a TV show.

This could potentially be done in 2 passes too. One to rough cut the transcript and dump out key terms, names etc. you fix then and re-run it.

whisper in italian

SLi — Sat, 07 Sep 2024 16:13:50 +0000

Ah, yes. Does it still consistently transcribe someone coughing as "Google"?

Burnistoun sketch on voice recognition with accents

farnz — Fri, 06 Sep 2024 10:35:49 +0000

There's a great comedy sketch about a voice activated elevator that can't handle Scottish accents, where they're struggling to get the machine to recognise a simple number.

whisper in italian

Wol — Thu, 05 Sep 2024 16:12:49 +0000

Sounds like Google Meet (I think it is) ... which we use for company pep broadcasts.

In the background my team are usually making fun of it - one of our senior guys has a very strong Scottish accent and oh boy does Meet have trouble with it ...

Cheers,
Wol

whisper in italian

LtWorf — Thu, 05 Sep 2024 08:00:24 +0000

In my experience, whisper in italian is kinda useless, unless someone speaks very slow and clear like the "learn italian" records.

It constantly invents new words, prefers completely unknown words to very common ones, gets word boundaries wrong very often.

… until it's not

Paf — Tue, 03 Sep 2024 21:40:22 +0000

Larger multimodal large language models do this - understand context and give reasonable and consistent transcriptions - but I don't know that any that are good enough can be run locally on realistic hardware, and they come with all of the baggage and issues we're all familiar with and I won't go in to here.

But they're 'smart' enough to do what you're talking about quite well.

… until it's not

yeltsin — Tue, 03 Sep 2024 20:53:42 +0000

I've tried using models of all sizes to transcribe podcasts for later reference (running rg through thousands of files is a lot easier than trying to remember which episode mentioned that application whose name you can't remember, and then digging through it).

It's generally fine, but often struggles with abbreviations, and writes the same name in many creative ways (calling the same person John → Joan → Jon → Johan in one episode).

It would be great to have voice recognition, so it can at least reliably split different speakers into separate paragraphs. Names and abbreviations I'm not sure, it will probably have to learn to understand the context, which seems like it would require a much more complex implementation.

Whisper is great

dskoll — Tue, 03 Sep 2024 17:30:27 +0000

I use Whisper (directly from the command-line) with the English language "small" model and it's extremely good. The only nitpicks I have are that its timecodes are a little off, so I have to adjust video captions, and the captions are not split in logical places. But it still saves a huge amount of time compared to captioning a video by hand, and all the processing is local so there's no cloud nonsense involved.

Cross-platform Alternative

burki99 — Tue, 03 Sep 2024 16:22:23 +0000

If you are on Mac and Windows and cannot run Speech Note, noScribe also runs Whispher-AI models through a fairly friendly GUI: https://github.com/kaixxx/noScribe