Whisper is great

Posted Sep 3, 2024 17:30 UTC (Tue) by dskoll (subscriber, #1630)
Parent article: Transcribing audio with AI using Speech Note

I use Whisper (directly from the command-line) with the English language "small" model and it's extremely good. The only nitpicks I have are that its timecodes are a little off, so I have to adjust video captions, and the captions are not split in logical places. But it still saves a huge amount of time compared to captioning a video by hand, and all the processing is local so there's no cloud nonsense involved.

… until it's not

Posted Sep 3, 2024 20:53 UTC (Tue) by yeltsin (guest, #171611) [Link] (4 responses)

I've tried using models of all sizes to transcribe podcasts for later reference (running rg through thousands of files is a lot easier than trying to remember which episode mentioned that application whose name you can't remember, and then digging through it).

It's generally fine, but often struggles with abbreviations, and writes the same name in many creative ways (calling the same person John → Joan → Jon → Johan in one episode).

It would be great to have voice recognition, so it can at least reliably split different speakers into separate paragraphs. Names and abbreviations I'm not sure, it will probably have to learn to understand the context, which seems like it would require a much more complex implementation.

… until it's not

Posted Sep 3, 2024 21:40 UTC (Tue) by Paf (subscriber, #91811) [Link]

Larger multimodal large language models do this - understand context and give reasonable and consistent transcriptions - but I don't know that any that are good enough can be run locally on realistic hardware, and they come with all of the baggage and issues we're all familiar with and I won't go in to here.

But they're 'smart' enough to do what you're talking about quite well.

… until it's not

Posted Sep 7, 2024 22:46 UTC (Sat) by ringerc (subscriber, #3071) [Link] (2 responses)

I'd love to be able to seed one with a context.

An invite list and a list of common terms for a meeting recording. Even better if there's a way to tell it how some jargon is pronounced ("Ceph" spoken by Americans seems to sound like "Seth" to transcription software; "jit" for JIT, git has a hard "g", etc).

Or a seed like an episode synopsis for a TV show.

This could potentially be done in 2 passes too. One to rough cut the transcript and dump out key terms, names etc. you fix then and re-run it.

… until it's not

Posted Sep 8, 2024 7:09 UTC (Sun) by Wol (subscriber, #4433) [Link]

Going the other way (satnavs) this is a hard problem.

Bear in mind, iirc, the rule is "an i or e after g makes it soft". We have a town locally called Gillingham, which follows the rule and has a soft G. Most satnavs get it wrong and use a hard G, which is a town down Somerset way called Gillingham, where we were on holiday a year or so ago.

If humans don't follow the rules and "just know" how something works, how on earth are "speech to text" and "text to speech" engines going to get it right! :-) Most of the roads down our way they simply can't pronounce.

Cheers.
Wol

Providing context

Posted Sep 25, 2024 7:53 UTC (Wed) by mwood (guest, #55622) [Link]

> I'd love to be able to seed one with a context.

You actually can! Whisper has an `--initial_prompt` option.

e.g. I tried the example from the article and gave it some context, which allowed it to correctly transcribe "shan't" and "bosh" and the repetition, but for some reason it got horribly confused in the middle.

I gave it the following context:

'This is a reading of a two stanza poem. It contains some old/unusual exclamations and the last line of each stanza contains some repetition like "and WORD, and WORD, and WORD"'

[00:00.000 --> 00:13.720] the cat this is a LibriVox recording all LibriVox recordings are in the public domain for more information or to volunteer please visit LibriVox.org
[00:13.720 --> 00:26.560] the cat advice to the young by harry graham from ruthless rhymes for heartless homes LibriVox coffee break collection number eight
[00:26.560 --> 00:41.280] my children you should imitate the harmless necessary cat who eats whatever's on his plate and doesn't even leave the fat who never stays in bed too late or does immoral things like that
[00:41.280 --> 00:55.080] instead of saying shan't or bosh he'll sit and wash and wash and wash when shadows fall and lights grow dim he sits beneath the kitchen stair
[00:55.080 --> 00:55.880] basta
[00:55.880 --> 00:56.460] ba
[00:56.460 --> 01:03.260] and limb a simple couch he chooses there and if you tumble over him he simply loves to hear you
[01:03.260 --> 01:19.500] swear and while bad language you prefer he'll sit and purr and purr and purr end of the cat by harry graham read by patrick wallace