Transcription nirvana? Automatic transcription with R & Google Speech API

For as long as I’ve been doing qualitative analysis I’ve been looking for ways to automate transcription. When I was doing my masters I spent more time (fruitlessly) looking for technical solutions than actually doing transcription. Speech recognition has come a a long way since then; perhaps it’s time to try again?

I came across a blog post recently that suggested it’s becoming possible using the Google Speech API. This is the same deep-learning model that powers Android speech recognition, so it seems promising.

After setting up a GCloud account (currently with $300 free credit; not sure how long that will last) installing the R libraries and running some text is simple:

#install package; run first time or to update package....

Once you’ve authorized with GCloud (a single line of code) the transcription itself requires a single command:

gl_speech("path to audio clip")

I tested it with a really challenging task: a 15 second clip of the Fermanagh Rose from the 2017 Rose of Tralee:

Then run the transcription:

audioclip <- "<<path to audio file>>"
testresult<-gl_speech(audioclip, encoding = "FLAC", sampleRateHertz = 22050, languageCode = "en-IE", maxAlternatives = 2L, profanityFilter = FALSE, speechContexts = NULL, asynch = FALSE)

Which spat out :

 startTime endTime word
1 0s 1.500s things
2 1.500s 1.600s are
3 1.600s 2.600s boyfriend
4 2.600s 2.700s and
5 2.700s 3.200s see
6 3.200s 3.600s uncle
7 3.600s 7.100s supposed
8 7.100s 7.300s to
9 7.300s 7.400s be
10 7.400s 7.500s on
11 7.500s 12.200s something
12 12.200s 12.700s instead
13 12.700s 13s so
14 13s 14.600s Big
15 14.600s 14.900s Brother
16 14.900s 15.400s big
17 15.400s 15.800s buzz
18 15.800s 16.300s around
19 16.300s 17.300s Broad
20 17.300s 17.600s range
21 17.600s 17.900s at
22 17.900s 24.300s Loughborough
23 24.300s 24.700s bank
24 24.700s 25.100s whereabouts
25 25.100s 25.100s in
26 25.100s 25.600s Fermanagh
27 25.600s 27.700s between
28 27.700s 28.300s Fermanagh
29 28.300s 28.800s Cavan
30 28.800s 29.700s and
31 29.700s 29.800s I
32 29.800s 30.100s live
33 30.100s 30.400s action
34 30.400s 30.600s the
35 30.600s 30.900s road
36 30.900s 31.500s on
37 31.500s 31.800s for
38 31.800s 32s the
39 32s 32.200s Marble
40 32.200s 32.300s Arch
41 32.300s 32.400s Caves
42 32.400s 33.800s and
43 33.800s 34.400s popular
44 34.400s 34.600s culture

Honestly, that’s not bad — although not quite useable. It’s certainly a good base to start transcribing from. I was not expecting it do deal so well with fast speech and regional dialects. Perhaps transcription nirvana will arrive soon; not quite here yet, but quite astonishing that such powerful  language processing is so easily accomplished.