For as long as I’ve been doing qualitative analysis I’ve been looking for ways to automate transcription. When I was doing my masters I spent more time (fruitlessly) looking for technical solutions than actually doing transcription. Speech recognition has come a a long way since then; perhaps it’s time to try again?
I came across a blog post recently that suggested it’s becoming possible using the Google Speech API. This is the same deep-learning model that powers Android speech recognition, so it seems promising.
After setting up a GCloud account (currently with $300 free credit; not sure how long that will last) installing the R libraries and running some text is simple:
#install package; run first time or to update package.... #devtools::install_github("ropensci/googleLanguageR") library(googleLanguageR)
Once you’ve authorized with GCloud (a single line of code) the transcription itself requires a single command:
gl_speech("path to audio clip")
I tested it with a really challenging task: a 15 second clip of the Fermanagh Rose from the 2017 Rose of Tralee:
Then run the transcription:
audioclip <- "<<path to audio file>>" testresult<-gl_speech(audioclip, encoding = "FLAC", sampleRateHertz = 22050, languageCode = "en-IE", maxAlternatives = 2L, profanityFilter = FALSE, speechContexts = NULL, asynch = FALSE) testresult
Which spat out :
startTime endTime word 1 0s 1.500s things 2 1.500s 1.600s are 3 1.600s 2.600s boyfriend 4 2.600s 2.700s and 5 2.700s 3.200s see 6 3.200s 3.600s uncle 7 3.600s 7.100s supposed 8 7.100s 7.300s to 9 7.300s 7.400s be 10 7.400s 7.500s on 11 7.500s 12.200s something 12 12.200s 12.700s instead 13 12.700s 13s so 14 13s 14.600s Big 15 14.600s 14.900s Brother 16 14.900s 15.400s big 17 15.400s 15.800s buzz 18 15.800s 16.300s around 19 16.300s 17.300s Broad 20 17.300s 17.600s range 21 17.600s 17.900s at 22 17.900s 24.300s Loughborough 23 24.300s 24.700s bank 24 24.700s 25.100s whereabouts 25 25.100s 25.100s in 26 25.100s 25.600s Fermanagh 27 25.600s 27.700s between 28 27.700s 28.300s Fermanagh 29 28.300s 28.800s Cavan 30 28.800s 29.700s and 31 29.700s 29.800s I 32 29.800s 30.100s live 33 30.100s 30.400s action 34 30.400s 30.600s the 35 30.600s 30.900s road 36 30.900s 31.500s on 37 31.500s 31.800s for 38 31.800s 32s the 39 32s 32.200s Marble 40 32.200s 32.300s Arch 41 32.300s 32.400s Caves 42 32.400s 33.800s and 43 33.800s 34.400s popular 44 34.400s 34.600s culture
Honestly, that’s not bad — although not quite useable. It’s certainly a good base to start transcribing from. I was not expecting it do deal so well with fast speech and regional dialects. Perhaps transcription nirvana will arrive soon; not quite here yet, but quite astonishing that such powerful language processing is so easily accomplished.