menu
Geef een of meerdere zoektermen op.
Gebruik dubbele aanhalingstekens om in de exacte woordvolgorde te zoeken.

Automatic speech recognition in transcription?

 

Considerations when using speech recognition

When considering using automatic speech recognition (ASR), it is important to realise that some AV sources are more suitable than others. Here are some things to keep in mind. First, the audio quality must be high. This means that voices should be clear, not echoing, and preferably recorded close to the mouth with adequate microphones.

Secondly, ASR works best with monologues. If an audio file is full of people interrupting each other and talking at cross purposes, the results can be confusing to read, as not all software is able to recognise different people by their voices. Ideally, the file should have a separate channel for each speaker.

Finally, ASR is generally not very good at dealing with accents and dialects. When dealing with migrants or rural dwellers with an accent that might be easy for you to understand, ASR can have great difficulty with it. Let alone accents that are difficult for outsiders to understand.

 

ASR software

When using software, bear in mind that you are uploading privacy-sensitive files. 

Always read the terms and conditions of an ASR service before deciding whether it meets your privacy requirements.

ASR with aTrain – including speaker detection

Accessible Transcription of Interviews

 

aTrain is a tool for automatically transcribing speech recordings utilizing state-of-the-art machine learning models without uploading any data. It was developed by researchers at the Business Analytics and Data Science-Center at the University of Graz and tested by researchers from the Knowledge Center Graz.

 

Windows (10 and 11) users can install aTrain via the Microsoft app store (Link) or by downloading the installer from the BANDAS-Center Website (Link).

 

aTrain provides a user friendly access to the faster-whisper implementation of OpenAI’s Whisper model, ensuring best in class transcription quality paired with higher speeds on your local computer. Transcription when selecting the highest-quality model takes only around three times the audio length on current mobile CPUs typically found in middle-class business notebooks (e.g., Core i5 12th Gen, Ryzen Series 6000).

 

Speaker detection
aTrain has a speaker detection mode based on pyannote.audio and can analyze each text segment to determine which speaker it belongs to.

 

Privacy Preservation and GDPR compliance
aTrain processes the provided speech recordings completely offline on your own device and does not send recordings or transcriptions to the internet. This helps researchers to maintain data privacy requirements arising from ethical guidelines or to comply with legal requirements such as the GDRP.

 

MAXQDA, ATLAS.ti and NVivo compatible output 
aTrain provides transcription files that are seamlessly importable into the most popular tools for qualitative analysis, ATLAS.ti, MAXQDA and NVivo. This allows you to directly play audio for the corresponding text segment by clicking on its timestamp. 

In addition to these output files, it also provides a subtitle file (srt) that can be read into a subtitle program such as Subtitle Edit.

 

Nvidia GPU support 

aTrain can either run on the CPU or an NVIDIA GPU (CUDA toolkit installation required). A CUDA-enabled NVIDIA GPU significantly improves the speed of transcriptions and speaker detection, reducing transcription time to 20% of audio length on current entry-level gaming notebooks.

 

* aTrain defaults to model large-v3-turbo and compute device CPU.
If you have a computer with larger video memory, you will get better results if you download model large-v2 or large-v3 and move Compute Type to float16 in Advanced Settings.
Speaker recognition gives better results if you specify in advance what the number of speakers is.

ASR with Subtitle Edit

As of January 2023 (version 3.6.12), a new automatic speech recognition option has been built into Subtitle Edit.

 

This version of Subtitle Edit includes two speech recognition features under the Video tab:

  1. Vosk/Kaldi (a somewhat older ASR method)
  2. Whisper (AI-based modern ASR feature) 

 

Brief installation instructions in Dutch for Subtitle Edit 4.0.6*, to make the program work best for Whisper speech recognition:

 

 

* Version 4.0.10 is available, with more options for Whisper again. The OpenAI, CTranslate2, WhisperX and stable-ts engines require separate installation of Python. The Purfiew’s Faster Whisper-XXL, CPP, and ConstMe engines can be used without Python.

The Advanced option at the Whisper “Audio to text” screen allows additional parameters for the Whisper command line to be specified.  And allow configure of Whisper post-processing via Settings.

Installing Python is a chapter by itself. A simplified way to install Python and Whisper on your computer is given below under the heading: Installing Whisper and Python (Windows) – for advanced users.

 

 

MacWhisper

Quickly and easily transcribe audio files into text with OpenAI’s state-of-the-art transcription technology Whisper.

 

  • Transcription is done on your device, your (sensitive) data does not leave your computer. 
  • Export subtitles in .srt & .vtt. Text export in .csv
  • Search the entire transcript and highlight words
  • Play audio and sync with transcripts
  • Supports 100 different languages
  • Automatically remove ums, uhhs and other similar padding words
  • Supported formats: mp3, wav, m4a and mp4 videos.
  • Supports Tiny and Base models

 

The Pro version requires a fee of € 49 (1 Pro License for Personal use)

The Pro version uses Medium and Large models, where transcription results are often much better.

AI Transcriptions by Riverside

Users who do not want to download software on their computer and still want to use AI transcribing can use Riverside’s transcriber. Transcribe audio and video in 100+ languages with just a few clicks. Riverside’s transcriber offers Ai transcriptions absolutely free.

 

 

There are some drawbacks to using it online:

(Sensitive) data you upload to an internet space
Transcription times can vary depending on file size, content length and how busy Riverside’s servers are.

 

Advantages:

Unlimited file upload (MP3, Wav, MP4 and MOV)

Output in Caption – subtitle file (srt) or Text file (txt)

 

Disadvantage:

Other file formats, such as m4a, must first be converted to a format readable for the website. For example, with Convertio.co

 

Whisper SteveDigital online

Users who do not want to download software on their computer and still want to use Whisper, they can use SteveDigital’s free service on the Internet.

Online convert audio files or YouTube files into text with OpenAI’s advanced transcription technology Whisper.

 

There are some drawbacks to using it online, though:

 

  • (Sensitive) data you upload to an Internet space.
  • At busy times there is a queue, sometimes can take a long time with large files
  • Output is a text file (without time coding)

 

Advantages:

 

  • Transcription takes 5-10 seconds per minute of audio
  • Uses large-model

Installing Whisper en Python (Windows) – for advanced users

Whisper AI uses the Python programming language.

Installing everything on your computer, from Python to the various Whisper models, does require some computer knowledge. The GitHub site has all the necessary files grouped together:

github.com/openai/whisper

 

Getting everything working on your personal computer is quite complicated. Which programme components to install depends a lot on your computer’s specifications.

 

However, there is an installation programme developed by TroubleChute that goes through the entire installation process automatically, taking into account your computer’s configuration.

 

Below is a link to a video that explains step-by-step how to easily install Python and Whisper on your computer (English spoken):

 

TroubleChute

 

One-click Whisper install windows install script

ASR with Word 365

Automatic speech recognition

 

Automatic speech recognition with Word in Office 365.

With a Microsoft registration, the service can be used online for free.

 

The disadvantage is that the result is a document without time codes.

Via an option in YouTube Studio, subtitles with time codes can be created.

DOWNLOAD the separate instruction document.

 

Instruction document for automatic speech recognition in Word can be downloaded here:

 

 

Automatic transcription

 

Automatic transcription with Word in Office 365.

The service can only be used with an Office 365 premium subscription.

(300 minutes of speech recognition per month)

 

The result is a document with start times per paragraph. An option in YouTube Studio can be used to turn it into a readable subtitle file with time codes. 

Download the separate instruction document.

 

 

 

Instructie-document voor automatische transcriptie in Word is hier te downloaden:

 

ASR with Google Docs

The automatic speech recognition can be used with a Google Account.

The disadvantage is that the result is a document without time codes.

Using an option in YouTube Studio, subtitles with time codes can be created from this.

DOWNLOAD the separate instruction document.

 

 

Instruction document for automatic speech recognition in Google Docs can be downloaded here:

 

 

 

 

 

 

 

ASR with YouTube

The automatic subtitles can be created with a Google / YouTube account.

Only suitable for video files. 

 

If you want to have an audio file (mp3, wav, ogg, etc.) automatically transcribed, it must first be converted to a video file in order to be uploaded to YouTube. There are all kinds of free programmes for this. The trick is to load a sound track and put a random picture along the entire length of the sound file. Then save the whole thing as an mp4 file. And the sound file is ready for uploading to YouTube.

 

Instruction document can be downloaded here

 

 

ASR for academics

Transcription Portal

 

  • Easy to use web-based ASR
  • Multilingual
  • Editing possible
  • Free of charge (academic use)

https://www.phonetik.uni-muenchen.de/apps/oh-portal

 

The Transcription Portal is an online ASR tool developed and hosted by LMU Munich for academic transcription purposes. The tool is not an ASR service itself, but allows you to process your audio files through many different ASR services. You can then correct and edit the results within the OH-Portal or export them in a file type of your choice.

 

 

 

If you are interested in transcription tools, please check here:

 

TRANSCRIPTIE-TOOLS