Week 1, Statistical Dialog Systems 2016
- ASR measures
- Word Error Rate (WER) \(wer(w_{hyp}, w) = \frac{S + 0.5 D + 0.5 I}{\mid w \mid}\) where \(w\) is sequence of words and \(S, D, I\) are substitute, delete and insert operations used for transforming gold transcription w to hypothesis \(w_{hyp}\) with minimum edit distance.
- Minimum edit distance and the operations used are computed exactly using dynamic programming.
- Computed typically per utterance
- Sentence Error Rate (SER) \(ser(w) = \frac{\mid \{gold_t = hyp_t; t \in \{1, .., N\}\}\mid}{N}\) where \(wer(gold_t, hyp_t) = 0\)
- RTF(real time factor)
- latency - for SDS how long user has to wait before hearing the reply, significant portion is the ASR latency before getting the ASR result
- Word Error Rate (WER) \(wer(w_{hyp}, w) = \frac{S + 0.5 D + 0.5 I}{\mid w \mid}\) where \(w\) is sequence of words and \(S, D, I\) are substitute, delete and insert operations used for transforming gold transcription w to hypothesis \(w_{hyp}\) with minimum edit distance.
- problems
- lexicon size and OOVs
- domain dependence for language model (LM)
- balancing LM vs AM
- keyword spotting does not need so fluent sentences
- ready to use tools and services
- Google Web Speech api https://developers.google.com/web/updates/2013/01/Voice-Driven-Web-Apps-Introduction-to-the-Web-Speech-API?hl=en
- Kaldi toolkit: https://github.com/kaldi-asr/kaldi
- For custom domains http://cloudasr.com
Install TensorFlow v0.7.1 till next time, we will use it instead of anounced scikit-learn. This week you will have several options what to submit as homework. Choose only one.
- Code simple edit distance utility
- New print both the minimum edit distance and the best aligment
- alignment - sequence of operations how to transform the gold sequence to hypothesis sequence
- names of operations
- n - nothing/null/identity
- s - substitute
- d - delete
- i - insert
- names of operations
- alignment - sequence of operations how to transform the gold sequence to hypothesis sequence
- Make optional weights for edit operations S,D,I (See above).
- Make optional separator for words defaulting to space
- Implement it yourself, do not copy it from web!
- Language of your choice but make it smoothly runnable on Ubuntu 14.04 or OSX 10.10.3
- Run wrapper script which demo the utility usage with following examples:
hyp='', gold=''
hyp='a a a', gold='a a a'
hyp='a b', gold='a a a'
hyp='a b c a', gold='a a a'
- New print both the minimum edit distance and the best aligment
- Use Cloudasr API(See batch API docs at the bottom and compare it to
Google Web Speech api
which can be also used from Python
- Create recording yourself
- Decode first 100 utterances from test set in Czech vystadial dataset
- Use
for scoring. See 3.rd task for details - Publish the data and the code on the web or to Rotunda lab disc and share paths with me and your colleagues via email.
- Measure WER, SER and confusion pairs for transcribed and gold utterances
- Install
by downloading a Makefile and runningmake sclite_compiled
- Verify the successful compilation by running
which should output help to stderr.
- Verify the successful compilation by running
Download files hyp_content.txt and gold_content.txt
- Run
on real datahyp_content.txt
- Do not forget that there is length mismatch in number of lines.
- Remove some lines so the file has equal length and describe your strategy/heuristic
- Hints: Sclite uses braces as special characters
- Look at the output options and report
- SER, WER, number of sentences, number of evaluated words
- Top 10 confusion pairs
- Tell me what is the extension of the result for detailed alignments, and how is deletion marked
- Do not forget that there is length mismatch in number of lines.
- It may pay of to check that
works on dummy data
- Install
# from the directory where you installed sclite
echo 'ahoj (1)' > a; echo 'cus bus (2)' >> a;
echo 'cau (1)' > b; echo 'cus uz (2)' >> b;
sctk/bin/sclite -r a -h b -i rm