Ondřej Plátek Blog
PhD candidate@UFAL, Prague. LLM & TTS evaluation. Engineer. Researcher. Speaker. Father.

Week 1, Statistical Dialog Systems 2016


  • ASR measures
    • Word Error Rate (WER) \(wer(w_{hyp}, w) = \frac{S + 0.5 D + 0.5 I}{\mid w \mid}\) where \(w\) is sequence of words and \(S, D, I\) are substitute, delete and insert operations used for transforming gold transcription w to hypothesis \(w_{hyp}\) with minimum edit distance.
      • Minimum edit distance and the operations used are computed exactly using dynamic programming.
      • Computed typically per utterance
    • Sentence Error Rate (SER) \(ser(w) = \frac{\mid \{gold_t = hyp_t; t \in \{1, .., N\}\}\mid}{N}\) where \(wer(gold_t, hyp_t) = 0\)
    • RTF(real time factor)
    • latency - for SDS how long user has to wait before hearing the reply, significant portion is the ASR latency before getting the ASR result
  • problems
    • lexicon size and OOVs
    • domain dependence for language model (LM)
    • balancing LM vs AM
      • keyword spotting does not need so fluent sentences
  • ready to use tools and services
    • Google Web Speech api https://developers.google.com/web/updates/2013/01/Voice-Driven-Web-Apps-Introduction-to-the-Web-Speech-API?hl=en
    • Kaldi toolkit: https://github.com/kaldi-asr/kaldi
    • For custom domains http://cloudasr.com


Install TensorFlow v0.7.1 till next time, we will use it instead of anounced scikit-learn. This week you will have several options what to submit as homework. Choose only one.

  1. Code simple edit distance utility
    • New print both the minimum edit distance and the best aligment
      • alignment - sequence of operations how to transform the gold sequence to hypothesis sequence
        • names of operations
          • n - nothing/null/identity
          • s - substitute
          • d - delete
          • i - insert
    • Make optional weights for edit operations S,D,I (See above).
    • Make optional separator for words defaulting to space
    • Implement it yourself, do not copy it from web!
    • Language of your choice but make it smoothly runnable on Ubuntu 14.04 or OSX 10.10.3
    • Run wrapper script which demo the utility usage with following examples:
      • hyp='', gold=''
      • hyp='a a a', gold='a a a'
      • hyp='a b', gold='a a a'
      • hyp='a b c a', gold='a a a'
  2. Use Cloudasr API(See batch API docs at the bottom and compare it to Google Web Speech api which can be also used from Python
    • Create recording yourself
    • Decode first 100 utterances from test set in Czech vystadial dataset
    • Use sclite for scoring. See 3.rd task for details
    • Publish the data and the code on the web or to Rotunda lab disc and share paths with me and your colleagues via email.
  3. Measure WER, SER and confusion pairs for transcribed and gold utterances
    • Install sclite by downloading a Makefile and running make sclite_compiled
      • Verify the successful compilation by running sctk/bin/sclite which should output help to stderr.
    • Download files hyp_content.txt and gold_content.txt

    • Run sclite on real data hyp_content.txt and gold_content.txt
      • Do not forget that there is length mismatch in number of lines.
        • Remove some lines so the file has equal length and describe your strategy/heuristic
      • Hints: Sclite uses braces as special characters
      • Look at the output options and report
        • SER, WER, number of sentences, number of evaluated words
        • Top 10 confusion pairs
        • Tell me what is the extension of the result for detailed alignments, and how is deletion marked
    • It may pay of to check that sclite works on dummy data
    # from the directory where you installed sclite
    echo 'ahoj (1)' > a; echo 'cus bus (2)' >> a;
    echo 'cau (1)' > b; echo 'cus uz (2)' >> b;
    sctk/bin/sclite -r a -h b -i rm