Week 1, Dialogue Systems

19 Feb 2019

Content

Setting up account on MFF lab, notebook vs lab stations vs ufal grid
ASR and how does the phonetic dictionary fits in?
- State IDs, triphones, phones, words, phrases
- Exploring phonetic transcriptions in data
- Exploring gold transcription of audio data and ASR hypothesis.
Informal introduction to measures in dialogue: What do you think is important?
- Search the answers in datasets – See the links below
- Word Error Rate (WER) \(wer(w_{hyp}, w) = \frac{S + D + I}{\mid w \mid}\) where \(w\) is sequence of words and \(S, D, I\) are substitute, delete and insert operations used for transforming gold transcription w to hypothesis \(w_{hyp}\) with minimum edit distance.
  - Minimum edit distance and the operations used are computed exactly using dynamic programming.
  - Computed typically per utterance
- Sentence Error Rate (SER) \(ser(w) = \frac{\mid \{gold_t = hyp_t; t \in \{1, .., N\}\}\mid}{N}\) where \(wer(gold_t, hyp_t) = 0\)
- RTF(real time factor)
- Latency - for SDS how long user has to wait before hearing the reply, significant portion is the ASR latency before getting the ASR result
- Intent/dialogue acts classification
- Policy action accuracy
- Dialogue success
- Measuring natural and factual responses content
- Text-to-speech intonation prosody
- BLEU
- Dialogue length
- Fluency
Open discussions about solving homework and bonus tasks

Useful toolkits and datasets

Kaldi ASR toolkit
Espnet end-to-end ASR toolkit
DSTC2 dataset
Vystadial CZ dataset
Ubuntu corpus 2.0
CamInfo See also other datasets and articles in Cambridge Dialogue System group

Homework

(1 point) Find suboptimal phonetic transcriptions for Czech mapping

 # Obtain script for converting a normalized ortographic form into a phonetic transcription
 wget https://raw.githubusercontent.com/kaldi-asr/kaldi/master/egs/vystadial_cz/s5/local/phonetic_transcription_cs.pl
 # Create file text-in-capitals filled with input data and generate phonetic transcription
 perl phonetic_transcription_cs.pl text-in-capitals.txt phonetic-text.txt

PS: See the corresponding RESULTS for Word Error Rate(WER) for the current version of phonetic_transcription_cs.pl. Let’s see if we can improve it.

(1 point) Write 5 example conversations with at least 25 turns in total between a system and a user. Simulate both the user and a system (Wizard-of-Oz style). Specify if your system is task oriented/chit-chat, it is user initiative/system initiative/mixed initiative, etc. Describe the system capabilities. Use English or Czech language for the conversations examples.
(1 point) Code a simple edit distance utility
- Print out the minimum edit distance
- Add an optional separator for words defaulting to space
- Run a unit-test which demoes the script usage on the following inputs:
  - hyp='', gold=''
  - hyp='a a a', gold='a a a'
  - hyp='a b', gold='a a a'
  - hyp='a b c a', gold='a a a'
- High code quality standards & README with usage description are expected
- Programming language of your choice but make the script smoothly runnable on Ubuntu 18.04
- Implement it yourself, do not copy it from web!
- BONUS (+1 point) Display also the best alignment
  - alignment - sequence of operations how to transform the gold sequence to hypothesis sequence
    - names of operations
      - n - nothing/null/identity
      - s - substitute
      - d - delete
      - i - insert
- BONUS (+1 point) write a unit-test for computing Word Error Rate (WER)
BONUS (3 point) Write down a use case and a product description of a (textual/spoken) dialogue system.
- Include a very high level technical implementation.
- Use 30-60 bullet points to motivate it and describe it.
BONUS (3 points) Download the DSTC2 dataset. Select any three slot types (e.g. food, price, time) and use regular expressions to automatically predict them from dialogue history.
- Process the dev part of DSTC2 dataset and measure F1 score
- Use any scripting tools you need.
- Readability and high coding standard are expected in the evaluation part.
BONUS (3 points) Train an acoustic model and a language model on any public dataset using ready-to-use scripts & Kaldi or Espnet.
- Reading the documentation & the tutorial is needed before starting
- Reserve several GBs (20-100GB) of disk space.
- Expect technical problems
  - You will need to change at least the number of jobs and launch commands and paths to data.
- Recommended models to train
  - Vystadial Kaldi recipe
    - Do not train last stage 9 if you do not own a GPU for training neural networks
  - Espnet Librispeech recipe
    - One needs to have at least one GPU and more than 16 cores is recommended.