Week 1, Dialogue Systems
Content
- Setting up account on MFF lab, notebook vs lab stations vs ufal grid
- ASR and how does the phonetic dictionary fits in?
- State IDs, triphones, phones, words, phrases
- Exploring phonetic transcriptions in data
- Exploring gold transcription of audio data and ASR hypothesis.
- Informal introduction to measures in dialogue: What do you think is important?
- Search the answers in datasets – See the links below
- Word Error Rate (WER) \(wer(w_{hyp}, w) = \frac{S + D + I}{\mid w \mid}\) where \(w\) is sequence of words and \(S, D, I\) are substitute, delete and insert operations used for transforming gold transcription w to hypothesis \(w_{hyp}\) with minimum edit distance.
- Minimum edit distance and the operations used are computed exactly using dynamic programming.
- Computed typically per utterance
- Sentence Error Rate (SER) \(ser(w) = \frac{\mid \{gold_t = hyp_t; t \in \{1, .., N\}\}\mid}{N}\) where \(wer(gold_t, hyp_t) = 0\)
- RTF(real time factor)
- Latency - for SDS how long user has to wait before hearing the reply, significant portion is the ASR latency before getting the ASR result
- Intent/dialogue acts classification
- Policy action accuracy
- Dialogue success
- Measuring natural and factual responses content
- Text-to-speech intonation prosody
- BLEU
- Dialogue length
- Fluency
- Open discussions about solving homework and bonus tasks
Useful toolkits and datasets
- Kaldi ASR toolkit
- Espnet end-to-end ASR toolkit
- DSTC2 dataset
- Vystadial CZ dataset
- Ubuntu corpus 2.0
- CamInfo See also other datasets and articles in Cambridge Dialogue System group
Homework
-
(1 point) Find suboptimal phonetic transcriptions for Czech mapping
# Obtain script for converting a normalized ortographic form into a phonetic transcription wget https://raw.githubusercontent.com/kaldi-asr/kaldi/master/egs/vystadial_cz/s5/local/phonetic_transcription_cs.pl # Create file text-in-capitals filled with input data and generate phonetic transcription perl phonetic_transcription_cs.pl text-in-capitals.txt phonetic-text.txt
PS: See the corresponding RESULTS for Word Error Rate(WER) for the current version of
phonetic_transcription_cs.pl
. Let’s see if we can improve it. -
(1 point) Write 5 example conversations with at least 25 turns in total between a system and a user. Simulate both the user and a system (Wizard-of-Oz style). Specify if your system is task oriented/chit-chat, it is user initiative/system initiative/mixed initiative, etc. Describe the system capabilities. Use English or Czech language for the conversations examples.
- (1 point) Code a simple edit distance utility
- Print out the minimum edit distance
- Add an optional separator for words defaulting to space
- Run a unit-test which demoes the script usage on the following inputs:
hyp='', gold=''
hyp='a a a', gold='a a a'
hyp='a b', gold='a a a'
hyp='a b c a', gold='a a a'
- High code quality standards & README with usage description are expected
- Programming language of your choice but make the script smoothly runnable on Ubuntu 18.04
- Implement it yourself, do not copy it from web!
- BONUS (+1 point) Display also the best alignment
- alignment - sequence of operations how to transform the gold sequence to hypothesis sequence
- names of operations
- n - nothing/null/identity
- s - substitute
- d - delete
- i - insert
- names of operations
- alignment - sequence of operations how to transform the gold sequence to hypothesis sequence
- BONUS (+1 point) write a unit-test for computing Word Error Rate (WER)
- BONUS (3 point) Write down a use case and a product description of a (textual/spoken) dialogue system.
- Include a very high level technical implementation.
- Use 30-60 bullet points to motivate it and describe it.
- BONUS (3 points) Download the DSTC2 dataset. Select any three slot types (e.g. food, price, time) and use regular expressions to automatically predict them from dialogue history.
- Process the
dev
part of DSTC2 dataset and measure F1 score - Use any scripting tools you need.
- Readability and high coding standard are expected in the evaluation part.
- Process the
- BONUS (3 points) Train an acoustic model and a language model on any public dataset using ready-to-use scripts & Kaldi or Espnet.
- Reading the documentation & the tutorial is needed before starting
- Reserve several GBs (20-100GB) of disk space.
- Expect technical problems
- You will need to change at least the number of jobs and launch commands and paths to data.
- Recommended models to train
- Vystadial Kaldi recipe
- Do not train last stage 9 if you do not own a GPU for training neural networks
- Espnet Librispeech recipe
- One needs to have at least one GPU and more than 16 cores is recommended.
- Vystadial Kaldi recipe