# Week 7, Dialogue Systems

### Content

• Review of homeworks with Amazon Alexa:
• Feedback on ParlAI
• What’s good?
• What’s unclear?
• Would you like to use it for research?
• Deciding which research papers to follow

### Personal notes to two of the suggested papers

#### Unsupervised NLU – clustering dialogue intents and slots without labeled data by Shi et al.

• Idea can be separated into:
• How the clustering is done?
• Mapping using autoencoder to fixed size embedding
• Features participating in the final embedding
• BiTerm model for intent clustering Yan et al. 2013 (interesting – should be better than LDA & PLSA)
• Word embeddings
• Frequent word feature (interesting) - clustered noun words
• RBF function used as distance measure for two clusters
• Intent clustering uses the Biterm and the word word features, Slot clustering uses only the word features
• How to choose optimal number of clusters?
• using a dataset split for choosing optimal number of clusters (stopping criterion)
• Very much needed approach
• Weak baselines, experiments results are promising but probably far from usable
• Experiment results can be improved in supervised way
• Manual mapping between clusters and named intents and slot values works surprisingly well for evaluation. Is it a coincidence for this particular dataset?

#### End-to-end task-oriented dialogue systems – seq2seq with attention for knowledge base queries by Wen et al.

• Works surprisingly well on Wizard of Oz KWRET database
• Solves end-to-end training of task-oriented dialogue with a Knowledge Base in form of table
• Architecture especially interesting for soft attention on KB
• dialogues history encoded LSTM word by word (both user & agent utterances)
• slots are defined by column in database
• pick from database is represented as probability distribution if the KB entry will be used
• unclear how exactly it works
• Why it works so well> – I worked on very similar architecture but gradient was not propagated
• Creates matrix U with dimension of number of (slots/KB columns x number of turns)
• Decoder uses attention over input and attention over KB through matrix U
• Training uses XE loss and REINFORCE algorithm for fine-tuning
• REINFORCE trained in second stage
• XE loss helps REINFORCE a lot for bootstrapping (sampling from with a strong prior)
• Would be interesting to compare with Hybrid Code Networks: practical and efficient end-to-end dialog control with supervised and reinforcement learning: J. Williams et al

#### Relevant datasets

• MultiWOZ – multi-domain dataset for city information (was used in the state-tracking paper mentioned above)
• Submit the homework as Merge Request to the dias-e2e repository. Extend the plan.md and rename the file so it contains your name e.g. plan-ondra.md.