Artificial Intelligence 10 min read

Modelling Hierarchical Structure between Dialogue Policy and Natural Language Generator with Option Framework for Task-oriented Dialogue Systems

This paper presents a hierarchical reinforcement learning approach that jointly trains dialogue policy and natural language generation modules for task-oriented dialogue systems, achieving state‑of‑the‑art performance on MultiWOZ 2.0 and 2.1 while preserving response fluency.

Laiye Technology Team
Laiye Technology Team
Laiye Technology Team
Modelling Hierarchical Structure between Dialogue Policy and Natural Language Generator with Option Framework for Task-oriented Dialogue Systems

The Ninth International Conference on Learning Representations (ICLR‑2021) featured a paper by Laiye‑Tech and Imperial College London that applies hierarchical reinforcement learning (HRL) to address semantic degradation in task‑oriented dialogue systems, achieving the best results on the MultiWOZ 2.0 and 2.1 datasets.

Background Knowledge

Task‑oriented dialogue systems consist of a pipeline of modules: Natural Language Understanding (NLU), Dialogue State Tracking (DST), Dialogue Policy Learning (DPL), and Natural Language Generation (NLG). Traditionally these modules were rule‑based, but recent deep‑learning approaches replace them with neural networks trained via supervised learning, which improves performance but still depends heavily on large annotated corpora.

Reinforcement learning (RL) mitigates the annotation bottleneck by optimizing policies based on task success signals rather than explicit language supervision, making it attractive for dialogue systems where success can be measured directly.

Research Motivation

Existing RL methods either fix the NLG module while optimizing the policy or treat every token as an action in an end‑to‑end setting, leading to reduced language quality and an excessively large action space. The authors propose to model the hierarchical relationship between the policy and NLG modules, decoupling them to reduce the action space and improve learning efficiency.

Research Method

The high‑level policy (Dialogue Policy) selects dialogue acts (options) that serve as sub‑goals for the low‑level policy (Natural Language Generator), which generates the actual utterances (primitive actions). The HRL framework jointly trains both modules with a composite reward: a binary task‑success reward (1 for success, 0 otherwise) and a language‑model reward proportional to the probability of generated tokens, encouraging fluent and coherent responses.

Both rewards are combined, and the joint optimization is shown to converge to a locally optimal solution.

Experimental Results

Evaluation on MultiWOZ 2.0 and 2.1 demonstrates that the proposed HDNO method outperforms previous RL baselines in dialogue success rate while also producing higher‑quality, more semantically coherent responses, as illustrated in the result tables and visualizations.

To verify the interpretability of the learned latent dialogue actions, the authors project the fixed‑dimensional latent vectors into 2‑D space and cluster them with K‑Means into eight groups. Each cluster corresponds to semantically similar replies expressed with varied wording, confirming meaningful latent representations.

Impact

The hierarchical RL framework simultaneously improves task success and language quality, and it can be extended to incorporate dialogue state tracking and NLG into a unified hierarchical structure, further reducing reliance on annotated data and promoting industrial adoption of RL for dialogue systems.

References

[1] A Survey on Dialogue Systems: Recent Advances and New Frontiers. https://arxiv.org/pdf/1711.01731.pdf

[2] Using pomdps for dialog management. http://mi.eng.cam.ac.uk/research/dialogue/slt06_sjy-talk.pdf

[3] Rethinking action spaces for reinforcement learning in end‑to‑end dialog agents with latent variable models. https://arxiv.org/abs/1902.08858

[4] Towards end‑to‑end learning for dialog state tracking and management using deep reinforcement learning. https://arxiv.org/abs/1606.02560

[5] MultiWOZ – A large‑scale multi‑domain wizard‑of‑oz dataset for task‑oriented dialogue modelling. https://arxiv.org/abs/1810.00278

[6] MultiWOZ 2.1: A Consolidated Multi‑Domain Dialogue Dataset with State Corrections and State Tracking Baselines. https://arxiv.org/abs/1907.01669

reinforcement learningtask-oriented dialogueNatural Language Generationdialogue policyhierarchical RLMultiWOZ
Laiye Technology Team
Written by

Laiye Technology Team

Official account of Laiye Technology, featuring its best tech innovations, practical implementations, and cutting‑edge industry insights.

0 followers
Reader feedback

How this landed with the community

login Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.