Artificial Intelligence 27 min read

Reinforcement Learning for Lead Generation in Task‑Oriented Dialogue Systems

This article presents a reinforcement‑learning‑based approach to improve lead‑capture efficiency of a task‑oriented chatbot used in local services, detailing the system architecture, RL algorithms (DQN/DDQN), data construction, model training, offline and online evaluation, and the resulting commercial gains.

58 Tech
58 Tech
58 Tech
Reinforcement Learning for Lead Generation in Task‑Oriented Dialogue Systems

The "Yellow Page" merchant smart chat assistant is a lead‑capture chatbot deployed on the 58 platform for local services such as cleaning, moving, and repairs. Traditional rule‑based and state‑transition methods suffered from low flexibility, poor naturalness, and diminishing performance over time.

To address these issues, the authors introduced reinforcement learning (RL) techniques, specifically Deep Q‑Network (DQN) and its improved variant Double DQN (DDQN), to learn optimal dialogue policies that maximize the probability of obtaining user contact information (phone, WeChat, address, etc.).

The task‑oriented dialogue system consists of four core modules: Natural Language Understanding (NLU), Dialogue State Tracking (DST), Dialogue Policy Learning (DPL), and Natural Language Generation (NLG). Dialogue management is implemented either with rule‑based finite‑state machines or RL‑based policies.

Key challenges in applying RL to this domain include constructing reward signals, handling the correlation of sequential samples, and coping with non‑stationary data distributions. Solutions such as experience replay, dual‑network architecture, and reward‑shaping were employed.

The training pipeline initializes the environment and two Q‑networks, collects experience tuples {ϕ(S), A, R, ϕ(S'), is_end} , stores them in a replay buffer, and updates the current Q‑network using sampled mini‑batches while periodically syncing the target network.

Data for training were extracted from real human‑agent conversations, filtered, clustered, and labeled into 20 action categories (e.g., request phone number, ask service time). Each state is represented as a tuple of recent user queries, previous actions, intent, and slot information.

Model training used a BiLSTM‑based dual network, taking roughly 3 hours on a 2‑core CPU and achieving an average inference latency of 74 ms. The loss function combined DQN loss and an auxiliary action classification loss: loss = θ·loss(DQN) + (1‑θ)·loss(action) .

Offline evaluation metrics included average reward and conversation completion rate, while online metrics focused on lead‑capture conversion rate. Experiments showed an absolute 10 % increase in conversion for top categories, with the RL‑enhanced model now serving 100 % of traffic.

Future work includes extending the approach to long‑tail categories, exploring multi‑task RL, and investigating advanced algorithms such as DDPG and A3C to further boost the chatbot’s lead‑generation capabilities.

Author: Zhu Tao, Senior Algorithm Engineer at 58.com TEG AI Lab, specializing in intelligent QA systems.

customer servicereinforcement learningChatbotDQNtask-oriented dialoguelead generation
58 Tech
Written by

58 Tech

Official tech channel of 58, a platform for tech innovation, sharing, and communication.

0 followers
Reader feedback

How this landed with the community

login Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.