LSTM‑Jump: Learning to Skim Text for Faster Sequence Modeling
The paper introduces LSTM‑Jump, a reinforcement‑learning‑trained LSTM variant that can dynamically skip irrelevant tokens, achieving up to six‑fold speed‑ups over standard sequential LSTMs while maintaining or improving accuracy on various NLP tasks such as sentiment analysis, document classification, and question answering.
1 Introduction
In many NLP sub‑fields—document classification, machine translation, and QA—recurrent neural networks (RNNs) have shown great promise, but they typically read every token sequentially, which is slow for long texts. This work proposes a method that allows the model to jump over unimportant parts of the input, reducing computation while preserving performance.
The underlying model is an LSTM that, after reading a small number of tokens, decides how many tokens to skip. A policy‑gradient reinforcement learning approach is used to train the discrete jump decisions. Experiments on four tasks (numeric prediction, sentiment analysis, news classification, and QA) show that the jumping LSTM can be up to six times faster than a standard sequential LSTM with comparable or better accuracy.
2 Method
We describe the LSTM‑Jump architecture. Before training, we fix the maximum number of jumps K, the number of tokens read between jumps R, and the maximum jump size K. K is a fixed hyper‑parameter, while N and R can vary during training and testing. Notation d₁:p denotes the sequence d₁, d₂, …, dₚ.
2.1 Model Overview
The model (illustrated in Figure 1) is built on a standard LSTM. At each step the LSTM reads R tokens, produces a hidden state, and feeds it to a softmax that predicts a distribution over possible jump lengths 1…K. A jump length κ is sampled from this distribution, and the next token to read becomes x_{R+κ}. The process repeats until one of three termination conditions occurs:
a) the jump softmax samples a 0;
b) the number of jumps exceeds N;
c) the network reaches the final token x_T.
After termination, the final hidden state is used for the downstream task: a classification softmax for tasks in Sections 3.1–3.3, or a similarity computation for the QA task in Section 3.4.
3 Experimental Results
Tables 1‑7 (shown as images) report task and dataset statistics, as well as test accuracy and runtime for various jump settings. Across synthetic numeric problems, sentiment analysis (IMDB), news classification (Rotten Tomatoes), and the Children’s Book Test, LSTM‑Jump consistently reduces processing time while achieving accuracy comparable to or better than the baseline sequential LSTM.
Qunar Tech Salon
Qunar Tech Salon is a learning and exchange platform for Qunar engineers and industry peers. We share cutting-edge technology trends and topics, providing a free platform for mid-to-senior technical professionals to exchange and learn.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.