Artificial Intelligence 17 min read

Demystifying RNNs and LSTMs: Architecture, Limits, and Python Forecasting

This article explains the structure and operation of recurrent neural networks (RNNs), their limitations, how long short‑term memory (LSTM) networks overcome these issues with gated mechanisms, and provides a complete Python implementation for time‑series airline passenger forecasting.

Model Perspective
Model Perspective
Model Perspective
Demystifying RNNs and LSTMs: Architecture, Limits, and Python Forecasting

Recurrent Neural Network (RNN)

Construction of RNN

RNN is a special neural‑network architecture that contains loops, allowing information to be passed between neurons across time steps. The diagram below shows a typical RNN cell where the output of the previous time step is fed back as input to the next step.

The input signal is a time‑series. At each moment the network receives the current input and splits its output into two streams: one is emitted as the external output, the other is stored as a hidden state that is fed back at the next time step. This creates a chain of copies of the network, each handling a specific time step, enabling the model to retain information from earlier moments.

Limitations of RNN

Although RNNs theoretically can preserve information from any previous time step, in practice the influence of distant states fades because gradients vanish over long sequences. This makes it difficult for plain RNNs to capture long‑term dependencies, which limits their performance on tasks such as language modeling.

Consider a language model that predicts the next word based on the current sentence. Predicting the word "汉语" in the phrase "我最常说汉语" requires information from the earlier clause "我是一个中国人", whereas predicting the final word "菜" in "我喜欢妈妈做的菜" only needs the recent words.

To address the long‑term dependency problem, Hochreiter and Schmidhuber (1997) introduced the Long Short‑Term Memory (LSTM) network, which uses gated mechanisms to preserve and control information flow.

Long Short‑Term Memory (LSTM)

Relationship between LSTM and RNN

LSTM is a special type of RNN that adds three gated components—forget gate, input (memory) gate, and output gate—to the basic recurrent cell, allowing the network to learn which information to keep, update, or discard over long sequences.

Basic Idea of LSTM

The core of an LSTM cell is the cell state, a vector that runs straight through the entire chain with only minor linear interactions. Gates, implemented as sigmoid‑activated layers followed by element‑wise multiplication, regulate the flow of information into and out of the cell state.

Forget Gate

The forget gate decides which parts of the previous cell state should be discarded. It takes the previous hidden output and the current input, passes them through a sigmoid layer, and multiplies the result with the old cell state.

Input (Memory) Gate

The input gate determines which new information will be added to the cell state. It consists of a sigmoid layer that selects relevant components and a tanh layer that creates candidate values, which are then combined by element‑wise multiplication.

Sigmoid layer outputs values between 0 and 1 to control the amount of new information.

Tanh layer generates a candidate vector whose values lie in [-1, 1].

Updating the Cell State

The new cell state is computed by element‑wise adding the scaled candidate vector (from the input gate) to the scaled old cell state (after the forget gate).

Output Gate

The output gate determines what part of the cell state will be exposed as the hidden output. It applies a sigmoid layer to the current input and previous hidden state, multiplies the result with the tanh‑transformed cell state, and produces the final output vector.

Python Implementation (Airline Passenger Forecasting)

<code># LSTM for international airline passengers problem with window regression framing
import numpy as
import matplotlib.pyplot as
import tensorflow as tf
from pandas import read_csv
from keras.models import Sequential
from keras.layers import Dense, LSTM
from sklearn.preprocessing import MinMaxScaler
from sklearn.metrics import mean_squared_error

# convert an array of values into a dataset matrix
def create_dataset(dataset, look_back=1):
    dataX, dataY = [], []
    for i in range(len(dataset)-look_back-1):
        a = dataset[i:(i+look_back), 0]
        dataX.append(a)
        dataY.append(dataset[i + look_back, 0])
    return np.array(dataX), np.array(dataY)

# fix random seed for reproducibility
tf.random.set_seed(7)
# load the dataset
dataframe = read_csv('data/airline-passengers.csv', usecols=[1], engine='python')
dataset = dataframe.values.astype('float32')
# normalize the dataset
scaler = MinMaxScaler(feature_range=(0, 1))
dataset = scaler.fit_transform(dataset)
# split into train and test sets
train_size = int(len(dataset) * 0.67)
train, test = dataset[0:train_size,:], dataset[train_size:len(dataset),:]
# reshape into X=t and Y=t+1
look_back = 3
trainX, trainY = create_dataset(train, look_back)
testX, testY = create_dataset(test, look_back)
# reshape input to be [samples, time steps, features]
trainX = np.reshape(trainX, (trainX.shape[0], 1, trainX.shape[1]))
testX = np.reshape(testX, (testX.shape[0], 1, testX.shape[1]))
# create and fit the LSTM network
model = Sequential()
model.add(LSTM(4, input_shape=(1, look_back)))
model.add(Dense(1))
model.compile(loss='mean_squared_error', optimizer='adam')
model.fit(trainX, trainY, epochs=100, batch_size=1, verbose=2)
# make predictions
trainPredict = model.predict(trainX)
testPredict = model.predict(testX)
# invert predictions
trainPredict = scaler.inverse_transform(trainPredict)
trainY = scaler.inverse_transform([trainY])
testPredict = scaler.inverse_transform(testPredict)
testY = scaler.inverse_transform([testY])
# calculate root mean squared error
trainScore = np.sqrt(mean_squared_error(trainY[0], trainPredict[:,0]))
print('Train Score: %.2f RMSE' % (trainScore))
testScore = np.sqrt(mean_squared_error(testY[0], testPredict[:,0]))
print('Test Score: %.2f RMSE' % (testScore))
# shift train predictions for plotting
trainPredictPlot = np.empty_like(dataset)
trainPredictPlot[:, :] = np.nan
trainPredictPlot[look_back:len(trainPredict)+look_back, :] = trainPredict
# shift test predictions for plotting
testPredictPlot = np.empty_like(dataset)
testPredictPlot[:, :] = np.nan
testPredictPlot[len(trainPredict)+(look_back*2)+1:len(dataset)-1, :] = testPredict
# plot baseline and predictions
plt.plot(scaler.inverse_transform(dataset))
plt.plot(trainPredictPlot)
plt.plot(testPredictPlot)
plt.show()
</code>

References

https://zhuanlan.zhihu.com/p/104475016

http://colah.github.io/posts/2015-08-Understanding-LSTMs/

https://machinelearningmastery.com/time-series-prediction-lstm-recurrent-neural-networks-python-keras/

Pythontime series forecastingneural networksLSTMRNN
Model Perspective
Written by

Model Perspective

Insights, knowledge, and enjoyment from a mathematical modeling researcher and educator. Hosted by Haihua Wang, a modeling instructor and author of "Clever Use of Chat for Mathematical Modeling", "Modeling: The Mathematics of Thinking", "Mathematical Modeling Practice: A Hands‑On Guide to Competitions", and co‑author of "Mathematical Modeling: Teaching Design and Cases".

0 followers
Reader feedback

How this landed with the community

login Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.