Transforming Time Series Data into Supervised Learning Datasets with Pandas
This tutorial explains how to convert single‑variable and multivariate time‑series data into supervised learning formats using pandas shift() and a custom series_to_supervised() function, covering one‑step, multi‑step, and sequence forecasting examples with complete Python code.
Machine‑learning methods such as deep learning can be applied to time‑series forecasting, but the series must first be reshaped into a supervised‑learning problem where each observation becomes a pair of input (X) and output (y) sequences.
The pandas shift() function is essential for this transformation: shifting a column forward creates lag (historical) features, while shifting backward creates forecast features, both of which can be combined to form the required X‑y pairs.
Example of creating a simple lag column:
<code>from pandas import DataFrame
df = DataFrame()
df['t'] = [x for x in range(10)]
print(df)</code>Resulting DataFrame shows the original series. Adding a lag column with df['t-1'] = df['t'].shift(1) produces:
<code> t t-1
0 0 NaN
1 1 0.0
2 2 1.0
3 3 2.0
4 4 3.0
5 5 4.0
6 6 5.0
7 7 6.0
8 8 7.0
9 9 8.0</code>Using a negative shift creates a forecast column, e.g., df['t+1'] = df['t'].shift(-1) , yielding the future values needed for prediction.
The core utility is the series_to_supervised() function, which automatically builds a supervised‑learning DataFrame from any time‑series array. Its parameters are:
data : list or 2‑D NumPy array of observations (required)
n_in : number of lag observations to use as inputs (default 1)
n_out : number of future observations to predict (default 1)
dropnan : whether to remove rows containing NaN values (default True)
The function concatenates shifted copies of the data, labels columns as var1(t‑k) for inputs and var1(t+m) for outputs, and returns the assembled DataFrame.
<code>from pandas import DataFrame, concat
def series_to_supervised(data, n_in=1, n_out=1, dropnan=True):
"""Convert time series to a supervised learning dataset.
data: sequence of observations (list or ndarray)
n_in: number of lag inputs
n_out: number of forecast outputs
dropnan: remove rows with NaN values
"""
n_vars = 1 if type(data) is list else data.shape[1]
df = DataFrame(data)
cols, names = list(), list()
# input sequence (t‑n, … t‑1)
for i in range(n_in, 0, -1):
cols.append(df.shift(i))
names += [('var%d(t-%d)' % (j+1, i)) for j in range(n_vars)]
# forecast sequence (t, t+1, … t+n)
for i in range(0, n_out):
cols.append(df.shift(-i))
if i == 0:
names += [('var%d(t)' % (j+1)) for j in range(n_vars)]
else:
names += [('var%d(t+%d)' % (j+1, i)) for j in range(n_vars)]
agg = concat(cols, axis=1)
agg.columns = names
if dropnan:
agg.dropna(inplace=True)
return agg</code>One‑step prediction example:
<code>values = [x for x in range(10)]
data = series_to_supervised(values)
print(data)</code>Output shows var1(t‑1) as input and var1(t) as target.
Multi‑step forecasting (e.g., two past steps to predict two future steps) is achieved by calling series_to_supervised(values, 2, 2) , producing columns var1(t‑2) , var1(t‑1) , var1(t) , and var1(t+1) .
For multivariate series, the same function handles multiple variables. Creating a DataFrame with columns ob1 and ob2 , then applying series_to_supervised(values) yields a supervised dataset where each time step contains both variables as inputs and outputs.
Python Programming Learning Circle
A global community of Chinese Python developers offering technical articles, columns, original video tutorials, and problem sets. Topics include web full‑stack development, web scraping, data analysis, natural language processing, image processing, machine learning, automated testing, DevOps automation, and big data.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.