Artificial Intelligence 20 min read

Predicting Tomorrow’s Weather with Random Forests: A European City Case Study

Using detailed meteorological records from 18 European cities between 2000 and 2010, this article demonstrates how random forest regression and comprehensive data preprocessing can forecast daily precipitation, evaluate model performance, and compare climatic patterns across cities, highlighting both strengths and limitations of the approach.

Model Perspective
Model Perspective
Model Perspective
Predicting Tomorrow’s Weather with Random Forests: A European City Case Study
How can we predict tomorrow's weather? Farmers, travelers, and everyday people all rely on accurate forecasts, making weather prediction a blend of science, culture, economy, and social life.
Scientists use weather stations, satellites, and especially data science to decode the weather code and anticipate future conditions.
Using detailed records from 18 European cities, we explore how random forest algorithms can map tomorrow's sky.

Data Collection, Selection, and Processing

The original meteorological data were retrieved from the ECA&D project, which provides daily observations from European and Mediterranean stations. We selected 18 cities with daily data available between 2000 and 2010. The cities include Basel (Switzerland), Budapest (Hungary), Dresden, Düsseldorf, Kassel, Munich (Germany), De Bilt, Maastricht (Netherlands), Heathrow (UK), Ljubljana (Slovenia), Malmö, Stockholm (Sweden), Montpellier, Perpignan, Tours (France), Oslo (Norway), Rome (Italy), and Sankt Blas (Austria).

Only the 2000‑2010 period was kept, yielding 3,654 daily observations across the 18 locations. The dataset contains variables such as average temperature, maximum temperature, minimum temperature, cloud cover, wind speed, gust, humidity, pressure, global radiation, precipitation, and sunshine duration.

After collection, basic cleaning removed columns with >5% invalid entries (marked “-9999”). Columns with ≤5% invalid entries had missing values replaced by the column mean, resulting in 165 variables for 3,654 days. Units were converted to more intuitive scales (e.g., temperature in °C, wind speed in m/s, humidity as a percentage, pressure in 1000 hPa, radiation in 100 W/m², precipitation in cm, sunshine in hours) to facilitate machine‑learning modeling without additional standardization.

Physical Units of Variables

Original units:

CC: cloud cover (eighths)

DD: wind direction (degrees)

FG: wind speed (0.1 m/s)

FX: gust (0.1 m/s)

HU: humidity (1 %)

PP: sea‑level pressure (0.1 hPa)

QQ: global radiation (W/m²)

RR: precipitation (0.1 mm)

SS: sunshine duration (0.1 h)

TG: mean temperature (0.1 °C)

TN: minimum temperature (0.1 °C)

TX: maximum temperature (0.1 °C)

Converted units:

FG, FX: 1 m/s

HU: 100 % scale

PP: 1000 hPa

QQ: 100 W/m²

RR: 10 mm

SS: 1 h

TG, TN, TX: 1 °C

<code># Import required libraries
import pandas as pd

# Load dataset
file_path = "data/weather_prediction_dataset.csv"
weather_data = pd.read_csv(file_path)

# Show first five rows
weather_data.head()
</code>

Sample of the first five rows:

<code>        DATE  MONTH  BASEL_cloud_cover  BASEL_humidity  BASEL_pressure  \
0 2000-01-01      1                 8           0.89          1.0286   
1 2000-01-02      1                 8           0.87          1.0318   
2 2000-01-03      1                 5           0.81          1.0314   
3 2000-01-04      1                 7           0.79          1.0262   
4 2000-01-05      1                 5           0.90          1.0246   

   BASEL_global_radiation  BASEL_precipitation  BASEL_sunshine  \
0                    0.20                 0.03              0.0   
1                    0.25                 0.00              0.0   
2                    0.50                 0.00              3.7   
3                    0.63                 0.35              6.9   
4                    0.51                 0.07              3.7   

   BASEL_temp_mean  BASEL_temp_min  ...  BASEL_humidity_lag_3  
0               2.9            1.6  ...                   NaN   
1               3.6            2.7  ...                   NaN   
2               2.2            0.1  ...                   NaN   
3               3.9            0.5  ...                  0.89   
4               6.0            3.8  ...                  0.87   

   BASEL_pressure_lag_1  BASEL_pressure_lag_2  BASEL_pressure_lag_3  \
0                    NaN                    NaN                    NaN   
1                 1.0286                    NaN                    NaN   
2                 1.0318                 1.0286                    NaN   
3                 1.0314                 1.0318                 1.0286   
4                 1.0262                 1.0314                 1.0318   

   BASEL_cloud_cover_lag_1  BASEL_cloud_cover_lag_2  BASEL_cloud_cover_lag_3  \
0                       NaN                       NaN                       NaN   
1                       8.0                       NaN                       NaN   
2                       8.0                       8.0                       NaN   
3                       5.0                       8.0                       8.0   
4                       7.0                       5.0                       8.0   

   BASEL_precipitation_lag_1  BASEL_precipitation_lag_2  \
0                         NaN                         NaN   
1                        0.03                         NaN   
2                        0.00                        0.03   
3                        0.00                        0.00   
4                        0.35                        0.00   

   BASEL_precipitation_lag_3  
0                         NaN   
1                         NaN   
2                         NaN   
3                        0.03   
4                        0.00   

[5 rows x 180 columns]
</code>

Data Integrity Analysis

First we check for missing values and outliers.

<code># Check missing values
missing_data_summary = weather_data.isnull().sum()

# Show columns with missing values (if any)
missing_columns = missing_data_summary[missing_data_summary > 0]
missing_columns
</code>
<code>Series([], dtype: int64)</code>

No missing values were found. Next, we examine outliers for a few representative variables.

Seasonality and Trend Analysis

We analyze Basel’s temperature and precipitation seasonality and trends.

<code>import matplotlib.pyplot as plt

# Convert DATE column to datetime
weather_data['DATE'] = pd.to_datetime(weather_data['DATE'], format='%Y%m%d')

# Select Basel temperature and precipitation
basel_temp_mean = weather_data[['DATE', 'BASEL_temp_mean']]
basel_precipitation = weather_data[['DATE', 'BASEL_precipitation']]

# Plot temperature time series
plt.figure(figsize=(12, 6))
plt.plot(basel_temp_mean['DATE'], basel_temp_mean['BASEL_temp_mean'], label='Mean Temperature (°C)')
plt.xlabel('Date')
plt.ylabel('Temperature (°C)')
plt.title('Mean Temperature Trend in Basel (Switzerland)')
plt.legend()
plt.show()

# Plot precipitation time series
plt.figure(figsize=(12, 6))
plt.plot(basel_precipitation['DATE'], basel_precipitation['BASEL_precipitation'], label='Precipitation (cm)')
plt.xlabel('Date')
plt.ylabel('Precipitation (cm)')
plt.title('Precipitation Trend in Basel (Switzerland)')
plt.legend()
plt.show()
</code>

Figures:

Observations:

Average temperature shows clear seasonal cycles, rising in summer and falling in winter, with a relatively stable long‑term trend.

Precipitation also exhibits seasonal variation, though less pronounced, and remains stable over the decade.

City‑wise Comparison

We compare average temperature and humidity across all 18 cities for 2000‑2010.

<code># Extract city names from column prefixes
cities = list(set([col.split('_')[0] for col in weather_data.columns if '_' in col]))

# Compute average temperature and humidity per city
city_avg_temp_humidity = []
for city in cities:
    temp_cols = [col for col in weather_data.columns if city in col and 'temp_mean' in col]
    humidity_cols = [col for col in weather_data.columns if city in col and 'humidity' in col]
    avg_temp = weather_data[temp_cols].mean().mean()
    avg_humidity = weather_data[humidity_cols].mean().mean()
    city_avg_temp_humidity.append((city, avg_temp, avg_humidity))

city_avg_temp_humidity_df = pd.DataFrame(city_avg_temp_humidity, columns=['City', 'Avg_Temperature (°C)', 'Avg_Humidity'])
city_avg_temp_humidity_df = city_avg_temp_humidity_df.sort_values(by='Avg_Temperature (°C)', ascending=False)
city_avg_temp_humidity_df
</code>

Resulting table (image):

Temperature: Rome and Perpignan have the highest averages; Sankt Blas the lowest.

Humidity: Sankt Blas shows the highest average humidity, while Perpignan the lowest.

Machine Learning Model

We use Basel’s data to predict the next day’s precipitation.

Select target variable and features – precipitation as target; temperature, humidity, pressure, and cloud cover as features.

Create lag features – past three days of each variable.

Split data – training, validation, and test sets.

Model selection and training – we choose a suitable time‑series model.

Prediction and evaluation – assess performance on the test set.

<code># Choose target and features
target_variable = 'BASEL_precipitation'
features = ['BASEL_temp_mean', 'BASEL_humidity', 'BASEL_pressure', 'BASEL_cloud_cover']

# Create lag features (1‑3 days)
lag_days = 3
for feature in features + [target_variable]:
    for lag in range(1, lag_days + 1):
        weather_data[f'{feature}_lag_{lag}'] = weather_data[feature].shift(lag)

# Assemble feature matrix and target vector
selected_features = [f'{feature}_lag_{lag}' for feature in features for lag in range(1, lag_days + 1)]
selected_features += features
X = weather_data[selected_features][lag_days:]
y = weather_data[target_variable][lag_days:]

# Standardize features
from sklearn.preprocessing import StandardScaler
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)

# Train‑test split (no shuffling for time series)
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X_scaled, y, test_size=0.2, shuffle=False)

# Show first few rows of the prepared data
X_train[:5], y_train[:5]
</code>

We then train a Random Forest regressor.

<code>from sklearn.ensemble import RandomForestRegressor
from sklearn.metrics import mean_absolute_error, mean_squared_error
import numpy as np

rf_regressor = RandomForestRegressor(n_estimators=100, random_state=42)
rf_regressor.fit(X_train, y_train)

# Predictions
y_train_pred = rf_regressor.predict(X_train)
train_mae = mean_absolute_error(y_train, y_train_pred)
train_rmse = np.sqrt(mean_squared_error(y_train, y_train_pred))

y_test_pred = rf_regressor.predict(X_test)
test_mae = mean_absolute_error(y_test, y_test_pred)
test_rmse = np.sqrt(mean_squared_error(y_test, y_test_pred))

train_mae, train_rmse, test_mae, test_rmse
</code>

Performance:

Training MAE: 0.0865 cm

Training RMSE: 0.1651 cm

Test MAE: 0.2450 cm

Test RMSE: 0.4743 cm

The higher test error suggests possible over‑fitting, which could be mitigated by hyper‑parameter tuning, feature engineering, or regularization.

<code># Visualize actual vs. predicted precipitation on the test set
plt.figure(figsize=(12, 6))
plt.plot(weather_data['DATE'][lag_days:].iloc[-len(y_test):], y_test, label='Actual Precipitation (cm)')
plt.plot(weather_data['DATE'][lag_days:].iloc[-len(y_test):], y_test_pred, label='Predicted Precipitation (cm)', linestyle='--')
plt.xlabel('Date')
plt.ylabel('Precipitation (cm)')
plt.title('Predicted vs Actual Precipitation in Basel (Switzerland)')
plt.legend()
plt.show()
</code>

The plot shows that while the model captures general precipitation trends, noticeable deviations remain for certain days.

Conclusion

We selected Basel’s daily precipitation as the prediction target and used related meteorological variables as features.

A Random Forest regression model was trained and evaluated, achieving modest accuracy on the test set.

Performance gaps indicate the need for further feature engineering, hyper‑parameter optimization, or alternative modeling approaches.

References:

[1] Klein Tank, A.M.G. and Coauthors, 2002. Daily dataset of 20th‑century surface air temperature and precipitation series for the European Climate Assessment. Int. J. of Climatol., 22, 1441‑1453. [2] https://www.kaggle.com/datasets/thedevastator/weather-prediction

Machine Learningtime seriesrandom forestclimate dataweather prediction
Model Perspective
Written by

Model Perspective

Insights, knowledge, and enjoyment from a mathematical modeling researcher and educator. Hosted by Haihua Wang, a modeling instructor and author of "Clever Use of Chat for Mathematical Modeling", "Modeling: The Mathematics of Thinking", "Mathematical Modeling Practice: A Hands‑On Guide to Competitions", and co‑author of "Mathematical Modeling: Teaching Design and Cases".

0 followers
Reader feedback

How this landed with the community

login Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.