Artificial Intelligence 15 min read

Can a Random Forest Predict Smoking Habits? 79% Accuracy Explained

This article analyzes a biomedical dataset to identify key factors influencing smoking status, performs descriptive and exploratory data analysis, selects important features with a Random Forest, builds a predictive model achieving about 79% accuracy, and discusses evaluation metrics and future improvements.

Model Perspective

Aug 5, 2023

Can a Random Forest Predict Smoking Habits? 79% Accuracy Explained

Smoking has been proven to negatively affect health in numerous ways, contributing to diseases and reducing life expectancy, and is a leading preventable cause of death worldwide.

The dataset analyzed contains many biomedical features such as age, height, weight, blood pressure, blood sugar, cholesterol, and smoking status.

Important factors influencing smoking status include hemoglobin, height, and γ‑GTP; a machine‑learning model trained on these features predicts smoking status with up to 79.34% accuracy.

import pandas as pd
# Load the dataset
data = pd.read_csv('Smoker Status Prediction.csv')
# Display the first few rows of the dataset
data.head()

The dataset’s first five rows show each row representing an individual and the last column "smoking" as the target variable.

Descriptive statistical analysis to understand central tendency, dispersion, and shape.

Missing‑value check.

Exploratory data analysis using visualizations to explore relationships between features and the target.

Feature selection based on the analysis.

Build a predictive model using the selected features.

Descriptive Statistics

data.describe()

Key statistics include average age 44.13 years, average height 164.69 cm, average weight 65.94 kg, average waist 82.06 cm, average hemoglobin 14.62 g/dL, and average smoking prevalence 0.37.

Missing‑Value Check

data.isnull().sum()

No missing values were found, allowing direct progression to exploratory analysis.

Exploratory Data Analysis

Distribution of smokers vs. non‑smokers:

import matplotlib.pyplot as plt
import seaborn as sns
sns.set_style("whitegrid")
plt.figure(figsize=(6,6))
sns.countplot(x='smoking', data=data)
plt.title('Distribution of Smokers')
plt.xlabel('Smoking Status')
plt.ylabel('Count')
plt.xticks([0,1], ['Non-Smoker','Smoker'])
plt.show()

Boxplots show that smokers tend to be slightly older and have a higher median weight than non‑smokers.

# Age boxplot
plt.figure(figsize=(6,6))
sns.boxplot(x='smoking', y='age', data=data)
plt.title('Age Distribution by Smoking Status')
plt.show()

# Weight boxplot
plt.figure(figsize=(6,6))
sns.boxplot(x='smoking', y='weight(kg)', data=data)
plt.title('Weight Distribution by Smoking Status')
plt.show()

Feature Selection

Random Forest is used to rank feature importance.

from sklearn.ensemble import RandomForestClassifier
X = data.drop('smoking', axis=1)
y = data['smoking']
rf = RandomForestClassifier(n_estimators=100, random_state=42)
rf.fit(X, y)
importances = rf.feature_importances_
feature_importances = pd.DataFrame({'Feature': X.columns, 'Importance': importances})
feature_importances = feature_importances.sort_values('Importance', ascending=False)
feature_importances

Hemoglobin

Height

γ‑GTP

Triglyceride

HDL

LDL

ALT

Cholesterol

Waist

Fasting blood sugar

Systolic pressure

AST

Diastolic pressure

Serum creatinine

Weight

Age

Machine Learning Model Prediction

A Random Forest classifier is trained on the selected features. The dataset is split into training and testing sets (≈80% train, 20% test). Model performance is evaluated using accuracy, precision, recall, and F1‑score.

# Split data (example)
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Train Random Forest
rf = RandomForestClassifier(n_estimators=100, random_state=42)
rf.fit(X_train, y_train)

y_pred = rf.predict(X_test)

# Accuracy
from sklearn.metrics import accuracy_score, classification_report
accuracy = accuracy_score(y_test, y_pred)
print('Accuracy:', accuracy)

# Classification report
report = classification_report(y_test, y_pred, target_names=['Non-Smoker','Smoker'])
print(report)

The model achieves an accuracy of 80.16% on the test set. Detailed metrics show precision 85% for non‑smokers and 73% for smokers, recall 84% and 73% respectively, with corresponding F1‑scores.

Conclusion

The analysis demonstrates that biomedical features such as hemoglobin, height, and γ‑GTP are strong predictors of smoking status, and a Random Forest model can predict smoking with roughly 79% accuracy. Future work may explore additional preprocessing, feature engineering, and alternative algorithms like gradient‑boosted trees or support vector machines to further improve performance.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

Random Forest feature importance Health Data smoking prediction

Written by

Model Perspective

Insights, knowledge, and enjoyment from a mathematical modeling researcher and educator. Hosted by Haihua Wang, a modeling instructor and author of "Clever Use of Chat for Mathematical Modeling", "Modeling: The Mathematics of Thinking", "Mathematical Modeling Practice: A Hands‑On Guide to Competitions", and co‑author of "Mathematical Modeling: Teaching Design and Cases".

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.