Fundamentals 8 min read

Common Python Mistakes in Data‑Science Projects and How to Avoid Them

This article outlines nine common Python mistakes in data‑science projects—such as neglecting virtual environments, overusing notebooks, hard‑coding absolute paths, ignoring warnings, avoiding list comprehensions, missing type hints, writing unreadable pandas chains, disregarding PEP guidelines, and not using coding assistants—providing explanations and code examples to help developers improve code quality and productivity.

Python Programming Learning Circle

Jul 24, 2024

Common Python Mistakes in Data‑Science Projects and How to Avoid Them

Applying software‑engineering best practices can improve the quality of data‑science projects, reducing errors, ensuring reliable results, and increasing coding efficiency.

The article lists nine frequent pitfalls and offers concrete advice and code snippets for each.

1. Not using virtual environments – Isolating project dependencies prevents package conflicts and eases deployment. Tools such as Anaconda, Pipenv, or Docker can create dedicated environments.

2. Overusing Jupyter Notebooks – Notebooks are great for exploration and teaching, but for long‑term, collaborative, and deployable projects a proper IDE (VS Code, PyCharm, Spyder, etc.) offers better tooling and version control.

3. Using absolute paths instead of relative paths – Absolute paths hinder portability. Set the project root as the working directory and reference files with relative paths.

import pandas as pd
import numpy as np
import os
# ---- Incorrect way ----
excel_path1 = "C:\\Users\\abdelilah\\Desktop\\mysheet1.xlsx"
excel_path2 = "C:\\Users\\abdelilah\\Desktop\\mysheet2.xlsx"
mydf1 = pd.read_excel(excel_path1)
mydf2 = pd.read_excel(excel_path2)

# ---- Correct way ----
DATA_DIR = "data"
# copy the files into the data directory
crime06_filename = "CrimeOneYearofData_2006.xlsx"
crime07_filename = "CrimeOneYearofData_2007.xlsx"
crime06_df = pd.read_excel(os.path.join(DATA_DIR, crime06_filename))
crime07_df = pd.read_excel(os.path.join(DATA_DIR, crime07_filename))

4. Ignoring warnings – Warnings (e.g., Pandas SettingWithCopyWarning or DeprecationWarning) signal potential issues. Understand their cause and decide whether they can be safely ignored.

5. Not using (or rarely using) list comprehensions – List comprehensions make code more concise and often faster.

import pandas as pd
import os

DATA_PATH = "data"
filename_list = os.listdir(DATA_PATH)

# Bad approach
csv_list = []
for filename in filename_list:
    csv_list.append(pd.read_csv(os.path.join(DATA_PATH, filename)))

# Recommended approach
csv_list = [pd.read_csv(os.path.join(DATA_PATH, filename)) for filename in filename_list if filename.endswith('.csv')]

6. Not using type annotations – Adding type hints improves IDE assistance and code readability.

def mystery_combine(a, b, times):
    return (a + b) * times

# With type hints
def mystery_combine(a: str, b: str, times: int) -> str:
    return (a + b) * times

7. Unreadable pandas method chains – Break long chains into multiple lines for clarity.

var_list = ["clicks", "time_spent"]
var_list_Q = [varname + "_Q" for varname in var_list]

# Hard‑to‑read version
# df_Q = df.groupby("id").rolling(window=3, min_periods=1, on="yearmonth").mean().reset_index().rename(columns=dict(zip(var_list, var_list_Q)))

# Readable version
df_Q = (
    df
    .groupby("id")
    .rolling(window=3, min_periods=1, on="yearmonth")[var_list]
    .mean()
    .reset_index()
    .rename(columns=dict(zip(var_list, var_list_Q)))
)

8. Not following PEP conventions – Adhering to the official Python style guide (PEP 8) greatly improves code consistency and maintainability.

9. Not using coding assistance tools – Tools such as Pylance, Kite, Tabnine, or GitHub Copilot provide intelligent completions, documentation lookup, and refactoring suggestions that boost productivity.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

coding standards IDE pandas best-practices type hints data-science

Written by

Python Programming Learning Circle

A global community of Chinese Python developers offering technical articles, columns, original video tutorials, and problem sets. Topics include web full‑stack development, web scraping, data analysis, natural language processing, image processing, machine learning, automated testing, DevOps automation, and big data.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.