Artificial Intelligence 12 min read

Eight Python Libraries to Accelerate Data‑Science Workflows

This article introduces eight Python libraries—including Optuna, ITMO_FS, shap‑hypetune, PyCaret, floWeaver, Gradio, Terality, and Torch‑Handle—that streamline data‑science tasks such as hyperparameter optimization, feature selection, model building, visualization, and deployment, helping users save coding time and improve productivity.

Python Programming Learning Circle

Mar 23, 2024

Eight Python Libraries to Accelerate Data‑Science Workflows

When doing data science, a lot of time can be wasted writing code and waiting for computations. The following Python libraries can help you save valuable time.

1. Optuna

Optuna is an open‑source hyperparameter‑optimization framework that automatically finds the best hyperparameters for machine learning models. It uses a Bayesian optimization algorithm called Tree‑structured Parzen Estimator, which leverages the history of previous trials to propose promising new configurations, avoiding exhaustive grid search and saving time. Optuna is framework‑agnostic and works with TensorFlow, Keras, PyTorch, and other ML libraries.

2. ITMO_FS

ITMO_FS is a feature‑selection library offering six categories of methods (supervised filters, unsupervised filters, wrappers, hybrids, embedded, and ensembles). It helps reduce over‑fitting by selecting a smaller, more interpretable set of features. Below is a typical usage example.

>> from sklearn.linear_model import SGDClassifier
>>> from ITMO_FS.embedded import MOS

>>> X, y = make_classification(n_samples=300, n_features=10, random_state=0, n_informative=2)
>>> sel = MOS()
>>> trX = sel.fit_transform(X, y, smote=False)

>>> cl1 = SGDClassifier()
>>> cl1.fit(X, y)
>>> cl1.score(X, y)
0.9033333333333333

>>> cl2 = SGDClassifier()
>>> cl2.fit(trX, y)
>>> cl2.score(trX, y)
0.9433333333333334

ITMO_FS is relatively new and may be unstable, but it is worth trying for feature‑selection tasks.

3. shap‑hypetune

shap‑hypetune combines SHAP‑based feature importance with hyperparameter tuning. SHAP (SHapley Additive exPlanations) explains any model’s output by assigning importance values to each feature. By integrating SHAP with hyperparameter search (grid, random, or Bayesian), shap‑hypetune simultaneously selects the best features and the best hyperparameters, reducing coding effort while considering feature‑hyperparameter interactions. It currently supports gradient‑boosting models.

4. PyCaret

PyCaret is an open‑source, low‑code machine‑learning library that automates the entire ML workflow, including data exploration, preprocessing, modeling, interpretability, and MLOps.

Example: loading a dataset and comparing models.

# load dataset 
from pycaret.datasets import get_data 
diabetes = get_data('diabetes') 

# init setup 
from pycaret.classification import * 
clf1 = setup(data = diabetes, target = 'Class variable') 

# compare models 
best = compare_models()

Creating a simple web app for a model:

from pycaret.datasets import get_data 
juice = get_data('juice') 
from pycaret.classification import * 
exp_name = setup(data = juice,  target = 'Purchase') 
lr = create_model('lr') 
create_app(lr)

Generating an API and Docker files for deployment:

from pycaret.datasets import get_data 
juice = get_data('juice') 
from pycaret.classification import * 
exp_name = setup(data = juice,  target = 'Purchase') 
lr = create_model('lr') 
create_api(lr, 'lr_api') 
create_docker('lr_api')

PyCaret provides a comprehensive suite of tools, making it easy to start experimenting with ML models.

5. floWeaver

floWeaver generates Sankey diagrams from streaming datasets, useful for visualizing conversion funnels, marketing journeys, or budget allocations. Input should be in the format "source x target x value"; a single line of code creates the diagram.

6. Gradio

Gradio lets you quickly build interactive front‑ends for Python functions by specifying input types, the function, and outputs. Compared with Flask, Gradio requires no HTML/CSS knowledge and can be hosted for free on Hugging Face, making it ideal for rapid prototyping and user interaction.

7. Terality

Terality is a Pandas‑compatible library that claims to be 10‑100× faster by compiling Pandas‑style operations to Spark and executing them on a remote backend. It enables parallel processing and avoids local memory limits, though the free tier caps usage at 1 TB per month, after which a paid plan is required.

8. Torch‑Handle

Torch‑Handle abstracts repetitive PyTorch training code, allowing data scientists to focus on data processing, model definition, and hyperparameter tuning. It provides concise training pipelines, automatic report generation, and TensorBoard integration.

from collections import OrderedDict 
import torch 
from torchhandle.workflow import BaseConpython  

class Net(torch.nn.Module):
    def __init__(self):
        super().__init__()
        self.layer = torch.nn.Sequential(OrderedDict([
            ('l1', torch.nn.Linear(10, 20)),
            ('a1', torch.nn.ReLU()),
            ('l2', torch.nn.Linear(20, 10)),
            ('a2', torch.nn.ReLU()),
            ('l3', torch.nn.Linear(10, 1))
        ]))

    def forward(self, x):
        x = self.layer(x)
        return x

num_samples, num_features = int(1e4), int(1e1) 
X, Y = torch.rand(num_samples, num_features), torch.rand(num_samples) 

dataset = torch.utils.data.TensorDataset(X, Y) 
trn_loader = torch.utils.data.DataLoader(dataset, batch_size=64, num_workers=0, shuffle=True) 
loaders = {"train": trn_loader, "valid": trn_loader} 
device = 'cuda' if torch.cuda.is_available() else 'cpu'  

model = {"fn": Net} 
criterion = {"fn": torch.nn.MSELoss} 
optimizer = {"fn": torch.optim.Adam,
            "args": {"lr": 0.1},
            "params": {"layer.l1.weight": {"lr": 0.01},
                       "layer.l1.bias": {"lr": 0.02}}}

scheduler = {"fn": torch.optim.lr_scheduler.StepLR,
            "args": {"step_size": 2, "gamma": 0.9}}

c = BaseConpython(model=model,
                criterion=criterion,
                optimizer=optimizer,
                scheduler=scheduler,
                conpython_tag="ex01")
train = c.make_train_session(device, dataloader=loaders)
train.train(epochs=10)

The code shows how a model, dataset, optimizer, loss, and scheduler can be defined once and then trained automatically, similar to high‑level frameworks like TensorFlow.

Overall, these libraries help reduce boilerplate, accelerate experimentation, and make it easier to move from prototype to production in data‑science projects.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

Machine Learning Python Automation libraries data-science feature selection hyperparameter optimization

Written by

Python Programming Learning Circle

A global community of Chinese Python developers offering technical articles, columns, original video tutorials, and problem sets. Topics include web full‑stack development, web scraping, data analysis, natural language processing, image processing, machine learning, automated testing, DevOps automation, and big data.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.