Artificial Intelligence 11 min read

Leveraging Large Language Models to Optimize Traditional Machine Learning Pipelines

Large language models can assist and enhance each stage of traditional machine learning—including sample generation, data cleaning, feature engineering, model selection, hyper‑parameter tuning, and workflow automation—by generating synthetic data, refining features, selecting models, and orchestrating pipelines, though challenges such as bias, privacy, and noise remain.

Cognitive Technology Team
Cognitive Technology Team
Cognitive Technology Team
Leveraging Large Language Models to Optimize Traditional Machine Learning Pipelines

1. Sample Generation

LLMs can be used to construct synthetic training samples by describing data in natural language, enabling generation, rewriting, and expansion of datasets. Representative works include Borisov et al. (2022) on realistic tabular data generation and Yu et al. (2023) on diversity and bias in generated data.

Challenges: potential introduction of bias, irrelevant features, noise, as well as copyright and privacy concerns.

2. Data Cleaning

By phrasing data quality queries in natural language, LLMs can identify erroneous or outlier attributes and fill missing values. Examples include Narayan et al. (2022) on using foundation models for data wrangling and the autom3l agents for feature refinement and data imputation.

Challenges: possible misinterpretation of features and lack of domain‑specific expert knowledge.

3. Feature Engineering

3.1 Feature Selection

LLMs can assess feature importance by providing task information and example annotations (e.g., LMPriors 2022) or by scoring and ranking candidate features (LLM‑Select 2024). Autom3l also offers automated feature refinement.

Challenge: pretrained LLMs may inherit biases that affect feature selection.

3.2 Feature Extraction

LLMs generate new features such as text embeddings or domain‑specific descriptors by prompting the model with dataset descriptions. Works like Einy et al. (2024) and Portisch & Paulheim (2022) illustrate feature enrichment via external knowledge bases.

3.3 Feature Fusion

LLM agents can combine existing features and operators to create novel composite features, as demonstrated by Zhang et al. (2024) on dynamic feature generation and TableLLM approaches that transform tabular attributes into natural‑language prompts for fine‑tuning.

Challenge: bias in LLMs can propagate to fused features.

4. Model Training

4.1 Model Selection

Retrieval‑based methods let LLMs match user requirements with stored model descriptions (e.g., MLCopilot 2023). Generation‑based approaches such as ModelGPT (2024) let LLMs synthesize model architectures and hyper‑parameters from task specifications.

4.2 LLM‑Guided Hyperparameter Search

LLMs automate hyper‑parameter optimization (HPO) by generating search spaces and evaluating configurations, as seen in AgentHPO (2024) and related works on LLM‑driven HPO.

4.3 Workflow Automation

Agents like Data Interpretor (2024) decompose complex tasks into sub‑graphs, dynamically constructing and refining execution plans. Verbalized Machine Learning (VLM) treats natural‑language descriptions as parameter specifications, using one LLM as a function approximator and another as an optimizer.

4.4 Final Remarks

Traditional small‑scale ML models are limited in capacity and transferability. By feeding the inputs and outputs of many such models into a large model, the LLM can act as an expert that continuously fine‑tunes on new data, providing expert features or predictions that can be fused with conventional models to improve performance.

machine learningfeature engineeringData GenerationLLMHyperparameter Optimization
Cognitive Technology Team
Written by

Cognitive Technology Team

Cognitive Technology Team regularly delivers the latest IT news, original content, programming tutorials and experience sharing, with daily perks awaiting you.

0 followers
Reader feedback

How this landed with the community

login Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.