Artificial Intelligence 19 min read

Automating a Data‑Science Workflow on Kubernetes: From GitHub Issue Mining to an MLP Bug Classifier

This article describes how to collect, clean, and analyze 90,000+ GitHub issues and pull requests from the Kubernetes repository using Kubeflow, TensorFlow, and a fully automated CI/CD pipeline, then build, train, and serve a simple MLP model that classifies release‑note texts as bugs or non‑bugs.

Cloud Native Technology Community
Cloud Native Technology Community
Cloud Native Technology Community
Automating a Data‑Science Workflow on Kubernetes: From GitHub Issue Mining to an MLP Bug Classifier

Introduction

The author outlines a data‑science journey that integrates the entire workflow—data collection, exploration, model building, and deployment—into a Kubernetes‑native environment using Kubeflow and Prow for continuous integration.

Data Acquisition

Raw GitHub API data (≈91,000 issues/PRs) is fetched via the issues endpoint, compressed with xz , and stored as a 25 MiB tarball. Incremental updates are performed by retrieving the delta between the last run and the current time.

> export GITHUB_TOKEN=<MY-SECRET-TOKEN> > ./main export

Updating the dataset uses the --update-api flag, which logs the number of items processed and timestamps.

> ./main export --update-api

Exploratory Analysis

Using matplotlib , the author visualizes issue/PR creation over time, created‑vs‑closed metrics, and label distributions (e.g., kind/bug , sig/ , area/ ).

> ./main analyze --created

> ./main analyze --labels-by-group

Label Focus

The study concentrates on the kind/ label group, especially kind/bug , because release‑note blocks are high‑quality textual data suitable for natural‑language processing.

Natural‑Language Processing

Release notes are vectorized with sklearn.feature_extraction.text.TfidfVectorizer (unigrams and bigrams, min_df , max_df defaults). The resulting vocabulary contains ~50 k terms; a SelectKBest selector could prune it, but the author keeps the full set.

["hostname", "hostname address", "hostname and", ...]

Model Construction

An MLP built on TensorFlow with two hidden layers (64 units, sigmoid activation) is used to classify whether a release note indicates a bug.

Training

The dataset (≈7 000 samples) is split 80/20 for training/validation. Training logs show rapid convergence, reaching ~92 % training accuracy and ~77 % validation accuracy after 68 epochs.

> ./main train

Prediction

After training, the model is saved ( model.h5 ) along with the vectorizer and selector. Sample predictions demonstrate high confidence for bug‑related texts and low confidence for non‑bugs.

> ./main predict --test

Automation with Kubeflow Pipelines

A Kubeflow pipeline orchestrates the steps: fetch source, update dataset, train model, and serve it via KFServing. The pipeline runs on a GPU‑enabled Kubernetes cluster and is triggered by Prow on new pull requests.

Serving and Real‑Time Labeling

The trained model is packaged into a container that runs a KFServing endpoint. A custom Prow plugin calls this endpoint for every new PR, automatically adding or removing kind/bug labels based on the prediction.

> curl https://kfserving.k8s.saschagrunert.de/v1/models/kubernetes-analysis:predict -d '{"text": "my test text"}'

Conclusion

The end‑to‑end system demonstrates how data‑science techniques can be fully integrated into a cloud‑native CI/CD workflow, enabling continuous improvement of a bug‑classification model directly within the Kubernetes project.

Machine Learningci/cdData MiningkubernetesTensorFlowKubeflow
Cloud Native Technology Community
Written by

Cloud Native Technology Community

The Cloud Native Technology Community, part of the CNBPA Cloud Native Technology Practice Alliance, focuses on evangelizing cutting‑edge cloud‑native technologies and practical implementations. It shares in‑depth content, case studies, and event/meetup information on containers, Kubernetes, DevOps, Service Mesh, and other cloud‑native tech, along with updates from the CNBPA alliance.

0 followers
Reader feedback

How this landed with the community

login Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.