Automating a Data‑Science Workflow on Kubernetes: From GitHub Issue Mining to an MLP Bug Classifier
This article describes how to collect, clean, and analyze 90,000+ GitHub issues and pull requests from the Kubernetes repository using Kubeflow, TensorFlow, and a fully automated CI/CD pipeline, then build, train, and serve a simple MLP model that classifies release‑note texts as bugs or non‑bugs.
Introduction
The author outlines a data‑science journey that integrates the entire workflow—data collection, exploration, model building, and deployment—into a Kubernetes‑native environment using Kubeflow and Prow for continuous integration.
Data Acquisition
Raw GitHub API data (≈91,000 issues/PRs) is fetched via the issues endpoint, compressed with xz , and stored as a 25 MiB tarball. Incremental updates are performed by retrieving the delta between the last run and the current time.
> export GITHUB_TOKEN=<MY-SECRET-TOKEN> > ./main export
Updating the dataset uses the --update-api flag, which logs the number of items processed and timestamps.
> ./main export --update-api
Exploratory Analysis
Using matplotlib , the author visualizes issue/PR creation over time, created‑vs‑closed metrics, and label distributions (e.g., kind/bug , sig/ , area/ ).
> ./main analyze --created
> ./main analyze --labels-by-group
Label Focus
The study concentrates on the kind/ label group, especially kind/bug , because release‑note blocks are high‑quality textual data suitable for natural‑language processing.
Natural‑Language Processing
Release notes are vectorized with sklearn.feature_extraction.text.TfidfVectorizer (unigrams and bigrams, min_df , max_df defaults). The resulting vocabulary contains ~50 k terms; a SelectKBest selector could prune it, but the author keeps the full set.
["hostname", "hostname address", "hostname and", ...]
Model Construction
An MLP built on TensorFlow with two hidden layers (64 units, sigmoid activation) is used to classify whether a release note indicates a bug.
Training
The dataset (≈7 000 samples) is split 80/20 for training/validation. Training logs show rapid convergence, reaching ~92 % training accuracy and ~77 % validation accuracy after 68 epochs.
> ./main train
Prediction
After training, the model is saved ( model.h5 ) along with the vectorizer and selector. Sample predictions demonstrate high confidence for bug‑related texts and low confidence for non‑bugs.
> ./main predict --test
Automation with Kubeflow Pipelines
A Kubeflow pipeline orchestrates the steps: fetch source, update dataset, train model, and serve it via KFServing. The pipeline runs on a GPU‑enabled Kubernetes cluster and is triggered by Prow on new pull requests.
Serving and Real‑Time Labeling
The trained model is packaged into a container that runs a KFServing endpoint. A custom Prow plugin calls this endpoint for every new PR, automatically adding or removing kind/bug labels based on the prediction.
> curl https://kfserving.k8s.saschagrunert.de/v1/models/kubernetes-analysis:predict -d '{"text": "my test text"}'
Conclusion
The end‑to‑end system demonstrates how data‑science techniques can be fully integrated into a cloud‑native CI/CD workflow, enabling continuous improvement of a bug‑classification model directly within the Kubernetes project.
Cloud Native Technology Community
The Cloud Native Technology Community, part of the CNBPA Cloud Native Technology Practice Alliance, focuses on evangelizing cutting‑edge cloud‑native technologies and practical implementations. It shares in‑depth content, case studies, and event/meetup information on containers, Kubernetes, DevOps, Service Mesh, and other cloud‑native tech, along with updates from the CNBPA alliance.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.