Artificial Intelligence 14 min read

Fine‑Tuning a ViT Image Classification Model on a Small Flower Dataset Using ModelScope

This tutorial walks through the complete process of fine‑tuning a Vision Transformer (ViT) model for 14‑class flower image classification on ModelScope, covering dataset preparation, model loading, training configuration, evaluation, and inference with practical code examples.

DataFunSummit
DataFunSummit
DataFunSummit
Fine‑Tuning a ViT Image Classification Model on a Small Flower Dataset Using ModelScope

The article introduces image classification as a fundamental computer‑vision task and demonstrates how to fine‑tune a pre‑trained ViT model on a small flower dataset containing 14 classes using ModelScope's online notebook environment.

First, users navigate to ModelScope's model hub, select the "ViT image‑classification‑Chinese‑daily‑objects" model, and open a notebook (either PAI‑DSW or Alibaba Cloud Elastic Compute) with free GPU resources.

The dataset, named Flowers14 (namespace tany0699 ), is organized into train and validation folders with JPG images and accompanying label files ( classname.txt and Flowers14.json ). Loading the dataset requires a single line of code after importing ModelScope dependencies.

Training details include modifying the configuration file: setting batch_size , num_workers , max_epochs (set to 1 for the demo), and changing num_classes from the original 1296 to 14. Learning‑rate and other hyper‑parameters are also adjusted for quick convergence.

A trainer is built via build_trainer with arguments such as model_id , work_dir , train_dataset , eval_dataset , and a callback to modify the config. Training is started with trainer.train() and evaluation with trainer.evaluate() . Logs show epoch number, learning rate, loss, and GPU memory usage.

After one epoch, the model achieves approximately 87.75% top‑1 accuracy and 98.97% top‑5 accuracy on the validation set. The resulting checkpoint ( pytorch_model.pt ) and updated configuration are saved in the work directory.

For inference, the trained model is loaded into a pipeline, and a test image (e.g., a sunflower) is classified, yielding a top‑5 list with the correct class at rank one, though confidence is low due to limited training iterations.

The guide concludes by encouraging users to explore other classification models on ModelScope, such as generic ViT, NextViT, or BEiT, and to experiment with longer training, different learning rates, and larger datasets for improved performance.

image classificationPythonDeep LearningFine-tuningModelScopeViT
DataFunSummit
Written by

DataFunSummit

Official account of the DataFun community, dedicated to sharing big data and AI industry summit news and speaker talks, with regular downloadable resource packs.

0 followers
Reader feedback

How this landed with the community

login Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.