Discretizing Numerical Variables with Pandas: between, cut, qcut, and value_counts
This article demonstrates four Pandas techniques—between with loc, cut, qcut, and value_counts—to discretize numeric variables into bins, assigning grades A, B, C to exam scores, and shows how to generate synthetic data, define bin boundaries, and count records per bin.
Discretization, also known as binning, is a common data preprocessing technique that groups continuous values into intervals or "bins". This tutorial explains four methods using the Python Pandas library to bin numeric variables.
Creating Synthetic Data
<code>import pandas as pd # version 1.3.5
import numpy as np
def create_df():
df = pd.DataFrame({'score': np.random.randint(0, 101, 1000)})
return df
create_df()
df.head()</code>The dataset contains 1,000 students' exam scores ranging from 0 to 100. The goal is to categorize these scores into grades "A", "B", and "C", where "A" is the best and "C" the worst.
1. between & loc
The between method returns a boolean Series indicating whether each element lies between the specified left and right boundaries. Combined with loc , it can assign grades based on custom intervals.
left: left boundary
right: right boundary
inclusive: which boundaries to include ("both", "neither", "left", "right")
Grade intervals:
A: (80, 100]
B: (50, 80]
C: [0, 50]
<code>df.loc[df['score'].between(0, 50, 'both'), 'grade'] = 'C'
df.loc[df['score'].between(50, 80, 'right'), 'grade'] = 'B'
df.loc[df['score'].between(80, 100, 'right'), 'grade'] = 'A'
</code>Counting the number of records per grade:
<code>df.grade.value_counts()
</code> <code>C 488
B 310
A 202
Name: grade, dtype: int64
</code>This approach requires explicit handling for each bin, making it suitable only when the number of bins is small.
2. cut
The cut function bins values into discrete intervals, useful for converting continuous variables into categorical ones.
x: the array to bin (must be 1‑D)
bins: sequence defining bin edges (allows non‑uniform widths)
labels: labels for the resulting bins
include_lowest: whether the first interval should be left‑inclusive
<code>bins = [0, 50, 80, 100]
labels = ['C', 'B', 'A']
df['grade'] = pd.cut(x=df['score'], bins=bins, labels=labels, include_lowest=True)
</code>The resulting grade distribution matches the previous method:
<code>df.grade.value_counts()
</code> <code>C 488
B 310
A 202
Name: grade, dtype: int64
</code>3. qcut
The qcut function creates bins based on quantiles, ensuring (approximately) equal numbers of observations per bin.
x: input array (1‑D)
q: number of quantiles (e.g., 3 for terciles)
labels: labels for the bins
retbins: whether to return the bin edges
<code>df['grade'], cut_bin = pd.qcut(df['score'], q=3, labels=['C','B','A'], retbins=True)
</code>Resulting bin edges:
<code>print(cut_bin)
>> [ 0. 36. 68. 100.]
</code>Grade distribution (≈333 records per grade):
<code>df.grade.value_counts()
</code> <code>C 340
A 331
B 329
Name: grade, dtype: int64
</code>4. value_counts with bins
The value_counts method can also perform binning when the bins argument is supplied.
<code>df['score'].value_counts(bins=3, sort=False)
</code>By default, the result is sorted descending by count; setting sort=False preserves the original bin order.
<code>(-0.101, 33.333] 310
(33.333, 66.667] 340
(66.667, 100.0] 350
Name: score, dtype: int64
</code>Custom bin edges can be provided to match the earlier examples:
<code>df['score'].value_counts(bins=[0,50,80,100], sort=False)
</code> <code>(-0.001, 50.0] 488
(50.0, 80.0] 310
(80.0, 100.0] 202
Name: score, dtype: int64
</code>This yields the same grade counts as the between and cut methods.
Conclusion: Pandas provides multiple flexible ways to discretize numeric data—using between with loc , cut , qcut , or value_counts —each suited to different scenarios such as a small fixed number of bins, custom bin edges, equal‑frequency bins, or quick binning via value counts.
Python Programming Learning Circle
A global community of Chinese Python developers offering technical articles, columns, original video tutorials, and problem sets. Topics include web full‑stack development, web scraping, data analysis, natural language processing, image processing, machine learning, automated testing, DevOps automation, and big data.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.