Artificial Intelligence 13 min read

75 Essential Data Science Terms Every Practitioner Must Know

This article compiles a comprehensive alphabetically ordered list of 75 crucial data science and machine learning terms—from accuracy and AUC to zero-shot learning—providing concise definitions that help practitioners quickly grasp essential concepts and improve their analytical vocabulary.

Model Perspective

Aug 26, 2023

75 Essential Data Science Terms Every Practitioner Must Know

Data science has a rich vocabulary. This list presents the 75 most common and important terms that data scientists use daily.

A

Accuracy : Measures the proportion of correct predictions among total predictions.

Area Under Curve (AUC) : Represents the area under the Receiver Operating Characteristic (ROC) curve, used to evaluate classification models.

ARIMA : A time series forecasting method.

B

Bias : The difference between the true value and the predicted value in a statistical model.

Bayes Theorem : A probability formula that calculates the likelihood of an event based on prior knowledge.

Binomial Distribution : A probability distribution modeling the number of successes in a fixed number of independent Bernoulli trials.

C

Clustering : Grouping data points based on similarity.

Confusion Matrix : A table used to evaluate the performance of classification models.

Cross-validation : A technique that assesses model performance by dividing data into subsets for training and testing.

D

Decision Trees : Tree‑structured models used for classification and regression tasks.

Dimensionality Reduction : The process of reducing the number of features in a dataset while retaining important information.

Discriminative Models : Models that learn boundaries between different classes.

E

Ensemble Learning : Techniques that combine multiple models to improve predictive performance.

EDA (Exploratory Data Analysis) : The process of analyzing and visualizing data to understand its patterns and attributes.

Entropy : A measure of uncertainty or randomness in information.

F

Feature Engineering : The process of creating new features from existing data to improve model performance.

F-score : A metric that balances precision and recall for binary classification.

Feature Extraction : Automatically extracting meaningful features from data.

G

Gradient Descent : An optimization algorithm that iteratively adjusts parameters to minimize a function.

Gaussian Distribution : The normal distribution with a bell‑shaped probability density function.

Gradient Boosting : An ensemble learning method that builds multiple weak learners sequentially.

H

Hypothesis : A testable statement or assumption in statistical inference.

Hierarchical Clustering : A clustering method that organizes data into a tree‑like structure.

Heteroscedasticity : Unequal variance of errors in a regression model.

I

Information Gain : A metric used in decision trees to determine feature importance.

Independent Variable : A variable that is manipulated in an experiment to observe its effect on the dependent variable.

Imbalance : A situation where class distribution in a dataset is uneven.

J

Jupyter : An interactive computing environment for data analysis and machine learning.

Joint Probability : The probability of two or more events occurring simultaneously.

Jaccard Index : A similarity measure between two sets.

K

Kernel Density Estimation : A non‑parametric method for estimating the probability density function of a continuous random variable.

KS Test (Kolmogorov‑Smirnov Test) : A non‑parametric test that compares two probability distributions.

KMeans Clustering : Divides data into K clusters based on similarity.

L

Likelihood : The probability of observing the data given a specific model.

Linear Regression : A statistical method for modeling the relationship between a dependent variable and one or more independent variables.

L1/L2 Regularization : Techniques that add penalty terms to a model’s loss function to prevent overfitting.

M

Maximum Likelihood Estimation : A method for estimating the parameters of a statistical model.

Multicollinearity : A situation where two or more independent variables in a regression model are highly correlated.

Mutual Information : A measure of the amount of information shared between two variables.

N

Naive Bayes : A probabilistic classifier based on Bayes’ theorem that assumes feature independence.

Normalization : Scaling data to a specified range.

O

Overfitting : When a model performs well on training data but poorly on unseen data.

Outliers : Data points that are markedly different from the rest of the dataset.

One-hot encoding : Converting categorical variables into binary vectors.

P

PCA (Principal Component Analysis) : A dimensionality‑reduction technique that transforms data into orthogonal components.

Precision : The proportion of true positive predictions among all positive predictions in a classification model.

p-value : The probability of observing results at least as extreme as those obtained, assuming the null hypothesis is true.

Q

QQ-plot (Quantile‑Quantile Plot) : A graphical tool for comparing the distributions of two datasets.

QR decomposition : Decomposes a matrix into an orthogonal matrix and an upper‑triangular matrix.

R

Random Forest : An ensemble learning method that uses multiple decision trees for prediction.

Recall : The proportion of true positive predictions among all actual positive instances in a classification model.

ROC Curve : A chart that displays the performance of a binary classifier at various threshold settings.

S

SVM (Support Vector Machine) : A supervised machine‑learning algorithm used for classification and regression.

Standardisation : Scaling data to have a mean of 0 and a standard deviation of 1.

Sampling : The process of selecting a subset of data points from a larger dataset.

T

t-SNE (t‑Distributed Stochastic Neighbor Embedding) : A dimensionality‑reduction technique for visualizing high‑dimensional data in lower dimensions.

t-distribution : A probability distribution used in hypothesis testing for small sample sizes.

Type I/II Error : In hypothesis testing, a Type I error is a false positive, and a Type II error is a false negative.

U

Underfitting : When a model is too simple to capture the underlying patterns in the data.

UMAP (Uniform Manifold Approximation and Projection) : A dimensionality‑reduction technique for visualizing high‑dimensional data.

Uniform Distribution : A probability distribution where all outcomes are equally likely.

V

Variance : A measure of how data points spread around the mean.

Validation Curve : A chart that shows how model performance varies with different hyperparameter values.

Vanishing Gradient : A problem in deep neural networks where gradients become extremely small during training.

W

Word embedding : Representing words as dense vectors in natural language processing.

Word cloud : A visual representation of text data where word frequency is indicated by size.

Weights : Parameters learned by a machine‑learning model during training.

X

XGBoost : Extreme Gradient Boosting, a popular gradient‑boosting library.

XLNet : Generalized Autoregressive Pretraining of Transformers, a language model.

Y

YOLO (You Only Look Once) : A real‑time object detection system.

Yellowbrick : A Python library for visualizing and diagnosing machine‑learning models.

Z

Z-score : A standardized value indicating how many standard deviations a data point is from the mean.

Z-test : A statistical test used to compare a sample mean to a known population mean.

Zero-shot learning : A machine‑learning approach where a model can recognize new categories without having seen explicit examples during training.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

machine learning statistics data-science Glossary AI terms

Written by

Model Perspective

Insights, knowledge, and enjoyment from a mathematical modeling researcher and educator. Hosted by Haihua Wang, a modeling instructor and author of "Clever Use of Chat for Mathematical Modeling", "Modeling: The Mathematics of Thinking", "Mathematical Modeling Practice: A Hands‑On Guide to Competitions", and co‑author of "Mathematical Modeling: Teaching Design and Cases".

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.