Data Structures Related to Machine Learning Algorithms
The article explains that machine‑learning programs rely on the same fundamental data structures as other software—especially arrays for linear algebra—but mastering their implementation, dynamic behavior, and alternatives such as linked lists, trees, heaps, stacks, queues, and hash maps is crucial for scaling and optimizing complex AI tasks.
The article argues that the data structures used in machine learning are not fundamentally different from those used in other software domains, but mastering the basics is essential because of the scale and difficulty of many ML problems.
Because machine learning is a mathematical field, data structures should be viewed both as tools for solving mathematical problems and as mathematical objects themselves.
Data structures can be classified in two ways: by their implementation and by the operations they support.
Arrays are highlighted as the most important data structure for ML, especially for linear algebra. One‑dimensional and two‑dimensional arrays correspond to vectors and matrices, while three‑ and four‑dimensional arrays appear for higher‑order tensors or grouped examples. The article discusses choosing among many scientific libraries (MATLAB, IDL, NumPy) for matrix operations.
Example C++ code for a simple matrix‑vector multiplication is provided:
for (int i=0; i<n; i++) {
y[i]=0;
for (int j=0; j<n; j++) y[i]+=a[i][j]*x[j];
}In most cases, arrays can be allocated with a fixed size at runtime, or a dynamic array (e.g., std::vector in C++ STL) can be used when the size must grow. Languages such as MATLAB and Python provide built‑in extensible arrays.
The article explains the metadata stored with dynamic arrays (capacity and actual size) and the amortized O(1) cost of appending elements.
Linked Lists consist of individually allocated nodes, each holding a data value and a pointer to the next node. Insertion is fast (constant time), but random access is slow because it requires scanning the list.
Linked lists can be easily concatenated or split, and variations include head/tail insertion, doubly linked lists, and structures derived from the same principle such as binary trees.
Binary Trees have two child pointers per node, enforcing an ordering where left‑child values are smaller than the parent and right‑child values are larger. This yields automatic sorting, with average O(log n) insertion and lookup. Trees can be converted to arrays for sorting.
Balanced Trees (self‑balancing trees) keep the tree height near optimal, ensuring efficient average‑case operations. The article mentions KD‑trees as a binary‑tree variant useful for nearest‑neighbor queries in ML.
Heaps are hierarchical, partially ordered structures where each parent is larger than its children (max‑heap) or smaller (min‑heap). Insertion and retrieval are performed by “bubbling” elements up or down. Heaps are typically stored in arrays, with the relationship between elements implicit.
Stacks follow a “last‑in‑first‑out” (LIFO) discipline and are used for parsing syntax and implementing programming languages. The article cites domain‑specific languages (DSLs) that rely on stacks for recursive control flow.
Queues follow a “first‑in‑first‑out” (FIFO) discipline, useful in real‑time programming to maintain a list of pending jobs. Sets, described as unordered collections of unique elements, are also mentioned as useful for many ML mathematical operations.
Associative Arrays (Hash Maps) store key‑value pairs, reflecting the relational nature of many training datasets. They are suitable for building dictionaries and DSL symbol tables.
The article encourages designing custom data structures when standard ones are insufficient, giving examples such as sparse matrix representations that store only non‑zero elements as triples within a dynamic array.
In conclusion, the author emphasizes that while basic fixed‑length arrays are common in their work, understanding and employing a variety of data structures—arrays, linked lists, trees, heaps, stacks, queues, and hash maps—greatly improves program flexibility and performance, especially for complex AI applications that may involve directed or undirected graphs.
Tencent Cloud Developer
Official Tencent Cloud community account that brings together developers, shares practical tech insights, and fosters an influential tech exchange community.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.