Fundamentals 13 min read

Python List vs Set: Performance Comparison and Underlying Implementation Details

This article compares the lookup speed of Python lists and sets on large datasets, presents benchmark code and results, and explains why sets are dramatically faster by examining the internal C‑level implementations of list (dynamic array) and set/dict (hash table) including resizing rules and collision‑resolution strategies.

Python Programming Learning Circle
Python Programming Learning Circle
Python Programming Learning Circle
Python List vs Set: Performance Comparison and Underlying Implementation Details

The article begins by asking how large‑scale data lookups differ in efficiency between Python's list and set structures, then demonstrates a benchmark that measures the time required to search for randomly generated numbers in a list of one million elements versus a set containing the same elements.

import time import random nums = [random.randint(0, 2000000) for i in range(1000)] list_test = list(range(1000000)) set_test = set(list_test) count_list, count_set = 0, 0 t1 = time.time() for num in nums: if num in list_test: count_list += 1 t2 = time.time() for num in nums: if num in set_test: count_set += 1 t3 = time.time() print('找到个数,列表:{},集合:{}'.format(count_list, count_set)) print('使用时间,列表:{:.4f}s'.format(t2 - t1)) print('使用时间,集合:{:.4f}s'.format(t3 - t2))

The output shows that the list lookup takes about 7.9 seconds while the set lookup completes in roughly 0.001 seconds, illustrating the huge performance gap for big data sets.

Python's list is implemented as a dynamic array of PyObject* pointers, defined in C as typedef struct { PyObject_VAR_HEAD PyObject **ob_item; Py_ssize_t allocated; } PyListObject; . The array stores references to objects of any type, and its size grows according to a specific allocation rule: new_allocated = (newsize >> 3) + (newsize < 9 ? 3 : 6); new_allocated += newsize; . This strategy doubles capacity when full and halves it when less than half full, but frequent resizing can cause costly memory copies.

Because a list is essentially a linear scan, its lookup time complexity is O(n), which explains the poor performance observed in the benchmark.

In contrast, set (and dict ) are built on hash tables. Keys are hashed to an integer, the result is reduced modulo the table size, and the value is stored in that bucket. Python uses open addressing with pseudo‑random probing to resolve collisions, and each entry can be in one of three states: Unused (both key and value NULL), Active (key and value present), or Dummy (key marked as a special placeholder after deletion). The Dummy state preserves the probe chain so that lookups remain correct after deletions.

Sets store only keys, while dictionaries store key‑value pairs; both require keys to be hashable (i.e., immutable). The hash table maintains a load factor (typically below 0.75) and expands when the factor reaches about two‑thirds of capacity.

For homogeneous data (e.g., all integers or strings), the article suggests using the array module or numpy arrays, which provide a contiguous memory layout and avoid the pointer indirection of lists, yielding even better performance.

PerformancePythonData Structureslisthash tablesetdynamic array
Python Programming Learning Circle
Written by

Python Programming Learning Circle

A global community of Chinese Python developers offering technical articles, columns, original video tutorials, and problem sets. Topics include web full‑stack development, web scraping, data analysis, natural language processing, image processing, machine learning, automated testing, DevOps automation, and big data.

0 followers
Reader feedback

How this landed with the community

login Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.