Optimizing Python Dependency Management: Refactoring pipdeptree for TencentOS
This article details the background, research, demos, discovered issues, and the comprehensive refactor of the pipdeptree tool to replace deprecated pkg_resources APIs with importlib.metadata, improving Python package dependency analysis for AI workloads in TencentOS.
Background
As AI ecosystems grow, Python becomes the foundational language, leading to an explosion of packages and libraries. Solving the Python dependency‑hell problem can bring two main benefits: trimming unnecessary packages reduces AI/python image size, and cleaning up redundant dependencies makes development environments cleaner and dependency relationships clearer.
Research
The
pipdeptreetool meets our requirements and is already integrated in TencentOS Server 4, installable via
dnf install python3-pipdeptree. We also use it when building Python runtime and PyTorch images to produce smaller, lighter images.
pipdeptreeis simple to use; the two most common commands are:
<code>pipdeptree -p ABC</code>
<code>pipdeptree -j</code>Internally,
pipdeptreerelies on the official
pkg_resourceslibrary. The core parts are:
Environment.from_paths(None).iter_installed_distributions– retrieves all installed distributions in the current environment.
DistInfoDistribution.requires()– obtains the dependencies of a distribution.
Example of the generated dependency tree (truncated):
<code>{
"package": {
"key": "adal",
"package_name": "adal",
"installed_version": "1.2.7"
},
"dependencies": [
{
"key": "cryptography",
"package_name": "cryptography",
"installed_version": "41.0.4",
"required_version": ">=1.1.0"
},
{
"key": "pyjwt",
"package_name": "PyJWT",
"installed_version": "2.6.0",
"required_version": ">=1.0.0,<3"
}
]
}
</code>Important notes:
The generated tree does not include
tox.initest dependencies, which are irrelevant for runtime RPM builds and can be ignored.
The
pkg_resourcesAPI is deprecated; we should replace it with
importlib.resourcesand related APIs.
Our own RPM packaging also uses
BuildRequiresand
Requires, which can introduce redundant dependencies when the upstream package does not accurately list its needs.
Proposed Optimizations
Use
pipdeptreeto prune unnecessary Python packages from the environment.
Parse Python package source code to verify whether declared dependencies are actually used.
Demo – Step 1
<code>import json
import subprocess
import re
FILTERED_DEPENDENCIES = ['python3']
def extract_package_name(dep):
match = re.match(r'.*python3(?:\.\d+)?dist\(([^)]+)\).*', dep)
if match:
return match.group(1)
return dep.split(' ')[0]
def get_package_dependencies(package_name):
package_name = get_rpmname(package_name)
print(package_name)
try:
command = f'rpm-dep -i {package_name} -q'
subprocess.run(command.split())
parse_cmd = "jq -r '.next[] | .pkg_name' dep_tree__{package_name}__install.json | sort | uniq"
output = subprocess.getoutput(parse_cmd)
dependencies = output.strip().split('\n')
return dependencies
except subprocess.CalledProcessError:
return []
def get_rpmname(py_name):
if not py_name.startswith('python-'):
command = f"dnf repoquery --whatprovides 'python3dist({py_name})' --latest-limit 1 --queryformat '%{{NAME}}' -q"
output = subprocess.getoutput(command)
if output == "":
command = f"dnf repoquery --whatprovides 'python3dist({py_name.lower()})' --latest-limit 1 --queryformat '%{{NAME}}' -q"
output = subprocess.getoutput(command)
if output == "":
py_name = f"python3-{py_name}"
info_command = f'dnf info python3-{py_name}'
info_result = subprocess.run(info_command.split(), stderr=subprocess.DEVNULL, stdout=subprocess.PIPE, text=True)
if info_result.returncode != 0:
py_name = "ERROR"
else:
py_name = output
else:
py_name = py_name[7:]
info_command = f'dnf info python3-{py_name}'
info_result = subprocess.run(info_command.split(), stderr=subprocess.DEVNULL, stdout=subprocess.PIPE, text=True)
if info_result.returncode != 0:
py_name = "ERROR"
else:
py_name = f"python3-{py_name}"
return py_name
def check_dependencies(package_data):
package_name = package_data['package']['key']
local_dependencies = [get_rpmname(dep['key']) for dep in package_data['dependencies']]
repo_dependencies = get_package_dependencies(package_name)
missing_dependencies = list(set(repo_dependencies) - set(local_dependencies))
extra_dependencies = list(set(local_dependencies) - set(repo_dependencies))
missing_dependencies = [dep for dep in missing_dependencies if dep not in FILTERED_DEPENDENCIES]
extra_dependencies = [dep for dep in extra_dependencies if dep not in FILTERED_DEPENDENCIES]
print(missing_dependencies)
print(extra_dependencies)
return {
'package_name': get_rpmname(package_name),
'missing_dependencies': missing_dependencies,
'extra_dependencies': extra_dependencies
}
def main():
with open('packages.json', 'r') as file:
packages_data = json.load(file)
result = []
for package_data in packages_data:
package_result = check_dependencies(package_data)
result.append(package_result)
with open('result.json', 'w') as file:
json.dump(result, file, indent=2)
if __name__ == '__main__':
main()
</code>Analysis of the results shows two main inaccuracies:
Test dependencies from
tox.iniare missing, causing RPM dependencies to appear larger than those found by
pipdeptree.
Some packages do not follow standard Python packaging conventions, leading to mismatched names (e.g.,
pycryptodomevs.
Crypto).
Demo – Step 2
<code>import ast
import importlib.metadata
import importlib.resources
import json
import os
import sys
import re
builtin_modules = set(sys.builtin_module_names)
def get_standard_library_modules():
lib_path = os.path.dirname(os.__file__)
modules = []
def add_module(root, file):
module_path = os.path.relpath(os.path.join(root, file), lib_path)
module_name = os.path.splitext(module_path.replace(os.path.sep, '.'))[0]
if module_name.endswith('.__init__'):
module_name = module_name[:-9]
modules.append(module_name)
for root, dirs, files in os.walk(lib_path):
if 'site-packages' in dirs:
dirs.remove('site-packages')
if root == lib_path:
for file in files:
if file.endswith('.py'):
add_module(root, file)
if '__init__.py' in files:
add_module(root, '__init__.py')
return modules
builtin_modules.update(get_standard_library_modules())
def parse_imports(file_path):
with open(file_path, 'r') as file:
content = file.read()
content = re.sub(r'#.*', '', content)
content = re.sub(r'""".*?"""', '', content, flags=re.DOTALL)
import_re = re.compile(r'(?:from\s+([.\w]+)(?:\s+import\s+[\w, ()]+)|import\s+([\w, ()]+))')
matches = import_re.findall(content)
imports = []
for match in matches:
module_names = match[0] if match[0] else match[1]
if not module_names.startswith('.'):
module_names = module_names.split(',')
for module_name in module_names:
module_name = module_name.strip().split(' as ')[0].split('.')[0]
if module_name not in builtin_modules and not module_name.startswith('_'):
imports.append(module_name)
return imports
def get_package_imports():
package_imports = {}
dists = importlib.metadata.distributions()
for dist in dists:
package_name = dist.metadata['Name']
try:
package_dir = importlib.resources.files(package_name)
if package_dir is not None:
package_imports[package_name] = {}
for root, dirs, files in os.walk(str(package_dir)):
for file in files:
if file.endswith('.py'):
file_path = os.path.join(root, file)
imports = parse_imports(file_path)
imports = list(set(imports))
if package_name in imports:
imports.remove(package_name)
package_imports[package_name][file_path] = imports
except Exception:
pass
return package_imports
package_imports = get_package_imports()
json_data = json.dumps(package_imports, indent=4)
print(json_data)
with open('packages.json', 'r') as file:
package_data = json.load(file)
for package in package_data:
package_name = package['package']['package_name']
if package_name in package_imports:
dependencies = {dep['package_name'] for dep in package['dependencies']}
for file_path, imports in package_imports[package_name].items():
for import_name in imports:
if import_name not in dependencies:
print(f'In package {package_name}, file {file_path} imports {import_name} which is not in dependencies.')
else:
print(f'In package {package_name}, file {file_path} imports {import_name} is found in pipdeptree.')
</code>Further analysis reveals additional problems:
AST cannot distinguish relative imports (e.g.,
from .ABC import DEF) from absolute imports, causing false positives.
Package names may not match module names (e.g.,
pycryptodomeprovides
Crypto).
Optional dependencies and test‑only dependencies appear as missing but are harmless.
Upstream packages sometimes omit required dependencies (e.g.,
urllib3missing
brotli,
google, etc.).
Advanced Refactor – Replacing Deprecated APIs
The
pkg_resourcesAPI is deprecated. We replace it with
importlib.metadataand
packagingwhile preserving functionality:
Replace
DistInfoDistributionwith
importlib.metadata.Distribution.
Replace
Requirementwith
packaging.requirements.Requirement.
Implement
local_onlyand
user_onlylogic using
sys.prefix,
sys.base_prefix,
site.getsitepackages(), and
site.getusersitepackages().
Handle
direct_url.jsonfor editable installs and retrieve the source location.
Adapt version specifiers using
packaging.specifiers.SpecifierSet.
Key code snippets for the new implementation:
<code>from importlib.metadata import Distribution
def iter_distributions(local_only=False, user_only=False):
if local_only and sys.prefix != sys.base_prefix:
paths = site.getsitepackages([sys.prefix])
return list(distributions(path=paths))
if user_only:
return list(distributions(path=[site.getusersitepackages()]))
return list(distributions())
</code>We also provide compatibility shims for attributes like
key,
project_name, and
editablethat were present in the old API.
Testing Adjustments
Tests were updated to include the new
packagingdependency, mock editable installs using
MagicMock, and simulate virtual environments by monkey‑patching
sys.prefixand command‑line arguments.
<code>def test_local_only(tmp_path, monkeypatch, capfd):
prefix = str(tmp_path / 'venv')
result = virtualenv.cli_run([prefix, '--activators', ''])
pip_path = str(result.creator.exe.parent / 'pip')
subprocess.run([pip_path, 'install', 'wrapt', '--prefix', prefix], stdout=subprocess.DEVNULL, stderr=subprocess.DEVNULL)
cmd = [str(result.creator.exe.parent / 'python3'), '--local-only']
monkeypatch.setattr(sys, 'prefix', [prefix])
monkeypatch.setattr(sys, 'argv', cmd)
main()
out, _ = capfd.readouterr()
found = {i.split('==')[0] for i in out.splitlines()}
expected = {'wrapt', 'pip', 'setuptools', 'wheel'}
assert found == expected
</code>Conclusion
After completing the core refactor, the updated
pipdeptreenow uses modern, non‑deprecated APIs, correctly handles virtual environments, optional dependencies, and editable installs, and passes the extended test suite. The work was reviewed and approved by the upstream maintainers, leading to an invitation to become a maintainer of the project.
Tencent Architect
We share technical insights on storage, computing, and access, and explore industry-leading product technologies together.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.