Photon: High‑Efficiency Multithreaded Web Crawler – Features, Compatibility, and Usage Guide
Photon is a fast, multithreaded Python web crawler that extracts URLs, files, and various intelligence from targets, offering flexible options, Ninja mode, and extensive command‑line parameters while supporting Linux, Windows, macOS, and Termux environments.
Project URL
https://github.com/s0md3v/Photon
Main Features
Photon provides many options for customized crawling, but its standout capability is high‑speed data extraction through intelligent multithreading.
Data Extraction
By default, Photon extracts the following data: URLs (both in‑scope and out‑of‑scope) Parameterized URLs (e.g., example.com/gallery.php?id=2) Intelligence such as emails, social media accounts, Amazon buckets, etc. Files (pdf, png, xml, …) JavaScript and other files Strings matching custom regular‑expression patterns
Extracted information is saved as shown in the diagram below.
Intelligent Multithreading
Unlike many tools that misuse threads, Photon assigns distinct work lists to each thread, avoiding contention and maximizing throughput.
Ninja Mode
In Ninja mode, three online servers act as proxies, allowing up to four clients to request the target simultaneously, which improves speed and reduces the risk of connection resets.
Compatibility & Dependencies
Compatibility
Photon works on Python 2.x and 3.x, though future development may drop Python 2 support.
Operating Systems
Tested on Linux (Arch, Debian, Ubuntu), Termux, Windows 7/10, and macOS. Bugs can be reported on GitHub.
Color Output
ANSI colour codes are not supported on macOS and Windows terminals.
Dependencies
<code>requests
urllib3
argparse</code>All other required libraries are part of the Python standard library.
How to Use Photon
<code>syntax: photon.py [options]
-u --url target URL
-l --level crawl depth (default 2)
-t --threads number of threads (default 2)
-d --delay delay between requests (seconds)
-c --cookie cookie header
-r --regex custom regex pattern
-s --seeds additional sub‑URLs (comma‑separated)
-e --export export format (e.g., json)
-o --output output directory (default target domain)
--exclude exclude URLs matching regex
--timeout request timeout (seconds, default 5)
--ninja enable Ninja mode
--update check for updates
--dns dump DNS data
--only-urls extract URLs only
--user-agent custom User‑Agent(s) (comma‑separated)
</code>Single‑Site Crawl
<code>python photon.py -u "http://example.com"</code>Depth Control
<code>python photon.py -u "http://example.com" -l 3</code>Depth defines how many link levels are followed; depth 2 crawls the homepage and its immediate links.
Thread Count
<code>python photon.py -u "http://example.com" -t 10</code>Increasing threads speeds up crawling but may trigger security mechanisms or overload small sites.
Request Delay
<code>python photon.py -u "http://example.com" -d 2</code>Specifies a pause (in seconds) between each HTTP(S) request.
Timeout
<code>python photon.py -u "http://example.com" --timeout=4</code>Sets the maximum wait time for a response before timing out.
Cookies
<code>python photon.py -u "http://example.com" -c "PHPSESSID=u5423d78fqbaju9a0qke25ca87"</code>Allows sending a Cookie header for sites that require session validation.
Output Directory
<code>python photon.py -u "http://example.com" -o "my_directory"</code>Results are saved in a folder named after the target domain by default; this option overrides the folder name.
Exclude Specific URLs
<code>python photon.py -u "http://example.com" --exclude="/blog/20[17|18]"</code>URLs matching the provided regex are omitted from crawling and results.
Specify Sub‑URLs
<code>python photon.py -u "http://example.com" --seeds "http://example.com/blog/2018,http://example.com/portals.html"</code>Custom seed URLs can be added, separated by commas.
Custom User‑Agents
<code>python photon.py -u "http://example.com" --user-agent "curl/7.35.0,Wget/1.15 (linux-gnu)"</code>Overrides the default user‑agent list without editing the user‑agents.txt file.
Custom Regex Pattern
<code>python photon.py -u "http://example.com" --regex "\d{10}"</code>Extracts strings that match the supplied regular expression during crawling.
Export Results
<code>python photon.py -u "http://example.com" --export=json</code>Supported export format: json.
Skip Data Extraction
<code>python photon.py -u "http://example.com" --only-urls</code>Only URLs are collected; files such as JavaScript are ignored.
Update
<code>python photon.py --update</code>Checks for a newer version, downloads it, and merges updates without overwriting existing files.
Ninja Mode
Enables the use of three proxy sites to issue requests on your behalf:
codebeautify.org photopea.com pixlr.com
DNS Dump
<code>python photon.py -u http://example.com --dns</code>Generates an image displaying DNS data for the target domain (sub‑domains are not supported).
Source references: kitploit, Covfefe compilation; please credit FreeBuf.COM when republishing.
Python Programming Learning Circle
A global community of Chinese Python developers offering technical articles, columns, original video tutorials, and problem sets. Topics include web full‑stack development, web scraping, data analysis, natural language processing, image processing, machine learning, automated testing, DevOps automation, and big data.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.