Crawling and Downloading Thousands of Images from Sogou Using Java
This article explains how to programmatically fetch and save thousands of images from Sogou by analyzing the XHR request parameters, constructing the appropriate URL, extracting image URLs from the JSON response, and using a multithreaded Java downloader with custom HTTP utilities.
Purpose: Retrieve and locally store thousands of beauty images from Sogou Image Search.
Preparation: The target URL is https://pic.sogou.com/pics?query=%E7%BE%8E%E5%A5%B3 . By opening the page, using the browser's DevTools (Network → XHR) and scrolling, the request URL pattern is discovered:
https://pic.sogou.com/napi/pc/searchList?mode=1&start=48&xml_len=48&query=%E7%BE%8E%E5%A5%B3
Key parameters: start (starting index), xml_len (number of images per request), and query (search keyword, URL‑encoded).
Analysis: The JSON response contains the desired image URLs in the picUrl field.
Approach: The workflow consists of four steps – set URL parameters, request the URL to obtain image URLs, store URLs in a list, and download the images concurrently using a thread pool.
Configure request parameters.
Fetch the URL and parse the JSON to collect picUrl values.
Accumulate URLs in a list.
Iterate the list with a thread pool to download each image to a local directory.
Code: The core implementation is provided in two Java classes.
import com.alibaba.fastjson.JSONObject;
import us.codecraft.webmagic.utils.HttpClientUtils;
import victor.chang.crawler.pipeline.SougouImgPipeline;
import java.util.ArrayList;
import java.util.List;
/**
* A simple PageProcessor.
*/
public class SougouImgProcessor {
private String url;
private SougouImgPipeline pipeline;
private List
dataList;
private List
urlList;
private String word;
public SougouImgProcessor(String url,String word) {
this.url = url;
this.word = word;
this.pipeline = new SougouImgPipeline();
this.dataList = new ArrayList<>();
this.urlList = new ArrayList<>();
}
public void process(int idx, int size) {
String res = HttpClientUtils.get(String.format(this.url, idx, size, this.word));
JSONObject object = JSONObject.parseObject(res);
List
items = (List
)((JSONObject)object.get("data")).get("items");
for(JSONObject item : items){
this.urlList.add(item.getString("picUrl"));
}
this.dataList.addAll(items);
}
// download
public void pipelineData(){
// multithread
pipeline.processSync(this.urlList, this.word);
}
public static void main(String[] args) {
String url = "https://pic.sogou.com/napi/pc/searchList?mode=1&start=%s&xml_len=%s&query=%s";
SougouImgProcessor processor = new SougouImgProcessor(url,"美女");
int start = 0, size = 50, limit = 1000; // start index, batch size, total
for(int i=start;i import java.io.File;
import java.io.FileOutputStream;
import java.io.InputStream;
import java.net.URL;
import java.net.URLConnection;
import java.util.List;
import java.util.Objects;
import java.util.concurrent.ExecutorService;
import java.util.concurrent.Executors;
import java.util.concurrent.TimeUnit;
import java.util.concurrent.atomic.AtomicInteger;
/**
* Store results in files.
*/
public class SougouImgPipeline {
private String extension = ".jpg";
private String path;
private volatile AtomicInteger suc;
private volatile AtomicInteger fails;
public SougouImgPipeline() {
setPath("E:/pipeline/sougou");
suc = new AtomicInteger();
fails = new AtomicInteger();
}
// ... (methods for downloadImg, process, processSync, etc.)
}Running the program may not download every image due to network issues, but repeated executions increase the success rate.
Conclusion: By analyzing the Sogou API, extracting image URLs, and employing a multithreaded Java downloader, large‑scale image collection can be automated efficiently.
Architecture Digest
Focusing on Java backend development, covering application architecture from top-tier internet companies (high availability, high performance, high stability), big data, machine learning, Java architecture, and other popular fields.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.