Big Data 15 min read

E‑commerce Data Scraping: Fundamentals, Tools, Python Scripts, and Challenges

This tutorial explains e‑commerce web scraping fundamentals, covering definitions, tool types, data categories, step‑by‑step Python script creation with Requests, BeautifulSoup, and Selenium, provides sample code for Amazon, Walmart, and eBay, discusses challenges like dynamic pages and anti‑scraping measures, and recommends using specialized scraping APIs.

DataFunSummit
DataFunSummit
DataFunSummit
E‑commerce Data Scraping: Fundamentals, Tools, Python Scripts, and Challenges

This article introduces the concept of e‑commerce web scraping, describing how data can be extracted from online retail platforms such as Amazon, Walmart, and eBay to support price analysis, review monitoring, market trend identification, and competitor research.

It outlines four main types of scraping tools: custom scripts written in languages like Python or JavaScript, no‑code scraping platforms, dedicated web‑scraping APIs, and browser extensions that collect data directly from the page.

The guide lists the typical data fields that can be harvested, including product details, pricing information, customer reviews, categories, seller information, logistics details, inventory status, and marketing data.

To build a custom scraper, the article recommends a workflow: understand the target page structure with DevTools, select the elements to extract, choose a scraping library, extract the data, clean it, and finally export it as JSON or CSV.

For simple sites, the following Python libraries are sufficient:

pip install requests beautifulsoup4

For pages that rely on JavaScript rendering (e.g., Amazon), Selenium is required:

pip install selenium

Amazon scraping example (searching for laptops):

from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.chrome.service import Service
import json

# Initialize the WebDriver
driver = webdriver.Chrome(service=Service())
driver.get("https://amazon.com/")
search_input_element = driver.find_element(By.ID, "twotabsearchtextbox")
search_input_element.send_keys("laptop")
search_button_element = driver.find_element(By.ID, "nav-search-submit-button")
search_button_element.click()

products = []
product_elements = driver.find_elements(By.CSS_SELECTOR, "[role=\"listitem\"][data-asin]")
for product_element in product_elements:
    url_element = product_element.find_element(By.CSS_SELECTOR, ".a-link-normal")
    url = url_element.get_attribute("href")
    name_element = product_element.find_element(By.CSS_SELECTOR, "h2")
    name = name_element.text
    image_element = product_element.find_element(By.CSS_SELECTOR, "img[data-image-load]")
    image = image_element.get_attribute("src")
    product = {"url": url, "name": name, "image": image}
    products.append(product)

with open("products.json", "w", encoding="utf-8") as json_file:
    json.dump(products, json_file, indent=4)

Walmart scraping example (searching for keyboards):

from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.chrome.service import Service
import json

driver = webdriver.Chrome(service=Service())
driver.get("https://www.walmart.com/search?q=keyboard")

products = []
product_elements = driver.find_elements(By.CSS_SELECTOR, ".carousel-4[data-testid=\"carousel-container\"] li")
for product_element in product_elements:
    url_element = product_element.find_element(By.CSS_SELECTOR, "a")
    url = url_element.get_attribute("href")
    name_element = product_element.find_element(By.CSS_SELECTOR, "h3")
    name = name_element.get_attribute("innerText")
    image_element = product_element.find_element(By.CSS_SELECTOR, "img[data-testid=\"productTileImage\"]")
    image = image_element.get_attribute("src")
    product = {"url": url, "name": name, "image": image}
    products.append(product)

with open("products.json", "w", encoding="utf-8") as json_file:
    json.dump(products, json_file, indent=4)

eBay scraping example (searching for mice):

import requests
from bs4 import BeautifulSoup
import json

url = "https://www.ebay.com/sch/i.html?_from=R40&_trksid=m570.l1313&_nkw=mouse&_sacat=0"
headers = {"User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/131.0.0.0 Safari/537.36"}
response = requests.get(url, headers=headers)

soup = BeautifulSoup(response.text, "html.parser")
products = []
product_elements = soup.select("li.s-item")
for product_element in product_elements:
    url_element = product_element.select("a[data-interactions]")[0]
    url = url_element["href"]
    name_element = product_element.select("[role=\"heading\"]")[0]
    name = name_element.text
    image_element = product_element.select("img")[0]
    image = image_element["src"]
    product = {"url": url, "name": name, "image": image}
    products.append(product)

with open("products.json", "w", encoding="utf-8") as json_file:
    json.dump(products, json_file, indent=4)

The article then discusses common challenges such as frequently changing page structures, diverse product page layouts, dynamic pricing, and anti‑scraping mechanisms like CAPTCHAs, which can block automated requests.

To overcome these issues, it suggests advanced techniques (e.g., Python CAPTCHA bypass, Playwright Stealth) and ultimately recommends using dedicated e‑commerce scraping APIs—particularly Bright Data’s API—which handle anti‑scraping, proxy management, and data cleaning for platforms including Amazon, Walmart, eBay, Lazada, and Shein.

In conclusion, while custom scripts provide full control, they require ongoing maintenance and technical expertise; using a specialized scraping API offers a more reliable and scalable solution for extracting e‑commerce data.

e‑commercePythonData Extractionweb scrapingSeleniumBeautifulSoupBright Data
DataFunSummit
Written by

DataFunSummit

Official account of the DataFun community, dedicated to sharing big data and AI industry summit news and speaker talks, with regular downloadable resource packs.

0 followers
Reader feedback

How this landed with the community

login Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.