Big Data 14 min read

Engineering Practices for a Billion‑Scale Image Asset Platform

The article recounts how the author built a billion‑scale AI image‑asset library by replacing a week‑long import with a clustered‑table, sharded pipeline, MD5‑based unique keys, a custom DataWorks task scheduler, and multi‑engine query layers, sharing practical engineering practices learned through successive iterations.

DaTaobao Tech

Nov 15, 2024

Engineering Practices for a Billion‑Scale Image Asset Platform

The author participated in the design and development of an industrial AI image‑asset library that handles tens of billions of images. During the platform construction many pitfalls were encountered, and a set of engineering practices is shared.

Initially the import relied on DataWorks node scheduling, which could not finish a billion‑scale import within 24 hours (often exceeding a week). A new import pipeline was designed to overcome this limitation.

1. Clustered tables for parallelism – MaxCompute supports clustered by | range clustered by syntax. Hash‑clustered tables distribute rows into buckets; join optimization requires bucket counts to be multiples (e.g., 512, 1024). Consistent bucket numbers across source and target tables are recommended.

2. Data sharding for fault tolerance – Even with clustered tables, a 100 M‑image import can degrade. The workflow splits the job into many shards (e.g., 38 shards for 150 M images, each handling ~1 M images). Benefits include: only failed shards need to be re‑run, controllable parallelism, and finer‑grained resource utilization.

3. Image‑key generation – Requirements: global uniqueness and extremely low collision probability. Four methods were evaluated (perceptual hash, average hash, difference hash, MD5). Advantages and drawbacks are listed in a table. The final solution uses MD5 as a unique primary key and stores an image fingerprint only when needed.

4. Custom task scheduling framework – Built on DataWorks open‑API. Core concepts:

Task: logical definition (preparation, sharding, key generation, upload, attribute writes).

Trigger & trigger record: initiate a task instance.

Task instance: smallest executable unit (ODPS node, SQL, Java), with dependencies, grouping, and status.

5. Output capabilities – Online queries use OpenSearch (vector search), Holo (PostgreSQL‑based fast indexing), and MySQL (primary‑key lookup). Offline queries leverage MaxCompute parameterized VIEWs, which accept tables or variables as parameters, enabling reusable SQL logic.

6. Parameterized VIEW example (Python implementation)

import cv2
import numpy as np
from tfsClient import tfsClient
from PIL import Image
from io import BytesIO

# Perceptual hash

def pic_p_hash(img, hash_size=32):
    img = cv2.resize(img, (hash_size, hash_size))
    gray = cv2.cvtColor(img, cv2.COLOR_BGR2GRAY)
    img = img.astype(np.float32)
    dct = cv2.dct(np.float32(gray))
    dct = dct[:hash_size, :hash_size]
    avg = np.mean(dct)
    phash = (dct > avg).astype(int).flatten()
    phash_str = ''.join([str(x) for x in phash])
    phash_hex = hex(int(phash_str, 2))[2:].zfill(hash_size // 4)
    return phash_hex

# Average hash

def pic_avg_hash(img):
    img = cv2.resize(img, (8, 8), interpolation=cv2.INTER_CUBIC)
    gray = cv2.cvtColor(img, cv2.COLOR_BGR2GRAY)
    s = np.sum(gray)
    avg = s / 64
    hash_str = ''.join(['1' if gray[i, j] > avg else '0' for i in range(8) for j in range(8)])
    return hash_str

# Difference hash

def pic_dif_hash(img):
    img = cv2.resize(img, (9, 8), interpolation=cv2.INTER_CUBIC)
    gray = cv2.cvtColor(img, cv2.COLOR_BGR2GRAY)
    hash_str = ''.join(['1' if gray[i, j] > gray[i, j+1] else '0' for i in range(8) for j in range(8)])
    return hash_str

# Hash comparison

def hash_cmp(hash1, hash2):
    if len(hash1) != len(hash2):
        return -1
    return sum(ch1 != ch2 for ch1, ch2 in zip(hash1, hash2))

if __name__ == '__main__':
    img1 = cv2.imread('a.jpeg')
    img2 = cv2.imread('b.jpeg')
    imgHash1 = pic_p_hash(img1, 32)
    imgHash2 = pic_p_hash(img2, 32)
    print(hash_cmp(imgHash1, imgHash2))

The engineering journey involved two major version iterations. Each iteration addressed new bottlenecks that emerged as data volume grew. Practical design required balancing forward‑looking architecture with avoiding over‑design, making trade‑offs based on deep system and business understanding.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

Big Data data pipeline image processing Task Scheduling Hashing

Written by

DaTaobao Tech

Official account of DaTaobao Technology

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.