Big Data 19 min read

Extending PyODPS with PAI‑Designer for Dynamic Offline Data Processing

By integrating PAI‑Designer with PyODPS, users can build visual offline workflows that overcome ODPS’s lack of network access, dynamic configuration, and image‑processing limits, using reusable Python components, OSS role‑ARNs, remote configuration fetching, and custom Docker images to read/write MaxCompute and OSS data.

DaTaobao Tech
DaTaobao Tech
DaTaobao Tech
Extending PyODPS with PAI‑Designer for Dynamic Offline Data Processing

This article continues the previous introduction to PyODPS and demonstrates how to use PAI‑Designer to overcome ODPS limitations such as lack of network access, dynamic configuration, and image‑processing capabilities. By integrating PAI‑Designer, users can create flexible offline workflows that combine data handling, OSS file operations, and custom Python scripts.

Background

After a year of using PyODPS, the author still faces three main issues: ODPS cannot access network resources, cannot upload files to OSS while processing rows, and cannot incorporate image‑processing or algorithmic capabilities.

Solution Overview

PAI‑Designer provides a visual workflow where each component can run a Python script with configurable inputs/outputs. The workflow includes:

Creating a workflow in PAI‑Designer.

Configuring OSS RoleARN for data access.

Writing reusable Python script templates that parse ODPS URLs, handle arguments, and read/write both OSS and MaxCompute tables.

Key Python Script Template

import os
import argparse
import json
"""Python component script example."""
ENV_JOB_MAX_COMPUTE_EXECUTION = "JOB_MAX_COMPUTE_EXECUTION"

def init_odps():
    """Initialize an ODPS instance for reading/writing MaxCompute data."""
    from odps import ODPS
    mc_execution = json.loads(os.environ[ENV_JOB_MAX_COMPUTE_EXECUTION])
    return ODPS(
        access_id="
",
        secret_access_key="
",
        endpoint=mc_execution["endpoint"],
        project=mc_execution["odpsProject"]
    )

def parse_odps_url(table_uri):
    """Parse a MaxCompute table URI and return (project, table, partition)."""
    from urllib import parse
    parsed = parse.urlparse(table_uri)
    project_name = parsed.hostname
    parts = parsed.path.split("/", 3)
    table_name = parts[2]
    partition = parts[3] if len(parts) > 3 else None
    return project_name, table_name, partition

def parse_args():
    parser = argparse.ArgumentParser(description="Python component script example.")
    parser.add_argument("--input1", type=str, default=None, help="Component input port 1.")
    parser.add_argument("--output1", type=str, default=None, help="Output OSS port 1.")
    # additional inputs/outputs omitted for brevity
    args, _ = parser.parse_known_args()
    return args

if __name__ == "__main__":
    args = parse_args()
    print(f"Input1={args.input1}")
    print(f"Output1={args.output1}")
    # write_table_example(args)
    # write_output1(args)

This template shows how to initialize ODPS, parse table URIs, and handle component arguments. The script can be extended to read data, perform custom aggregation, and write results back to OSS or MaxCompute.

Reading Online Configuration

import requests, os, argparse, json

def parse_args():
    parser = argparse.ArgumentParser(description="Python component script example.")
    parser.add_argument("--output1", type=str, default=None, help="Output OSS port 1.")
    return parser.parse_args()

url = "https://xxx.alicdn.com/fpi/xxxx-data/v1/xxx-config.js?"
response = requests.get(url)
if response.status_code == 200:
    args = parse_args()
    os.makedirs(args.output1, exist_ok=True)
    with open(os.path.join(args.output1, "result.txt"), "wb") as f:
        f.write(response.content)
else:
    raise Exception("DIY C端配置读取失败")

The script fetches a remote JSON configuration from a CDN and stores it in an OSS‑mounted directory, enabling downstream components to consume the data.

Custom Docker Image for Image Processing

FROM reg.docker.alibaba-inc.com/alibase/alios7u2-min:1.13
COPY ./resource/Python-3.9.18.tar.xz /home/admin/Python-3.9.18.tar.xz
WORKDIR /home/admin
RUN rpm --rebuilddb && yum install -y gcc gcc-c++ automake autoconf libtool make zlib-devel openssl openssl-devel libxslt-devel libxml2-devel
RUN rpm --rebuilddb && yum install -y pcre pcre-devel zlib zlib-devel libffi-devel
# Install Python 3.9
RUN tar xJf Python-3.9.18.tar.xz && \
    cd Python-3.9.18 && ./configure --prefix=/usr/local/python && make && make install && \
    rm -f /usr/bin/python && ln -s /usr/local/python/bin/python3 /usr/bin/python && \
    ln -s /usr/local/python/bin/pip3 /usr/bin/pip
RUN pip install --upgrade pip && pip config set global.index-url https://xxxx.xxxx-xxxx.cn/simple/
RUN pip install setuptools>=3.0 pyodps pillow requests numpy scipy matplotlib

The Dockerfile builds a custom image containing Python 3.9 and the required libraries (pyodps, pillow, etc.) so that PAI‑Designer can execute image‑processing tasks such as computing edge colors.

Conclusion

PAI‑Designer extends the capabilities of traditional ODPS by allowing network access, dynamic configuration, and integration of external libraries. With reusable Python script components and custom Docker images, complex data pipelines—including image analysis—can be built efficiently. Users should still consider security and compliance when configuring external resources.

DockerPythonData ProcessingMaxComputePAI-DesignerPyODPS
DaTaobao Tech
Written by

DaTaobao Tech

Official account of DaTaobao Technology

0 followers
Reader feedback

How this landed with the community

login Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.