Backend Development 14 min read

Product Middle Platform Workflow Orchestration Engine: Use Cases, Architecture, and High‑Availability Solutions

Tencent’s product middle platform employs a self‑built, stateless workflow orchestration engine—configurable via drag‑and‑drop or DSL—to coordinate massive product processing and audit tasks, using load‑balancing, retry, rate‑limiting, circuit‑breaker and service isolation strategies that ensure high availability, performance, and horizontal scalability on TKE.

Tencent Cloud Developer

May 14, 2024

The article introduces the product middle platform (Tencent Advertising product middle platform) and explains how a self‑built workflow orchestration engine is used to handle various business scenarios, achieving high‑availability, high‑performance, and high‑scalability, thereby improving development efficiency and system stability.

Use Cases

1. Product processing : Managing a catalog of nearly 4 billion products with daily processing volume exceeding 80 million items. The engine coordinates tasks such as category recognition, attribute extraction, brand identification, tagging, understanding, auditing, image/video transfer, and creative generation.

2. Product audit governance : Provides comprehensive audit units (public feature audit, baseline style audit, low‑quality style audit, infringement audit, prohibited content audit, etc.) and supports asynchronous human‑review callbacks within the workflow.

Why Use a Workflow Orchestration Engine?

Complex business processes involve multiple services that must be coordinated for sequencing, parallelism, timeout handling, retries, circuit breaking, and distributed consistency. An orchestration engine abstracts these concerns, allowing developers to focus on high‑cohesion, low‑coupling service nodes.

Building a Workflow

Two approaches are provided:

• Visual drag‑and‑drop editing : Users compose tasks (e.g., SCF, Kafka producer, HTTP interface) by dragging nodes in the console. An example shows a workflow with an image‑save task followed by an audit task.

• Code‑based creation using a DSL : The workflow is defined in JSON‑like DSL. The following example defines a parallel branch with an image‑save task and an audit task, followed by a final DAG node.

{
  "Comment": "业务A",
  "StartAt": "Parallel",
  "States": {
    "Parallel": {
      "Type": "Parallel",
      "Next": "FinalState",
      "Branches": [
        {
          "StartAt": "ImageSave",
          "States": {
            "ImageSave": {
              "Type": "Task",
              "Comment": "图片转存",
              "Resource": "resource地址，支持http协议、kafka协议、Serverless协议",
              "Next": "NewAudit"
            },
            "NewAudit": {
              "Type": "Task",
              "Comment": "审核流程",
              "Resource": "resource地址，支持http协议、kafka协议、Serverless协议",
              "InputPath": "$.inputAswStr",
              "End": true
            }
          }
        }
      ]
    },
    "FinalState": {
      "Type": "Task",
      "Comment": "DAG结束节点",
      "Resource": "resource地址，支持http协议、kafka协议、Serverless协议",
      "End": true
    }
  }
}

Engine Architecture

The engine runs on Tencent Cloud TKE using Docker containers and is stateless, enabling dynamic scaling based on resource pressure.

High‑Availability (Three‑High) Solutions

1. Load‑balancing strategies – round‑robin, least‑connection, hash, random, weighted round‑robin. The scheduler distributes DAG tasks to executors based on these policies and monitors executor health every second.

Executor

Load

Remaining Capacity

Selection Probability

100%‑a%

(100‑a)/[(100‑a)+(100‑b)+(100‑c)]

100%‑b%

(100‑b)/[(100‑a)+(100‑b)+(100‑c)]

100%‑c%

(100‑c)/[(100‑a)+(100‑b)+(100‑c)]

2. Interface retry strategy – detects error codes, decides whether to retry, and configures retry interval, max attempts, and back‑off rate. Example configuration:

{
  "Comment": "业务A",
  "StartAt": "unit_a",
  "States": {
    "unit_a": {
      "Type": "Task",
      "Comment": "审核A单元",
      "Resource": "resource地址，支持http协议、kafka协议、Serverless协议",
      "Retry": [
        {
          "ErrorEquals": ["StatesTimeout"],
          "IntervalSeconds": 1,
          "MaxAttempts": 2,
          "BackoffRate": 2.0
        }
      ],
      "End": true
    }
  }
}

3. Rate limiting and circuit‑breaker – prevents overload and early failure of faulty services, avoiding cascade failures. The engine also uses message‑queue buffering for back‑pressure. 4. Service isolation – physical isolation of clusters based on hot‑spot and user‑scenario dimensions, ensuring that failures in one service do not affect others. Performance and Scalability The engine has undergone extensive stress testing, optimizing storage I/O, large object handling, concurrency, and component usage. Being stateless and containerized, it supports automatic horizontal scaling on TKE.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

backend DSL Microservices Orchestration high-availability

Written by

Tencent Cloud Developer

Official Tencent Cloud community account that brings together developers, shares practical tech insights, and fosters an influential tech exchange community.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.