Tag

data pipelines

1 views collected around this technical thread.

Airbnb Technology Team
Airbnb Technology Team
Mar 24, 2025 · Artificial Intelligence

Chronon: Open‑Source Feature Platform for Machine Learning – Architecture, Workflow, and Code Examples

Chronon is an open‑source ML feature platform that lets engineers declaratively define, compute, and serve both batch and real‑time features with built‑in observability, data‑quality checks, and a low‑latency retrieval API, ensuring online‑offline consistency while simplifying pipeline management and enabling future automation.

ChrononFeature EngineeringObservability
0 likes · 13 min read
Chronon: Open‑Source Feature Platform for Machine Learning – Architecture, Workflow, and Code Examples
DataFunSummit
DataFunSummit
Mar 3, 2025 · Artificial Intelligence

DeepSeek Open Source Week: Seven Core Technologies Reshaping Large‑Model Training

The DeepSeek open‑source week introduced seven breakthrough technologies—FlashMLA, DeepGEMM, DeepEP, DualPipe, EPLB, 3FS, and Smallpond—that together overhaul data flow, algorithmic complexity, hardware utilization, MoE communication, and resource balancing, dramatically improving large‑model training efficiency and lowering entry barriers for the AI industry.

AI hardwareDeepSeekLarge Models
0 likes · 17 min read
DeepSeek Open Source Week: Seven Core Technologies Reshaping Large‑Model Training
DataFunSummit
DataFunSummit
Feb 24, 2025 · Big Data

Building Real-Time Data Synchronization Pipelines with Apache SeaTunnel

Apache SeaTunnel is an open‑source, distributed data integration platform that enables efficient real‑time data synchronization across diverse sources and destinations, supporting both streaming and batch processing, with detailed architecture, connector plugins, CDC handling, transform capabilities, and deployment strategies for large‑scale data pipelines.

Apache SeatunnelCDCdata pipelines
0 likes · 34 min read
Building Real-Time Data Synchronization Pipelines with Apache SeaTunnel
macrozheng
macrozheng
Dec 20, 2024 · Big Data

Master Data Pipelines with Kestra: Open‑Source Workflow Engine Explained

This article introduces the open‑source Kestra workflow engine, outlines its key features for building scalable data pipelines, provides step‑by‑step Docker installation and YAML workflow examples, and showcases its visual UI for monitoring and managing complex ETL and automation tasks.

DockerKestradata pipelines
0 likes · 6 min read
Master Data Pipelines with Kestra: Open‑Source Workflow Engine Explained
Rare Earth Juejin Tech Community
Rare Earth Juejin Tech Community
Nov 29, 2024 · Big Data

How ByteDance Builds Large-Scale Data Processing Pipelines for Multimodal Models with Ray

The article details ByteDance's use of Ray and RayData to construct scalable audio and video data processing pipelines for multimodal AI models, addressing challenges of massive data volume, resource constraints, and fault tolerance through pipeline design, RayCore enhancements, and custom scheduling optimizations.

AIBig DataByteDance
0 likes · 16 min read
How ByteDance Builds Large-Scale Data Processing Pipelines for Multimodal Models with Ray
DataFunSummit
DataFunSummit
Sep 24, 2024 · Artificial Intelligence

Streaming Data Pipelines and Scaling Laws for Efficient Large‑Model Training

The article discusses the challenges of training ever‑larger AI models on internet‑scale data, critiques traditional batch ETL pipelines, and proposes a streaming data‑flow architecture with dynamic data selection and a shared‑memory/Alluxio middle layer to decouple data processing from model training, improving efficiency and scalability.

AI infrastructureLarge Modelsdata pipelines
0 likes · 20 min read
Streaming Data Pipelines and Scaling Laws for Efficient Large‑Model Training
Test Development Learning Exchange
Test Development Learning Exchange
Mar 31, 2024 · Big Data

Apache Airflow Overview and Advanced Usage Examples

This article introduces Apache Airflow, explains its core concepts such as DAGs, tasks, operators, executors, and the web UI, and provides multiple practical Python code examples for Bash commands, Python functions, SQL queries, task dependencies, sensors, dynamic DAGs, SubDAGs, XCom, email alerts, and error handling.

Apache AirflowDAGWorkflow
0 likes · 7 min read
Apache Airflow Overview and Advanced Usage Examples
Inke Technology
Inke Technology
Nov 24, 2023 · Backend Development

Building a Scalable Overseas Ad Platform: Architecture, Permissions & Automation

To support rapid overseas expansion, the article outlines a comprehensive backend architecture—including management, data ingestion, device tracking, attribution, and offline tasks—while detailing fine-grained user permission controls, automated product onboarding, batch ad creation, and server‑side attribution workflows, plus future enhancements.

Automationadvertising platformbackend
0 likes · 12 min read
Building a Scalable Overseas Ad Platform: Architecture, Permissions & Automation
DataFunTalk
DataFunTalk
Aug 3, 2023 · Game Development

Applying A/B Testing to Drive Growth in Tencent Overseas Games

This article explains how Tencent leverages A/B testing across its overseas games, detailing market differences, experimental methodology, multi‑cloud platform compliance, data architecture, and case studies that illustrate how targeted experiments improve user onboarding, gameplay settings, and email‑based re‑engagement.

A/B testingdata pipelinesexperiment design
0 likes · 12 min read
Applying A/B Testing to Drive Growth in Tencent Overseas Games
Java Architecture Diary
Java Architecture Diary
Jul 5, 2023 · Cloud Native

Deploy and Explore StreamPipes: A Self‑Service Industrial IoT Toolbox

This guide introduces StreamPipes, an end‑to‑end industrial IoT toolbox for non‑technical users, outlines its key features, shows how to connect data sources, build pipelines, visualize data, and provides step‑by‑step Docker‑Compose installation, configuration, and development instructions.

Docker ComposeInstallation GuideStreamPipes
0 likes · 8 min read
Deploy and Explore StreamPipes: A Self‑Service Industrial IoT Toolbox
DevOps Cloud Academy
DevOps Cloud Academy
Feb 28, 2023 · Operations

Understanding Apache Airflow Celery Executor: Architecture, Setup, and Task Execution

This article explains how Apache Airflow's Celery Executor works, covering its key features, installation steps, configuration details, architectural components, and the complete task execution process that enables scalable, distributed workflow orchestration for data pipelines.

Apache AirflowCelery Executordata pipelines
0 likes · 15 min read
Understanding Apache Airflow Celery Executor: Architecture, Setup, and Task Execution
DevOps Cloud Academy
DevOps Cloud Academy
Nov 13, 2022 · Operations

An Introduction to Apache Airflow: Features and Benefits of Digital Workflow Management

This article explains why modern organizations replace manual cron jobs with automated digital workflow management using Apache Airflow, detailing its troubleshooting, flexibility, monitoring, rich web UI, CLI/API, complex dependency handling, scalability, containerization, and extensibility through plugins and integrations.

Apache AirflowOpen-sourceWorkflow Automation
0 likes · 9 min read
An Introduction to Apache Airflow: Features and Benefits of Digital Workflow Management
DevOps Cloud Academy
DevOps Cloud Academy
Oct 20, 2022 · Big Data

Installing Apache Airflow, Creating Users, and Using Basic Commands

This guide explains how to install Apache Airflow in a virtual environment, set up the Airflow home, create an admin user, understand role‑based access control, and run essential Airflow CLI commands for managing DAGs and tasks.

Airflow RolesApache AirflowInstallation
0 likes · 6 min read
Installing Apache Airflow, Creating Users, and Using Basic Commands
DevOps Cloud Academy
DevOps Cloud Academy
Oct 15, 2022 · Big Data

Introduction to Apache Airflow

Apache Airflow is an open‑source platform for programmatically authoring, scheduling, and monitoring workflows using Directed Acyclic Graphs (DAGs), featuring components such as Scheduler, Web Server, Database, and various Executors, and offering easy‑to‑use, extensible, scalable, and robust integrations for data pipeline management.

Apache AirflowDAGExecutor
0 likes · 10 min read
Introduction to Apache Airflow
DataFunSummit
DataFunSummit
Aug 25, 2022 · Big Data

Managing the Full Lifecycle of Risk Features: Pitfalls, Solutions, and Future Directions

The talk by Tang Gengyang from Citic Baixin Bank details the challenges faced in risk feature engineering, presents two solution frameworks (1.0 and 2.0) for accelerating deployment, improving reuse, handling offline/online consistency, and outlines future enhancements for a more efficient, automated feature pipeline.

Big DataFeature EngineeringFlink
0 likes · 14 min read
Managing the Full Lifecycle of Risk Features: Pitfalls, Solutions, and Future Directions
Top Architect
Top Architect
Jun 7, 2022 · Databases

An Introduction to Change Data Capture (CDC) Practices and Modern Approaches

This article introduces the concept of Change Data Capture (CDC), explains why traditional batch reporting strains resources, describes how CDC captures only data changes to keep source databases performant, and outlines modern CDC architectures, production‑ready considerations, and best‑practice guidelines for building reliable data pipelines.

CDCChange Data Capturedata integration
0 likes · 16 min read
An Introduction to Change Data Capture (CDC) Practices and Modern Approaches
Big Data Technology Architecture
Big Data Technology Architecture
Jun 3, 2022 · Operations

Understanding Apache Airflow DAGs, Operators, and Scheduling

This article explains Apache Airflow's core concepts, including DAG definitions, scheduling intervals, task dependencies, various operators such as BashOperator, PythonOperator, Branch operators, sensors, and custom operators, and provides code examples and configuration details for building robust data pipelines.

Apache AirflowDAGOperators
0 likes · 15 min read
Understanding Apache Airflow DAGs, Operators, and Scheduling
Big Data Technology Architecture
Big Data Technology Architecture
May 31, 2022 · Big Data

Comprehensive Guide to Installing and Using Apache Airflow with Docker on Windows

This article provides a detailed tutorial on Apache Airflow fundamentals, Docker-based installation on Windows, Dockerfile creation, container deployment via Docker run and Docker Compose, Airflow configuration, and practical usage of DAGs, tasks, connections, and UI features for data pipeline orchestration.

Apache AirflowDockerDocker Compose
0 likes · 14 min read
Comprehensive Guide to Installing and Using Apache Airflow with Docker on Windows
ByteDance Data Platform
ByteDance Data Platform
Apr 8, 2022 · Operations

How Baseline Monitoring Transforms Data Pipeline Reliability at ByteDance

This article explains ByteDance's baseline monitoring system for data pipelines, detailing its motivation, core concepts, architecture, instance generation, alert types, and handling of complex task dependencies to reduce operational costs and improve SLA compliance across hundreds of projects.

AlertingBig Databaseline monitoring
0 likes · 21 min read
How Baseline Monitoring Transforms Data Pipeline Reliability at ByteDance
Efficient Ops
Efficient Ops
Jun 1, 2021 · Artificial Intelligence

How Time‑Series Analysis Powers AIOps: Overcoming Real‑World Challenges

At the 16th GOPS Global Operations Conference, Shen Hui of DingMao Technology explained how time‑series data analysis underpins AIOps, outlining its four‑step workflow, key challenges, and the company’s three‑pipeline solution that enables trend forecasting, fault prediction, and a robust AI‑driven operational platform.

AIAIOpsdata pipelines
0 likes · 7 min read
How Time‑Series Analysis Powers AIOps: Overcoming Real‑World Challenges