Tag

ETL

0 views collected around this technical thread.

Architect's Guide
Architect's Guide
Jun 14, 2025 · Big Data

Mastering Data Warehouse Design: From Fact Tables to Dimensional Modeling

This article explains the core components of a data warehouse ecosystem, distinguishes fact and dimension tables, outlines synchronization strategies, introduces star, snowflake, and constellation schemas, and details the layered architecture from ODS to data marts for effective big‑data analytics.

ETLbig datadata warehouse
0 likes · 15 min read
Mastering Data Warehouse Design: From Fact Tables to Dimensional Modeling
Java Tech Enthusiast
Java Tech Enthusiast
May 13, 2025 · Big Data

Using Alibaba DataX 3.0 for MySQL Data Synchronization: Installation, Configuration, and Incremental Sync

This article introduces Alibaba DataX 3.0, explains its architecture and role‑based design, walks through Linux installation, JDK setup, MySQL preparation, and provides step‑by‑step examples of full‑load and incremental data synchronization between two MySQL instances using JSON job configurations and command‑line execution.

Data SynchronizationDataXETL
0 likes · 14 min read
Using Alibaba DataX 3.0 for MySQL Data Synchronization: Installation, Configuration, and Incremental Sync
macrozheng
macrozheng
May 12, 2025 · Big Data

Master DataX: Efficient Data Synchronization for Massive MySQL Datasets

Learn how to overcome inaccurate reporting and cross-database challenges by using Alibaba’s open-source DataX tool to efficiently synchronize massive MySQL datasets, covering its architecture, job scheduling, installation, configuration, full- and incremental sync, and practical command-line examples.

Data SynchronizationDataXETL
0 likes · 15 min read
Master DataX: Efficient Data Synchronization for Massive MySQL Datasets
Top Architect
Top Architect
May 7, 2025 · Big Data

Using DataX for Efficient MySQL Data Synchronization

This article provides a comprehensive guide on using Alibaba's open‑source DataX tool for efficient offline synchronization between heterogeneous databases such as MySQL, covering its architecture, installation on Linux, job configuration, full‑ and incremental data transfer, and practical code examples.

Data SynchronizationDataXETL
0 likes · 18 min read
Using DataX for Efficient MySQL Data Synchronization
Architecture Digest
Architecture Digest
May 6, 2025 · Big Data

Using DataX for Efficient Data Synchronization Between MySQL Databases

This article explains how to employ Alibaba's open‑source DataX tool to perform fast, reliable full‑ and incremental data synchronization between MySQL instances, covering installation, framework design, job execution, and practical shell commands for Linux environments.

Data SynchronizationDataXETL
0 likes · 16 min read
Using DataX for Efficient Data Synchronization Between MySQL Databases
vivo Internet Technology
vivo Internet Technology
Dec 18, 2024 · Big Data

Kafka Streams: Architecture, Configuration, and Monitoring Use Cases

Kafka Streams is a client library that enables low‑latency, fault‑tolerant real‑time processing of Kafka data through configurable topologies, time semantics, and state stores, and the article explains its architecture, essential configurations, monitoring‑focused ETL example, performance tuning, and strategies for handling partition skew.

ETLJavaKafka Streams
0 likes · 25 min read
Kafka Streams: Architecture, Configuration, and Monitoring Use Cases
Test Development Learning Exchange
Test Development Learning Exchange
Dec 1, 2024 · Big Data

How to Install Apache Airflow and Build a Simple Data Processing Pipeline

This tutorial guides you through installing Apache Airflow, initializing its database, starting the web server and scheduler, creating a Python DAG that reads, cleans, groups, and saves CSV data, configuring the DAG directory, and monitoring the pipeline via the Airflow web UI.

Apache AirflowDAGETL
0 likes · 6 min read
How to Install Apache Airflow and Build a Simple Data Processing Pipeline
macrozheng
macrozheng
Sep 27, 2024 · Big Data

Master DataX: Efficient Offline Data Sync for Heterogeneous Sources

This guide walks through the challenges of synchronizing massive datasets across heterogeneous databases, introduces Alibaba's open‑source DataX tool, explains its framework‑plugin architecture, and provides step‑by‑step instructions—including environment setup, installation, job configuration, and both full and incremental MySQL synchronization—complete with code examples and performance metrics.

DataXETLIncremental Sync
0 likes · 15 min read
Master DataX: Efficient Offline Data Sync for Heterogeneous Sources
IT Xianyu
IT Xianyu
Aug 26, 2024 · Big Data

Hive Data Warehouse: Modeling, Partitioning, and ID‑Mapping for User Profiles

This article explains how Hive serves as a data‑warehouse layer for user‑profile tagging, covering data‑warehouse fundamentals, fact‑and‑dimension modeling, partitioned storage, label aggregation, and ID‑mapping techniques with practical Hive DDL/DML examples.

ETLHiveID Mapping
0 likes · 11 min read
Hive Data Warehouse: Modeling, Partitioning, and ID‑Mapping for User Profiles
DataFunTalk
DataFunTalk
Aug 8, 2024 · Big Data

Building a User Profile Data Warehouse at 58.com: Architecture, Modeling, and Practices

This article details the design and implementation of a user‑profile data warehouse at 58.com, covering data‑warehouse fundamentals, user‑profile tag generation, layered architecture, dimensional modeling choices, ETL migration from Hive to Spark, data‑quality safeguards, and the resulting scale of tables, metrics and tags.

Data qualityETLbig data
0 likes · 20 min read
Building a User Profile Data Warehouse at 58.com: Architecture, Modeling, and Practices
DataFunTalk
DataFunTalk
Jul 10, 2024 · Big Data

Apache SeaTunnel: A Next‑Generation Data Integration Platform for ETL/ELT and OLAP

This article introduces Apache SeaTunnel, a modern data integration platform designed for the EtLT era, detailing its architecture, core connector APIs, checkpoint mechanism, model inference, multi‑table synchronization, the high‑performance SeaTunnel Zeta engine, OLAP use cases, community roadmap, and the commercial WhaleTunnel product.

Apache SeatunnelELTETL
0 likes · 22 min read
Apache SeaTunnel: A Next‑Generation Data Integration Platform for ETL/ELT and OLAP
DaTaobao Tech
DaTaobao Tech
Jul 8, 2024 · Big Data

ODPS (MaxCompute) SQL Basics, Data Integration and Hologres Import Guide

This guide provides a comprehensive, beginner‑to‑advanced reference for ODPS (MaxCompute) SQL, covering table creation, DDL/DML commands, query syntax, join hints, MySQL‑to‑ODPS synchronization, one‑click and custom imports into Hologres, and scheduling variables for automated data pipelines.

ETLHologresODPS
0 likes · 37 min read
ODPS (MaxCompute) SQL Basics, Data Integration and Hologres Import Guide
DevOps
DevOps
Jun 27, 2024 · Big Data

Agile Data Engineering: Code‑as‑Infrastructure, Reuse Strategies, and ETL‑Level Continuous Integration

This article explores agile data engineering, advocating code‑as‑infrastructure practices such as code‑everything, data and code reuse, and ETL‑level continuous integration, while discussing the trade‑offs between data‑centric and code‑centric reuse, cost considerations, and practical implementation tips for modern data projects.

Agile DevelopmentCode as InfrastructureData Engineering
0 likes · 22 min read
Agile Data Engineering: Code‑as‑Infrastructure, Reuse Strategies, and ETL‑Level Continuous Integration
DataFunTalk
DataFunTalk
May 26, 2024 · Big Data

Athena Data Factory: A One‑Stop Data Development and Governance Platform for Sparkle Thinking

The article details how Sparkle Thinking built the Athena Data Factory—a comprehensive, self‑service data development and governance platform that integrates data integration, ETL, real‑time processing, monitoring, and analytics, describing its architecture, key technologies, implementation timeline, operational practices, performance gains, and future directions.

AirflowETLSpark
0 likes · 26 min read
Athena Data Factory: A One‑Stop Data Development and Governance Platform for Sparkle Thinking
DataFunTalk
DataFunTalk
May 13, 2024 · Big Data

Data Integration Maturity Model: From ETL to EtLT

The article examines the evolution of data integration architectures—from traditional ETL through ELT to the emerging EtLT model—highlighting their advantages, disadvantages, industry trends, maturity stages, and practical guidance for enterprises and professionals navigating modern big‑data pipelines.

DataOpsELTETL
0 likes · 31 min read
Data Integration Maturity Model: From ETL to EtLT
Test Development Learning Exchange
Test Development Learning Exchange
May 9, 2024 · Fundamentals

Getting Started with petl: Installation, Basic Operations, and Practical Examples

This article introduces the Python petl library for easy ETL tasks, explains how to install it via pip, and demonstrates core operations such as loading CSV data, viewing, filtering, sorting, converting, aggregating, joining, deduplicating, and performing basic statistical analysis with clear code examples.

Data ProcessingETLPython
0 likes · 4 min read
Getting Started with petl: Installation, Basic Operations, and Practical Examples
DataFunSummit
DataFunSummit
May 2, 2024 · Big Data

Building an Attribution System for NetEase Cloud Music Data Warehouse: Challenges and Solutions

This article presents the problems faced by NetEase Cloud Music's data warehouse attribution system and details a comprehensive solution that includes upgrading the event‑tracking framework, redesigning the attribution model, and launching a unified management platform to improve stability, accuracy, and scalability.

ETLanalyticsbig data
0 likes · 13 min read
Building an Attribution System for NetEase Cloud Music Data Warehouse: Challenges and Solutions
DataFunSummit
DataFunSummit
Mar 24, 2024 · Big Data

Design and Implementation of a User Data Warehouse and Profiling System at 58.com

This article details the design and implementation of a user data warehouse at 58.com, covering data warehouse fundamentals, user profiling concepts, multi‑layer architecture, modeling methods, ETL migration from Hive to Spark, data quality assurance, and the resulting achievements.

ETLSparkbig data
0 likes · 20 min read
Design and Implementation of a User Data Warehouse and Profiling System at 58.com
DataFunTalk
DataFunTalk
Mar 1, 2024 · Big Data

Understanding Data Fabric and Data Virtualization: Concepts, Practices, and Real‑World Case Study

This article explains the fundamentals of Data Fabric and data virtualization, highlights the limitations of traditional centralized data warehouses, describes the three‑layer virtualization architecture, and presents a detailed securities‑industry case study that demonstrates cost, efficiency, and compliance benefits.

Data VirtualizationETLLogical Data Warehouse
0 likes · 17 min read
Understanding Data Fabric and Data Virtualization: Concepts, Practices, and Real‑World Case Study
Sohu Tech Products
Sohu Tech Products
Jan 31, 2024 · Operations

Logstash Grok Filter: Complete Guide for Log Data Parsing and ETL

This guide explains Logstash’s Grok filter plugin, detailing how its 120 built‑in and custom patterns transform unstructured logs—such as Apache, MySQL, or HiveServer2—into structured fields through named regex captures, supporting type conversion, cleaning, debugging, and efficient ETL for analysis and monitoring.

Data ProcessingETLGrok filter
0 likes · 8 min read
Logstash Grok Filter: Complete Guide for Log Data Parsing and ETL