Tag

MapReduce

1 views collected around this technical thread.

Rare Earth Juejin Tech Community
Rare Earth Juejin Tech Community
Dec 26, 2024 · Big Data

Understanding Hadoop HDFS and MapReduce: Principles, Architecture, and Sample Code

This article explains the origins of big‑data technologies, details the architecture and read/write mechanisms of Hadoop's HDFS, describes the MapReduce programming model, and provides complete Java code examples for a simple distributed file‑processing job using Maven dependencies.

Big DataHDFSHadoop
0 likes · 15 min read
Understanding Hadoop HDFS and MapReduce: Principles, Architecture, and Sample Code
Qunar Tech Salon
Qunar Tech Salon
Dec 10, 2024 · Big Data

Understanding and Solving Small File Problems in Hive and Spark

This article explains what constitutes a small file in HDFS, why they harm memory, compute and cluster load, outlines common sources such as data sources, streaming and dynamic partitioning, and provides detailed Hive and Spark solutions—including CombineHiveInputFormat, merge parameters, distribute by, and custom Spark extensions—to efficiently merge small files and improve job performance.

Big DataHiveMapReduce
0 likes · 23 min read
Understanding and Solving Small File Problems in Hive and Spark
Rare Earth Juejin Tech Community
Rare Earth Juejin Tech Community
Nov 23, 2024 · Big Data

Implementing a Basic Hadoop MapReduce Word Count with Extensible Design and Performance Tuning

This article explains Hadoop’s core concepts using a library analogy, details HDFS storage and MapReduce processing, provides complete Java implementations for a word‑count job with support for text, CSV, and JSON inputs, and discusses extensibility and performance optimizations such as combiners and custom partitioners.

Big DataHadoopJava
0 likes · 20 min read
Implementing a Basic Hadoop MapReduce Word Count with Extensible Design and Performance Tuning
Code Ape Tech Column
Code Ape Tech Column
Jun 16, 2024 · Backend Development

Introducing PowerJob: A Lightweight Distributed Task Scheduling Framework and Its Usage

This article introduces PowerJob, a young yet mature distributed task scheduling framework, explains why it was chosen, details its architecture, high‑availability design, deployment steps, and demonstrates various job types—including standalone, broadcast, map, and MapReduce—along with CRON, fixed‑rate, and fixed‑delay scheduling configurations.

JavaMapReducePowerJob
0 likes · 13 min read
Introducing PowerJob: A Lightweight Distributed Task Scheduling Framework and Its Usage
DaTaobao Tech
DaTaobao Tech
Dec 11, 2023 · Big Data

Design and Implementation of an Online Batch Processing Framework for Large-Scale Promotion Systems

The paper presents a centralized online batch‑processing framework for large‑scale promotion systems, where applications integrate via an SDK, a task‑center schedules and dispatches sub‑tasks through RocketMQ to Dubbo‑enabled containers, employing MapReduce‑style splitting, Guava rate‑limiting, heartbeat health checks, and has successfully handled over 1.3 million tasks during Double‑11.

Big DataDubboJava
0 likes · 9 min read
Design and Implementation of an Online Batch Processing Framework for Large-Scale Promotion Systems
Code Ape Tech Column
Code Ape Tech Column
Dec 9, 2023 · Backend Development

PowerJob Overview: Selection Rationale, Architecture, Task Types, and Scheduling Strategies with Code Samples

This article introduces the PowerJob distributed task framework, explains why it was chosen, details its architecture and high‑availability design, demonstrates various job types—including standalone, broadcast, map, and map‑reduce—with Java code examples, and covers scheduling options such as CRON, fixed‑rate, and fixed‑delay execution.

JavaMapReducePowerJob
0 likes · 14 min read
PowerJob Overview: Selection Rationale, Architecture, Task Types, and Scheduling Strategies with Code Samples
Rare Earth Juejin Tech Community
Rare Earth Juejin Tech Community
Sep 29, 2023 · Backend Development

Concurrent Chunk Processing in Go: A MapReduce‑Style Solution

The article explains how to handle business scenarios that require splitting large data sets into concurrent I/O requests and sequential aggregation by presenting a Go‑based chunk processing framework with map and reduce functions, configurable concurrency, and example code.

Chunk ProcessingConcurrencyGo
0 likes · 7 min read
Concurrent Chunk Processing in Go: A MapReduce‑Style Solution
Architecture Digest
Architecture Digest
Jun 3, 2023 · Backend Development

PowerJob: A Next‑Generation Distributed Task Scheduling and Computing Framework – Features, Comparison, and Quick‑Start Guide

PowerJob is a modern distributed job scheduling framework that addresses the limitations of Quartz, XXL‑Job and SchedulerX by offering a web UI, rich scheduling strategies, DAG workflow support, lock‑free high‑performance scheduling, multiple processor types and step‑by‑step quick‑start instructions for developers.

JavaMapReducePowerJob
0 likes · 10 min read
PowerJob: A Next‑Generation Distributed Task Scheduling and Computing Framework – Features, Comparison, and Quick‑Start Guide
Architecture Digest
Architecture Digest
Dec 16, 2022 · Backend Development

PowerJob: A Next‑Generation Distributed Task Scheduling and Computing Framework – Introduction and Quick‑Start Guide

PowerJob is a third‑generation distributed job scheduler that adds workflow orchestration, map‑reduce style computation and rich execution modes to traditional CRON‑based scheduling, and this guide explains its advantages, core features, architecture, and provides step‑by‑step instructions with code samples to get started quickly.

JavaMapReducePowerJob
0 likes · 11 min read
PowerJob: A Next‑Generation Distributed Task Scheduling and Computing Framework – Introduction and Quick‑Start Guide
Practical DevOps Architecture
Practical DevOps Architecture
Jan 4, 2022 · Big Data

Step-by-Step Guide to Installing and Configuring Hadoop 2.9.2 Cluster on Three Nodes

This article provides a detailed, step-by-step tutorial for installing Hadoop 2.9.2, configuring environment variables, editing XML configuration files, formatting the NameNode, starting HDFS and YARN services, testing the cluster, and setting up the MapReduce history server on a three‑node Linux environment.

Big DataCluster SetupHadoop
0 likes · 9 min read
Step-by-Step Guide to Installing and Configuring Hadoop 2.9.2 Cluster on Three Nodes
DataFunTalk
DataFunTalk
Dec 27, 2021 · Big Data

Comprehensive Big Data Interview Q&A: Hadoop, Spark, Kafka, Hive, and Related Technologies

This article presents a detailed interview-style walkthrough covering Hadoop cluster setup, HDFS components, MapReduce workflow, YARN advantages, Spark fundamentals, Kafka replication, Hive table types, and related big‑data concepts, providing concise explanations and practical insights for data engineers.

Big DataHadoopHive
0 likes · 20 min read
Comprehensive Big Data Interview Q&A: Hadoop, Spark, Kafka, Hive, and Related Technologies
HomeTech
HomeTech
Dec 24, 2021 · Big Data

Handling java.lang.OutOfMemoryError in Hadoop MapReduce

This article explains the four locations where java.lang.OutOfMemoryError can occur in Hadoop's MapReduce framework—client, ApplicationMaster, Map, and Reduce phases—and provides configuration adjustments and best‑practice solutions to mitigate each type of OOM issue.

Big DataHadoopJava
0 likes · 11 min read
Handling java.lang.OutOfMemoryError in Hadoop MapReduce
DataFunTalk
DataFunTalk
Jun 11, 2021 · Big Data

Comprehensive Guide to Fast and Stable Hive‑to‑HBase Data Transfer Using Bulkload, MapReduce, and Spark

This article explains how to efficiently move large volumes of data from Hive to HBase by leveraging HBase's bulkload mechanism, detailing the original MapReduce workflow, its performance bottlenecks, and a rewritten Spark‑based solution that simplifies ETL, improves partitioning, and achieves several‑fold speedup.

Big DataBulkloadETL
0 likes · 17 min read
Comprehensive Guide to Fast and Stable Hive‑to‑HBase Data Transfer Using Bulkload, MapReduce, and Spark
Full-Stack Internet Architecture
Full-Stack Internet Architecture
Jan 27, 2021 · Big Data

Introduction to Hadoop: Architecture, HDFS, MapReduce, and YARN Overview

This article provides a comprehensive overview of Hadoop, covering its origins, core components such as HDFS, MapReduce, and YARN, their architectures, data storage and processing mechanisms, fault‑tolerance features, scheduling strategies, and practical optimization techniques for large‑scale distributed computing.

Big DataHDFSHadoop
0 likes · 33 min read
Introduction to Hadoop: Architecture, HDFS, MapReduce, and YARN Overview
Ctrip Technology
Ctrip Technology
Sep 10, 2020 · Big Data

Design and Implementation of a Unified Log Framework for Ctrip Payment Center

The article describes the design, architecture, and operational details of a unified logging framework at Ctrip's payment center, covering log production via a Log4j2 extension, Kafka‑Camus collection, Hive/ORC storage, MapReduce parsing optimizations, and governance strategies for massive daily TB‑scale data.

Big DataHadoopMapReduce
0 likes · 15 min read
Design and Implementation of a Unified Log Framework for Ctrip Payment Center
Big Data Technology Architecture
Big Data Technology Architecture
Apr 28, 2020 · Big Data

Understanding Shuffle in Hadoop MapReduce and Spark

This article explains the concept and workflow of shuffle in Hadoop MapReduce and Spark, covering map‑side buffering, spill and merge, reduce‑side copy‑merge‑reduce, the reasons for sorting and file merging, and compares Hash‑Shuffle and Sort‑Shuffle implementations with performance considerations.

Big DataHash ShuffleMapReduce
0 likes · 16 min read
Understanding Shuffle in Hadoop MapReduce and Spark
Big Data Technology Architecture
Big Data Technology Architecture
Mar 19, 2020 · Big Data

Handling Data Skew in Hive: Join, Group By, and COUNT(DISTINCT) Optimizations

Data skew in Hive MapReduce jobs, caused by uneven key distribution during joins, group‑by, or COUNT(DISTINCT) operations, can severely slow tasks, and the article explains common scenarios and practical solutions such as using MapJoin, enabling map‑side aggregation, load‑balancing, and rewriting queries to mitigate skew.

Big DataHiveMapJoin
0 likes · 7 min read
Handling Data Skew in Hive: Join, Group By, and COUNT(DISTINCT) Optimizations