Big Data 11 min read

Router-Based Federation in Hadoop: Architecture, Components, and Didi’s Deployment

Router‑Based Federation replaces Hadoop’s single‑point HDFS bottleneck with a server‑side global namespace managed by Routers and a State Store, enabling scalable, highly available sub‑clusters; Didi back‑ported the feature, deployed five Routers, fixed numerous bugs, and contributed patches to improve stability and functionality.

Didi Tech
Didi Tech
Didi Tech
Router-Based Federation in Hadoop: Architecture, Components, and Didi’s Deployment

HDFS’s master/slave architecture creates a single‑point bottleneck as metadata storage and service provision become constrained when data scales. To address scalability, performance, and isolation issues, the Hadoop community introduced Federation (HDFS‑1052).

Federation, however, exposes multiple namespaces to users, requiring them to know which namespace holds the data they need. ViewFS (HADOOP‑7257) was proposed as a client‑side solution that mounts user directories to specific namespaces, but it suffers from upgrade difficulty and maintenance overhead.

The community later released Router‑Based Federation (RBF, HDFS‑10467) as a server‑side solution that simplifies namespace management. Didi adopted this approach and performed several customizations.

Router‑Based Federation Overview

The Router service sits in the Federation layer and provides a transparent global namespace, allowing clients to access any sub‑cluster while each sub‑cluster independently manages its own block pool. The Router forwards client requests to the appropriate active NameNode based on metadata stored in a State Store, offering scalability, high availability, and fault tolerance.

Key Components

1. Router

Provides a global NameNode interface and forwards requests to the active NameNode of the correct sub‑cluster.

Maintains NameNode information in the State Store.

Routers cache remote mount‑table entries and sub‑cluster states for performance. They periodically report NameNode HA status and load/space metrics to the State Store. Routers are stateless; if one fails, others continue serving.

When a Router cannot connect to the State Store for a configured timeout, it enters a safe mode similar to NameNode safe mode, rejecting client requests until the State Store becomes reachable.

Interaction interfaces include RPC, Admin, and WebUI. RPC handles standard client operations (MapReduce, Spark, Hive). Admin provides RPC for managing federation metadata, and WebUI visualizes federation state, mount tables, and Router status.

2. State Store

The State Store holds:

Sub‑cluster load, available disk space, and HA status.

Mount‑table mappings between directories/files and sub‑clusters (e.g., hdfs://tmp → hdfs://C0‑1/tmp).

Router status information.

Its backend can be a file system or ZooKeeper.

3. Future Plans

RBF currently implements a subset of NameNode APIs. Ongoing JIRA tickets (e.g., HDFS‑13655) will add missing protocol interfaces, and stability issues are tracked in HDFS‑13891.

RBF Deployment at Didi

Didi’s Hadoop version is 2.7.2, so the RBF feature from Hadoop 2.9/3.0 was back‑ported and adapted. The production cluster is split into five NameNode groups, each served by a dedicated Router, resulting in five Routers for the whole cluster. The setup has been stable for over two months.

Compatibility work included modifying the Hive client to resolve absolute HDFS paths with schema prefixes, preventing “Wrong FS” errors when running Hive jobs on RBF.

Community Contributions

Didi identified and fixed several RBF issues (quota handling, mount‑table cache misuse, ZNode creation bugs, etc.) and contributed patches that were largely accepted by the community, including JIRA tickets HDFS‑13710, HDFS‑13821, HDFS‑13836, HDFS‑13844, HDFS‑13845, HDFS‑13854, HDFS‑13856, HDFS‑13857, HDFS‑13802, HDFS‑13852, and HDFS‑14114.

Additional Work

Improvements were made to the RBF WebUI to correctly aggregate node counts and storage totals across sub‑clusters. New APIs were added to allow product services to programmatically add mount‑table entries.

Conclusion

Router‑Based Federation has been successfully deployed at Didi, providing operational convenience and high availability. Didi plans to continue contributing to the community to enrich RBF functionality and stability.

big dataHigh AvailabilityHDFSHadoopdistributed-filesystemRouter Based Federation
Didi Tech
Written by

Didi Tech

Official Didi technology account

0 followers
Reader feedback

How this landed with the community

login Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.