Big Data 14 min read

Building a Real-Time Stream Processing Platform with Hadoop Ecosystem (Kafka, Spark Streaming, HBase)

This guide details how to construct a real-time data processing platform on CentOS 7 using the Hadoop ecosystem—installing and configuring Zookeeper, Maven, Hadoop, Kafka, HBase, Spark, and Flume—followed by a Spark Streaming job that consumes Kafka messages and writes them into HBase.

Architecture Digest

May 28, 2018

Building a Real-Time Stream Processing Platform with Hadoop Ecosystem (Kafka, Spark Streaming, HBase)

The article explains the need for a real‑time big‑data processing pipeline and proposes rebuilding the existing system with the Hadoop ecosystem, using Kafka as a message bus, Spark Streaming for real‑time computation, and HBase for multi‑dimensional storage.

All components run on CentOS 7. Required frameworks include Flume 1.8.0, Hadoop 2.9.0, Kafka 1.0.0, Spark 2.2.1, HBase 1.2.6, ZooKeeper 3.4.11, and Maven 3.5.2, with Java 1.8+ and Scala as the development language.

1. Configure development environment – download and extract JDK 1.8 and Scala, then edit /etc/profile to set JAVA_HOME, SCALA_HOME, and PATH:

export JAVA_HOME=/usr/java/jdk1.8.0_144
export PATH=$JAVA_HOME/bin:$PATH
export CLASSPATH=.:$JAVA_HOME/lib/dt.jar:$JAVA_HOME/lib/tools.jar
export SCALA_HOME=/usr/local/scala-2.11.12
export PATH=$PATH:$SCALA_HOME/bin

Apply the changes with source /etc/profile.

2. Install ZooKeeper and Maven – download, extract, and configure:

export MAVEN_HOME=/usr/local/apache-maven-3.5.2
export PATH=$PATH:$MAVEN_HOME/bin

Set up ZooKeeper by copying zoo_sample.cfg to zoo.cfg, edit data directory, and start with /usr/local/zookeeper-3.4.11/bin/zkServer.sh start.

3. Install Hadoop – download and extract, then set:

export HADOOP_HOME=/usr/local/hadoop-2.9.0
export PATH=$PATH:$HADOOP_HOME/bin

Configure core-site.xml, hdfs-site.xml, yarn-site.xml, format the namenode, and start Hadoop services.

4. Install Kafka – download and extract, then set:

export KAFKA_HOME=/usr/local/kafka_2.11-1.0.0
export PATH=$KAFKA_HOME/bin:$PATH

Configure server.properties (log directories, ZooKeeper connection) and start Kafka with

kafka-server-start.sh -daemon $KAFKA_HOME/config/server.properties

. Create a test topic using kafka-topics.sh.

5. Install HBase – download and extract, then set:

export HBASE_HOME=/usr/local/hbase-1.2.6
export PATH=$PATH:$HBASE_HOME/bin

Edit hbase-env.sh to set JAVA_HOME and HBASE_MANAGES_ZK=false. Add the following hbase-site.xml configuration (excerpt):

<configuration>
  <property>
    <name>hbase.rootdir</name>
    <value>hdfs://tsk1:9000/hbase</value>
  </property>
  <property>
    <name>hbase.master</name>
    <value>tsk1:60000</value>
  </property>
  <property>
    <name>hbase.zookeeper.quorum</name>
    <value>192.168.70.135</value>
  </property>
  ...
</configuration>

Start HBase with start-hbase.sh and verify using the HBase shell.

6. Install Spark – download and extract, then set:

export SPARK_HOME=/usr/local/spark-2.2.1-bin-hadoop2.7
export PATH=$PATH:$SPARK_HOME/bin

Apply with source /etc/profile.

7. Test the pipeline – compile a Java project containing two classes. HBaseHelper provides a singleton for HBase connections and a putAdd method. KafkaRecHbase creates a Spark Streaming context, reads from Kafka, splits each line, and writes each word to HBase via HBaseHelper.getInstances().putAdd(...). Submit the job with:

spark-submit --jars $(echo /usr/local/hbase-1.2.6/lib/*.jar | tr ' ' ',') \
  --class com.test.spark.spark_test.KafkaRecHbase \
  --packages org.apache.spark:spark-streaming-kafka-0-8_2.11:2.2.1 \
  /opt/FileTemp/streaming/spark-test-0.1.1.jar tsk1:2181 test testTopic 1

After the job starts, produce messages to the Kafka topic; they will appear in the HBase table.

8. Collect Nginx logs with Flume – download Flume, set FLUME_HOME, and create nginxStreamingKafka.conf defining an exec source that tails the Nginx log, a memory channel, and a Kafka sink. Start Flume with:

flume-ng agent --name agent1 --conf $FLUME_HOME/conf \
  --conf-file $FLUME_HOME/conf/nginxStreamingKafka.conf \
  -Dflume.root.logger=INFO,console

The logs are streamed into Kafka and can be processed by the same Spark Streaming job.

By following these steps, a functional real‑time streaming platform is built, ready for further scaling, monitoring, and optimization.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

Big Data Real-time Processing HBase Hadoop Spark Streaming Flume

Written by

Architecture Digest

Focusing on Java backend development, covering application architecture from top-tier internet companies (high availability, high performance, high stability), big data, machine learning, Java architecture, and other popular fields.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.