Building a Real-Time Stream Processing Platform with Hadoop Ecosystem (Kafka, Spark Streaming, HBase)
This guide details how to construct a real-time data processing platform on CentOS 7 using the Hadoop ecosystem—installing and configuring Zookeeper, Maven, Hadoop, Kafka, HBase, Spark, and Flume—followed by a Spark Streaming job that consumes Kafka messages and writes them into HBase.
The article explains the need for a real‑time big‑data processing pipeline and proposes rebuilding the existing system with the Hadoop ecosystem, using Kafka as a message bus, Spark Streaming for real‑time computation, and HBase for multi‑dimensional storage.
All components run on CentOS 7. Required frameworks include Flume 1.8.0, Hadoop 2.9.0, Kafka 1.0.0, Spark 2.2.1, HBase 1.2.6, ZooKeeper 3.4.11, and Maven 3.5.2, with Java 1.8+ and Scala as the development language.
1. Configure development environment – download and extract JDK 1.8 and Scala, then edit /etc/profile to set JAVA_HOME , SCALA_HOME , and PATH :
export JAVA_HOME=/usr/java/jdk1.8.0_144
export PATH=$JAVA_HOME/bin:$PATH
export CLASSPATH=.:$JAVA_HOME/lib/dt.jar:$JAVA_HOME/lib/tools.jar
export SCALA_HOME=/usr/local/scala-2.11.12
export PATH=$PATH:$SCALA_HOME/binApply the changes with source /etc/profile .
2. Install ZooKeeper and Maven – download, extract, and configure:
export MAVEN_HOME=/usr/local/apache-maven-3.5.2
export PATH=$PATH:$MAVEN_HOME/binSet up ZooKeeper by copying zoo_sample.cfg to zoo.cfg , edit data directory, and start with /usr/local/zookeeper-3.4.11/bin/zkServer.sh start .
3. Install Hadoop – download and extract, then set:
export HADOOP_HOME=/usr/local/hadoop-2.9.0
export PATH=$PATH:$HADOOP_HOME/binConfigure core-site.xml , hdfs-site.xml , yarn-site.xml , format the namenode, and start Hadoop services.
4. Install Kafka – download and extract, then set:
export KAFKA_HOME=/usr/local/kafka_2.11-1.0.0
export PATH=$KAFKA_HOME/bin:$PATHConfigure server.properties (log directories, ZooKeeper connection) and start Kafka with kafka-server-start.sh -daemon $KAFKA_HOME/config/server.properties . Create a test topic using kafka-topics.sh .
5. Install HBase – download and extract, then set:
export HBASE_HOME=/usr/local/hbase-1.2.6
export PATH=$PATH:$HBASE_HOME/binEdit hbase-env.sh to set JAVA_HOME and HBASE_MANAGES_ZK=false . Add the following hbase-site.xml configuration (excerpt):
<configuration>
<property>
<name>hbase.rootdir</name>
<value>hdfs://tsk1:9000/hbase</value>
</property>
<property>
<name>hbase.master</name>
<value>tsk1:60000</value>
</property>
<property>
<name>hbase.zookeeper.quorum</name>
<value>192.168.70.135</value>
</property>
...
</configuration>Start HBase with start-hbase.sh and verify using the HBase shell.
6. Install Spark – download and extract, then set:
export SPARK_HOME=/usr/local/spark-2.2.1-bin-hadoop2.7
export PATH=$PATH:$SPARK_HOME/binApply with source /etc/profile .
7. Test the pipeline – compile a Java project containing two classes. HBaseHelper provides a singleton for HBase connections and a putAdd method. KafkaRecHbase creates a Spark Streaming context, reads from Kafka, splits each line, and writes each word to HBase via HBaseHelper.getInstances().putAdd(...) . Submit the job with:
spark-submit --jars $(echo /usr/local/hbase-1.2.6/lib/*.jar | tr ' ' ',') \
--class com.test.spark.spark_test.KafkaRecHbase \
--packages org.apache.spark:spark-streaming-kafka-0-8_2.11:2.2.1 \
/opt/FileTemp/streaming/spark-test-0.1.1.jar tsk1:2181 test testTopic 1After the job starts, produce messages to the Kafka topic; they will appear in the HBase table.
8. Collect Nginx logs with Flume – download Flume, set FLUME_HOME , and create nginxStreamingKafka.conf defining an exec source that tails the Nginx log, a memory channel, and a Kafka sink. Start Flume with:
flume-ng agent --name agent1 --conf $FLUME_HOME/conf \
--conf-file $FLUME_HOME/conf/nginxStreamingKafka.conf \
-Dflume.root.logger=INFO,consoleThe logs are streamed into Kafka and can be processed by the same Spark Streaming job.
By following these steps, a functional real‑time streaming platform is built, ready for further scaling, monitoring, and optimization.
Architecture Digest
Focusing on Java backend development, covering application architecture from top-tier internet companies (high availability, high performance, high stability), big data, machine learning, Java architecture, and other popular fields.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.