Big Data 18 min read

Building Real-Time User Behavior Engineering with Apache Flink: Architecture, Features, and Implementation

This article introduces the design and implementation of a real‑time user behavior engineering platform at Qunar using Apache Flink, covering Flink's core characteristics, distributed runtime, DataStream programming model, fault‑tolerance, back‑pressure handling, event‑time processing, windowing, watermarks, and practical code examples for filtering, splitting, joining, and state management.

Qunar Tech Salon

Feb 20, 2019

Building Real-Time User Behavior Engineering with Apache Flink: Architecture, Features, and Implementation

Introduction

Hi everyone! This article presents the practice of building a user real‑time basic behavior engineering platform based on Flink, including the relevant Flink technical points and the business logic of the behavior engineering.

Flink is Qunar's primary open‑source real‑time data processing platform, intended to replace SparkStreaming. It can process over 1.2 billion events per day with second‑level latency and support up to 100 k QPS for user behavior jobs.

Flink Overview

Apache Flink is a distributed open‑source compute framework for both stream and batch processing, offering low latency, exactly‑once guarantees, and high‑throughput batch capabilities. Unlike SparkStreaming, which treats streaming as a series of micro‑batches, Flink treats streams as unbounded data and batch as a special case of bounded streams.

Key Features of Flink

Exactly‑once stateful computation with checkpointing.

Event‑time semantics and flexible windowing (time, count, session, data‑driven).

Lightweight fault tolerance via state snapshots.

High throughput, low latency processing.

Savepoints for manual state snapshots.

Scalable cluster deployment on YARN, Mesos, etc.

Back‑pressure support.

Built‑in JVM memory management and OOM prevention.

Iterative computation.

Automatic program optimization to avoid expensive shuffles and sorts.

Flink Distributed Runtime

JobManager coordinates the distributed execution, TaskManager performs the actual data processing, and the client generates the execution plan.

Flink DataStream Programming Model

Abstraction Layers

The lowest layer is the stateful stream; the DataStream API (core API) handles most business logic such as aggregation and joins. Table API and SQL provide schema‑based queries but are less used here.

Stream Programming

Flink programs consist of three elements: Source, Operator, and Sink. Data flows from Source, is transformed by Operators, and finally persisted by Sink.

User Real‑Time Basic Behavior Engineering

Architecture

Kafka topics from various business lines are subscribed by Flink jobs for data cleaning. Cleaned data is keyed by business line and stored in Redis. The server retrieves all behaviors for a user from Redis, merges them with offline data via Dubbo, and returns the result to the client. CPU‑intensive tasks like decompression are performed on the client to reduce server load.

Business Value

The service provides a user's behavior within the last 100 days across flights, hotels, trains, tickets, vacations, etc., sourced from Hotdog logs, Kylin logs, and business‑line logs. It enables fine‑grained retention analysis, quality assessment, product analysis, user segmentation, and personalized recommendation, improving click‑through rates by about 20% compared to using T‑1 day data.

Typical DataStream Operators

1. Filter

Use filter to keep only the desired behavior type.

dataStream.filter(new FilterFunction<Integer>() {
    @Override
    public boolean filter(Integer value) throws Exception {
        return value != 0;
    }
});

2. Split and Select

Split a mixed‑type log stream into separate streams for different business logic.

SplitStream<Integer> split = someDataStream.split(new OutputSelector<Integer>() {
    @Override
    public Iterable<String> select(Integer value) {
        List<String> output = new ArrayList<>();
        if (value % 2 == 0) {
            output.add("even");
        } else {
            output.add("odd");
        }
        return output;
    }
});
DataStream<Integer> even = split.select("even");
DataStream<Integer> odd = split.select("odd");
DataStream<Integer> all = split.select("even", "odd");

3. Join Two Streams

Join two streams on a key using a time window.

dataStream.join(otherStream)
    .where(keySelector1).equalTo(keySelector2)
    .window(TumblingEventTimeWindows.of(Time.seconds(3)))
    .apply(new JoinFunction<...>() { ... });

Fault‑Tolerance Strategies

1. Retry on Failure

Flink automatically restarts failed jobs based on checkpoint state. Checkpoints are taken every 5 seconds and stored in HDFS. Restart strategies can be configured globally via flink-conf.yaml or per‑job (fixed‑delay or failure‑rate). The current job‑level setting retries once after a 10‑second pause.

2. Stop‑And‑Savepoint

When updating jobs or the cluster, a manual savepoint captures a globally consistent snapshot.

. /bin/flink cancel -s [savepointDirectory] <jobID>
. /bin/flink run -s <savepointPath> ...

3. Monitoring & Alerts

Flink exposes Metrics (similar to Qmonitor) for data ingestion rate, failure rate, end‑to‑end latency, processing latency, and persistence success rate.

Availability

1. High Availability

The Flink cluster runs on 4 TaskManagers with 16 slots, MySQL‑MMM for DB HA, a 10‑node Redis cluster, and 8 NG virtual machines for the server layer, eliminating single points of failure.

2. Back‑Pressure Handling

Flink's built‑in back‑pressure automatically throttles upstream producers when downstream operators become slow, using bounded buffers similar to Java's blocking queues.

3. Event‑Time, Window, and Watermark for Out‑of‑Order Streams

Event‑time processing ensures correct ordering despite out‑of‑order arrivals. Windows (tumbling, sliding, session, global) segment the infinite stream into finite chunks. Watermarks indicate progress in event time, allowing windows to close when late data is unlikely.

Time Models

Processing time – local operator time when an element is processed.

Event time – the original timestamp of the event.

Ingestion time – timestamp when the element enters the operator.

Window Types

Windows split an infinite stream into bounded buffers for aggregation. Types include tumbling, sliding, session, and global windows.

Watermark Concept

A watermark is a special element that carries a timestamp; when the system sees a watermark of time t, it knows no later element with timestamp ≤ t will arrive, so it can trigger window computation.

Implementation Steps

1. Enable Event‑Time

env.setStreamTimeCharacteristic(TimeCharacteristic.EventTime);

2. Define a Sliding Window

ds.window(SlidingEventTimeWindows.of(Time.minutes(2), Time.minutes(1)))

3. Set Watermark Generator

static AssignerWithPeriodicWatermarks<String> assigner = new AssignerWithPeriodicWatermarks<String>() {
    private final long maxTimeLag = 60000; // 60 seconds
    @Override
    public Watermark getCurrentWatermark() {
        return new Watermark(System.currentTimeMillis() - maxTimeLag);
    }
    @Override
    public long extractTimestamp(String logStr, long previousElementTimestamp) {
        JsonObject jsonObject = new JsonParser().parse(logStr).getAsJsonObject();
        return jsonObject.get("timestamp").getAsLong();
    }
};

The custom assigner defines how watermarks are generated and how event timestamps are extracted.

Performance Scaling

When business volume grows, simply add more TaskManagers, expand the Redis cluster, provision additional virtual machines, and submit new Flink jobs. Horizontal scaling is straightforward and does not affect running jobs.

Thank you for reading. Feel free to leave comments, ask questions, or contact me for further discussion.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

Real-time Processing Flink Watermark DataStream Checkpoint EventTime

Written by

Qunar Tech Salon

Qunar Tech Salon is a learning and exchange platform for Qunar engineers and industry peers. We share cutting-edge technology trends and topics, providing a free platform for mid-to-senior technical professionals to exchange and learn.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.