Big Data 21 min read

Path Analysis Model Design and Engineering Implementation for Internet Data Operations

The article details the design and engineering of a high‑performance path analysis model for internet data operations, explaining session handling, Sankey visualizations, adjacency‑table storage, multi‑granular session partitioning, Spark‑to‑ClickHouse pipelines, and optimizations that enable billion‑scale user‑path queries in about one second.

vivo Internet Technology
vivo Internet Technology
vivo Internet Technology
Path Analysis Model Design and Engineering Implementation for Internet Data Operations

This article introduces the design and implementation of a path analysis model for internet data operations. Path analysis is a unique data analysis method in the internet industry used to visualize and analyze user behavior paths within products.

Application Scenarios: The model addresses questions such as: what are the main user paths ranked by conversion rate, where do users deviate from expected paths, and how do behavioral paths differ across user segments. A practical business scenario demonstrates analyzing the main behavior paths for "active users" reaching a target landing page (small video page) with billions of daily data volume and query results in approximately 1 second.

Core Concepts: The article explains key concepts including Session (a time-bounded series of interactions), Sankey diagrams (flow diagrams with branch widths representing data flow), adjacency tables (graph compression storage structure), tree pruning (removing insignificant nodes), and PV/SV metrics (Page View and Session View counts).

Data Model Design: The data flows from a unified data warehouse through Spark computation to ClickHouse, with Hive for cold backup. The model uses flexible session partitioning supporting multiple time granularities (5, 10, 15, 30, 60 minutes). Key processing steps include: obtaining page information and partitioning sessions, deduplicating adjacent pages, extracting 4-level forward/backward pages for each page, calculating PV/SV for positive and negative paths, and computing conversion rates at each level.

Engineering Architecture: The backend constructs Sankey diagrams by building weighted path trees using adjacency tables organized by level. The implementation includes reading data layer by layer, constructing bidirectional edge relationships (parent-child and child-parent), pruning to remove isolated nodes and incomplete paths, and finally constructing the adjacency table for visualization.

Technical Implementation: The system uses ClickHouse for its columnar storage and extremely fast query performance. The article also describes optimizations for distributed table writing, reducing TCP connection wait numbers by over 72% and input traffic peaks by over 88% through DNS polling to write local tables.

big datadata modelingClickHouseOLAPuser behavior analysispath analysisSankey Diagramsession analysis
vivo Internet Technology
Written by

vivo Internet Technology

Sharing practical vivo Internet technology insights and salon events, plus the latest industry news and hot conferences.

0 followers
Reader feedback

How this landed with the community

login Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.