Applying RisingWave to Real-Time Feature Engineering: Architecture, Capabilities, and Use Cases
This article introduces RisingWave, an open‑source streaming database, and explains how its SQL‑based interface, compute‑storage separation, UDF support, and materialized views enable efficient real‑time feature engineering, state management, and diverse downstream applications, including the enhancements in RisingWave 2.0.
RisingWave is an open‑source streaming database written in Rust that combines a PostgreSQL‑compatible SQL interface with a compute‑storage separated architecture, supporting both real‑time and batch workloads.
The platform offers easy‑to‑use features such as SQL‑based data ingestion, rich UDF extensions, materialized views for incremental computation, and robust state management with exactly‑once guarantees and checkpointing.
Its three‑layer architecture consists of a frontend layer for query parsing, a compute layer for distributed stream processing, and a storage layer backed by object storage, coordinated by a meta node.
In real‑time feature engineering, RisingWave simplifies the pipeline by handling data ingestion from various sources (Kafka, CDC, file systems), data cleaning and selection via SQL functions, feature construction using aggregation, window, and join operations, and serving features through queryable materialized views.
Advanced capabilities include multi‑stream joins (inner, outer, temporal, window), efficient state expiration, long‑term state storage, and the ability to expose internal state tables via SHOW INTERNAL TABLES for debugging.
RisingWave also supports creating UDFs and aggregates with CREATE FUNCTION and CREATE AGGREGATE , indexing materialized views for faster serving queries, and scaling streaming and serving components independently.
Beyond feature engineering, the system can be used for real‑time monitoring, wide‑table generation, rule engines, and data marketplaces, with integrations to downstream sinks such as Redis, ClickHouse, StarRocks, Elasticsearch, and more.
RisingWave 2.0 introduces a Premium self‑hosted edition, an enhanced cloud version, unified streaming‑batch support, automatic schema change and mapping, optimized materialized view back‑fill, and other performance improvements.
DataFunTalk
Dedicated to sharing and discussing big data and AI technology applications, aiming to empower a million data scientists. Regularly hosts live tech talks and curates articles on big data, recommendation/search algorithms, advertising algorithms, NLP, intelligent risk control, autonomous driving, and machine learning/deep learning.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.