Field Extraction and Read‑Time Modeling in the Honghu Data Platform
The Honghu data platform delivers a unified, UI‑driven environment that uses read‑time modeling to dynamically structure massive heterogeneous logs during queries, replacing pre‑defined schemas and ETL pipelines with rule‑based field extraction (regex, JSON, key‑value, IP) bound to data‑source types, trading CPU cycles for flexible, accurate analysis.
The Honghu system provides a one‑stop heterogeneous data platform that offers an out‑of‑the‑box UI for data analysis, simplifying the overall workflow. Its key technical feature is read‑time modeling, which enables on‑demand structuring of massive heterogeneous data during query execution, thereby improving efficiency and accuracy.
Why read‑time modeling? Traditional relational databases or data warehouses struggle with three main pain points in the cloud‑native big‑data era: massive data volume requiring high write throughput, rapidly changing log formats due to micro‑service architectures, and the overhead of maintaining ETL pipelines for format unification. Read‑time modeling eliminates the need for pre‑defined schemas and ETL tasks, storing only raw data and minimal metadata, and dynamically generates enriched tables at query time.
Field extraction definition – In a read‑time modeling system, field extraction refers to the process of structuring and enriching raw data on the fly according to predefined extraction rules. It trades CPU cycles for flexible queries and reduces storage consumption.
Implementation principle – A field‑extraction rule set consists of multiple individual rules applied in a defined order. Each rule is bound to a specific data‑source type, allowing the system to apply the correct rule set when a query references that source.
Extraction rule types – The platform currently supports four built‑in rule types:
Regular‑expression extraction: captures named groups from unstructured logs.
JSON extraction: parses embedded JSON strings into separate columns.
Key‑value extraction: splits "key=value" patterns into distinct fields.
IP‑address extraction: resolves IPs to geographic information.
Data‑source type binding – Each rule application must be bound to a data‑source type (analogous to a table in a NoSQL namespace). This enables the system to apply the correct extraction logic for different log formats such as switch logs, firewall logs, or router logs.
Rule‑application UI – The UI allows users to view and edit rule applications, see execution order, and preview extraction results without writing code. Users can create, modify, or delete rules, and the system automatically binds them to the selected data‑source type.
Practical workflow example – A typical process includes: selecting target data, choosing a sample event, editing extraction rules (e.g., using UI‑driven regex generation), previewing enriched fields, and saving the rule set. The example demonstrates extracting time, module, and message fields from a log line, then further enriching IP and detail fields via additional rule layers.
Performance comparison – Read‑time modeling incurs extra CPU cost compared with write‑time modeling, which pre‑materializes columns. However, for highly flexible or rapidly evolving data, read‑time modeling offers superior adaptability. Users can mitigate performance loss by using materialized views or pre‑queries when data structures are stable.
Q&A highlights – Differences between field extraction at read‑time vs. index‑time, feasibility of IP‑based geolocation, best practices for custom data‑source types, and the ability to edit extraction rules after creation.
Sohu Tech Products
A knowledge-sharing platform for Sohu's technology products. As a leading Chinese internet brand with media, video, search, and gaming services and over 700 million users, Sohu continuously drives tech innovation and practice. We’ll share practical insights and tech news here.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.