Tag

parquet

2 views collected around this technical thread.

DataFunTalk
DataFunTalk
May 29, 2025 · Databases

Introducing DuckLake: An Integrated Data Lake and Catalog Format Powered by SQL

DuckDB's DuckLake is an open‑standard, SQL‑driven data lake and catalog format that simplifies lakehouse architecture by managing metadata in a database while storing data in scalable Parquet files, offering multi‑user collaboration, time‑travel queries, and MIT licensing.

Data LakeSQLdatabases
0 likes · 4 min read
Introducing DuckLake: An Integrated Data Lake and Catalog Format Powered by SQL
360 Smart Cloud
360 Smart Cloud
May 23, 2024 · Big Data

Archer Engine: Integrating Inverted Index with Iceberg for Scalable Big Data Log Analytics

The article introduces Archer, a new big‑data warehouse engine built on Iceberg that adds an inverted‑index mechanism using Tantivy to provide full‑text and JSON search, storage‑compute separation, and significant performance gains over traditional Elasticsearch and Iceberg connectors.

Archer EngineInverted IndexPerformance Optimization
0 likes · 9 min read
Archer Engine: Integrating Inverted Index with Iceberg for Scalable Big Data Log Analytics
Baidu Geek Talk
Baidu Geek Talk
Jun 15, 2022 · Big Data

Replacing Classic Data Warehouse with a One‑Layer Wide Table Model: Architecture, Benefits, and Challenges

The article proposes replacing the traditional multi‑layered data‑warehouse architecture (ODS‑DWD‑DWS‑ADS) with a single, column‑store wide‑table per business theme, achieving roughly 30 % storage savings and faster queries, while acknowledging higher ETL complexity, back‑tracking costs, and production timing challenges.

ETLStorage Optimizationbig data
0 likes · 11 min read
Replacing Classic Data Warehouse with a One‑Layer Wide Table Model: Architecture, Benefits, and Challenges
Big Data Technology Architecture
Big Data Technology Architecture
Aug 24, 2021 · Big Data

An Overview of Apache Parquet: Architecture, Storage Model, and Comparison with ORC

This article provides a comprehensive introduction to Apache Parquet, covering its origins, columnar storage advantages, nested schema support, internal architecture, storage model components, comparison with ORC, and practical tools for inspecting Parquet files.

HadoopORC Comparisonbig data
0 likes · 10 min read
An Overview of Apache Parquet: Architecture, Storage Model, and Comparison with ORC
Big Data Technology Architecture
Big Data Technology Architecture
Apr 5, 2021 · Big Data

Understanding Apache Iceberg: Table Format Architecture, Comparison with Hive Metastore, and Business Benefits

This article introduces Apache Iceberg as an open table format for massive analytic datasets, explains its underlying concepts such as schema, partitioning, statistics, and read/write APIs, compares it with Hive Metastore, outlines its ACID commit process, highlights the performance and operational advantages for big‑data workloads, and previews upcoming community features.

ACIDApache IcebergData Lake
0 likes · 19 min read
Understanding Apache Iceberg: Table Format Architecture, Comparison with Hive Metastore, and Business Benefits
Laravel Tech Community
Laravel Tech Community
Feb 28, 2021 · Big Data

Apache Beam 2.28.0 Release Highlights and New Features

Apache Beam 2.28.0 introduces extensive Parquet support, new hash functions in BeamSQL and ZetaSQL, ApproximateDistinct via HLL, enhanced I/O connectors including SpannerIO for Numeric fields, ParquetIO schema support, KafkaTableProvider thrift, HadoopFormatIO key/value cloning skip, and various other improvements.

Apache BeamData ProcessingI/O Connectors
0 likes · 3 min read
Apache Beam 2.28.0 Release Highlights and New Features
DataFunTalk
DataFunTalk
Oct 29, 2020 · Big Data

Building a Large-Scale Near Real-Time Data Analytics Platform at Lyft Using Apache Flink

Lyft transformed its legacy data pipeline by designing a cloud‑native, Flink‑based near real‑time analytics platform that ingests billions of events, writes Parquet files to S3, leverages Presto for interactive queries, and implements multi‑stage non‑blocking ETL, fault‑tolerant back‑fill, and extensive performance optimizations.

AWSData LakeETL
0 likes · 12 min read
Building a Large-Scale Near Real-Time Data Analytics Platform at Lyft Using Apache Flink
Big Data Technology Architecture
Big Data Technology Architecture
May 19, 2020 · Big Data

An Overview of Apache Parquet: Architecture, Features, and Comparison with ORC

Apache Parquet is a language‑agnostic, columnar storage format for the Hadoop ecosystem that offers high compression, efficient I/O through column and predicate push‑down, nested‑structure support, and a three‑layer architecture, and is compared with ORC while providing tooling for schema inspection.

Apache HadoopData FormatsORC Comparison
0 likes · 9 min read
An Overview of Apache Parquet: Architecture, Features, and Comparison with ORC
Big Data Technology Architecture
Big Data Technology Architecture
Apr 24, 2020 · Big Data

Kyligence Kylin on Parquet: Architecture, Engine Design, and Performance Evaluation

The article introduces Kyligence's Kylin on Parquet solution, explains its plug‑in architecture, reasons for replacing HBase with Parquet, details the new Spark‑based build and query engines, auto‑tuning, global dictionary, fault‑tolerance features, and presents performance comparisons with Kylin 3.0.

Apache KylinPerformance OptimizationSpark
0 likes · 11 min read
Kyligence Kylin on Parquet: Architecture, Engine Design, and Performance Evaluation
Big Data Technology Architecture
Big Data Technology Architecture
Jun 9, 2019 · Big Data

An Introduction to Apache Parquet: Architecture, Data Model, File Format, and Basic Operations

This article provides a comprehensive overview of Apache Parquet, covering its purpose, architectural components, nested data model, file structure, practical Hive commands for creating and inspecting Parquet tables, and a brief introduction to the TPC‑DS benchmark for performance testing.

HiveTPC-DSbig data
0 likes · 8 min read
An Introduction to Apache Parquet: Architecture, Data Model, File Format, and Basic Operations
Qunar Tech Salon
Qunar Tech Salon
Feb 26, 2017 · Big Data

Comparative Analysis of Big Data Storage and Query Solutions

This article reviews major big‑data storage and query architectures—including HBase, Dremel/Parquet, pre‑aggregation systems, Lucene, and the custom Tindex solution—evaluating their strengths, weaknesses, and suitability for real‑time, high‑volume analytical workloads.

HBaseLuceneQuery
0 likes · 20 min read
Comparative Analysis of Big Data Storage and Query Solutions