Tagged articles

3697 articles

Page 12 of 37

Jun 2, 2023 · Big Data

Iceberg Data Lake Implementation and Optimization at iQIYI

This article details iQIYI's adoption of the Iceberg data lake, covering its OLAP architecture, reasons for a lake, Iceberg table format advantages over Hive, platform construction, extensive performance optimizations, and real‑world business use cases such as ad‑flow unification, log analysis, audit, and CDC pipelines.

Big DataData LakeFlink

0 likes · 18 min read

Iceberg Data Lake Implementation and Optimization at iQIYI

DevOps Cloud Academy

Jun 1, 2023 · Big Data

DataOps 2.0: Integrated Data Development and Governance Practices at NetEase

The article recounts NetEase’s presentation at the inaugural DataOps conference, detailing the evolution from DataOps 1.0 pipeline to a 2.0 integrated data development‑governance model, the challenges faced, practical solutions, and strategic advice for data managers.

Big DataData EngineeringData Management

0 likes · 11 min read

DataOps 2.0: Integrated Data Development and Governance Practices at NetEase

WeChat Backend Team

Jun 1, 2023 · Big Data

How WeChat Boosted Flink Stability with TaskManager Recovery and Load Balancing

This article details WeChat’s Gemini‑2.0 real‑time streaming platform built on Flink, explaining two key stability enhancements: a TaskManager‑level partial failure recovery that avoids data loss during node crashes, and a load‑balancing scheduler that evenly distributes tasks across TaskManagers to improve resource utilization and reduce latency.

Big DataFlinkStream Processing

0 likes · 16 min read

How WeChat Boosted Flink Stability with TaskManager Recovery and Load Balancing

DataFunTalk

May 30, 2023 · Big Data

Optimizing Chart Query Performance in YouShu BI: Data Query Principles, Intelligent Caching, Query Merging, and Diagnostics

This article explains the data query fundamentals of YouShu BI charts, introduces intelligent caching design, describes query merging and various optimization techniques—including partition filters, value acceleration, and SQL generation—and outlines performance diagnosis methods to improve BI chart responsiveness.

BIBig DataChart Performance

0 likes · 16 min read

Optimizing Chart Query Performance in YouShu BI: Data Query Principles, Intelligent Caching, Query Merging, and Diagnostics

Architects Research Society

May 28, 2023 · Big Data

Understanding Azure Synapse Analytics: An Integrated Data Lake and Data Warehouse Platform

This article examines Microsoft Azure Synapse Analytics, explaining how its unified framework combines data lake and data warehouse capabilities through components such as Pipelines, Dedicated SQL pools, Spark pools, and Serverless SQL, and evaluates its advantages over separate tools like Snowflake and Databricks.

Azure SynapseBig DataCloud Analytics

0 likes · 7 min read

Understanding Azure Synapse Analytics: An Integrated Data Lake and Data Warehouse Platform

Architects Research Society

May 28, 2023 · Big Data

Databricks vs Snowflake: Comparing Data Lake and Data Warehouse Cloud Solutions

This article compares the cloud‑based analytics platforms Databricks and Snowflake, examining how Databricks serves as a data‑lake processing tool with emerging warehouse features while Snowflake operates as a scalable data‑warehouse that incorporates lake‑style capabilities, and discusses their complementary use cases.

Big DataCloud AnalyticsDatabricks

0 likes · 7 min read

Databricks vs Snowflake: Comparing Data Lake and Data Warehouse Cloud Solutions

StarRocks

May 26, 2023 · Big Data

How SeaTunnel’s StarRocks Connector Enables High‑Performance Data Sync

This article explains SeaTunnel’s architecture and its StarRocks connector, detailing source and sink features such as field projection, predicate push‑down, parallel reading, state recovery, data type mapping, Stream Load writes, CDC support, configuration examples, and future roadmap for exactly‑once semantics.

Big DataConnectorData Integration

0 likes · 16 min read

How SeaTunnel’s StarRocks Connector Enables High‑Performance Data Sync

vivo Internet Technology

May 24, 2023 · Big Data

Kafka Real-time Data Archiving to Hive: Flink SQL and DataStream Implementation Solutions

The article explains how to archive Kafka real‑time data to Hive using either Flink SQL, which quickly creates partitioned ORC tables but requires timezone handling, or Flink DataStream for more complex pipelines, and offers best‑practice guidance on data quality, system complexity, security, and performance.

Big DataDataStreamFlink

0 likes · 15 min read

Kafka Real-time Data Archiving to Hive: Flink SQL and DataStream Implementation Solutions

DataFunTalk

May 23, 2023 · Big Data

Building a Millisecond‑Response Lakehouse Platform with Apache Iceberg: Architecture, Query Acceleration, and Intelligent Optimization

This article details Bilibili's technical practice of constructing a millisecond‑response lake‑warehouse platform using Apache Iceberg, covering the background challenges, unified architecture, multi‑dimensional sorting and indexing for query acceleration, the Magnus service for intelligent optimization, and the current production deployment and performance metrics.

Big DataCubeIceberg

0 likes · 14 min read

Building a Millisecond‑Response Lakehouse Platform with Apache Iceberg: Architecture, Query Acceleration, and Intelligent Optimization

Qunar Tech Salon

May 23, 2023 · Operations

Interview with Sun Bin on Qunar’s Technology Operations, AI Initiatives, and Technical Branding

In this interview, Qunar VP Sun Bin reflects on his 13‑year journey, the technology operations center’s pandemic‑driven innovations, the company’s AI committee and big‑data strategies, and the philosophy of pure technical branding within the ITCP alliance.

Artificial IntelligenceBig DataTechnical Branding

0 likes · 11 min read

Interview with Sun Bin on Qunar’s Technology Operations, AI Initiatives, and Technical Branding

DataFunTalk

May 22, 2023 · Big Data

Alibaba Cloud Data Lake: Unified Metadata and Storage Management Practices

This article explains Alibaba Cloud's data lake architecture, unified metadata services, storage management optimizations, and format handling techniques, illustrating how lakehouse concepts, multi‑engine support, and lifecycle policies enable efficient, secure, and cost‑effective big data processing in the cloud.

Big DataCloud ServicesData Lake

0 likes · 22 min read

Alibaba Cloud Data Lake: Unified Metadata and Storage Management Practices

Data Thinking Notes

May 21, 2023 · Information Security

Why Government Data Sharing Stalls and How a “Three‑Rights” Model Can Unlock It

The article analyzes why government data sharing often fails—citing legal, technical, security, and organizational hurdles—then outlines one‑to‑one and centralized sharing models, highlights four critical success factors, and proposes a “three‑rights” framework supported by blockchain to create trustworthy, sustainable inter‑departmental data exchange.

Big DataBlockchainInformation Security

0 likes · 11 min read

Why Government Data Sharing Stalls and How a “Three‑Rights” Model Can Unlock It

IT Services Circle

May 21, 2023 · R&D Management

Interviewer’s Reflections: Evaluating Senior Candidates for Cloud and Big Data Positions

The article shares an interviewer's experience assessing senior candidates for cloud and big‑data roles, detailing candidate backgrounds, interview questions on algorithms, Java, Spring, and Kubernetes, the evaluation outcomes, and practical advice for both interviewers and senior engineers.

Big DataCloud ComputingInterview

0 likes · 11 min read

Interviewer’s Reflections: Evaluating Senior Candidates for Cloud and Big Data Positions

Big Data Technology & Architecture

May 19, 2023 · Big Data

Comprehensive Big Data Interview Q&A and Personal Project Summary

This article shares a recent graduate's successful job offer story, emphasizes preparing a detailed personal project summary, and provides extensive big‑data interview questions covering Hadoop, Spark, Flink, Kafka, Hive, ClickHouse, and related technologies to help candidates excel in interviews.

Big DataFlinkHadoop

0 likes · 15 min read

Comprehensive Big Data Interview Q&A and Personal Project Summary

Data Thinking Notes

May 17, 2023 · Big Data

Inside Wing Pay’s Scalable Big Data Platform: Architecture & Governance

This article details how Wing Pay built a comprehensive data development and governance platform, covering company background, business scenarios, goals, challenges, task development workflow, task types, SparkSQL editor features, double‑environment deployment, Airflow scheduling, DataX data bus, resource isolation, compute optimization, data quality monitoring, cloud‑native practices, future outlook, and a Q&A on data permissions and governance.

AirflowBig DataData Platform

0 likes · 17 min read

Inside Wing Pay’s Scalable Big Data Platform: Architecture & Governance

DataFunTalk

May 17, 2023 · Databases

Evolution of 360 Commercial Real-Time Data Warehouse and Apache Doris Deployment

This article details the three‑stage evolution of 360's real‑time data warehouse—from Storm + Druid + MySQL to Flink + Druid + TiDB and finally to Flink + Apache Doris—explaining architectural pain points, the reasons for choosing Doris, and how the new system delivers sub‑second query latency, strong consistency, and simplified operations across advertising scenarios.

Apache DorisBig DataData Consistency

0 likes · 17 min read

Evolution of 360 Commercial Real-Time Data Warehouse and Apache Doris Deployment

Tongcheng Travel Technology Center

May 17, 2023 · Databases

StarRocks Production Practice at Tongcheng Travel: Architecture, Use Cases, and Technical Evaluation

This article details Tongcheng Travel’s production deployment of the StarRocks OLAP database, covering background, business scenarios, technical evaluation against ClickHouse and Greenplum, implementation with Flink SQL, real‑time analytics, offline reporting, CDP use cases, performance optimizations, and future cloud‑native plans.

Big DataData WarehouseFlink

0 likes · 12 min read

StarRocks Production Practice at Tongcheng Travel: Architecture, Use Cases, and Technical Evaluation

WeChat Backend Team

May 17, 2023 · Big Data

Boosting Real-Time Recommendations: Apache Pulsar Optimizations at WeChat

This article details how WeChat's Gemini‑2.0 big‑data platform leverages Apache Pulsar, outlining cloud‑native advantages, load‑balancing refinements, cache and SSD tuning, high‑availability safeguards, and cost‑saving strategies that together enable large‑scale, real‑time, deep‑learning recommendation workloads.

Apache PulsarBig DataMessage queue

0 likes · 17 min read

Boosting Real-Time Recommendations: Apache Pulsar Optimizations at WeChat

Alibaba Cloud Big Data AI Platform

May 16, 2023 · Big Data

How Hash Cluster Tables Slash Shuffle Costs in MaxCompute Pipelines

This article explains how building hash cluster tables in MaxCompute can compress pre‑sorted data, enable shuffle removal, and dramatically reduce execution time and resource consumption for conversion attribution tasks.

Big DataData WarehouseHash Clustering

0 likes · 7 min read

How Hash Cluster Tables Slash Shuffle Costs in MaxCompute Pipelines

Laravel Tech Community

May 15, 2023 · Big Data

Introducing DataEase: An Easy‑to‑Use Open‑Source BI Tool with Rich Features and Quick Deployment

The article reviews DataEase, a Chinese open‑source business‑intelligence platform that offers a low‑learning‑curve interface, extensive data‑source support, built‑in template marketplace, and Docker‑based one‑command installation, making data visualization and dashboard creation accessible to a broad range of users.

BIBig DataData visualization

0 likes · 7 min read

Introducing DataEase: An Easy‑to‑Use Open‑Source BI Tool with Rich Features and Quick Deployment

DataFunTalk

May 15, 2023 · Big Data

Kuaishou Data Lake Construction with Apache Hudi: Architecture, Challenges, and Solutions

This article explains why Kuaishou built a data lake, describes its Hudi‑based architecture, outlines five major challenges encountered during implementation, and presents the solutions and future development plans, illustrating performance improvements and practical use cases across various business scenarios.

Apache HudiBig DataData Lake

0 likes · 19 min read

Kuaishou Data Lake Construction with Apache Hudi: Architecture, Challenges, and Solutions

Alibaba Cloud Big Data AI Platform

May 15, 2023 · Big Data

Quickly Analyze Public Big Data Sets with Alibaba DataWorks & MaxCompute (Free Trial)

This step‑by‑step tutorial shows how to set up Alibaba Cloud DataWorks and MaxCompute, bind them together, and use free trial resources to explore public big‑data datasets such as Alibaba e‑commerce, Github events, and custom data with SQL queries and visualizations.

Alibaba CloudBig DataData Analysis

0 likes · 6 min read

Quickly Analyze Public Big Data Sets with Alibaba DataWorks & MaxCompute (Free Trial)

Data Thinking Notes

May 14, 2023 · Big Data

Why Data Governance Matters: Boosting Data Quality and Business Value

Data governance, the overarching framework for evaluating, guiding, and supervising an organization’s data lifecycle—from collection to utilization—ensures high data quality, compliance, and security, ultimately maximizing data value and supporting AI-driven initiatives, while distinguishing itself from data management and data control through a strategic, top‑down approach.

Big DataData ManagementData Quality

0 likes · 8 min read

Why Data Governance Matters: Boosting Data Quality and Business Value

DataFunTalk

May 11, 2023 · Big Data

Scaling ByteDance Feature Store to EB‑Level with Apache Iceberg: Architecture, Practices, and Future Roadmap

This article describes how ByteDance tackled petabyte‑scale feature storage by adopting Apache Iceberg, detailing the problem background, design choices, implementation of COW and MOR back‑fill strategies, performance optimizations, and future plans such as lake‑cold‑layering and materialized views.

Apache IcebergBig DataData Lake

0 likes · 16 min read

Scaling ByteDance Feature Store to EB‑Level with Apache Iceberg: Architecture, Practices, and Future Roadmap

Amap Tech

May 11, 2023 · Artificial Intelligence

A 20‑Year Review of AI Infrastructure Milestones

Over the past two decades, AI infrastructure has evolved from early distributed storage and MapReduce to GPU programming, modern package managers, in‑memory processing, deep‑learning frameworks, parameter servers, AI compilers, synthetic data pipelines, open‑source model hubs, and today’s large‑scale Kubernetes‑based clusters, forming the essential foundation for every breakthrough.

AI CompilersAI InfrastructureBig Data

0 likes · 29 min read

A 20‑Year Review of AI Infrastructure Milestones

Big Data Technology & Architecture

May 11, 2023 · Big Data

Remote State Backend for Flink: Design, Optimization, and Deployment with Taishan KV Store

This article describes the motivation, challenges, design, and performance optimizations of a remote state backend for Flink that leverages Bilibili's Taishan distributed KV store to achieve storage‑compute separation, lighter checkpoints, faster rescaling, and improved resource utilization in large‑scale streaming jobs.

Big DataFlinkPerformance Optimization

0 likes · 20 min read

Remote State Backend for Flink: Design, Optimization, and Deployment with Taishan KV Store

DataFunTalk

May 9, 2023 · Databases

High‑Performance Inverted Index in Apache Doris for Log Data Storage and Analysis

This article explains how Apache Doris implements a high‑performance, column‑oriented inverted index to address the challenges of massive, real‑time log data storage and analysis, delivering dramatically higher write throughput, lower storage costs, and faster query performance than traditional Elasticsearch and Loki solutions.

Apache DorisBig DataLog Analytics

0 likes · 19 min read

High‑Performance Inverted Index in Apache Doris for Log Data Storage and Analysis

Data Thinking Notes

May 7, 2023 · Big Data

How Financial Institutions Can Master Data‑Driven Transformation in 2024

This article examines two decades of data warehouse evolution in the financial sector, identifies persistent pain points such as platform lag, data quality, and low service efficiency, and proposes a cloud‑native, data‑centric framework—including a unified blueprint, three‑layer architecture, and six core capabilities—to accelerate enterprise‑wide data capability building and drive high‑quality digital growth.

Big DataData PlatformDigital Transformation

0 likes · 18 min read

How Financial Institutions Can Master Data‑Driven Transformation in 2024

DataFunSummit

May 7, 2023 · Big Data

Tencent SuperSQL: A Unified Adaptive Big Data Computing Platform

The article presents Tencent's SuperSQL platform, detailing the big‑data challenges of heterogeneous data sources and fragmented SQL experiences, describing its multi‑layer adaptive architecture, core technologies such as unified SQL parsing, cost‑based and history‑based optimization, federated computation, materialized views and security, and summarizing its performance gains, industry impact and community contributions.

Big DataSQL optimizationSuperSQL

0 likes · 16 min read

Tencent SuperSQL: A Unified Adaptive Big Data Computing Platform

WeiLi Technology Team

May 6, 2023 · Big Data

How We Upgraded Our Flink Cluster from 1.10 to 1.14.6 and Overcame Common Pitfalls

This article details the background of a Flink 1.10 cluster on Huawei Cloud, the technical challenges that prompted an upgrade, a step‑by‑step migration plan to Flink 1.14.6, troubleshooting of frequent errors, precautionary measures, and the performance and operational benefits achieved after the upgrade.

Big DataCDCFlink

0 likes · 19 min read

How We Upgraded Our Flink Cluster from 1.10 to 1.14.6 and Overcame Common Pitfalls

DataFunTalk

May 6, 2023 · Databases

Apache Doris: Overview, Data Lake Analysis Architecture, Community Development and Future Roadmap

This article provides a comprehensive overview of Apache Doris, detailing its origins, MPP‑based analytical capabilities, data‑lake integration techniques, recent architectural enhancements, performance optimizations, community growth, and upcoming development plans, while also addressing common user questions.

Apache DorisBig DataData Lake

0 likes · 20 min read

Apache Doris: Overview, Data Lake Analysis Architecture, Community Development and Future Roadmap

MaGe Linux Operations

May 5, 2023 · Operations

How to Build a Flexible Kubernetes Monitoring System for Big Data with kube‑prometheus

This article explains how to design and implement a lightweight, flexible monitoring solution for big‑data components running on Kubernetes using kube‑prometheus, covering metric exposure methods, scrape configurations, alert rule design, exporter deployment, and practical examples with code snippets.

AlertmanagerBig DataPrometheus

0 likes · 19 min read

How to Build a Flexible Kubernetes Monitoring System for Big Data with kube‑prometheus

DataFunTalk

May 5, 2023 · Big Data

NetEase Cloud Music Real-Time Data Warehouse Architecture and Low-Code Platform Practices

This article presents NetEase Cloud Music's real-time data warehouse architecture, covering its streaming and batch scenarios, layered design (ODS, CDM, ADS), technology stack choices, consistency mechanisms, the FastX low-code platform, and future development plans, offering a comprehensive technical overview for data engineers and architects.

Big DataClickHouseFlink

0 likes · 18 min read

NetEase Cloud Music Real-Time Data Warehouse Architecture and Low-Code Platform Practices

Big Data Technology & Architecture

May 5, 2023 · Big Data

Strategies for Handling Small Files in Hive and Spark

This article examines the causes and impacts of small file proliferation in Hive and Spark environments, and presents multiple mitigation techniques—including Spark 3 adaptive query execution, reducing reduce tasks, using DISTRIBUTE BY RAND(), post‑processing clean‑up, Hive and Spark configuration tweaks, and automated tooling—to improve performance and storage efficiency.

Big DataHiveSmall Files

0 likes · 9 min read

Strategies for Handling Small Files in Hive and Spark

Top Architect

May 4, 2023 · Big Data

Data Middle Platform: General Architecture and Core Components

The article explains the concept, benefits, and detailed modular architecture of a data middle platform, covering data storage, acquisition, processing, governance, security, and operation frameworks, and illustrates how enterprises can build and evolve such platforms to turn data into valuable services.

Big DataData ArchitectureData Integration

0 likes · 19 min read

Data Middle Platform: General Architecture and Core Components

DataFunTalk

May 3, 2023 · Big Data

Shuttle2.0: Enhancing Spark and Flink Shuffle with Distributed Sorting and Adaptive Broadcast

Shuttle2.0 extends OPPO's open‑source high‑availability Spark Remote Shuffle Service to support Flink, introduces a unified stream‑batch data model, pipelines shuffle with distributed sorting, and provides an Adaptive BroadcastJoin solution that dramatically improves performance and stability for large‑scale big‑data workloads.

Adaptive BroadcastBig DataDistributed Sorting

0 likes · 11 min read

Shuttle2.0: Enhancing Spark and Flink Shuffle with Distributed Sorting and Adaptive Broadcast

Data Thinking Notes

Apr 25, 2023 · Operations

Why Data Quality Matters: A Practical Guide to Governance and Seven‑Dimensional Evaluation

This article explains why data quality is critical for businesses, outlines common data quality problems, their root causes, and presents a comprehensive governance framework—including monitoring rules, alerting, full‑link monitoring, and a seven‑dimensional evaluation model—to ensure high‑quality data delivery.

Big DataData QualityOperations

0 likes · 12 min read

Why Data Quality Matters: A Practical Guide to Governance and Seven‑Dimensional Evaluation

ITPUB

Apr 25, 2023 · Big Data

Top 8 Open‑Source ETL Tools for Data Migration and Integration

This article reviews eight widely used ETL and data‑migration tools—including Kettle, DataX, DataPipeline, Talend, DataStage, Sqoop, FineDataLink, and Canal—detailing their core features, architectures, supported data sources, and typical usage scenarios to help practitioners choose the right solution.

Big DataData IntegrationData Migration

0 likes · 13 min read

Top 8 Open‑Source ETL Tools for Data Migration and Integration

Python Programming Learning Circle

Apr 23, 2023 · Big Data

Parallel Processing of Large CSV Files in Python with multiprocessing, joblib, and tqdm

This tutorial demonstrates how to accelerate processing of a 2.8‑million‑row CSV dataset by using Python's multiprocessing, joblib, and tqdm libraries, covering serial, parallel, and batch processing techniques, performance measurements, and best‑practice code examples for efficient large‑scale data handling.

Big DataData EngineeringPython

0 likes · 9 min read

Parallel Processing of Large CSV Files in Python with multiprocessing, joblib, and tqdm

Big Data Technology & Architecture

Apr 23, 2023 · Big Data

Spark and Flink Optimization Guide: Parallelism, GC Tuning, Memory Settings, and Production Configurations

This article provides a comprehensive guide on optimizing Spark and Flink workloads, covering parallelism settings, garbage‑collection tuning, out‑of‑memory mitigation, and full production‑grade configuration examples for both frameworks.

Big DataFlinkGC optimization

0 likes · 7 min read

Spark and Flink Optimization Guide: Parallelism, GC Tuning, Memory Settings, and Production Configurations

Tongcheng Travel Technology Center

Apr 20, 2023 · Big Data

Apache Paimon in Practice: Replacing Hudi for Improved Write and Query Performance

Apache Paimon was adopted at Tongcheng Travel to replace Hudi, achieving three‑fold write speed gains and ten‑fold query acceleration, with detailed discussion of lakehouse challenges, performance issues, migration steps, configuration examples, and future plans for the platform.

Apache PaimonBig DataFlink

0 likes · 15 min read

Apache Paimon in Practice: Replacing Hudi for Improved Write and Query Performance

Data Thinking Notes

Apr 19, 2023 · Big Data

How Bilibili Transformed Big Data Governance: From Reactive Storage Management to Proactive Multi‑Dimensional Control

This article details Bilibili's evolution of big data governance, describing the early data growth challenges, the launch of the "Wanglou" project, the development of asset metadata and governance indicator frameworks, storage cost reduction strategies, scoring models, and the shift from passive, single‑point fixes to proactive, multi‑dimensional governance across the organization.

Big DataBilibiliCost Management

0 likes · 22 min read

How Bilibili Transformed Big Data Governance: From Reactive Storage Management to Proactive Multi‑Dimensional Control

Big Data Technology Architecture

Apr 19, 2023 · Big Data

Why the Big Data Era Is Over

The article argues that the era of big data is ending, showing that most organizations store only modest amounts of data, that storage costs outweigh benefits, and that modern cloud and analytics tools allow efficient processing without needing massive datasets.

AnalyticsBig DataData Management

0 likes · 16 min read

Code Ape Tech Column

Apr 19, 2023 · Databases

Comparative Analysis of Elasticsearch and ClickHouse: Architecture, Query Performance, and Practical Benchmarks

This article compares Elasticsearch and ClickHouse by outlining their architectures, detailing deployment configurations, presenting benchmark queries and performance results, and concluding that ClickHouse generally outperforms Elasticsearch in many basic search and aggregation scenarios, while also noting each system's strengths and limitations.

Big DataClickHouseElasticsearch

0 likes · 13 min read

Comparative Analysis of Elasticsearch and ClickHouse: Architecture, Query Performance, and Practical Benchmarks

dbaplus Community

Apr 18, 2023 · Big Data

How Bilibili Scaled Its OLAP Platform with ClickHouse and Lakehouse Integration

At Bilibili, the OLAP platform evolved through three phases—consolidating data services onto ClickHouse, migrating text search to ClickHouse, and integrating a lake‑house architecture—delivering massive cost reductions, sub‑second query latency, and scalable analytics for billions of daily events.

Big DataClickHouseOLAP

0 likes · 15 min read

How Bilibili Scaled Its OLAP Platform with ClickHouse and Lakehouse Integration

DataFunTalk

Apr 18, 2023 · Big Data

Real-time OLAP with Apache Doris: Architecture, Use Cases, and Optimization at Dingdong Maicai

This article details Dingdong Maicai's adoption of Apache Doris as a real‑time OLAP engine, covering business requirements, comparative evaluation with ClickHouse, system architecture, practical applications such as real‑time analytics, B‑end queries, tag systems, and performance‑boosting techniques like Colocate Join, bitmap, prefix and Bloom‑filter indexes, materialized views, and streamlined Broker Load workflows.

Apache DorisBig DataData Warehouse

0 likes · 19 min read

Real-time OLAP with Apache Doris: Architecture, Use Cases, and Optimization at Dingdong Maicai

Huolala Tech

Apr 17, 2023 · Big Data

How HuoLala Accelerated Ad‑hoc Queries with a Hybrid Offline Engine

This article describes how HuoLala identified slow ad‑hoc query performance in its Hive‑on‑Tez stack, surveyed comparable industry solutions, and built a multi‑engine hybrid offline service that dramatically improves query latency, outlines its architecture, key design decisions, production impact, and future roadmap.

Big DataDistributed SystemsSQL Routing

0 likes · 12 min read

How HuoLala Accelerated Ad‑hoc Queries with a Hybrid Offline Engine

Big Data Technology & Architecture

Apr 17, 2023 · Big Data

Comprehensive Guide to Data Governance and Data Asset Management

This article presents a detailed roadmap for enterprise data governance, covering business digitization goals, data governance construction, typical digital platform architecture, core governance actions, implementation pathways, data asset inventory techniques, and real‑world case studies to illustrate practical execution.

Big DataData Asset ManagementData Quality

0 likes · 18 min read

Comprehensive Guide to Data Governance and Data Asset Management

Data Thinking Notes

Apr 16, 2023 · Big Data

Mastering Data Asset Management: From Inventory to Value Realization

This article outlines a complete data asset management lifecycle—starting with data inventory, moving through governance, classification, responsibility, permission, and security, and culminating in value realization via basic services, profiling, and algorithmic models—providing practical guidance for building a robust big‑data platform.

Big DataData QualityMaster Data

0 likes · 10 min read

Mastering Data Asset Management: From Inventory to Value Realization

Efficient Ops

Apr 16, 2023 · Operations

How Capability Platforms Empower Intelligent Container Cloud Operations

At the 20th GOPS Global Operations Conference, China Mobile Jiangsu showcased how its capability platform leverages AI, big data, and blockchain to automate health scoring and intelligent inspection, dramatically improving container‑cloud operational efficiency and paving the way for smarter, SRE‑driven DevOps practices.

Artificial IntelligenceBig DataCapability Platform

0 likes · 5 min read

How Capability Platforms Empower Intelligent Container Cloud Operations

ITPUB

Apr 15, 2023 · Big Data

How Bilibili Turned Big Data Governance from Reactive to Proactive

This article details Bilibili's journey from a late‑started, reactive big‑data platform to a mature, proactive governance system that combines asset metadata, metric‑driven strategies, cost‑aware billing, and automated tooling to achieve massive storage savings and operational efficiency across the organization.

Big DataOperational EfficiencyStorage Management

0 likes · 22 min read

How Bilibili Turned Big Data Governance from Reactive to Proactive

JD Retail Technology

Apr 14, 2023 · Big Data

Understanding Data Skew and Its Mitigation in Hive and Spark

This article explains the concept of data skew, its symptoms such as slow tasks and OOM errors, and provides comprehensive mitigation techniques and configuration examples for Hive and Spark, including custom partitioning, map joins, adaptive execution, and key detection methods.

Adaptive ExecutionBig DataData Skew

0 likes · 15 min read

Understanding Data Skew and Its Mitigation in Hive and Spark

DataFunSummit

Apr 14, 2023 · Big Data

An Overview of User Profiling: Definitions, Elements, Types, Dimensions, Applications, and Development Process

This article provides a comprehensive introduction to user profiling, covering its definition, key elements, classification types, common dimensions, practical application scenarios, lifecycle considerations, development workflow, and validation methods for building effective data‑driven user models.

Big DataData AnalysisMarketing

0 likes · 10 min read

An Overview of User Profiling: Definitions, Elements, Types, Dimensions, Applications, and Development Process

DataFunTalk

Apr 13, 2023 · Big Data

Four Paradigms of StarRocks Lakehouse Integration and an Overview of StarRocks 3.0

This article explains why lake‑warehouse integration is needed, outlines its challenges, describes StarRocks' four integration paradigms—including query acceleration, layered modeling, real‑time warehouse‑lake fusion, and the cloud‑native 3.0 solution—and previews the upcoming StarRocks 3.0 release.

Big DataData LakeData Warehouse

0 likes · 18 min read

Four Paradigms of StarRocks Lakehouse Integration and an Overview of StarRocks 3.0

Data Thinking Notes

Apr 12, 2023 · Big Data

Building an End‑to‑End Data Governance System: Challenges, Solutions & Impact

This article details DataCake's data‑governance journey, covering the problems of data silos, unclear costs, and tool fragmentation, then explains the strategic thinking, the multi‑layered solution architecture, and the measurable outcomes such as higher resource utilization and reclaimed storage.

Big Datacost analysisdata governance

0 likes · 17 min read

Building an End‑to‑End Data Governance System: Challenges, Solutions & Impact

DataFunSummit

Apr 10, 2023 · Big Data

Spark on Kubernetes: Practices and Optimizations at Eggplant Technology

This article explains how Spark can be effectively deployed on Kubernetes, covering its advantages over traditional Hadoop clusters, the principles of Spark on K8s, dynamic allocation, reuse PVC enhancements, scheduling optimizations, and real‑world performance results from Eggplant Technology's production use.

Big DataSchedulingperformance-optimization

0 likes · 21 min read

Spark on Kubernetes: Practices and Optimizations at Eggplant Technology

Big Data Technology & Architecture

Apr 10, 2023 · Big Data

Fine‑grained Configuration, State Migration, and Debugging Techniques for Flink SQL at Meituan

This article describes how Meituan addresses the rapid growth of Flink SQL jobs by introducing fine‑grained TTL and concurrency settings, an editable execution plan for state migration, pre‑analysis compatibility checks, and a bytecode‑instrumented debugging system that captures operator data and streams it to Kafka for analysis.

Big DataDebuggingFlink

0 likes · 24 min read

Fine‑grained Configuration, State Migration, and Debugging Techniques for Flink SQL at Meituan

DataFunTalk

Apr 10, 2023 · Big Data

Interview on Data Lakehouse: Current Applications, Challenges, and Evolution

This interview with NetEase data‑lake technology manager Ma Jin explains the distinction between data lakes and lakehouses, reviews the evolution of table‑format technologies such as Iceberg, Hudi and Delta Lake, evaluates feature maturity and performance trade‑offs, and discusses systematic versus non‑systematic adoption in enterprises.

Big DataData LakehouseDelta Lake

0 likes · 13 min read

Interview on Data Lakehouse: Current Applications, Challenges, and Evolution

360 Tech Engineering

Apr 10, 2023 · Big Data

Performance Tuning and Stability Analysis of Large Offline Apache Flink Jobs

This article examines how to run large offline Apache Flink jobs stably by analyzing task slot and resource configurations, CPU‑to‑slot ratios, and memory usage, offering practical recommendations to improve speed, reduce resource consumption, and avoid Hadoop‑related failures.

Apache FlinkBig DataResource Tuning

0 likes · 10 min read

Performance Tuning and Stability Analysis of Large Offline Apache Flink Jobs

Data Thinking Notes

Apr 9, 2023 · Big Data

Why Data Quality Is the Hidden Driver of Big Data Success

In the big‑data era, high‑quality data are essential for reliable analytics, and this article explains data‑quality concepts, key dimensions, analysis methods for missing values, outliers, inconsistencies and duplicates, as well as practical management practices to ensure data assets become a competitive advantage.

Big DataData AnalysisData Management

0 likes · 15 min read

Why Data Quality Is the Hidden Driver of Big Data Success

ITPUB

Apr 9, 2023 · Big Data

How Meituan Optimized Flink SQL: Fine‑Grained Config, State Migration, and Debugging

This article details Meituan's implementation of Flink SQL at scale, covering fine‑grained job configuration, state‑TTL management, state‑migration techniques for job upgrades, a custom debugging tool for correctness issues, and future directions for Flink SQL enhancements.

Big DataDebuggingFlink

0 likes · 24 min read

How Meituan Optimized Flink SQL: Fine‑Grained Config, State Migration, and Debugging

DataFunSummit

Apr 9, 2023 · Big Data

Expert Interview: Architecture and Trends of Big Data Platforms

This article presents a comprehensive interview with several big‑data platform experts, outlining the core components such as data integration, storage and computation, distributed scheduling, and query analysis, while also highlighting current challenges, best‑practice tools, and future trends in big‑data architecture.

Big DataData IntegrationDistributed computing

0 likes · 10 min read

Expert Interview: Architecture and Trends of Big Data Platforms

DataFunTalk

Apr 9, 2023 · Big Data

Building an Agile Business Intelligence Platform at Zhongyuan Bank: Architecture, Practices, and Future Outlook

The article details Zhongyuan Bank's end‑to‑end agile BI platform construction, covering business goals, a step‑by‑step development timeline, core architecture, eight key functionalities, low‑code data processing, real‑time streaming, visualization dashboards, intelligent Q&A, and future directions for platform intelligence and openness.

BIBig DataData Platform

0 likes · 19 min read

Building an Agile Business Intelligence Platform at Zhongyuan Bank: Architecture, Practices, and Future Outlook

ITPUB

Apr 8, 2023 · Big Data

How Bilibili Cut Data Pipeline Costs by 20% with Flink Real‑Time Incremental Computing

Facing daily terabyte‑scale data ingestion and costly duplicate reads in its ODS‑to‑DWD pipeline, Bilibili introduced a Flink‑based real‑time incremental computation and multi‑level partition shuffling, dramatically reducing read amplification, cutting resource usage by ~20%, improving latency to minutes, and enhancing scalability.

Big DataFlinkReal-time Processing

0 likes · 19 min read

How Bilibili Cut Data Pipeline Costs by 20% with Flink Real‑Time Incremental Computing

dbaplus Community

Apr 8, 2023 · Big Data

How Zhihu Built a Scalable DMP: Architecture, Data Pipelines, and Real‑Time Targeting

This article details Zhihu's Data Management Platform (DMP), covering the business problems it solves, the end‑to‑end workflow, feature taxonomy, system architecture, data pipelines for batch and streaming, audience targeting processes, performance challenges, and future technical directions.

Big DataDMPData Platform

0 likes · 8 min read

How Zhihu Built a Scalable DMP: Architecture, Data Pipelines, and Real‑Time Targeting

DataFunTalk

Apr 7, 2023 · Big Data

Introducing Apache Paimon: An Open‑Source Streaming Lakehouse Storage Engine

Apache Paimon is an open‑source streaming data lake storage system that combines LSM‑based real‑time updates, open file formats, and deep integration with Flink, Spark, and Trino to deliver high‑throughput ingestion, low‑latency queries, and unified batch‑stream processing for modern big‑data workloads.

Apache PaimonBig DataFlink

0 likes · 7 min read

Introducing Apache Paimon: An Open‑Source Streaming Lakehouse Storage Engine

Data Thinking Notes

Apr 5, 2023 · Big Data

Mastering Data Governance: From Challenges to End‑to‑End Solutions

This article explores the key problems data governance aims to solve, outlines a comprehensive governance framework, and details practical implementation steps—including tool integration, metadata management, lake‑in and lake‑out processes, and governance policies—to achieve a closed‑loop, value‑driven data ecosystem.

Big DataData LakeData Quality

0 likes · 13 min read

Mastering Data Governance: From Challenges to End‑to‑End Solutions

Big Data Technology & Architecture

Apr 4, 2023 · Big Data

Understanding Flink’s Data Flow: Buffer Pools, Network Transfer, and Credit‑Based Flow Control

This article explains Flink’s internal data abstraction and transfer mechanisms, detailing how data moves between operators via network buffers, the role of ByteBuffer and NetworkBufferPool, the serialization process, Netty integration, and credit‑based flow control to handle backpressure.

Big DataCredit-based Flow ControlData Flow

0 likes · 10 min read

Understanding Flink’s Data Flow: Buffer Pools, Network Transfer, and Credit‑Based Flow Control

DataFunTalk

Apr 4, 2023 · Big Data

Upgrading Hangzhou Bank Consumer Finance Big Data Platform with Apache Doris 1.2: Architecture, Performance Gains, and Integration

This article details how Hangzhou Bank Consumer Finance modernized its big‑data platform by introducing Apache Doris 1.2, replacing the original Greenplum + CDH architecture, unifying data sources via Multi‑Catalog, achieving second‑level query latency, reducing storage and compute costs, and outlining the integration workflow with DolphinScheduler, SeaTunnel, and Spark.

Apache DorisBig DataData Integration

0 likes · 20 min read

Upgrading Hangzhou Bank Consumer Finance Big Data Platform with Apache Doris 1.2: Architecture, Performance Gains, and Integration

DataFunTalk

Apr 4, 2023 · Big Data

Compass: An Open‑Source Big Data Task Diagnosis Platform for DolphinScheduler, Airflow and Spark

Compass is an open‑source big‑data diagnostic platform developed by OPPO that provides non‑intrusive, real‑time monitoring and root‑cause analysis for offline and streaming tasks on schedulers such as DolphinScheduler and Airflow, covering workflow‑level failures, Spark engine anomalies, resource usage, and offering one‑click reports and extensible rule‑based diagnostics.

Big DataDolphinSchedulerSpark

0 likes · 13 min read

Compass: An Open‑Source Big Data Task Diagnosis Platform for DolphinScheduler, Airflow and Spark

Bilibili Tech

Apr 4, 2023 · Big Data

How Bilibili’s Flink‑Based Real‑Time Incremental Pipeline Cuts Costs and Boosts Latency

This article details Bilibili’s migration from a Spark‑based offline ODS‑to‑DWD sharding process to a Flink real‑time incremental pipeline, explaining the background challenges, the design of multi‑level partitioning, small‑file optimizations, stability enhancements, and the measurable performance gains achieved.

Big DataData WarehouseFlink

0 likes · 19 min read

How Bilibili’s Flink‑Based Real‑Time Incremental Pipeline Cuts Costs and Boosts Latency

DataFunSummit

Apr 3, 2023 · Big Data

Evolution and Architecture of Data Lineage in Volcano Engine DataLeap

This article outlines the background, development stages, architectural evolution, key features such as incremental updates and quality metrics, and future directions of the data lineage capability within Volcano Engine's DataLeap big‑data governance platform.

Big DataDataLeapMetadata

0 likes · 18 min read

Evolution and Architecture of Data Lineage in Volcano Engine DataLeap

dbaplus Community

Apr 2, 2023 · Big Data

Unlock Faster ODPS SQL: Proven UNION, COUNT DISTINCT, and Join Optimizations

This article walks through common ODPS SQL scenarios—union, count distinct, large‑table joins, mapjoin, and predicate placement—explains why naïve implementations can be inefficient, shows how to read and interpret execution plans, and provides concrete rewritten queries that dramatically improve performance and resource usage.

Big DataCOUNT DISTINCTMapJoin

0 likes · 17 min read

Unlock Faster ODPS SQL: Proven UNION, COUNT DISTINCT, and Join Optimizations

Liulishuo Tech Team

Mar 31, 2023 · Big Data

Understanding and Experimenting with the Data Warehouse Toolbox: Dimensional Modeling

This article explains the concepts, key characteristics, terminology, and practical steps of dimensional modeling—including star and snowflake schemas—and demonstrates how to apply the methodology to a real‑world sales analysis scenario, while also discussing common challenges in building star‑schema models.

Big DataData WarehouseStar Schema

0 likes · 13 min read

Understanding and Experimenting with the Data Warehouse Toolbox: Dimensional Modeling

DataFunSummit

Mar 31, 2023 · Big Data

Data Governance Practices and Implementation at DataCake

The article outlines DataCake's data governance journey, describing the challenges of data silos and cost inefficiencies, the strategic thinking behind a unified metadata platform, the implementation of governance tools, cost analysis modules, and asset inventory, and concludes with results, future plans, and a Q&A session.

Big DataOperational Efficiencycost analysis

0 likes · 14 min read

Data Governance Practices and Implementation at DataCake

HomeTech

Mar 31, 2023 · Artificial Intelligence

Digital Transformation of Used‑Car Buying: Integrated Data, AI Valuation, and VR Visualization

The article describes how a comprehensive digital platform combines structured, semi‑structured, and panoramic data with machine‑learning valuation models, natural‑language processing, and VR technology to make used‑car condition information transparent, improve estimation accuracy, and enhance user decision‑making in the Chinese second‑hand car market.

AI valuationBig DataData Integration

0 likes · 15 min read

Digital Transformation of Used‑Car Buying: Integrated Data, AI Valuation, and VR Visualization

Big Data Technology & Architecture

Mar 30, 2023 · Big Data

Apache Paimon (Incubating): A Streaming Lakehouse Storage Project Overview

Apache Paimon, newly incubated by the Apache Software Foundation, combines Flink's real‑time streaming capabilities with open lakehouse storage formats, offering high‑throughput, low‑latency data ingestion, partial‑update merges, and seamless integration with engines like Flink, Spark, and Trino for unified batch and streaming analytics.

Apache PaimonBig DataData Lake

0 likes · 7 min read

Apache Paimon (Incubating): A Streaming Lakehouse Storage Project Overview

ITPUB

Mar 28, 2023 · Big Data

How We Turned a Hive Data Warehouse into a Real‑Time Lakehouse with Apache Hudi

This article details the migration from a traditional Hive‑based data warehouse to a lakehouse architecture using Apache Hudi, covering the original Lambda setup, its pain points, lake‑vs‑warehouse differences, Hudi features, integration challenges, practical solutions, and future roadmap.

Apache HudiBig DataData Warehouse

0 likes · 11 min read

How We Turned a Hive Data Warehouse into a Real‑Time Lakehouse with Apache Hudi

Huawei Cloud Developer Alliance

Mar 28, 2023 · Databases

What’s Next for Data Warehouses? From History to Future Trends

This article reviews the origins, core characteristics, traditional and logical architectures of data warehouses, explores emerging trends such as massive real‑time data, and outlines Huawei Cloud GaussDB(DWS) evolution toward a cloud‑native, elastic, lake‑warehouse integrated solution.

Big DataData IntegrationData Warehouse

0 likes · 8 min read

What’s Next for Data Warehouses? From History to Future Trends

DataFunTalk

Mar 28, 2023 · Big Data

Big Data Challenges and Serverless Data Solutions: Insights from an AWS Data Architect

The article examines the evolution of big‑data technologies, outlines the operational, cost and security challenges enterprises face, and presents serverless data—particularly AWS’s cloud‑native services—as a scalable, low‑cost solution that eliminates maintenance while enabling real‑time processing and advanced analytics.

Big DataCloud ComputingServerless

0 likes · 16 min read

Big Data Challenges and Serverless Data Solutions: Insights from an AWS Data Architect

Big Data Technology & Architecture

Mar 27, 2023 · Big Data

Key Updates in Apache Flink 1.17: Batch and Streaming Enhancements

The article reviews Apache Flink 1.17's major batch and streaming improvements, including new Delete/Update APIs, performance boosts, SQL client gateway, checkpoint and watermark enhancements, StateBackend upgrades, and practical use‑case scenarios for data engineers.

Apache FlinkBatch processingBig Data

0 likes · 7 min read

Key Updates in Apache Flink 1.17: Batch and Streaming Enhancements

Baidu Geek Talk

Mar 27, 2023 · Big Data

Precise Watermark Design and Implementation in Baidu's Unified Streaming-Batch Data Warehouse

The article details Baidu's precise watermark design for its unified streaming‑batch data warehouse, describing how a centralized watermark server and client ensure end‑to‑end data completeness, align real‑time and batch windows with 99.9‑99.99% precision, and support accurate anti‑fraud calculations within the broader big‑data ecosystem.

Apache FlinkBaiduBig Data

0 likes · 14 min read

Precise Watermark Design and Implementation in Baidu's Unified Streaming-Batch Data Warehouse

macrozheng

Mar 27, 2023 · Big Data

Top 8 Open-Source ETL Tools for Efficient Data Migration

This guide reviews eight popular ETL and data migration tools—including Kettle, DataX, DataPipeline, Talend, DataStage, Sqoop, FineDataLink, and Canal—detailing their core features, architectures, and use cases to help engineers choose the right solution for reliable data integration.

Big DataData IntegrationData Migration

0 likes · 14 min read

Top 8 Open-Source ETL Tools for Efficient Data Migration

Data Thinking Notes

Mar 26, 2023 · Big Data

Why Data Governance Is the Key to Unlocking Your Data’s True Value

This article explains how effective data governance transforms raw data into a trusted enterprise asset, outlines common pitfalls such as backward and passive governance, and presents a structured, four‑phase approach—including organizational setup, standards, platform selection, and continuous operations—to successfully implement data governance at scale.

Big DataData ManagementData Quality

0 likes · 10 min read

Why Data Governance Is the Key to Unlocking Your Data’s True Value

ITPUB

Mar 25, 2023 · Big Data

Mastering Efficient SQL in ODPS: Union, Count‑Distinct, and Join Optimizations

This article walks through common SQL development scenarios on ODPS, examining why naïve UNION and COUNT DISTINCT can be slow, how to rewrite queries with GROUP BY, UNION ALL, JSON aggregation, and map‑join techniques, and shows the resulting execution‑plan improvements with concrete code and performance numbers.

Big DataCountDistinctMapJoin

0 likes · 17 min read

Mastering Efficient SQL in ODPS: Union, Count‑Distinct, and Join Optimizations

Su San Talks Tech

Mar 24, 2023 · Big Data

Top 8 Open-Source ETL Tools You Should Know for Efficient Data Migration

Explore a comprehensive overview of eight popular ETL and data migration tools—including Kettle, DataX, DataPipeline, Talend, DataStage, Sqoop, FineDataLink, and Canal—detailing their features, architectures, and use cases to help you choose the right solution for efficient data integration.

Big DataData IntegrationData Migration

0 likes · 13 min read

Top 8 Open-Source ETL Tools You Should Know for Efficient Data Migration

Volcano Engine Developer Services

Mar 22, 2023 · Fundamentals

How ByteDance Scales Data Governance: Challenges, Distributed Solutions, and Best Practices

This article examines ByteDance's data governance journey, outlining business, organizational, and cultural challenges, the six-stage evolution framework, real‑world case studies, and the shift from centralized to distributed autonomous governance to improve quality, security, cost, and team efficiency.

Big DataData QualityOperations

0 likes · 18 min read

How ByteDance Scales Data Governance: Challenges, Distributed Solutions, and Best Practices

DataFunTalk

Mar 21, 2023 · Databases

Design and Technical Details of Apache Doris for Lakehouse Architecture

This article explains how Apache Doris extends its real‑time OLAP capabilities to support Lakehouse architectures, covering unified metadata, query acceleration, elastic compute, performance benchmarks, and future roadmap for richer data‑source integration and resource isolation.

Apache DorisBig DataData Warehouse

0 likes · 20 min read

Design and Technical Details of Apache Doris for Lakehouse Architecture

Big Data Technology & Architecture

Mar 20, 2023 · Big Data

Using SparkSQL to Connect and Operate with Apache Hudi: Configuration, Table Creation, Data Manipulation, and Deletion

This guide demonstrates how to configure Hive metastore, connect SparkSQL to Apache Hudi, create COW and MOR tables, perform insert, update, merge, delete, and insert‑overwrite operations, and illustrates each step with executable code snippets and sample results.

Apache HudiBig DataData Lake

0 likes · 14 min read

Using SparkSQL to Connect and Operate with Apache Hudi: Configuration, Table Creation, Data Manipulation, and Deletion

Data Thinking Notes

Mar 19, 2023 · Big Data

Why Data Quality Is the Key to Successful Big Data Initiatives

The article explains that while big data aims to boost organizational insight and innovation, its true value depends on high data quality, outlines industry standards, identifies technical, business, and management causes of poor quality, and proposes a three‑phase strategy of prevention, monitoring, and post‑improvement to ensure reliable data for decision‑making.

Big DataData QualityStandards

0 likes · 21 min read

Why Data Quality Is the Key to Successful Big Data Initiatives

DataFunSummit

Mar 16, 2023 · Artificial Intelligence

Construction of Real‑World Medical Knowledge Graphs and Clinical Event Graphs

The article describes how YiduCloud builds real‑world medical knowledge graphs and clinical event graphs from heterogeneous hospital systems (EMR, HIS, LIS, RIS) using data aggregation, de‑identification, quality control, NLP‑driven entity extraction, standardisation, graph construction, cleaning, embedding and various AI‑powered applications such as decision support, intelligent diagnosis, automated medical‑record generation and patient recruitment.

AIBig DataMedical Knowledge Graph

0 likes · 21 min read

Construction of Real‑World Medical Knowledge Graphs and Clinical Event Graphs

Alibaba Cloud Developer

Mar 16, 2023 · Big Data

How SLS’s Schema‑on‑Read Scanning Boosts Log Analytics Flexibility and Cuts Costs

This article explains the motivation, design, and implementation of Alibaba Cloud's SLS Schema‑on‑Read scanning mode, showing how it enables SQL analysis on raw log data without pre‑built indexes, improves flexibility for evolving schemas, and reduces storage and index costs in various log‑analysis scenarios.

Big DataColumnar StorageLog Analytics

0 likes · 27 min read

How SLS’s Schema‑on‑Read Scanning Boosts Log Analytics Flexibility and Cuts Costs

Bilibili Tech

Mar 14, 2023 · Big Data

Bilibili HDFS Erasure Coding Strategy and Implementation

Bilibili reduced petabyte‑scale storage costs by back‑porting erasure‑coding patches to its HDFS 2.8.4 cluster, deploying a parallel EC‑enabled cluster, adding a data‑proxy service, intelligent routing and block‑checking, and automating cold‑data migration, while noting write overhead and planning native acceleration.

Big DataData ReliabilityDistributed Systems

0 likes · 14 min read

Bilibili HDFS Erasure Coding Strategy and Implementation

Open Source Linux

Mar 14, 2023 · Big Data

Can Data Lakes and Data Warehouses Coexist? Exploring the Lake‑Warehouse Fusion

This article traces 20 years of big‑data evolution, compares data lakes and data warehouses, defines both concepts, examines their technical trade‑offs, and presents Alibaba Cloud’s lake‑warehouse (lakehouse) solution that unifies flexible storage with enterprise‑grade performance and governance.

Big DataCloud ComputingData Lake

0 likes · 32 min read

Can Data Lakes and Data Warehouses Coexist? Exploring the Lake‑Warehouse Fusion

ITPUB

Mar 13, 2023 · Big Data

What’s New in Apache Kyuubi 1.6.0? Server, Client, and Engine Enhancements

Apache Kyuubi 1.6.0 introduces major server‑side upgrades such as batch JAR task submission with RESTful APIs and a metadata store for HA, client‑side improvements including a unified JDBC driver and enhanced Beeline, plus mature Spark, Flink, Trino, and Hive engine plugins, while outlining the community’s roadmap.

Big DataEngine PluginsFlink

0 likes · 13 min read

What’s New in Apache Kyuubi 1.6.0? Server, Client, and Engine Enhancements

Alibaba Cloud Big Data AI Platform

Mar 13, 2023 · Big Data

Unlocking Big Data with Alibaba Cloud’s Native Data Lake Solution

Alibaba Cloud’s cloud‑native data lake analysis solution combines fully managed storage (OSS‑HDFS), a one‑stop lake management platform (Data Lake Formation), and multimodal compute capabilities, delivering high performance, massive scalability, and low cost for big‑data and AI workloads across offline, real‑time, and lake‑house scenarios.

AnalyticsBig DataData Lake

0 likes · 11 min read

Unlocking Big Data with Alibaba Cloud’s Native Data Lake Solution

Data Thinking Notes

Mar 12, 2023 · Big Data

Why Data Middle Platforms Are Evolving: New Trends in Data Governance and DataOps

The article examines how China's data middle platform concept is reshaping enterprise data strategy, highlighting a shift toward value‑driven adoption, the intertwined relationship with data governance, and emerging trends such as fine‑grained business governance, full‑link monitoring, integrated platforms, and DataOps.

Big DataData Middle PlatformDataOps

0 likes · 9 min read

Why Data Middle Platforms Are Evolving: New Trends in Data Governance and DataOps

DataFunTalk

Mar 12, 2023 · Big Data

Apache Kyuubi 1.6.0 Feature Overview and Enhancements

The article provides a comprehensive walkthrough of Apache Kyuubi 1.6.0, detailing server‑side enhancements such as batch (JAR) task submission, metadata store and unified API/authentication, client‑side improvements to the built‑in JDBC driver and Beeline, as well as engine plugins for Spark, Flink, Trino and Hive, and concludes with the community’s roadmap and statistics.

Apache KyuubiBatch processingBig Data

0 likes · 12 min read

Apache Kyuubi 1.6.0 Feature Overview and Enhancements

DataFunSummit

Mar 11, 2023 · Databases

Graph Database Storage and Knowledge Graph Practices – Forum Overview

The forum explores the rapid growth and complexity of knowledge graphs, addressing storage and computation challenges through expert talks on graph database storage, query languages, practical implementation, and large‑scale financial knowledge graph platforms, offering attendees deep technical insights and hands‑on guidance.

Big DataData StorageKnowledge Graph

0 likes · 8 min read

Graph Database Storage and Knowledge Graph Practices – Forum Overview

DataFunSummit

Mar 9, 2023 · Big Data

Designing Efficient and Agile Real-Time Big Data Analytics Platforms for Enterprises

The article explains how enterprises can build a comprehensive big data analytics platform—covering data collection, storage, computation, and decision layers—by clarifying business scenarios, choosing appropriate on‑premise or cloud deployment, selecting suitable architectures such as Lambda/Kappa, and addressing component choices and emerging technical trends.

Big DataData ArchitectureReal-time Analytics

0 likes · 9 min read

Designing Efficient and Agile Real-Time Big Data Analytics Platforms for Enterprises