Big Data 5 min read

Overview of Apache Hive Features, Usage, and Management

Apache Hive is an open‑source data‑warehouse system built on Hadoop that enables users to read, write, and manage large distributed datasets using SQL‑like queries, offering features such as ETL support, various file‑format connectors, extensible UDFs, and integration with tools like Tez, Spark, and MapReduce.

Architects Research Society

Jul 27, 2018

Apache Hive™ data‑warehouse software helps read, write, and manage large datasets residing in distributed storage using SQL syntax.

Hive Features

Hive is built on Apache Hadoop™ and provides the following capabilities:

SQL‑based access to data for data‑warehouse tasks such as extract/transform/load (ETL), reporting, and analysis.

A mechanism to impose structure on various data formats.

Access to files stored directly in Apache HDFS™ or other storage systems like Apache HBase™.

Query execution via Apache Tez™, Apache Spark™, or MapReduce.

Procedural language support with HPL‑SQL.

Sub‑second query retrieval through Hive LLAP, Apache YARN, and Apache Slider.

Hive provides standard SQL functionality, including many later SQL:2003 and SQL:2011 analytic features.

Hive SQL can be extended with user‑defined functions (UDF), user‑defined aggregation functions (UDAF), and user‑defined table‑generating functions (UDTF).

There is no unique “Hive format” for storing data. Hive includes built‑in connectors for CSV/TSV text files, Apache Parquet™, Apache ORC™, and other formats.

Users can extend Hive with connectors for additional formats; see the developer guide for File Formats and Hive SerDe details.

Hive is not suitable for online transaction processing (OLTP) workloads; it is best suited for traditional data‑warehouse tasks.

Hive is designed for maximum scalability (by dynamically adding more machines to a Hadoop cluster), performance, fault tolerance, and loose coupling with input formats.

Hive components include HCatalog and WebHCat.

HCatalog is a Hive component that serves as Hadoop’s table and storage management layer, allowing tools such as Pig and MapReduce to read and write data more easily.

Processing tools—including Pig and MapReduce—can more easily read/write data across the grid.

WebHCat provides services for running Hadoop MapReduce (or YARN), Pig, Hive jobs, or performing Hive metadata operations via an HTTP (REST‑style) interface.

Hive Usage

Hive SQL language manual covering commands, CLI, data types, DDL (create/delete/alter/truncate/show/describe), statistics (analysis), indexing, archiving, DML (load/insert/update/delete/merge, import/export, explain plan), queries (select), operators and UDFs, locking, and authorization.

File formats and compression: RCFile, Avro, ORC, Parquet; compression options including LZO.

Programming language: Hive HPL/SQL.

Hive configuration properties.

Hive clients (JDBC, ODBC, Thrift) and HiveServer2 (client and server, metrics).

Hive web interface.

Hive SerDes: Avro SerDe, Parquet SerDe, CSV SerDe, JSON SerDe.

Hive integrations: Accumulo, HBase, Druid.

Hive transactions, streaming data ingest, and streaming mutation API.

Hive counters.

Hive Management

Installation of Hive.

Configuration of Hive.

Setting up the Metastore.

Hive Schema Tool.

Setting up the Hive web interface.

Setting up Hive servers (JDBC, ODBC, Thrift, HiveServer2).

Hive replication.

Hive on Amazon Web Services.

Hive on Amazon Elastic MapReduce.

Hive on Spark.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

Big Data SQL Data Warehouse ETL Hadoop Apache Hive

Written by

Architects Research Society

A daily treasure trove for architects, expanding your view and depth. We share enterprise, business, application, data, technology, and security architecture, discuss frameworks, planning, governance, standards, and implementation, and explore emerging styles such as microservices, event‑driven, micro‑frontend, big data, data warehousing, IoT, and AI architecture.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.