Using Apache Phoenix on CDH HBase: Installation, Configuration, and Secondary Index Creation
This article explains how to integrate Apache Phoenix with CDH‑based HBase, covering Phoenix overview, version selection, parcel installation, HBase configuration, command‑line usage, mapping existing tables, creating schemas and views, building secondary indexes, and comparing different index types for performance optimization.
Apache Phoenix is a high‑performance SQL engine built on top of HBase, providing both OLTP and OLAP capabilities and offering secondary‑index support for HBase tables.
Phoenix introduction
CDH HBase integration with Phoenix
Creating secondary indexes on HBase via Phoenix
Overview of Phoenix index types
Phoenix Introduction
Phoenix stores data in HBase and exposes a standard JDBC API. It supports standard SQL, full ACID transactions, and provides a secondary‑index solution for HBase. Phoenix also integrates with components such as Spark, Hive, and Flume.
CDH HBase Integration – Version Information
For CDH platforms, Phoenix‑4.14.0‑cdh5.{11,12,13,14}.2 is available as source, binary tar, and parcel packages. The parcel format is recommended; download links are provided.
Higher CDH versions can follow the official Cloudera announcement; lower versions can use CLABS_Phoenix (max Phoenix 4.7.0) from the Cloudera Labs parcel repository.
Installation
Download the appropriate parcel package and place it on an HTTP server.
[root@hadoop-01 /var/www/html/phoenix/4.14.0]$ ll
total 300524
-rw-r--r-- 1 root root 307722240 Feb 3 19:30 APACHE_PHOENIX-4.14.0-cdh5.11.2.p0.3-el7.parcel
-rw-r--r-- 1 root root 178 Feb 3 19:28 APACHE_PHOENIX-4.14.0-cdh5.11.2.p0.3-el7.parcel.sha512
-rw-r--r-- 1 root root 5081 Feb 3 19:30 manifest.jsonConfiguration
After installation, add the following properties to hbase-site.xml to enable Phoenix secondary indexes and namespace mapping:
hbase.regionserver.wal.codec
org.apache.hadoop.hbase.regionserver.wal.IndexedWALEditCodec
phoenix.schema.isNamespaceMappingEnabled
true
phoenix.schema.mapSystemTablesToNamespace
trueCommand‑Line Usage
After CDH installation, environment variables are set. Use phoenix-sqlline.py to start an interactive session:
[root@hadoop-01 ~]$ phoenix-
phoenix-performance.py phoenix-psql.py phoenix-sqlline.py phoenix-utils.pyRun phoenix-sqlline.py to initialize Phoenix.
Mapping an Existing HBase Table
Assume an HBase table ns1000:operate_log with ~2.8 million rows and 14 columns.
hbase(main):017:0> count 'ns1000:operate_log', INTERVAL => 100000
2799827 row(s) in 173.4200 secondsCreating a Phoenix Schema and View
0: jdbc:phoenix> create schema if not exists "ns1000"; 0: jdbc:phoenix> use "ns1000";
0: jdbc:phoenix> create view "operate_log"(
pk varchar primary key,
"f"."appVersion" varchar,
"f"."city" varchar,
"f"."lat" varchar,
"f"."lon" varchar,
"f"."memberType" varchar,
"f"."time" varchar);Query the view to verify data:
0: jdbc:phoenix> select * from "operate_log" limit 1;
+------------------------+-------------+-------+---------------------+---------------------+-------------+----------------+
| pk | appVersion | city | lat | lon | memberType | time |
+------------------------+-------------+-------+---------------------+---------------------+-------------+----------------+
1538216707720 | 2.22.0 | 重庆市 | 29.577587127685547 | 106.50493621826172 | 0 | 1538216707720 |
+------------------------+-------------+-------+---------------------+---------------------+-------------+----------------+Creating a Secondary Index on the time Column
0: jdbc:phoenix> create index index_operate_log_time on "operate_log" ("f"."time");
2799827 rows affected (95.814 seconds)After the index is built, the same time‑range query runs in ~50 ms:
0: jdbc:phoenix> select count(*) from "operate_log" where "f"."time" between '1538216707720' and '1538223834000';
+-----------+
| COUNT(1) |
+-----------+
| 5883 |
+-----------+
1 row selected (0.049 seconds)Phoenix Index Types
Covered Indexes
A covered index stores selected columns directly in the index table, allowing queries that only need those columns to be satisfied without accessing the base table.
Functional Indexes
Introduced in Phoenix 4.3, functional indexes allow arbitrary expressions (e.g., substr("f"."time",1,10) ) to be indexed, enabling fast queries that use the same expression.
Global Indexes
Global indexes store index data in a separate table; they are suitable for read‑heavy workloads but add network overhead on writes.
Local Indexes
Since Phoenix 4.8.0, local indexes keep index data in a dedicated column family of the same table, reducing write overhead but potentially degrading read performance; they are better for write‑heavy scenarios.
Summary
Phoenix provides a SQL layer on top of HBase, offering standard SQL queries and powerful secondary‑index capabilities. Global indexes favor read‑heavy use cases, while local indexes are better for write‑heavy workloads. Use EXPLAIN to verify whether a query utilizes an index.
References:
HBase integration with Phoenix – view vs. table
How to create secondary indexes in Phoenix on CDH HBase
Cloudera announces Apache Phoenix support in CDH
Big Data Technology Architecture
Exploring Open Source Big Data and AI Technologies
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.