Databases 9 min read

Optimizing Queries on Non‑Atomic Data in PostgreSQL Using Regex, Arrays, and GIN Indexes

This article examines how to efficiently query PostgreSQL tables that store comma‑separated city names by comparing plain OR conditions, regular‑expression operators, and array‑based searches with GIN indexes, showing speed measurements and validation of result consistency.

Qunar Tech Salon
Qunar Tech Salon
Qunar Tech Salon
Optimizing Queries on Non‑Atomic Data in PostgreSQL Using Regex, Arrays, and GIN Indexes

Many tables store non‑atomic data as comma‑separated strings; for example, the arrive column in the bps table contains city names such as "泰国,泰国毒蛇研究中心,泰国大城".

b2c_product=# select id, arrive from bps order by id limit 10;
 id | arrive
----+---------------------------------------------------------------
  1 | 北海道
  2 | 济州岛
  3 | 法兰克福
  4 | 北京
  5 | 三亚海棠秀
  6 | 拉市海湿地公园
  7 | 拉市海湿地公园
  8 | 泰国,泰国毒蛇研究中心,泰国大城,泰国国家博物馆,泰国玉佛寺,泰国马车博物馆,曼谷野生动物园,曼谷艺术文化中心,郑王庙,大皇宫,玉佛寺,珊瑚岛,金佛寺,四面佛
  9 | 拉市海湿地公园
 10 | 山东,泰山,孔府,孔林,孔庙,中天门,南天门
(10 rows)

The requirement is to find rows whose arrive field contains any of a list of destinations such as 成都、杜甫草堂、青羊宫、武侯祠、锦里古街.

Solution 1 – plain OR

Using the SQL OR operator:

select id from bps where arrive ~ '成都' or arrive ~ '杜甫草堂' or arrive ~ '青羊宫' or arrive ~ '武侯祠' or arrive ~ '锦里古街' or arrive ~ '成都';

Advantages: simple logic, easy to understand. Disadvantages: cumbersome to build the query programmatically, long SQL, and slow execution.

Solution 2 – regular‑expression operator

PostgreSQL’s regex operator can express the same logic more compactly:

select id, arrive from bps where arrive ~ '成都|杜甫草堂|青羊宫|武侯祠|锦里古街|成都' order by id;

Advantages: short query, clear logic, only requires basic regex knowledge. Disadvantages: still requires regex familiarity and performance is not optimal.

Solution 3 – array method with GIN index

PostgreSQL’s array operators (e.g., @> , <@ , = , && ) can be used after converting the comma‑separated string to an array. Creating a GIN index on the generated array speeds up the search:

create index CONCURRENTLY on bps using gin (string_to_array(arrive, ','));

Query using the overlap operator:

select id, arrive from bps where string_to_array(arrive,',') && ARRAY['成都','杜甫草堂','青羊宫','武侯祠','锦里古街','成都'];

Speed comparison

Execution times measured with EXPLAIN ANALYZE :

Plain OR: ~1060 ms

Regex ~ : ~559 ms (about twice as fast)

Array && with GIN index: ~26 ms (about 40 times faster)

Validation

All three methods return the same total row count (1800) except the array method, which returns 1775 rows; the missing rows are caused by incorrect data entries that contain the destination name but are not true matches. An outer‑join comparison identifies the 25 discrepant rows.

Conclusion and reminder

Using PostgreSQL’s native data‑processing capabilities—especially GIN indexes on arrays—provides a simple yet highly efficient solution for searching non‑atomic fields, demonstrating that a well‑chosen database feature can dramatically improve performance.

PerformanceSQLPostgreSQLarrayRegexGIN index
Qunar Tech Salon
Written by

Qunar Tech Salon

Qunar Tech Salon is a learning and exchange platform for Qunar engineers and industry peers. We share cutting-edge technology trends and topics, providing a free platform for mid-to-senior technical professionals to exchange and learn.

0 followers
Reader feedback

How this landed with the community

login Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.