Best Practice: Using EMR Serverless StarRocks AI Function for Financial Text Classification
This article demonstrates how to leverage StarRocks AI Function on EMR Serverless to perform sentiment analysis, intelligent classification, information extraction, and PII redaction on financial text entirely within SQL, eliminating data export, reducing latency, and ensuring compliance while providing concrete code examples, performance benchmarks, and best‑practice recommendations.
The financial industry generates massive unstructured data such as customer complaints, regulatory announcements, and contract texts. Traditional pipelines require exporting data to external NLP services, resulting in long latency, high cost, and compliance risks. The article presents a best‑practice solution that uses StarRocks AI Function on EMR Serverless to run large‑language‑model (LLM) capabilities directly inside the database engine, achieving "data‑in‑place" analysis.
Core Capabilities
StarRocks AI Function provides built‑in SQL functions for common financial NLP tasks: ai_sentiment(text) – sentiment analysis (e.g., customer feedback monitoring). ai_classify(text, categories) – text classification into predefined business categories. ai_extract(text, labels) – key‑element extraction from contracts or tickets. ai_redact(text) – PII redaction for compliance. ai_summarize(text) – concise summary of long documents. ai_filter(text, condition) – semantic filtering returning a Boolean. ai_complete(prompt, instruction) – custom analysis or reasoning.
Solution Advantages
Architectural Innovation (high concurrency): An asynchronous double‑pipeline with bthread cooperation allows a few system threads (default 4) to drive hundreds of concurrent LLM requests, removing OS thread scheduling bottlenecks.
Performance Breakthrough (low latency): Predicate push‑down ensures LLM calls only on necessary rows; an in‑memory LRU cache provides millisecond‑level responses with zero token consumption. End‑to‑end latency is reduced by ~40% and concurrency improves 3‑5×.
Cost Efficiency (accurate billing): Three‑layer adaptive rate‑limiting and intelligent retries avoid wasted calls; automatic UID injection enables precise cost allocation, cutting token expenses by over 30%.
Security & Compliance (no data export): All analysis runs inside the StarRocks engine, eliminating the risk of leaking PII‑containing data to external services.
Setup
Requirements:
EMR Serverless StarRocks Stella version 2.1 or later (AI Function support).
AI Center enabled.
Public network access enabled.
Data preparation involves creating three sample tables that mimic typical financial scenarios.
CREATE TABLE fin_customer_tickets (
ticket_id BIGINT,
customer_id VARCHAR(20) COMMENT 'Customer ID',
channel VARCHAR(10) COMMENT 'Channel: app/web/phone/branch',
product_type VARCHAR(20) COMMENT 'Product type: credit_card/loan/deposit/fund/insurance',
ticket_text VARCHAR(65533) COMMENT 'Ticket content',
created_at DATETIME
) DUPLICATE KEY(ticket_id) DISTRIBUTED BY HASH(ticket_id) BUCKETS 8;
INSERT INTO fin_customer_tickets VALUES
(1001, 'C20240001', 'app', 'credit_card', 'Your credit card bill is wrong! Last month I clearly paid 8000, but the statement shows I still owe 5000. This is the third time. If not resolved, I will complain to the regulator.', '2026-05-01 09:15:00'),
(1002, 'C20240002', 'phone', 'loan', 'I would like to inquire about the early repayment process for a mortgage, what documents are required? Is there a penalty for early repayment?', '2026-05-01 10:30:00'),
(1003, 'C20240003', 'web', 'fund', 'The stable fund you recommended lost 15% in a month. Is this really stable? I demand compensation!', '2026-05-01 11:45:00'),
(1004, 'C20240004', 'app', 'deposit', 'The auto‑renewal feature for large‑amount deposits is convenient, and the interest rate is 0.1% higher than before. Overall experience is good.', '2026-05-01 14:00:00'),
(1005, 'C20240005', 'branch', 'insurance', 'Insurance claim is too slow. I spent 30,000 on hospitalization, and the materials have been submitted for two months with no result.', '2026-05-01 15:20:00'),
(1006, 'C20240006', 'app', 'credit_card', 'The newly issued platinum card benefits are great, the airport lounge is comfortable, and point redemption is reasonable. I have recommended it to friends.', '2026-05-01 16:00:00'),
(1007, 'C20240007', 'phone', 'loan', 'Can the monthly repayment date for my car loan be changed? Currently it is the 5th, but my salary is paid on the 15th, causing cash flow pressure.', '2026-05-02 09:00:00'),
(1008, 'C20240008', 'web', 'fund', 'I have been investing in an index fund for half a year, the annualized return is about 8%, planning to hold longer.', '2026-05-02 10:15:00'); CREATE TABLE fin_regulatory_news (
news_id BIGINT,
source VARCHAR(50) COMMENT 'Source agency',
publish_date DATE,
title VARCHAR(500),
content VARCHAR(65533) COMMENT 'Full text'
) DUPLICATE KEY(news_id) DISTRIBUTED BY HASH(news_id) BUCKETS 4;
INSERT INTO fin_regulatory_news VALUES
(2001, 'People\'s Bank of China', '2026-04-28', 'Notice on Strengthening Consumer Rights Protection', 'To further strengthen consumer rights protection, financial institutions must establish dedicated departments, fully disclose product risks, and handle complaints within 15 working days.'),
(2002, 'China Banking and Insurance Regulatory Commission', '2026-04-25', 'Regulatory Requirements for Loan Fund Usage', 'Recent inspections found some banks misusing loan funds for real estate and stock markets. Institutions must ensure loan funds are used for the agreed purpose and avoid penalties.'),
(2003, 'China Securities Regulatory Commission', '2026-04-20', 'Opinions on Improving Listed Company Disclosure', 'Companies should disclose information truthfully, accurately, completely, and timely, and avoid false statements or material omissions.'),
(2004, 'People\'s Bank of China', '2026-04-15', 'Q1 2026 Monetary Policy Execution Report', 'In Q1 2026, M2 grew 8.2% YoY, RMB loans grew 10.5% YoY, and social financing grew 9.8% YoY. The policy will remain prudent and liquidity ample.'); CREATE TABLE fin_contracts (
contract_id VARCHAR(30),
contract_type VARCHAR(20) COMMENT 'Contract type: loan/guarantee/pledge',
party_a VARCHAR(200),
party_b VARCHAR(200),
contract_text VARCHAR(65533) COMMENT 'Key contract clauses'
) DUPLICATE KEY(contract_id) DISTRIBUTED BY HASH(contract_id) BUCKETS 4;
INSERT INTO fin_contracts VALUES
('LOAN-2026-0001', 'loan', 'Some Bank Co., Ltd. Hangzhou Branch', 'Hangzhou StarTech Co., Ltd.', 'Loan amount: RMB 5,000,000.00. Term: 2026‑01‑15 to 2027‑01‑14 (12 months). Interest: LPR + 60 bps (4.15% annual). Repayment: equal principal‑interest, due on the 20th each month. Default: 0.05% daily penalty on overdue amount. Guarantee: mortgage on property at XX Road, West Lake District, Hangzhou.'),
('LOAN-2026-0002', 'loan', 'Some Bank Co., Ltd. Shanghai Branch', 'Shanghai JinXiu Trade Co., Ltd.', 'Loan amount: RMB 10,000,000.00. Term: 2026‑03‑01 to 2028‑02‑28 (24 months). Fixed interest rate 4.35% annually. Repayment: interest‑first, principal at maturity. Default: if borrower misses three consecutive interest payments or misuses funds, the lender may declare early repayment.'),
('GUAR-2026-0001', 'guarantee', 'Some Bank Co., Ltd. Hangzhou Branch', 'Zhejiang Xinda Industrial Group Co., Ltd.', 'Guarantee method: joint and several liability. Scope: principal of RMB 5,000,000 plus interest, penalty, and enforcement costs. Period: two years after debt maturity. Contact: Zhang Wei, phone 13812345678, ID 330102198501153456.');Scenario 1: Customer Ticket Sentiment Analysis
Identify sentiment with a single SQL statement:
SELECT ticket_id, product_type, ai_sentiment(ticket_text) AS sentiment, LEFT(ticket_text, 40) AS preview FROM fin_customer_tickets ORDER BY ticket_id;Sample result shows negative sentiment for ticket 1001, neutral for 1002, etc.
Ticket Classification and Sentiment Distribution
SELECT ticket_id, ai_classify(ticket_text, ARRAY['billing dispute','business inquiry','complaint','product experience','claim dispute']) AS category, ai_sentiment(ticket_text) AS sentiment FROM fin_customer_tickets; SELECT product_type, ai_sentiment(ticket_text) AS sentiment, COUNT(*) AS ticket_count FROM fin_customer_tickets GROUP BY product_type, ai_sentiment(ticket_text) ORDER BY product_type, ticket_count DESC;Results illustrate that credit‑card tickets are mostly negative, while deposit tickets are positive.
Semantic Filtering for High‑Risk Tickets
SELECT ticket_id, product_type, ticket_text, ai_sentiment(ticket_text) AS sentiment FROM fin_customer_tickets WHERE ai_filter(ticket_text, 'Customer threatens to complain to regulator or demand compensation');The query returns 7 high‑risk tickets, accounting for 23.3% of the total.
Scenario 2: Regulatory Announcement Classification & Summarization
SELECT news_id, source, title, ai_classify(content, ARRAY['consumer protection','loan supervision','disclosure','monetary policy','anti‑money laundering','market entry']) AS reg_category FROM fin_regulatory_news;Results can be visualized as a pie chart in ChatBI.
SELECT news_id, title, ai_summarize(content) AS summary FROM fin_regulatory_news WHERE publish_date >= '2026-04-01';Compliance Impact Analysis
SELECT news_id, title, ai_complete(content, 'You are a bank compliance analyst. Analyze the impact of this announcement on retail banking and list remediation points (max 100 words)') AS compliance_impact FROM fin_regulatory_news WHERE ai_filter(content, 'contains bank compliance requirements or penalties');Scenario 3: Contract Key‑Element Extraction & PII Redaction
SELECT contract_id, contract_type, ai_extract(contract_text, ARRAY['loan amount','loan term','annual interest','repayment method','guarantee method','default clause']) AS key_elements FROM fin_contracts WHERE contract_type = 'loan';Result (JSON) returns structured fields such as loan amount, term, interest rate, etc.
SELECT contract_id, ai_redact(contract_text) AS redacted_text FROM fin_contracts WHERE contract_type = 'guarantee';Example redaction: "Contact: Zhang **, phone: 138****5678, ID: 330102********3456".
Contract Risk Assessment Pipeline
WITH contract_elements AS (
SELECT contract_id, party_b, contract_text,
ai_extract(contract_text, ARRAY['loan amount','annual interest','guarantee method','default clause']) AS elements
FROM fin_contracts WHERE contract_type = 'loan'
)
SELECT contract_id, party_b, elements,
ai_complete(CONCAT('Contract elements: ', CAST(elements AS VARCHAR), '
Original contract: ', contract_text),
'You are a credit risk expert. Based on the contract elements, assess the risk level (low/medium/high) and list main risk points (max 80 words).') AS risk_assessment
FROM contract_elements;Scenario 4: End‑to‑End Analytical Pipeline
Customer 360 Dashboard
SELECT ticket_id, customer_id, product_type, channel,
ai_sentiment(ticket_text) AS sentiment,
ai_classify(ticket_text, ARRAY['billing dispute','business inquiry','complaint','product experience','claim dispute']) AS category,
ai_extract(ticket_text, ARRAY['involved amount','request']) AS key_info,
ai_summarize(ticket_text) AS summary
FROM fin_customer_tickets WHERE created_at >= '2026-05-01';The result can be materialized for real‑time dashboards in QuickBI.
Materialized View for Reuse
CREATE MATERIALIZED VIEW mv_ticket_analysis AS
SELECT ticket_id, customer_id, product_type, channel, created_at,
ai_sentiment(ticket_text) AS sentiment,
ai_classify(ticket_text, ARRAY['billing dispute','business inquiry','complaint','product experience','claim dispute']) AS category
FROM fin_customer_tickets; SELECT DATE(created_at) AS dt, product_type, COUNT(*) AS negative_tickets FROM mv_ticket_analysis WHERE sentiment = 'negative' GROUP BY DATE(created_at), product_type ORDER BY dt, negative_tickets DESC;Batch ETL with INSERT‑SELECT
CREATE TABLE fin_ticket_analysis_result (
ticket_id BIGINT,
customer_id VARCHAR(20),
product_type VARCHAR(20),
sentiment VARCHAR(20),
category JSON,
key_info JSON,
summary VARCHAR(65533),
analyzed_at DATETIME DEFAULT CURRENT_TIMESTAMP
) DUPLICATE KEY(ticket_id) DISTRIBUTED BY HASH(ticket_id) BUCKETS 4;
INSERT INTO fin_ticket_analysis_result (ticket_id, customer_id, product_type, sentiment, category, key_info, summary)
SELECT ticket_id, customer_id, product_type,
ai_sentiment(ticket_text),
ai_classify(ticket_text, ARRAY['billing dispute','business inquiry','complaint','product experience','claim dispute']),
ai_extract(ticket_text, ARRAY['involved amount','request']),
ai_summarize(ticket_text)
FROM fin_customer_tickets
WHERE ticket_id NOT IN (SELECT ticket_id FROM fin_ticket_analysis_result);Best‑Practice Recommendations
Prefer the built‑in AI functions (e.g., ai_sentiment, ai_complete) over generic LLM calls for lower latency and cost.
Use INSERT INTO SELECT for bulk processing and persist results in a table or materialized view to avoid repeated AI calls.
Apply ai_filter first to narrow the dataset, then invoke heavier functions on the reduced set.
Monitor token consumption; for long texts (e.g., contracts) run ai_summarize first and then analyze the summary.
Standardize classification labels across the organization to keep ai_classify results consistent.
Overall, StarRocks AI Function offers a "data‑in‑place, zero‑code" paradigm for processing unstructured financial text, delivering high concurrency, lower cost, and strict compliance.
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
Alibaba Cloud Big Data AI Platform
The Alibaba Cloud Big Data AI Platform builds on Alibaba’s leading cloud infrastructure, big‑data and AI engineering capabilities, scenario algorithms, and extensive industry experience to offer enterprises and developers a one‑stop, cloud‑native big‑data and AI capability suite. It boosts AI development efficiency, enables large‑scale AI deployment across industries, and drives business value.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
