Big Data 26 min read

ODPS Development Guide: Parameters, Built‑in Functions, UDF Creation, and Performance Optimization

This comprehensive ODPS (MaxCompute) development guide serves as a mini‑encyclopedia, detailing common parameter tuning, built‑in SQL functions, step‑by‑step Java UDF creation, job lifecycle insights, and practical performance‑optimization techniques such as parallelism adjustment, map‑join hints, and small‑file mitigation.

DaTaobao Tech

Jul 10, 2024

ODPS Development Guide: Parameters, Built‑in Functions, UDF Creation, and Performance Optimization

This article is a mini‑encyclopedia for ODPS (MaxCompute) development, covering both beginner and advanced topics.

Common Parameter Settings

Typical tuning focuses on the number and memory of map, join, and reduce tasks. Example settings include:

set odps.sql.mapper.cpu=100

set odps.sql.mapper.memory=1024

set odps.sql.mapper.split.size=256

set odps.sql.joiner.instances=-1

set odps.sql.joiner.cpu=100

set odps.sql.reducer.instances=-1

set odps.sql.reducer.cpu=100

Additional parameters control file merging, UDF resources, map‑join memory, dynamic partition handling, and data‑skew optimization.

Built‑in SQL Functions

The guide classifies functions into date, math, window, aggregation, string, complex‑type, encryption, and others, providing typical usage examples such as:

SELECT DATEADD(GETDATE(), -7, 'dd');

to_char('2018-01-11 10:00:00','yyyymmdd') as date_3

split(str, pat)

regexp_replace(msg_id, "\\[|\\]", "") as msg_id

These functions help with date calculations, string manipulation, JSON extraction, and more.

Custom Java UDF Development

Step‑by‑step instructions show how to install the MaxCompute Studio plugin in IDEA, create a Java project, add a UDF class, configure Maven assembly to package dependencies, and publish the JAR to the ODPS resource library.

Key commands:

set odps.sql.udf.jvm.memory=1024

set odps.sql.udf.timeout=1800

After packaging, the UDF is uploaded via “Deploy to server”, linked to a function name, and can be invoked directly in SQL.

Performance Analysis & Optimization

The article explains the job lifecycle (scheduling, optimization, physical plan generation, execution, and completion) and common bottlenecks such as resource shortage, data skew, excessive small files, and inefficient UDFs.

Typical solutions include adjusting parallelism ( set odps.sql.reducer.instances=xxx), enabling HBO, using map‑join hints, dynamic filter hints, materialized views, and reducing small‑file generation ( set odps.merge.smallfile.filesize.threshold=64).

Sample SQL‑function creation for reuse:

CREATE SQL FUNCTION IF NOT EXISTS get_json_object_checkboxField(@a STRING,@b STRING) AS REPLACE(REPLACE(REPLACE(GET_JSON_OBJECT(@a,@b),'[\"',''),'\"]',''),'\"','');

Conclusion

After a month of consolidation, the author delivers a foundational ODPS development reference, emphasizing continuous learning, knowledge sharing, and community building.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

sql MaxCompute ODPS UDF

Written by

DaTaobao Tech

Official account of DaTaobao Technology

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.