Understanding Join Algorithms in Presto: Theory, Implementation, and Engineering Practices
The article explains Presto’s join processing by detailing the business need to limit multi‑table joins, then describing nested‑loop, sort‑merge, and hash join algorithms with Java examples, and finally showing how the Volcano model, columnar pages, and planner integration enable scalable, efficient OLAP join execution.
In the series "Exploring the Presto SQL Engine", this second article focuses on the principles and implementation of Join operations in Presto. It starts by describing the business need to limit multi‑table joins for performance reasons, citing Alibaba’s development guidelines that forbid joins on more than three tables and require indexed join columns.
The article then explains the fundamental semantics of Join in relational databases and introduces the Cartesian product (Cross Join) as the simplest case. A concrete Java example demonstrates how a nested loop produces the Cartesian product:
List<Tuple> r = newArrayList(
new Tuple(newArrayList(1,"a")),
new Tuple(newArrayList(2,"b")));
List<Tuple> s = newArrayList(
new Tuple(newArrayList(3,"c")),
new Tuple(newArrayList(4,"d")));
int cnt = 0;
for (Tuple ri : r) {
for (Tuple si : s) {
Tuple c = new Tuple().merge(ri).merge(si);
System.out.println(++cnt + ": " + c);
}
}
/**
* out:
* 1: [1, a, 3, c]
* 2: [1, a, 4, d]
* 3: [2, b, 3, c]
* 4: [2, b, 4, d]
*/From this simple nested loop, the article derives the Nested Loop Join algorithm, highlighting its O(m·n) complexity and discussing practical optimizations such as driving the join with the smaller table and using block‑nested loops to reduce I/O.
Next, the Sort‑Merge Join is presented. By sorting both inputs first, the algorithm reduces the comparison count to O(m log m + n log n + m + n), which can be further lowered to O(m + n) when indexes guarantee ordered data. Visual diagrams illustrate the two‑phase process (Sort then Merge).
The article then covers the Hash Join . It explains why, in large‑scale data warehouses, maintaining sorted data is costly, and how hashing the smaller table enables fast equality lookups with O(m + n) complexity. A Java snippet shows how to build a Page in Presto using column‑oriented storage:
import com.facebook.presto.common.Page;
import com.facebook.presto.common.PageBuilder;
import com.facebook.presto.common.block.Block;
import com.facebook.presto.common.block.BlockBuilder;
import com.facebook.presto.common.type.BigintType;
import com.facebook.presto.common.type.Type;
import com.facebook.presto.common.type.VarcharType;
import com.google.common.collect.Lists;
import io.airlift.slice.Slice;
import java.util.List;
import static io.airlift.slice.Slices.utf8Slice;
public class PageBlockDemo {
private static Page buildPage(List
types, List
dataSet) {
PageBuilder pageBuilder = new PageBuilder(types);
for (Object[] row : dataSet) {
pageBuilder.declarePosition();
for (int column = 0; column < types.size(); column++) {
BlockBuilder out = pageBuilder.getBlockBuilder(column);
Object colVal = row[column];
if (colVal == null) {
out.appendNull();
} else {
Type type = types.get(column);
Class
javaType = type.getJavaType();
if (javaType == long.class) {
type.writeLong(out, (long) colVal);
} else if (javaType == Slice.class) {
type.writeSlice(out, utf8Slice((String) colVal));
} else {
throw new UnsupportedOperationException("not implemented");
}
}
}
}
Page page = pageBuilder.build();
pageBuilder.reset();
return page;
}
private static void readColumn(List
types, Page page) {
for (int column = 0; column < types.size(); column++) {
Block block = page.getBlock(column);
Type type = types.get(column);
Class
javaType = type.getJavaType();
System.out.print("column[" + type.getDisplayName() + "]>>");
List
colList = Lists.newArrayList();
for (int pos = 0; pos < block.getPositionCount(); pos++) {
if (javaType == long.class) {
colList.add(block.getLong(pos));
} else if (javaType == Slice.class) {
colList.add(block.getSlice(pos, 0, block.getSliceLength(pos)).toStringUtf8());
} else {
throw new UnsupportedOperationException("not implemented");
}
}
System.out.println(colList);
}
}
public static void main(String[] args) {
List
types = Lists.newArrayList(BigintType.BIGINT, VarcharType.VARCHAR);
List
dataSet = Lists.newArrayList(
new Object[]{1L, "aa"},
new Object[]{2L, "ba"},
new Object[]{3L, "cc"},
new Object[]{4L, "dd"});
Page page = buildPage(types, dataSet);
readColumn(types, page);
}
}
// column[bigint]>>[1, 2, 3, 4]
// column[varchar]>>[aa, ba, cc, dd]The discussion then introduces the Volcano Model , an extensible and parallel query evaluation architecture originally described by Goetz Graefe. It emphasizes operator separation (open‑next‑close iterator pattern) and dynamic pipeline assembly. Diagrams illustrate how Scan, Select, and Project operators are combined, and how Presto adapts the model with column‑oriented Pages and vectorized execution.
Subsequent sections describe the engineering prerequisites for implementing Join in Presto, such as parsing Join syntax with ANTLR, binding tables and columns, choosing the appropriate Join algorithm (Hash, Sort‑Merge, or Nested Loop), and constructing the execution pipeline (Parser → Binding → Planner → Executor). A code excerpt shows how Presto’s NestedLoopPageBuilder builds the Cartesian product of two Pages using Run‑Length‑Encoded blocks.
Finally, the article summarizes the key takeaways: understanding Join fundamentals, the Volcano execution framework, columnar storage, and the practical considerations for scaling Join operations in OLAP workloads. It points to future topics like multi‑table Join ordering, semi‑join optimizations, and handling data skew.
vivo Internet Technology
Sharing practical vivo Internet technology insights and salon events, plus the latest industry news and hot conferences.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.