Best Paper Review: Hike – A Hybrid Human‑Machine Method for Entity Alignment in Large‑Scale Knowledge Bases (CIKM 2017)
This article summarizes the award‑winning CIKM 2017 paper “Hike: A Hybrid Human‑Machine Method for Entity Alignment in Large‑Scale Knowledge Bases,” explaining its motivation, the entity‑partition and predicate‑similarity techniques, the construction of partial orders, question‑selection strategies, inference modeling, error‑tolerance mechanisms, and the greedy algorithms that together achieve state‑of‑the‑art alignment performance.
The 26th ACM International Conference on Information and Knowledge Management (CIKM 2017) was held in Singapore, receiving 1,274 full/short paper submissions and awarding three best‑paper honors. The best long paper, titled Hike: A Hybrid Human‑Machine Method for Entity Alignment in Large‑Scale Knowledge Bases , is the focus of this review.
Knowledge bases (KBs) model real‑world entities and their relationships, but aligning entities across heterogeneous, massive KBs remains challenging due to scale, inconsistency, and low recall of existing automatic methods. Hike addresses this by combining human insight with algorithmic processing.
Entity Partition : The method first splits each large KB into small blocks using predicate‑based cues. Matching predicates are identified via similarity scores, and an improved HAC clustering algorithm merges predicate pairs, reducing the alignment problem to parallel processing within each block.
Predicate Similarity : For predicates p_i and p'_j from two KBs, similarity ρ(p_i, p'_j) is computed using the overlap of their subject‑object pairs, with cosine similarity applied to relation and attribute vectors. Inverse functionality weights predicates to reflect their importance.
Partial Order Construction : Within each block, a partial order over entity pairs is built based on pairwise similarity scores. This order enables inference: if an entity pair is known to match, all preceding pairs are likely matches, and succeeding pairs are likely mismatches.
Question Selection : To minimize crowdsourcing cost, the authors formulate a budgeted NP‑hard problem of selecting B entity pairs (questions) whose answers maximize the number of inferable pairs. Two greedy heuristics—Serial Question Selection (SQS) and Parallel Question Selection (MQS)—are proposed, using an inference‑expectation metric.
Inference Model : Given a partial order and a set of answered questions, the model propagates match/no‑match decisions to other pairs, reducing the need for exhaustive labeling.
Error Tolerance : The paper tackles two error sources—crowdsourced worker mistakes and error propagation—by weighting worker reliability, employing majority voting, and applying an error‑propagation reduction scheme that queries ancestor and descendant pairs for ambiguous cases.
Experimental results show that Hike outperforms prior automatic alignment methods, achieving higher recall and precision while keeping computational complexity manageable (from O(n³) to O(n²) for partitioning and further reductions for partial‑order construction).
The review concludes with personal reflections on the paper’s strengths, minor shortcomings, and an invitation for readers to discuss details.
AntTech
Technology is the core driver of Ant's future creation.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.