How BlackPearl Dominated All Three KDD 2024 OAG‑Challenge Tracks with Large‑Model Techniques
The BlackPearl team leveraged large‑model strategies—including iterative self‑refinement, train‑time difficulty increase, test‑time augmentation, grafting‑learning, and boosting—to dominate the WhoIsWho‑IND, PST, and AQA tracks of the KDD 2024 OAG‑Challenge Cup, surpassing traditional feature‑engineered, GNN, and BERT baselines.
WhoIsWho‑IND (Paper Name Disambiguation)
Challenge: each sample contains a very large number of candidate papers and long textual fields, causing input sequences to exceed token limits and making conventional clustering ineffective.
Approach:
Task Format Conversion : transform the clustering problem into a pairwise comparison task. The model receives a target paper and a set of reference papers and predicts whether the target belongs to the main (correct) class.
Train‑Time Difficulty Increase (TTDI) : during fine‑tuning reduce the maximum sequence length and increase the proportion of incorrect papers in the training batch, forcing the model to handle harder examples.
Test‑Time Augmentation (TTA) : at inference shuffle the order of reference papers, run the model multiple times, and average the predictions to reduce sensitivity to input ordering.
Iterative Refinement (IRF) : after each inference round rank reference papers by their predicted correctness probability, keep the top‑k as input for the next round, and repeat without additional model training. This self‑feedback loop raises the true‑positive concentration in subsequent rounds.
Model ensemble : fine‑tune several models on different data sources (title, author) using DeepSpeed zero‑1 optimizer, LoRA and QLoRA for parameter‑efficient adaptation. Ensemble the predictions by weighted averaging.
Input splitting : split overly long inputs to stay within the token budget; experiments identified title and author as the most informative fields, so dedicated fine‑tuning focused on these two.
Results: extensive ablation studies (Table 1/2, Fig. 2) show that each component—Task Format Conversion, TTDI, TTA, and IRF—adds measurable gains. The full pipeline achieved first‑place performance on the WhoIsWho‑IND track.
PST (Paper Source Tracing)
Challenge: label distribution shift between rule‑based and human‑annotated datasets, extremely long HTML identifiers (tens of thousands of tokens) with little useful content, and a massive unlabeled DBLP corpus that must be leveraged for auxiliary information.
Approach:
Grafting‑Learning For DataSet : fine‑tune a BERT model on the large, low‑quality rule‑based dataset, extract its final‑layer hidden states, and use these hidden states as additional features for a second BERT model trained on the high‑quality human‑annotated data. This grafting preserves useful signals from the noisy dataset while avoiding the negative transfer observed with ordinary sequential fine‑tuning.
Grafting‑Learning For LongText : split long documents and process each segment with separate BERT models. Each BERT produces a prediction probability; the set of probabilities is fed to ChatGLM‑3, which makes the final decision using a short text plus the BERT‑derived signals. This avoids quadratic attention cost on very long inputs.
Automatic Retrieval‑Augmented Generation (RAG) & Feature Engineering : automatically extract salient auxiliary information from the DBLP corpus for each paper, filter noisy entries, and construct a compact feature set that can be concatenated with the short text input. This reduces input length while preserving useful context.
Training pipeline : pre‑train BERT on the DBLP corpus with MLM, then apply the two grafting stages. The final prediction is produced by ChatGLM‑3 using the grafted features.
Results: the grafting techniques consistently outperformed ordinary transfer learning in ablation experiments, and the RAG & feature‑engineering step further improved robustness on the large unlabeled DBLP data.
AQA (Academic Paper Question Answering)
Challenge: retrieve the most relevant papers for a user query (MAP@20) from a noisy web‑sourced dataset where questions may map to multiple correct papers and many distractors exist.
Approach:
LLM for Vector : use the 7B SFR‑Embedding‑Mistral model for dense retrieval, leveraging its superior text‑embedding capability compared with sub‑1B auto‑encoders.
Hard Example Mining : during contrastive fine‑tuning, for each positive sample randomly sample 3 hard negatives from a pool of 100 candidates, ensuring the model learns to discriminate difficult distractors.
Boosting (Iterative Negative‑Sample Mining) : start with the base SFR‑Embedding‑Mistral model to retrieve the top‑100 hardest negatives, fine‑tune the recall model on these, then repeat the process. The same hard negatives are used to fine‑tune the ranking model based on SOLAR‑10.7B‑Instruct‑v1.0. Each iteration improves MAP@20.
Training details:
Recall model: instruction‑tuned on SFR‑Embedding‑Mistral with contrastive loss, 10 epochs, learning rate 1e‑4, optional QLoRA for single‑GPU fine‑tuning.
Ranking model: instruction‑tuned on SOLAR‑10.7B‑Instruct‑v1.0 with cross‑entropy loss, same hyper‑parameters.
Boosting loop: eight iterations; MAP@20 improved from an initial +0.07 to 0.301 after the final iteration (Fig. 4).
Results: the iterative boosting pipeline consistently raised MAP@20, outperforming traditional feature‑engineered, GNN, and BERT baselines on the OAG‑Bench benchmark.
All three solutions are publicly available at
https://github.com/BlackPearl-Lab/KddCup-2024-OAG-Challenge-1st-Solutions.
References
[1] Wang, G., Li, W., Ourselin, S., & Vercauteren, T. (2019). Automatic brain tumor segmentation using convolutional neural networks with test‑time augmentation. In Brainlesion (pp. 61–72). Springer.
[2] Rasley, J., Rajbhandari, S., Ruwase, O., et al. (2020). DeepSpeed: System optimizations enable training deep learning models with over 100 billion parameters. In Proceedings of KDD 2020 (pp. 3505‑3506).
[3] Hu, E. J., Shen, Y., Wallis, P., et al. (2021). LoRA: Low‑rank adaptation of large language models. arXiv:2106.09685.
[4] Dettmers, T., Pagnoni, A., Holtzman, A., & Zettlemoyer, L. (2024). QLoRA: Efficient finetuning of quantized LLMs. NeurIPS 2024 .
[5] Jiangli Club. (n.d.). 嫁接学习的提出与具体用例. Retrieved from http://jiangliclub.com/article?article_id=72.
[6] Meng, R., Liu, Y., Joty, S. R., et al. (2024). SFR‑Embedding‑Mistral: Enhance Text Retrieval with Transfer Learning. Salesforce AI Research Blog. Retrieved from https://blog.salesforceairesearch.com/sfr-embedded-mistral/.
[7] Zhang, F., Shi, S., Zhu, Y., et al. (2024). OAG‑Bench: A Human‑Curated Benchmark for Academic Graph Mining. arXiv:2402.15810.
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
Meituan Technology Team
Over 10,000 engineers powering China’s leading lifestyle services e‑commerce platform. Supporting hundreds of millions of consumers, millions of merchants across 2,000+ industries. This is the public channel for the tech teams behind Meituan, Dianping, Meituan Waimai, Meituan Select, and related services.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
