When to Pre‑Train Graph Neural Networks: Data‑Active Pre‑Training and a Graph Generator Framework
This article examines the conditions under which graph neural network pre‑training is beneficial, proposes a data‑centric generator framework to assess transferability, introduces a data‑active pre‑training strategy that selects informative graphs, and presents experimental results showing that using less, well‑chosen data can outperform full‑scale pre‑training.
Background: With the rise of large models, researchers question whether pre‑training graph neural networks (GNNs) can lead to general AI, but negative transfer often occurs, affecting about 45% of downstream tasks.
The article first asks when GNN pre‑training is necessary and under what data conditions it benefits downstream tasks.
It proposes evaluating transferability by modeling the generation process from pre‑training data to downstream data using a flexible graph generator framework. The framework defines an input space, a generator space (multiple generator bases), and a possible downstream space.
Generator bases can be combined with weighted coefficients (α1…α4) to form a combined generator that captures common patterns across domains. By restricting the generator basis search using domain knowledge or graph similarity, the search space becomes tractable.
Optimization focuses on finding the best combination weights and generator bases, potentially via gradient‑based methods, to maximize the probability of generating downstream graphs.
Experiments involve 11 pre‑training datasets from three domains and 13 downstream datasets from seven domains, evaluating both node and graph classification. Results show that selecting a small, well‑chosen subset of pre‑training data often outperforms using all available data.
Key findings include: (1) negative transfer is common, so deciding whether to pre‑train is crucial; (2) data quantity is less important than data relevance; (3) a data‑active pre‑training pipeline that iteratively selects informative samples based on model uncertainty and graph properties yields better performance.
The work concludes with three use cases: (i) defining the application scope of a GNN pre‑training model via the possible downstream space, (ii) estimating feasibility of pre‑training for a given downstream task before investing resources, and (iii) selecting optimal pre‑training data to maximize downstream gains.
DataFunSummit
Official account of the DataFun community, dedicated to sharing big data and AI industry summit news and speaker talks, with regular downloadable resource packs.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.