FinBERT 1.0: An Open‑Source Chinese Financial Domain Pre‑trained BERT Model and Its Evaluation
FinBERT 1.0 is an open‑source Chinese BERT model pre‑trained on large‑scale financial corpora that achieves 2‑5 % F1 improvements across multiple downstream fintech tasks without additional tuning, demonstrating the value of domain‑specific pre‑training for natural language processing.
FinBERT 1.0 is the first open‑source Chinese BERT model pre‑trained on large‑scale financial corpora, released by Entropy AI Lab to promote NLP in fintech.
Background: existing Chinese BERT models are generic; FinBERT targets the financial domain, achieving 2‑5 % F1 improvement on downstream tasks without extra tuning.
Model architecture: based on Google’s BERT‑Base (12‑layer Transformer); only the Base version is released.
Training data: ~3 million documents covering financial news (≈1 M), research reports & announcements (≈2 M), and financial encyclopedia entries (≈1 M), filtered to 30 billion tokens.
Pre‑training tasks: word‑level Financial Whole Word Mask (FWWM) and Next Sentence Prediction, plus supervised tasks—research‑report industry classification and financial‑entity recognition—using a two‑stage training schedule (sentence lengths 128 then 512).
Acceleration: TensorFlow XLA and Automatic Mixed Precision (AMP) reduce training time and memory, achieving ~3× speedup.
Experiments: four downstream benchmarks (financial short‑text type classification, industry classification, sentiment classification, and named‑entity recognition) compare FinBERT with Google Chinese BERT, BERT‑wwm and RoBERTa‑wwm‑ext; FinBERT consistently outperforms baselines by 2‑5 percentage points in F1.
Conclusion: FinBERT demonstrates the benefit of domain‑specific pre‑training for Chinese financial NLP; future work includes larger corpora and FinBERT 2.0/3.0 releases.
References and author information are provided, along with the GitHub repository https://github.com/valuesimplex/FinBERT.
DataFunTalk
Dedicated to sharing and discussing big data and AI technology applications, aiming to empower a million data scientists. Regularly hosts live tech talks and curates articles on big data, recommendation/search algorithms, advertising algorithms, NLP, intelligent risk control, autonomous driving, and machine learning/deep learning.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.