Artificial Intelligence 14 min read

FinBERT 1.0: An Open‑Source Chinese Financial Domain Pre‑trained BERT Model and Its Evaluation

FinBERT 1.0 is an open‑source Chinese BERT model pre‑trained on large‑scale financial corpora that achieves 2‑5 % F1 improvements across multiple downstream fintech tasks without additional tuning, demonstrating the value of domain‑specific pre‑training for natural language processing.

DataFunTalk
DataFunTalk
DataFunTalk
FinBERT 1.0: An Open‑Source Chinese Financial Domain Pre‑trained BERT Model and Its Evaluation

FinBERT 1.0 is the first open‑source Chinese BERT model pre‑trained on large‑scale financial corpora, released by Entropy AI Lab to promote NLP in fintech.

Background: existing Chinese BERT models are generic; FinBERT targets the financial domain, achieving 2‑5 % F1 improvement on downstream tasks without extra tuning.

Model architecture: based on Google’s BERT‑Base (12‑layer Transformer); only the Base version is released.

Training data: ~3 million documents covering financial news (≈1 M), research reports & announcements (≈2 M), and financial encyclopedia entries (≈1 M), filtered to 30 billion tokens.

Pre‑training tasks: word‑level Financial Whole Word Mask (FWWM) and Next Sentence Prediction, plus supervised tasks—research‑report industry classification and financial‑entity recognition—using a two‑stage training schedule (sentence lengths 128 then 512).

Acceleration: TensorFlow XLA and Automatic Mixed Precision (AMP) reduce training time and memory, achieving ~3× speedup.

Experiments: four downstream benchmarks (financial short‑text type classification, industry classification, sentiment classification, and named‑entity recognition) compare FinBERT with Google Chinese BERT, BERT‑wwm and RoBERTa‑wwm‑ext; FinBERT consistently outperforms baselines by 2‑5 percentage points in F1.

Conclusion: FinBERT demonstrates the benefit of domain‑specific pre‑training for Chinese financial NLP; future work includes larger corpora and FinBERT 2.0/3.0 releases.

References and author information are provided, along with the GitHub repository https://github.com/valuesimplex/FinBERT.

deep learningdomain adaptationBERTchinesepretrained language modelfinancial NLPFinBERT
DataFunTalk
Written by

DataFunTalk

Dedicated to sharing and discussing big data and AI technology applications, aiming to empower a million data scientists. Regularly hosts live tech talks and curates articles on big data, recommendation/search algorithms, advertising algorithms, NLP, intelligent risk control, autonomous driving, and machine learning/deep learning.

0 followers
Reader feedback

How this landed with the community

login Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.