Artificial Intelligence 13 min read

Tencent FinTech AI Development Platform: Architecture, Challenges, and Solutions

This article details the background, goals, and evolution of Tencent's FinTech AI development platform, outlines the technical challenges faced in feature engineering, model training, and inference services, and presents the comprehensive solutions and future plans implemented to improve efficiency, stability, and scalability.

DataFunTalk
DataFunTalk
DataFunTalk
Tencent FinTech AI Development Platform: Architecture, Challenges, and Solutions

The presentation introduces Tencent FinTech's AI development platform, describing its business scope—mobile payments, investment services, livelihood services, and cross‑border payments—and the need for a unified, one‑stop development environment.

The platform has progressed through four stages, from traditional machine learning to deep learning, but still suffers from low development efficiency and high usage barriers, prompting a 2022 overhaul.

Key 2022 challenges include:

Feature‑engineering performance and quality, requiring longer sample cycles and better evaluation tools.

Model development efficiency, with duplicated efforts across teams.

Training capability, as larger datasets and models strain resources.

Inference service stability, handling high request volumes.

Solutions implemented:

Construction of a unified feature platform with online and offline services, supporting feature selection, slicing, and monitoring.

Feature selection using quality filters, business relevance, and importance‑based methods (filter, wrapper, embedded).

Sample rollback optimization using business‑level partitioning, sparse storage, Bloom filters, dictionary conversion, and broadcast joins.

Training optimizations: upgrading to TensorFlow 2, using TFRecord, GPU pre‑loading, sparse embedding acceleration, mixed‑precision, multi‑card training with Horovod, model parallelism for sparse layers, data parallelism for dense layers, and a three‑level cache (SSD, memory, GPU).

Model deployment via a unified inference service with a visual UI for deployment, verification, traffic switching, and validation.

Inference acceleration through operator optimization, model pruning, and quantization.

Service governance using cloud‑native architectures for disaster recovery, fault tolerance, and elastic scaling.

Stability measures: change‑management procedures, code and dependency optimization, adherence to development standards, and regular disaster‑recovery drills.

Future plans focus on scaling large‑graph training and enhancing AutoML capabilities for hyper‑parameter tuning and model selection.

The Q&A section addresses platform openness, Bloom filter usage, key factors affecting inference stability, and the impact of large models on architecture, highlighting the need for specialized training frameworks and performance optimizations such as vLLM and TensorRT‑LLM.

cloud-nativeArtificial IntelligenceFeature Engineeringmodel trainingInferenceFinTech
DataFunTalk
Written by

DataFunTalk

Dedicated to sharing and discussing big data and AI technology applications, aiming to empower a million data scientists. Regularly hosts live tech talks and curates articles on big data, recommendation/search algorithms, advertising algorithms, NLP, intelligent risk control, autonomous driving, and machine learning/deep learning.

0 followers
Reader feedback

How this landed with the community

login Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.