Artificial Intelligence 13 min read

Tencent FinTech AI Development Platform: Architecture, Challenges, and Solutions

This article details the background, goals, and evolution of Tencent's FinTech AI development platform, outlines the technical challenges faced in feature engineering, model training, and inference services, and presents the comprehensive solutions and future plans implemented to improve efficiency, stability, and scalability.

DataFunTalk

May 18, 2024

Tencent FinTech AI Development Platform: Architecture, Challenges, and Solutions

The presentation introduces Tencent FinTech's AI development platform, describing its business scope—mobile payments, investment services, livelihood services, and cross‑border payments—and the need for a unified, one‑stop development environment.

The platform has progressed through four stages, from traditional machine learning to deep learning, but still suffers from low development efficiency and high usage barriers, prompting a 2022 overhaul.

Key 2022 challenges include:

Feature‑engineering performance and quality, requiring longer sample cycles and better evaluation tools.

Model development efficiency, with duplicated efforts across teams.

Training capability, as larger datasets and models strain resources.

Inference service stability, handling high request volumes.

Solutions implemented:

Construction of a unified feature platform with online and offline services, supporting feature selection, slicing, and monitoring.

Feature selection using quality filters, business relevance, and importance‑based methods (filter, wrapper, embedded).

Sample rollback optimization using business‑level partitioning, sparse storage, Bloom filters, dictionary conversion, and broadcast joins.

Training optimizations: upgrading to TensorFlow 2, using TFRecord, GPU pre‑loading, sparse embedding acceleration, mixed‑precision, multi‑card training with Horovod, model parallelism for sparse layers, data parallelism for dense layers, and a three‑level cache (SSD, memory, GPU).

Model deployment via a unified inference service with a visual UI for deployment, verification, traffic switching, and validation.

Inference acceleration through operator optimization, model pruning, and quantization.

Service governance using cloud‑native architectures for disaster recovery, fault tolerance, and elastic scaling.

Stability measures: change‑management procedures, code and dependency optimization, adherence to development standards, and regular disaster‑recovery drills.

Future plans focus on scaling large‑graph training and enhancing AutoML capabilities for hyper‑parameter tuning and model selection.

The Q&A section addresses platform openness, Bloom filter usage, key factors affecting inference stability, and the impact of large models on architecture, highlighting the need for specialized training frameworks and performance optimizations such as vLLM and TensorRT‑LLM.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

Cloud Native model training inference FinTech

Written by

DataFunTalk

Dedicated to sharing and discussing big data and AI technology applications, aiming to empower a million data scientists. Regularly hosts live tech talks and curates articles on big data, recommendation/search algorithms, advertising algorithms, NLP, intelligent risk control, autonomous driving, and machine learning/deep learning.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.