Artificial Intelligence 11 min read

Optimizing DSP Deep Model Latency by Externalizing Feature Processing with EzFeaFly

By externalizing feature processing with the EzFeaFly tool and feeding a dense index/value tensor directly to the GPU, the DSP platform decouples feature transformation from model inference, cutting instance usage by ~40%, reducing inference latency 70‑80%, and achieving over 60% end‑to‑end latency improvement while lowering costs.

Didi Tech

Apr 16, 2024

Optimizing DSP Deep Model Latency by Externalizing Feature Processing with EzFeaFly

In the DSP (Demand‑Side Platform) business, deep learning models are widely used for ad placement, but the latency of model inference becomes a critical bottleneck. The original online architecture relied on CPU + TensorFlow Feature Column, which, while convenient, sacrifices inference performance as the feature processing cannot be fully accelerated on GPU.

Analysis of the end‑to‑end latency revealed that more than 75% of the time is spent in the feature‑processing stage, especially due to Feature Column operations that require CPU‑GPU data transfers and memory copies.

The core idea of the solution is to decouple feature processing from model computation. By completely removing Feature Column from the TensorFlow graph and handling features with a dedicated tool called EzFeaFly (Easy Feature Fly), the system can run the heavy computation on GPU while keeping feature transformation on CPU, thus eliminating the costly cross‑device switches.

Basic Principle

Feature processing and model inference are separated.

EzFeaFly processes raw features (both online and offline) and outputs a dense tensor where odd indices store feature IDs and even indices store corresponding values.

The dense tensor is fed directly to the model, which can now fully exploit GPU acceleration.

Offline Pipeline

The offline workflow consists of three configuration files: feature_sql.sql: defines the source of raw features. parser.conf: (not needed online) would normally specify feature, label, and sample‑id columns. feature_list.conf: describes the feature processing steps, operators, and output format. The same configuration can be deployed online without modification.

Online Inference Flow

Synchronize EzFeaFly processing configuration to the online environment.

Deploy the trained model (which expects the dense index/value tensor) to the model service.

Configure the strategy workflow to route traffic through the feature service, EzFeaFly, and the model inference service.

Performance Gains

Machine‑instance count reduced by ~40%, cutting infrastructure cost.

Inference latency decreased by 70‑80% under the same QPS.

Overall end‑to‑end latency improvement exceeds 60% even after accounting for EzFeaFly’s own processing time.

Feature Tensor Splitting Example (TensorFlow)

# Number of features
FEATURE_NUM = 4

# Input layer: length = FEATURE_NUM * 2, dtype = tf.float32
inputs = tf.keras.Input(shape=[FEATURE_NUM * 2], dtype=tf.float32, name="feature_inputs")

# Extract all indices (subtract 1 because EzFeaFly starts from 1)
input_index = tf.cast(inputs[:, 0::2], tf.int64) - 1

# Extract all values
input_value = tf.cast(inputs[:, 1::2], tf.float32)

# Build the model (outputs omitted for brevity)
model = tf.keras.Model(inputs=inputs, outputs=[...])

Another utility class, IndexEmbedding, wraps a standard Keras embedding layer to directly accept the global index vector and output a flattened dense embedding:

import tensorflow as tf

class IndexEmbedding(tf.keras.layers.Embedding):
    def __init__(self, dense_fea_dim, embedding_dim, sparse_fea_dim,
                 embeddings_initializer="uniform", embeddings_regularizer=None,
                 activity_regularizer=None, embeddings_constraint=None,
                 mask_zero=False, input_length=None, **kwargs):
        super(IndexEmbedding, self).__init__(
            input_dim=dense_fea_dim,
            output_dim=embedding_dim,
            embeddings_initializer=embeddings_initializer,
            embeddings_regularizer=embeddings_regularizer,
            activity_regularizer=activity_regularizer,
            embeddings_constraint=embeddings_constraint,
            mask_zero=mask_zero,
            input_length=input_length,
            **kwargs)
        self.sparse_fea_dim = sparse_fea_dim
        self.out_dense_dim = sparse_fea_dim * embedding_dim

    def build(self, input_shape=None):
        name = "index_embedding_{}_{}".format(self.input_dim, self.output_dim)
        self.embeddings = self.add_weight(
            shape=(self.input_dim, self.output_dim),
            initializer=self.embeddings_initializer,
            regularizer=self.embeddings_regularizer,
            constraint=self.embeddings_constraint,
            experimental_autocast=False,
            name=name)
        self.built = True

    def call(self, feature, **kwargs):
        _embedding = tf.nn.embedding_lookup(self.embeddings, feature)
        embedding = tf.reshape(_embedding, [-1, self.out_dense_dim])
        return embedding

# Example usage
input_index = tf.cast(feature[:, 0::2], tf.int64) - 1
embedding = IndexEmbedding(dense_fea_dim=20, embedding_dim=8, sparse_fea_dim=4)(input_index)

By adopting this external feature‑processing architecture, the DSP system achieved substantial cost savings and latency reductions, enabling more complex models and richer feature sets to be deployed in production.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

Python GPU Acceleration TensorFlow DSP

Written by

Didi Tech

Official Didi technology account

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.