Artificial Intelligence 8 min read

Audio Quality Assessment Using a BiLSTM Deep Learning Model

This article presents a no‑reference audio quality assessment system that leverages a bidirectional LSTM network to extract spectral features via FFT and predict perceptual scores, describing the architecture, technical advantages, data preparation, loss design, and TensorFlow implementation.

360 Quality & Efficiency
360 Quality & Efficiency
360 Quality & Efficiency
Audio Quality Assessment Using a BiLSTM Deep Learning Model

With the rapid growth of audio‑visual products, users demand higher audio quality, prompting research into objective evaluation methods that can replace subjective listening tests. Objective approaches are divided into reference‑based (e.g., PESQ) and no‑reference methods (e.g., P.563).

Most current objective techniques belong to the no‑reference category, yet humans can still judge quality without a reference, suggesting an underlying evaluation mechanism. Inspired by this, the authors explore a deep‑learning solution that trains a neural network to emulate human judgment, using a BiLSTM model.

The overall solution consists of four stages: (1) the client uploads audio data to the server; (2) a web server forwards the request via Nginx to the appropriate task server, which packages the data for an AI server; (3) the AI server extracts spectral features from the audio using a Fast Fourier Transform (FFT); (4) the extracted features are fed into a bidirectional LSTM network, which outputs an audio‑quality score that is returned to the client.

Key technical advantages of this approach are: it requires no reference audio, it can handle audio of arbitrary length, and by using spectral features the computational load is reduced while improving prediction accuracy.

In the core implementation, a BiLSTM network is employed because audio frames depend on both past and future context, allowing the model to capture global information. The network produces two outputs: frame‑level scores and an overall quality rating.

For data preparation, the ST‑CMDS clean Chinese speech corpus is combined with 100 types of noise at various signal‑to‑noise ratios (SNR) to simulate real‑world conditions. PESQ scores are computed for the noisy samples and used as ground‑truth labels.

The loss function comprises two parts: (1) the mean‑squared error (MSE) between the predicted overall score and the true PESQ value, and (2) a frame‑level MSE weighted to emphasize frames with higher impact on perceived quality.

Code for the frame‑level loss and training loop:

def frame_mse_tf(y_true, y_pred):
    True_pesq = y_true[:,0,:]
    loss = tf.constant(0, dtype=tf.float32)
    for i in range(y_true.shape[0]):
        loss += (10**(True_pesq[i] - 4.5)) * tf.reduce_mean(tf.math.square(y_true[i] - y_pred[i]))
    return loss / tf.constant(y_true.shape[0], dtype=tf.float32)

def train_loop(features, labels1, labels2):
    loss_object = tf.keras.losses.MeanSquaredError()
    with tf.GradientTape() as tape:
        predictions1, predictions2 = model(features)
        loss1 = loss_object(labels1, predictions1)
        loss2 = frame_mse_tf(labels2, predictions2)
        loss = loss1 + loss2
    gradients = tape.gradient(loss, model.trainable_variables)
    optimizer.apply_gradients(zip(gradients, model.trainable_variables))
    return loss

Model training uses a TensorFlow data pipeline: the dataset is loaded from .npy files, shuffled, padded‑batched, and prefetched. The RMSprop optimizer (learning rate 0.001) updates the model parameters during the training loop.

def read_npy_file(filename):
    data = np.load(filename.numpy().decode())
    return data.astype(np.float32)

def data_preprocessing(feature):
    feature, label1 = feature[...,0], feature[0][0]
    label2 = label1[0] * np.ones([feature.shape[0]], 1)
    return feature, label1, label2

def read_feature(filename):
    [feature,] = tf.py_function(read_npy_file, [filename], [tf.float32,])
    data, label1, label2 = tf.py_function(data_preprocessing, [feature], [tf.float32, tf.float32, tf.float32])
    return data, label1, label2

def generate_data(file_path):
    list_ds = tf.data.Dataset.list_files(file_path + '*.npy')
    feature_ds = list_ds.map(read_feature, num_parallel_calls=tf.data.experimental.AUTOTUNE)
    return feature_ds

optimizer = tf.keras.optimizers.RMSprop(learning_rate=0.001)
W = model.layers[0].get_weights()
ds = generate_data(train_file_path)
ds = ds.shuffle(buffer_size=1000).padded_batch(BATCH_SIZE, padded_shapes=([None, None], [None], [None, None])).prefetch(tf.data.experimental.AUTOTUNE)
for step, (x, y, z) in enumerate(ds):
    loss = train_loop(x, y, z)

In summary, the proposed solution builds an automatic, reference‑free audio quality assessment method using a bidirectional LSTM network trained on a synthetic dataset derived from clean Chinese speech and diverse noise, achieving accurate quality predictions without needing a high‑quality reference signal.

deep learningTensorFlowsignal processingaudio qualityBiLSTMno-reference evaluation
360 Quality & Efficiency
Written by

360 Quality & Efficiency

360 Quality & Efficiency focuses on seamlessly integrating quality and efficiency in R&D, sharing 360’s internal best practices with industry peers to foster collaboration among Chinese enterprises and drive greater efficiency value.

0 followers
Reader feedback

How this landed with the community

login Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.