Artificial Intelligence 11 min read

Model‑Based Audio Denoising Using Deep Learning for Device Quality Evaluation

This article presents a deep‑learning approach that transforms recorded audio into spectrograms, trains a noise‑prediction network (e.g., ResNet, U‑Net, LSTM) to estimate environmental noise, subtracts it in the frequency domain, and reconstructs a cleaner signal for more accurate audio‑device quality assessment.

360 Quality & Efficiency

Sep 3, 2021

Model‑Based Audio Denoising Using Deep Learning for Device Quality Evaluation

With the rapid development of digital technology, audio has become an integral part of daily life, and the quality of recorded audio is now a key indicator of device performance; however, environmental noise often contaminates recordings, making it difficult to evaluate device quality accurately.

The proposed solution first applies a short‑time Fourier transform (STFT) to obtain the magnitude and phase of the audio signal, feeds the magnitude into a trainable noise‑prediction model that outputs an estimated noise spectrum, subtracts this from the original spectrum, and finally performs an inverse STFT to reconstruct a denoised waveform for quality testing.

The overall workflow is divided into five steps:

1. Collect clean speech recordings and separate recordings of environmental and device noise.

2. Segment the recordings into equal‑length clips, randomly scale them, and mix them to create diverse training samples; then compute their spectrograms.

3. Train a noise‑prediction network that can estimate the noise spectrum from a mixed spectrogram.

4. Denoise by feeding a noisy spectrogram into the trained model, subtracting the predicted noise, and applying inverse STFT.

5. Use the cleaned audio for device quality evaluation.

Technical advantages include targeted removal of specific noise types that traditional methods struggle with, fast and simple operation, and the ability to treat spectrograms as images for effective feature extraction.

For the network backbone, architectures such as ResNet, U‑Net, or LSTM can be employed. The loss function is defined as: L(S,N;θ) = 1/2 || f(S;θ) ⊙ S - N ||² Data preprocessing involves collecting long‑duration clean and noise recordings, segmenting them into 5‑second clips (sampling rate 16 kHz, 80 000 samples per clip), and saving the segments:

path = long_voice_path
files = os.listdir(path)
for file in files:
    name = file.split('.')[0]
    long_voice, sample_rate = librosa.load(path + file, sr=16000)
    n = int(len(long_voice) / 80000)
    for i in range(1, n):
        short_voice = long_voice[(i-1)*80000 : i*80000]
        save_path = short_voice_path + name + '_' + str(i) + '.wav'
        sf.write(save_path, short_voice, sample_rate)

Audio mixing is performed by randomly scaling clean and noise clips and adding them:

files = os.listdir(clean_voice_path)
for i in range(5):
    clean_voice, clean_sr = librosa.load(clean_voice_path + files[i], sr=16000)
    for j in range(1,5):
        noise = librosa.load(noise_voice_path + noises[j], sr=16000)[0]
        x = np.random.uniform(0.5, 3)
        y = np.random.uniform(0.5, 3)
        noise_voice = x * noise
        clean_voice = y * clean_voice
        mixed_voice = clean_voice + noise_voice
        sf.write(mixed_voice_path + files[i].split('.')[0] + '.wav', mixed_voice, 16000)

STFT and inverse STFT conversions are carried out as follows:

dim_square_spec = int(n_fft / 2) + 1
m_amp_db_voice, m_pha_voice = numpy_audio_to_matrix_spectrogram(clean_voice, dim_square_spec, n_fft, hop_length_fft)

Training the noise‑prediction model uses the mixed spectrogram as input and the clean spectrogram as target:

def training(weights_path, training_from_scratch, epochs, batch_size):
    voice_in = np.load(mixed_voice_spectrogram_path)
    voice_ou = np.load(clean_voice_spectrogram_path)
    voice_ou = voice_in - voice_ou
    voice_in = scaled_in(voice_in)
    voice_ou = scaled_ou(voice_ou)
    voice_train, voice_test, label_train, label_test = train_test_split(voice_in, voice_ou, test_size=0.20, random_state=42)
    if training_from_scratch:
        model = model()
    else:
        model = model(pretrained_weights=None)
    checkpoint = ModelCheckpoint(weights_path + '/model_best.h5', verbose=1, monitor='val_loss', save_best_only=True, mode='auto')
    history = model.fit(voice_train, label_train, epochs=epochs, batch_size=batch_size, shuffle=True, callbacks=[checkpoint], validation_data=(voice_test, label_test))

During inference, the noisy audio is transformed to a spectrogram, passed through the trained model, the predicted noise is subtracted, and the result is converted back to the time domain:

def prediction(weights_path, name_model, audio_dir_prediction, dir_save_prediction, audio_input_prediction, audio_output_prediction, sample_rate, min_duration, frame_length, hop_length_frame, n_fft, hop_length_fft):
    json_file = open(weights_path + '/' + name_model + '.json', 'r')
    loaded_model = model_from_json(json_file.read())
    json_file.close()
    loaded_model.load_weights(weights_path + '/' + name_model + '.h5')
    audio = audio_files_to_numpy(audio_dir_prediction, audio_input_prediction, sample_rate, frame_length, hop_length_frame, min_duration)
    dim_square_spec = int(n_fft / 2) + 1
    m_amp_db_audio, m_pha_audio = numpy_audio_to_matrix_spectrogram(audio, dim_square_spec, n_fft, hop_length_fft)
    X_in = scaled_in(m_amp_db_audio).reshape(m_amp_db_audio.shape[0], m_amp_db_audio.shape[1], m_amp_db_audio.shape[2], 1)
    X_pred = loaded_model.predict(X_in)
    inv_sca_X_pred = inv_scaled_ou(X_pred)
    X_denoise = m_amp_db_audio - inv_sca_X_pred[:,:,:,0]
    audio_denoise = matrix_spectrogram_to_numpy_audio(X_denoise, m_pha_audio, frame_length, hop_length_fft)
    sf.write(dir_save_prediction + audio_output_prediction, audio_denoise, sample_rate)

Experimental results show that the original noisy recording received a quality score of 1.2, while the U‑Net‑based denoised version scored 4.1 and the LSTM‑based version scored 3.8, demonstrating that the model‑driven approach effectively mitigates environmental noise and substantially improves audio quality.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

Deep Learning model training noise reduction audio denoising spectrogram STFT

Written by

360 Quality & Efficiency

360 Quality & Efficiency focuses on seamlessly integrating quality and efficiency in R&D, sharing 360’s internal best practices with industry peers to foster collaboration among Chinese enterprises and drive greater efficiency value.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.