Artificial Intelligence 7 min read

Report on Interspeech 2017 and SLaTE 2017: Highlights in Speech Recognition, Synthesis, and Speaker Technologies

The article reports on Liulishuo’s participation in Interspeech 2017 and the SLaTE 2017 workshop, summarizing key research papers on noise‑robust ASR, attention‑based models, TensorFlow training, modern TTS systems, speaker identification datasets, and includes a hiring announcement for AI engineers.

Liulishuo Tech Team

Sep 3, 2017

Report on Interspeech 2017 and SLaTE 2017: Highlights in Speech Recognition, Synthesis, and Speaker Technologies

Interspeech 2017, one of the top international conferences in speech technology, was held in Stockholm from August 20‑24, 2017 with the theme "Situated Interaction" and attracted 1,778 registrants and 2,053 attendees covering topics such as acoustic modeling, language modeling, speech synthesis, speaker verification, and speech enhancement.

Three algorithm engineers from Liulishuo attended the conference and presented the paper "Deep Context Model for Grammatical Error Correction" at the workshop SLaTE 2017.

Speech Recognition : The report highlights research on noise‑robust and far‑field ASR, including "Acoustic Modeling for Google Home" (adaptive dereverberation front‑end, multi‑channel acoustic modeling, Grid‑LSTMs), permutation‑invariant training for multi‑talker speech, and the shift from CTC‑only to attention‑based seq2seq models with CTC‑guided initialization. It also mentions the joint CTC‑Attention end‑to‑end system with a deep CNN encoder and the use of TensorFlow for large‑scale acoustic model training.

Speech Synthesis : Notable works cited are Siri’s on‑device deep‑learning‑guided unit‑selection TTS, Google’s real‑time unit‑selection synthesizer using seq2seq LSTM autoencoders, and the Tacotron end‑to‑end system that predicts spectrograms directly from characters and uses the Griffin‑Lim algorithm for waveform reconstruction.

Speaker Identification, Verification, and Diarization : The article references the VoxCeleb dataset built from YouTube videos using computer‑vision techniques, deep neural network embeddings for text‑independent speaker verification, and an online speaker diarization system.

SLaTE 2017 Workshop : Sessions covered robot‑assisted language teaching, TTS quality measurement, pronunciation scoring, and a shared task on response acceptance/rejection in dialogue. An off‑topic detection paper using bidirectional LSTM with attention was discussed, and the authors exchanged insights on the interpretability of attention mechanisms.

Finally, Liulishuo announced a hiring drive for deep‑learning engineers to work on speech recognition, synthesis, adaptive learning, dialogue systems, and natural language understanding.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

AI natural language processing Speech Recognition speech synthesis Interspeech

Written by

Liulishuo Tech Team

Help everyone become a global citizen!

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.