{getToc} $title={Table of Contents}
Summary
The study proposes a novel approach to speech synthesis from articulatory movements recorded by real-time MRI, achieving a 15.18% Word Error Rate on the USC-TIMIT MRI corpus.
Highlights
- Introduces a novel approach to speech synthesis from articulatory movements recorded by real-time MRI.
- Adapts the multi-modal self-supervised AV-HuBERT model for text prediction from rtMRI.
- Incorporates a new flow-based duration predictor for speaker-specific alignment.
- Achieves a 15.18% Word Error Rate on the USC-TIMIT MRI corpus.
- Demonstrates generalization ability to unseen speakers.
- Evaluates the impact of different articulators on text prediction.
- Synthesizes speech in any novel voice using a trained speech decoder.
Key Insights
- The study's approach achieves a significant improvement over the current state-of-the-art in speech synthesis from articulatory movements recorded by real-time MRI.
- The use of a multi-modal self-supervised model allows for effective text prediction from rtMRI, eliminating the need for noisy ground-truth speech.
- The incorporation of a flow-based duration predictor enables accurate speaker-specific alignment, crucial for high-quality speech synthesis.
- The study demonstrates the importance of internal articulators in speech production, showing that they have a more significant impact on text prediction than lip movements alone.
- The approach's ability to synthesize speech in any novel voice using a trained speech decoder expands its potential applications in various fields, including language learning and speech disorders.
- The study's results highlight the potential of real-time MRI technology in advancing speech synthesis research and applications.
- The proposed approach has the potential to enable more accurate and efficient speech synthesis, which can benefit various industries, including education, healthcare, and entertainment.
Mindmap
Citation
Shah, N., Kashyap, A., Karande, S., & Gandhi, V. (2024). MRI2Speech: Speech Synthesis from Articulatory Movements Recorded by Real-time MRI (Version 1). arXiv. https://doi.org/10.48550/ARXIV.2412.18836