Download PDFOpen PDF in browserLeveraging Transfer Learning for Voice Cloning in Bengali LanguageEasyChair Preprint 15582, version 214 pages•Date: January 10, 2025AbstractVoice cloning was an incredible innovation in the field of AI, which replaced human-machine interaction. Unlike other conventional voice synthesis methods, voice cloning requires the least amount of data in order to re-create the voice of a speaker and can offer personalized options for communication. But the creation of such small and powerful models working with scarce voice samples remains a big challenge. This is even challenging, especially for low-resource languages like Bengali, due to the scarcity of data itself apart from the intricacy of regional accents. Our study looks into voice cloning for Bengali using a transfer learning technique from speaker verification models. In this study we have adapted the model for Bengali using Mozilla Common Voice Bengali dataset with the SV2TTS framework. This dataset contains voices ranging in a wide variety of accents and dialects. Retraining the encoder, synthesizer, and vocoder components to capture the unique phonetic features specific to Bengali allows our approach to generate realistic, high-quality voice replications. It is evident from the results, as obtained by evaluation using the Mean Opinion Score method, that the cloned voices turn out very natural and similar in likeness to the speaker. These findings demonstrate prowess for under-resourced languages and extend into customized communications, voice acting, and speech-based assistive tools. This research is focused on the development of methods and models for Bengali speech processing to tackle challenges associated with low-resource language processing; further advances in Bengali speech technologies stand on such bedrock. Keyphrases: AI-driven, Bengali, Bengali language, Mean Opinion Score, Mel Spectrograms, Mel-spectrogram, Speaker specific, Voice Cloning, Voice replication, acoustics speech and signal processing, automatic speech recognition and translation, bengali synthesis, cloning with a few samples, encoder synthesizer and vocoder, encoder synthesizer and vocoder components, large bengali speech recognition dataset, low resource languages like bengali, low-resource language, mozilla common voice bengali dataset, predicted vs target mel, recognition and translation for low resource, speaker encoder, speaker s unique voice features, speaker s voice, speaker text to speech synthesis, speaker verification to multi speaker, speech synthesis, text-to-speech, tts models and vocoder combinations, verification to multi speaker text, voice cloning for bengali, vs target mel spectrogram
|