Skip to content
This repository has been archived by the owner on Aug 3, 2021. It is now read-only.

Choppy generation using pre-trained tacotron-gst model checkpoint #536

Open
astricks opened this issue May 7, 2020 · 1 comment
Open

Comments

@astricks
Copy link

astricks commented May 7, 2020

Hi,

I am using the tacotron-gst for speech generation (mag) and getting choppy generated audio, as someone else noted here. My inference output files are here.

I'm running inference on an NVIDIA tf docker container. Here are my inference logs.

The text I am trying to generate is from the M-AILABS dataset itself. My inference file contains the one line below:

en_US/by_book/female/judy_bieber/the_master_key/wavs/the_master_key_10_f000002|UNUSED|How Rob Served a Mighty King.

If I understand correctly, the provided checkpoint has been trained on the M-AILABS dataset, which means it has seen this particular sentence/audio pair.

  1. Is sample_step0_0_infer_mag.wav the quality to be expected?
  2. Can I swap out griffin-lim and use wavenet to improve the audio quality?
  3. Can you please share some Tacotron-GST audio samples (I found the non-GST tacotron samples in the docs) you have generated, so that we can know what to expect? My expectations are set by the Google tacotron team's audio samples on their webpage.
  4. In short - Is there any way to tell (from the output spectrogram image perhaps) what is causing the low quality generation, and what to change to improve quality? The model, or the vocoder? Both?
@astricks
Copy link
Author

I'd really appreciate any advice on this

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant