许多同鞋因为家里设备不佳训练模型效果不好,不得不去世界最大乞丐炼丹聚集地colab上训练。但是对于无法扩容google drive和升级colab的同鞋来说,上传数据集真的如同地狱一般,网速又慢空间又不够,而且每次重置都要上传,预处理令人头疼。我耗时9天终于解决了这个问题,现在给各位同学分享我的解决方案。 首先要去kaggle这个网站上面注册一个账号,然后获取token 我已经把预处理了的数据集(用的aidatatang_200zh)上传在上面了,但是下载数据集需要token,token需要注册账号,具体获取token的方法请自行百度,在此不过多赘述。
然后打开colab 修改-> 笔记本设置->运行时把 None 改成 GPU 输入以下代码:
!pip install kaggle
import json
token = {"username":"你的账号","key":"你获取到的token"}
with open('/content/kaggle.json', 'w') as file:
json.dump(token, file)
!mkdir -p ~/.kaggle
!cp /content/kaggle.json ~/.kaggle/
!chmod 600 ~/.kaggle/kaggle.json
!kaggle config set -n path -v /content
第三行请根据之前获取到的token填写 这一步是准备好kaggle命令行
然后是下载数据集并解压
!kaggle datasets download -d bjorndido/sv2ttspart1
!unzip "/content/datasets/bjorndido/sv2ttspart1/sv2ttspart1.zip" -d "/content/aidatatang_200zh"
!rm -rf /content/datasets
!kaggle datasets download -d bjorndido/sv2ttspart2
!unzip "/content/datasets/bjorndido/sv2ttspart2/sv2ttspart2.zip" -d "/content/aidatatang_200zh"
!rm -rf /content/datasets
为了怕某些童鞋用和我一样的免费版,如果从下载未处理的数据集开始磁盘要爆炸,所以我把预处理过的数据集上传到kaggle了 而且解压后会自己删掉zip,非常滴银杏 实测下载速度能达到200MB/s,网慢点也有50MB/s,非常滴快 这一步要不了10分钟就可以弄好了
!git clone https://github.com/babysor/MockingBird.git
!pip install -r /content/MockingBird/requirements.txt
git仓库,下载依赖,这一步不多说
然后改hparams
%%writefile /content/MockingBird/synthesizer/hparams.py
import ast
import pprint
import json
class HParams(object):
def __init__(self, **kwargs): self.__dict__.update(kwargs)
def __setitem__(self, key, value): setattr(self, key, value)
def __getitem__(self, key): return getattr(self, key)
def __repr__(self): return pprint.pformat(self.__dict__)
def parse(self, string):
# Overrides hparams from a comma-separated string of name=value pairs
if len(string) > 0:
overrides = [s.split("=") for s in string.split(",")]
keys, values = zip(*overrides)
keys = list(map(str.strip, keys))
values = list(map(str.strip, values))
for k in keys:
self.__dict__[k] = ast.literal_eval(values[keys.index(k)])
return self
def loadJson(self, dict):
print("\Loading the json with %s\n", dict)
for k in dict.keys():
if k not in ["tts_schedule", "tts_finetune_layers"]:
self.__dict__[k] = dict[k]
return self
def dumpJson(self, fp):
print("\Saving the json with %s\n", fp)
with fp.open("w", encoding="utf-8") as f:
json.dump(self.__dict__, f)
return self
hparams = HParams(
### Signal Processing (used in both synthesizer and vocoder)
sample_rate = 16000,
n_fft = 800,
num_mels = 80,
hop_size = 200, # Tacotron uses 12.5 ms frame shift (set to sample_rate * 0.0125)
win_size = 800, # Tacotron uses 50 ms frame length (set to sample_rate * 0.050)
fmin = 55,
min_level_db = -100,
ref_level_db = 20,
max_abs_value = 4., # Gradient explodes if too big, premature convergence if too small.
preemphasis = 0.97, # Filter coefficient to use if preemphasize is True
preemphasize = True,
### Tacotron Text-to-Speech (TTS)
tts_embed_dims = 512, # Embedding dimension for the graphemes/phoneme inputs
tts_encoder_dims = 256,
tts_decoder_dims = 128,
tts_postnet_dims = 512,
tts_encoder_K = 5,
tts_lstm_dims = 1024,
tts_postnet_K = 5,
tts_num_highways = 4,
tts_dropout = 0.5,
tts_cleaner_names = ["basic_cleaners"],
tts_stop_threshold = -3.4, # Value below which audio generation ends.
# For example, for a range of [-4, 4], this
# will terminate the sequence at the first
# frame that has all values < -3.4
### Tacotron Training
tts_schedule = [(2, 1e-3, 10_000, 32), # Progressive training schedule
(2, 5e-4, 15_000, 32), # (r, lr, step, batch_size)
(2, 2e-4, 20_000, 32), # (r, lr, step, batch_size)
(2, 1e-4, 30_000, 32), #
(2, 5e-5, 40_000, 32), #
(2, 1e-5, 60_000, 32), #
(2, 5e-6, 160_000, 32), # r = reduction factor (# of mel frames
(2, 3e-6, 320_000, 32), # synthesized for each decoder iteration)
(2, 1e-6, 640_000, 32)], # lr = learning rate
tts_clip_grad_norm = 1.0, # clips the gradient norm to prevent explosion - set to None if not needed
tts_eval_interval = 500, # Number of steps between model evaluation (sample generation)
# Set to -1 to generate after completing epoch, or 0 to disable
tts_eval_num_samples = 1, # Makes this number of samples
## For finetune usage, if set, only selected layers will be trained, available: encoder,encoder_proj,gst,decoder,postnet,post_proj
tts_finetune_layers = [],
### Data Preprocessing
max_mel_frames = 900,
rescale = True,
rescaling_max = 0.9,
synthesis_batch_size = 16, # For vocoder preprocessing and inference.
### Mel Visualization and Griffin-Lim
signal_normalization = True,
power = 1.5,
griffin_lim_iters = 60,
### Audio processing options
fmax = 7600, # Should not exceed (sample_rate // 2)
allow_clipping_in_normalization = True, # Used when signal_normalization = True
clip_mels_length = True, # If true, discards samples exceeding max_mel_frames
use_lws = False, # "Fast spectrogram phase recovery using local weighted sums"
symmetric_mels = True, # Sets mel range to [-max_abs_value, max_abs_value] if True,
# and [0, max_abs_value] if False
trim_silence = True, # Use with sample_rate of 16000 for best results
### SV2TTS
speaker_embedding_size = 256, # Dimension for the speaker embedding
silence_min_duration_split = 0.4, # Duration in seconds of a silence for an utterance to be split
utterance_min_duration = 1.6, # Duration in seconds below which utterances are discarded
use_gst = True, # Whether to use global style token
use_ser_for_gst = True, # Whether to use speaker embedding referenced for global style token
)
我用的batch size是32,同鞋们可以根据情况自行更改
开始训练
%cd "/content/MockingBird/"
!python synthesizer_train.py train "/content/aidatatang_200zh" -m /content/drive/MyDrive/
注意,开始这个步骤前请先挂载谷歌云盘,不想挂载的就把-m后面的路径改了 我选择drive是因为下次训练又能继续上传训练的进度继续训练 然后就是欢快的白嫖时间了 氪金的同鞋可以运行!nvidia-smi查看显卡信息,白嫖版的都是tesla t4 16g显存 实测9k步的时候开始出现注意力曲线,loss值为0.45 注意!白嫖版的用户长时间不碰电脑colab会自动断开 再次打开环境会还原成最初的样子 这个时候选择drive保存的优势就体现出来了:不用担心模型重置被删掉
第一次写,写得不好请见谅 希望这篇教程可以帮助到你们