Blog

Happy I: Starting again.

October 4th 2021

Around two years ago, I started working on a digital assistant from scratch. I had decided to work on machine learning for speech for my travail de maturit√©, a sort of thesis that Swiss students work on. At the time, I only had a few weeks of experience with machine learning but I thought that it would be cool to have my own version of Siri. I was clearly out of my depth. I eventually ended up with something that worked and a 40-page paper, but by the time I finished, I would look back at some of the code that I had written at the beginning and think about how much better I could do with all of my new knowledge. It was already too late to start again though, which is why I haven't gotten back to it until now.

My new digital assistant will be called "Happy" and although the original project was completed using Keras and Tensorflow, the new implementation will be based on PyTorch both because the have better implementations of utilities for handling audio and so that I can work with a different framework. All of the code for this project will be on my GitHub although I will be separating the model training into different repositories (https://github.com/hnhaefliger/happy).

This first part will focus on all the code that goes around training the model, such as reformatting the dataset, writing the training loop and transforming the data, but not the model itself.

This first part will focus on all the code that goes around training a CTC-based model, such as reformatting the dataset, writing the training loop and transforming the data, but not the model itself.

At the moment, I am working on Google colab, which gives me limited resources to experiment with for free before upgrading to a runtime with better performance for the final training. This post is also fairly code-heavy because there isn't much else to this part of the project.

First thing to do is grab a dataset, which in this case will be Mozilla's common voice. A version of it can be downloaded from Kaggle and Torchaudio already has a class for it (torchaudio.datasets.COMMONVOICE).


kaggle datasets download mozillaorg/common-voice
unzip -q common-voice.zip

Unfortunately, a couple of things need to be moved around for the files to be compatible with Torchaudio, but nothing a couple of short scripts cannot do. First, the clips should be in a directory called cv-valid-xxx/clips/cv-valid-xxx, not cv-valid-xxx/cv-valid-xxx. Problem easily solved with a few quick bash commands.


mkdir ./cv-valid-train/clips/
mv ./cv-valid-train/cv-valid-train ./cv-valid-train/clips/

mkdir ./cv-valid-test/clips/
mv ./cv-valid-test/cv-valid-test ./cv-valid-test/clips/

Then, the labels files need to contain the headers "client_id,path,sequence,...", and not "filename,sentence". They also need to be .tsv's and not .csv and should be moved into the cv-valid-xxx/ directory. We can fix this pretty quickly by opening the files with python and changing some characters.


with open('./cv-valid-train.csv', 'r') as f_in:
    with open('./cv-valid-train/train.tsv', 'w+') as f_out:
        data = 'path,sentence' + f_in.read()[13:]
        data = '\n'.join(['client_id,' + line for line in data.split('\n')])
        data = data.replace(',', '\t')
        f_out.write(data)

with open('./cv-valid-test.csv', 'r') as f_in:
    with open('./cv-valid-test/test.tsv', 'w+') as f_out:
        data = 'path,sentence' + f_in.read()[13:]
        data = '\n'.join(['client_id,' + line for line in data.split('\n')])
        data = data.replace(',', '\t')
        f_out.write(data)

In the repo, all of this is done in the happy_sr/scripts/ directory.

Now we can move on to the actual project. All of the following code can be found in the happy_sr/scripts/ directory. To handle the labels, we need some utility functions to convert between characters and integers.


chars = ' ,<SPACE>,a,b,c,d,e,f,g,h,i,j,k,l,m,n,o,p,q,r,s,t,u,v,w,x,y,z'.split(',')

def text_to_int(text):
    return [chars.index(char) for char in text]

def int_to_text(labels):
    return ''.join([chars[label] for label in labels])

We can also define some audio transforms. We will always pass the model mel spectrograms which the authors of DeepSpeech found to be the most effective audio representation. However, for training we will also add some frequency and time masking for data augmentation which were shown to achieve significant accuracy gains in the paper SpecAugment. We can come back to play with the parameters later on.


train_audio_transforms = torch.nn.Sequential(
    torchaudio.transforms.MelSpectrogram(
        sample_rate=48000, 
        win_length=int(32*48000/1000), 
        hop_length=int(10*48000/1000), 
        n_fft=int(32*48000/1000),
        n_mels=64,
    ),
    torchaudio.transforms.FrequencyMasking(freq_mask_param=15),
    torchaudio.transforms.TimeMasking(time_mask_param=35),
)

valid_audio_transforms = torch.nn.Sequential(
    torchaudio.transforms.MelSpectrogram(
        sample_rate=48000,
        win_length=int(32*48000/1000), 
        hop_length=int(10*48000/1000),
        n_fft=int(32*48000/1000),
        n_mels=64,
    ),
)

While we're at it, we can check if a GPU is available.


device = torch.device("cuda" if torch.cuda.is_available() else "cpu")

if device == "cuda":
    num_workers = 1
    pin_memory = True
else:
    num_workers = 0
    pin_memory = False

Now we need a function to transform a batch from the dataset into data that can be passed into the model. We retain the input and label lengths as they will be useful later for calculating loss. We also use a regular expression to make sure that invalid characters don't cause errors.


def prepare_training_data(data):
    spectrograms = []
    labels = []
    input_lengths = []
    label_lengths = []

    for (waveform, _, dictionary) in data:
        spectrogram = train_audio_transforms(waveform).squeeze(0).transpose(0, 1)

        spectrogram = valid_audio_transforms(waveform).squeeze(0).transpose(0, 1)

        spectrograms.append(spectrogram)

        label = torch.Tensor(text_to_int(re.sub('[^a-zA-Z ]+', '', dictionary['sentence'].lower())))
        labels.append(label)

        input_lengths.append(spectrogram.shape[0]//2)
        label_lengths.append(len(label))

    spectrograms = torch.nn.utils.rnn.pad_sequence(spectrograms, batch_first=True).unsqueeze(1).transpose(2, 3)
    labels = torch.nn.utils.rnn.pad_sequence(labels, batch_first=True)

    return spectrograms, labels, input_lengths, label_lengths


def prepare_testing_data(data):
    spectrograms = []
    labels = []
    input_lengths = []
    label_lengths = []

    for (waveform, _, dictionary) in data:
        spectrogram = valid_audio_transforms(waveform).squeeze(0).transpose(0, 1)

        spectrograms.append(spectrogram)

        label = torch.Tensor(text_to_int(re.sub('[^a-zA-Z ]+', '', dictionary['sentence'].lower())))
        labels.append(label)

        input_lengths.append(spectrogram.shape[0]//2)
        label_lengths.append(len(label))

    spectrograms = torch.nn.utils.rnn.pad_sequence(spectrograms, batch_first=True).unsqueeze(1).transpose(2, 3)
    labels = torch.nn.utils.rnn.pad_sequence(labels, batch_first=True)

    return spectrograms, labels, input_lengths, label_lengths

Finally, we can return torch dataloaders for the dataset.


def get_training_data(batch_size=16, root='./cv-valid-train', tsv='train.tsv'):
    train_dataset = torchaudio.datasets.COMMONVOICE(root, tsv)

    return torch.utils.data.DataLoader(
        train_dataset,
        batch_size=batch_size,
        shuffle=True,
        collate_fn=prepare_training_data,
        num_workers=num_workers,
        pin_memory=pin_memory,
    )


def get_testing_data(batch_size=16, root='./cv-valid-test', tsv='test.tsv'):
    test_dataset = torchaudio.datasets.COMMONVOICE(root, tsv)

    return torch.utils.data.DataLoader(
        test_dataset,
        batch_size=batch_size,
        shuffle=False,
        drop_last=False,
        collate_fn=prepare_testing_data,
        num_workers=num_workers,
        pin_memory=pin_memory,
    )

Now that all of the data preparation is complete, we can write a training loop an initialize an optimizer for the model. As mentioned previously, we will use CTC loss alongside the AdamW optimizer (happy_sr/models).


def get_optimizer(model, loader):
    optimizer = torch.optim.AdamW(model.parameters(), 1e-6)
    scheduler = torch.optim.lr_scheduler.OneCycleLR(optimizer, max_lr=1e-6, steps_per_epoch=int(len(loader)), epochs=10, anneal_strategy='linear')

    return optimizer, scheduler


def get_loss():
    loss = torch.nn.CTCLoss(blank=28).to(device)

    return loss

To keep track of progress during training, we will use the tqdm library. The functions for an epoch of training and testing can be found in happy_sr/utils. I plan to add other metrics later on as the current loop is not very informative. It may also accelerate training to perform the data transforms after sending the data to the GPU.


def train(model, optimizer, loss_fn, dataset):
    model.train()
    progress_bar = tqdm(total=len(dataset))
    progress_bar.set_description(f'training')

    for batch_idx, (data, target, input_lengths, label_lengths) in enumerate(dataset):
        data = data.to(device)
        target = target.to(device)

        output = model(data)
        output = torch.nn.functional.log_softmax(output, dim=2)
        output = output.transpose(0, 1)

        loss = loss_fn(output, target, input_lengths, label_lengths)

        optimizer.zero_grad()
        loss.backward()
        optimizer.step()

        progress_bar.set_postfix(loss=f'{loss.item():.2f}')
        progress_bar.update(1)

This code provides a good base for experimenting with different models. I have also added a shell utility for training and made the parameters more easily editable although this does not change the fundamental functionality of the code. My next post will likely cover metrics and then models. After that I will address natural language understanding and speech synthesis.