There are a lot of things that I wish I would have known when I dove head first into my Masters, and as silly as it sounds, one of them being how to pre-process your data correctly. Machine Learning models are only as good as the data that you feed them, and if you feed it garbage you’re very likely to end up with garbage outputs. Which is something that me and probably many others had to learn the hard way. Your models will not somehow magically look past all the problems in your data and miraculously learn exactly what you want them to learn (even though I hope that this aspect will become easier to handle in the future).

A large portion of your data-sciecing time should be spent on data-collection, curation and clean-up. Many open source tools exist for these purposes and are relatively easy to utilize, but sometimes you’ll still need to curl up your sleeves and concoct a python script that does what you want. My biggest gripe with data pre-processing is that it’s usually always neglected and merely brushed over in research paper, rather than explaining it in a ‘tell me exactly what you did’ manner, such that it is in the realm of possibility to reproduce the reported results.

Getting back to the main point of this post, which is pre-processing Spectrograms and treating them as images. If you’ve followed the trends of deep learning and sound synthesis in recent years, you’ve probably come across some models that utilize GANs for this purpose. One of them being SpecGAN, short for Spectrogram Generative Adversarial Networks. Which simply is a GAN that is capable of generating spectrogram representations of sound, the spectrogram being treated as a digital image.

The first time I came across the paper, this sounded like a superb idea, why deal with data in the time domain if we already have strong models for image synthesis? However, this domain conversion usually comes with two caveats:

  • How exactly do we treat a spectrogram as an Image?
  • How do we get the actual audio from the Spectrogram?

If you’re not familiar with what Spectrograms are, then I got you covered with this previous blog post of mine. For now we need not concern ourselves in depth with spectrogram inversion methods (if you would like to learn more details about them check this post), we’ll just assume that we have a ‘good-enough’ method for doing so.

The Nitty Gritty of Spectrogram Pre-rocessing

Spectrogram pre-processing can be split into several stages:

  • Processing the audio data itself before passing it through the STFT

  • The STFT parameters
  • Processing the spectrogram

About Audio Data Normalization

Normalizing audio can be a step that is often disregarded as a primer in your pipeline. Making sure that your data samples are as clean as representative as possible is crucial for steps further down the line.

We’ve previously talked about sample-rate and bit-depth, and the first thing you want to ensure that all samples in your dataset have the same sample-rate. Bit depth is less important, you probably can get away with having different bit-depths, but ideally we’ll also want to make sure that all samples have a similar bit-depth.

About STFT parameter Selection

The main STFT parameters are usually the window length, the hop size and the signal length. For our purposes we will be dealing with short audio sampples, and it makes the most sense to select a specific sample length.


We need to find the global maximum and global minimum for normalizing our spectrograms.

def calc_global_norm_stats(dataset_path, hop_length, n_mels):
    files = os.listdir(dataset_path)
    min = 0;
    max = 0;
    for file in files:
        y, sr = librosa.load(os.path.join(dataset_path, file))

        mels = librosa.feature.melspectrogram(y=y, sr=sr, n_mels=n_mels,
                                              n_fft=hop_length * 2, hop_length=hop_length)
        mels = numpy.log(mels + 1e-9)  # add small number to avoid log(0)

        max += mels.max()
        min += mels.min()

    min = min / len(files)
    max = max / len(files)

    return min, max

Scaling the spectrogram depnding on the minimum and maximum:

def scale_minmax(X,  mi, ma, min=0.0, max=1.0):
    X_std = (X - mi) / (ma - mi)
    X_scaled = X_std * (max - min) + min
    return X_scaled

def inverse_scale_minmax(X, mi, ma, min=0.0, max=1.0):
    X = (X - min) / (max + min)
    X_std = (X ) * (ma - mi) + mi
    return X_std

Creating an image from the spectrogram:

def spectrogram_image(y, sr, hop_length, n_mels, mi, ma):
    # use log-melspectrogram
    mels = librosa.feature.melspectrogram(y=y, sr=sr, n_mels=n_mels,
                                          n_fft=hop_length * 2, hop_length=hop_length)
    mels = numpy.log(mels + 1e-9)  # add small number to avoid log(0)

    # min-max scale to fit inside 8-bit range
    img = scale_minmax(mels, mi, ma, 0, 255).astype(numpy.uint8)
    img = numpy.flip(img, axis=0)  # put low frequencies at the bottom in image
    img = 255 - img  # invert. make black==more energy
    # save as PNG
    return img