Audio data can be represented in many different ways. One of the most useful ones being the Spectrogram, which has found it’s way into many real world applications such as signal processing and the analysis thereof. Recent studies show that combining modern deep learning models tailored towards image classification, with spectrogram representations allow us to unlock a completely new perspective on solving audio related tasks with state of the art results.
However, obtaining a spectrogram from a given signal is non-trivial, and requires a number of different steps. The aim of this article is to explain these steps in layman terms and as succinctly as possible, without assuming any prior knowledge.
Audio data stored on modern systems is an abstraction of what audio is in the real world. On a hard drive audio data is stored as a floating point array of numbers, which essentially captures the amplitude of a signal at different points in time. To get an intuitive idea of spectrograms, it’s probably best to zoom out a little, and have a look at what sound actually is, and how it behaves in the real world.
In the physical world, a perceived sound wave can be regarded as a compression of particles, propagating through the space near and around your ear. For instance, if you were to pop a balloon you’ll hear a very distinct clapping sound, that is caused by the compressed air inside the balloon, rushing out and expanding into all directions. When this pressure wave reaches our ear, a thin stretch of skin that resides at the end of the ear canal begins to vibrate. Once in vibration, the ear drum fires a signal to our brain, which is then decoded into semantic information that we can understand and classify. Now, the apparatus that sits between our ear and our brain is quite intricate and I highly recommend you to learn more about it. For the purposes of this article, this simple explanation should be sufficient however.
Now we can already make two very fascinating observations, firstly, that sounds come in many different flavors and secondly, that our human brain is capable of discerning and classifying these different sounds. Think about it, someone drops something in the other room, and you can probably tell what material that object was made of, or even more impressively we can understand and decode words and sentences uttered by someone else, and all of that in a matter of milliseconds and incredibly effortlessly. From an evolutionary point of view, the ability of hearing (and communication) things and understanding what they were probably contributed a great deal to our development as a species.
But back to machines, we’ve only very recently been able to make advances in machine perception and audio classification with convolutional deep learning models, and many of these models actually don’t interact with audio signals directly, but rather operate directly on spectrograms, as they’re much easier to deal with. More on that later, first let’s have a look at how we actually go from raw audio to a spectrogram representation.
A little bit about recorded audio
In a similar way to how the human ear works, we designed microphones to behave in a very similar manner. Microphones usually consist of a diaphragm that begins vibrating when it is met with sound (like the ear drum), this diaphragm then converts sound waves into an electrical signal via transduction, transduction being the conversion of one type of energy into another type of energy. Here we’re converting pressure waves into an electrical current. We can then record the amplitude of this electric signal at every given moment over a specific period of time, and simply store this sequence as an array of numbers on disc.
Obviously, in the physical world, sound is a continuous signal, which we are now abstracting as a discrete sequence, which can be parametrized (defined, described by) with mainly 2 parameters: sampling rate and bit depth. The number of times at which we take snapshots of the amplitude with our microphone is regarded as the sampling rate. For example, if we were to take 16000 samples per second the sampling rate of the recorded signal would be 16KHz (Kilo Hertz). The sampling rate of a signal is essentially it’s temporal resolution, meaning that if we were to take more snapshots in the same amount of time, we would end up with a better quality recording. The second parameter is the bit depth of the signal, which describes the number of bits used to encode each individual sample. Which is essentially the resolution of each individual sample taken, and like before, higher bit-depth means more quality (there is an upper limit though). If you’d like to learn a little bit more about digital audio I recommend this read by izotope.
Great, we now have established a discrete time-signal that can serve as an abstraction of sound waves in the real world.
What is Frequency? Phase?
However, there’s a little more to sound than just sampling rate and bit depth. You’ve probably heard about frequency and phase before, and have a vague idea of what they are but don’t feel like you have a good grasp on these concepts. This was my situation before digging into the topic in more depth, and it took me a while to be able to explain these concepts to myself. This section serves as an explanation, to give you an understanding of both frequency and phase.
Usually, no two ‘natural’ sounds occur the exact same way twice (except if it’s a recording played by a machine). If you clap your hands twice in a row, it is quiet impossible to produce the exact same sound twice. We’re talking microscopic levels of ‘exact’ here. The compression of particles that occurs when you clap your hands happens at different rates, meaning that some particles will be displaced harder and faster than others. This makes it such that different sound waves reach your ear at different speeds. The speed at which sound waves travel to your ear (the speed at which air particles wiggle back and forth) is what we call Frequency.
This might be a bit obscure, but basically, when you clap your hands, you’re creating a number of different pressure waves that reach your ear. This combination of different pressure waves is what characterizes specific sounds. For example, you’ll probably agree, that when you clap your hands you probably won’t hear the sound of an ambulance siren, and no matter how hard you try, you’ll probably never be able to produce the sound of an ambulance siren by clapping your hands. However, with the deformable vocal apparatus connected to the bottom of your head, you can imitate, to a certain degree, many sounds such as a clapping sound or the sound of a siren. The important idea here is, that sounds in the real world consist of many different frequencies. If a sound were to consist of only a single frequency we would call it a pure tone, which do not generally occur in nature. To conclude, you can think about frequencies as oscillations at very specific speeds.
Sounds in the real world consist of many different frequencies occurring at the same time, overlapping and interleaving with each other to construct intricate and extremely distinguishable sounds. This brings up another important aspect of sound, namely ‘phase’. Phase information, describes how different frequencies, that make up a sound, are aligned with respect to each other. A good visual example would be race cars on a circular race track, two cars might be racing side by side, but one of them might already be a lap ahead, which you wouldn’t be able to determine by simply looking at them. Obviously, it’s not as simple as that, but this kind of understanding should be sufficient for now.
Before we talk more about frequency and how to represent it, we’ll make a small detour and look at the sampling rate again. Arguably, the most popular sampling rate for audio data is 44.1KHz. This is due to two reasons, firstly because it’s the sampling rate at which audio is encoded on the good old compact disc (CD), and secondly because it satisfies the Nyquist Theorem. Well, it’s really only one reason, because the CDs sampling rate was chosen to satisfy the Nyquist Theorem. The name might sound intimidating, but it’s a quite simple rule actually. Human hearing has an upper limit of around 20KHz, meaning that we can’t hear frequencies higher than that. How does this relate to the Nyquist Theorem? Well, the Nyquist theorem states that for a discrete sequence to capture all the information from a continuous signal, the sampling rate has to be twice as much as the highest frequency component in that signal. Why exactly this holds is a bit more difficult to explain, and will be covered in a future post, but for now I think you can already see why this is important. Since our upper limit for hearing is 20KHz, a sampling rate of 44.1KHz is more than enough to capture all the frequencies that we could possibly experience.
We have now established two different representations for audio, one that approximately describes the continuous signal in the physical world, and another that describes it by it’s individual frequencies and how they are aligned with each other. The first can simply be represented by a discrete array of numbers, whereas frequency and phase information can be represented by a spectrogram. Before we discuss how we actually obtain the spectrogram, it might be beneficial to have a look at a spectrogram and it’s parts. A spectrogram usually has 3 axis: frequency on the y-axis, time on the x-axis and amplitude represented by color. Let’s have a look at an example:
Let’s inspect each of these three axis in more detail.
- The x-axis that represents time, and shows how the different frequencies behave over time (more to this later).
- The y-axis represents the different frequencies present in the signal, rom low to high frequencies. Since the physical world is continuous, this axis is also a discretization, where a segment on the axis is oftentimes called a frequency bin. Where a bin consists of a number of frequencies grouped together and their amplitudes added together.
- Color on the other hand, can be considered the third axis, and is represented by brightness. Similarly to a heatmap, the brighter a specific spot is, the higher the intensity of a specific frequency at that point, intensity being the amplitude in this case.
You might have noticed already, but phase information is not included in this spectrogram, and there is a good reason for that.
The Fourier Transform
We’re getting closer to the meat of the story, the problem at hand is, how do we actually obtain this spectrogram representation? To be more precise, how do we transition from a discretized time-signal towards a frequency representation? Retrieving the different frequencies that occur in a specific recording, might be problematic. At the time of recording, we observed the amplitude of the signal reaching the diaphragm of the microphone, where the recorded amplitude already consisted of a mixture of different frequencies. At this point it would seem impossible to obtain the different frequencies! It would be similar to trying to retrieve the initial ingredients and fruits that went into the making of a smoothie with a blender.
Luckily we can accomplish this with most modern programming frameworks in a couple lines of code! This is thanks to the Fourier transform, which is named after it’s creator. It is actually very likely that you’ve used technology that implements a fourier transform in some manner (spoilers: your phone). Essentially, what the fourier transform allows us to do, is to convert a time domain signal into a frequency reresentation. The formula for the fourier transform can look a little scary, but don’t fret for now, it’ll be clear in a second. The formula:
I strongly believe that it is much easier to understand the Fourier transform by understanding how it accomplishes this disentanglement, rather than working through the math immediately. The fundamental concept that the fourier transform builds on, is that any signal can be represented by a combination of a number of simpler signals. In this case this simple signal will be a sine wave. This is great! The fourier transform thus tries to find these simpler sine waves that make up our original signal. Of course, in the physical world, not all oscillations are sine waves, but they are more than sufficient to approximate the frequency content in a given signal.
The single most illustrative resource on the fourier transform that I could find is this YouTube video by Grant Sanderson, who runs his channel under the alias 3blue1brown. I have watched this video more times than I would like to admit, and still feel like I am learning a little something new every time I do. Explaining in 3blue1brown terms, the formula you see above wraps our recorded signal around a circle, and imagine that this signal we just ‘wound’ around the circle has a certain weight to it (similar to a metal wire), the red dot representing it’s center of mass. We now wind this wire evermore tightly around the circle and record the position of it’s center of mass relative to the center of the circle. Observing a large deviation signifies the presence of a specific frequency. This process is illustrated in the gif below, which is actually an excerpt of Grant Sanderson’s video.
In the gif above, the original signal is a simple sine wave, oscillating at 3 beats per second. Meaning that it has a frequency of 3 beats per second. By wrapping it around the unit circle and incrementally tightening it, we will observe a deviation from it’s center of mass at this very specific frequency. Attempting to re-explain the Fourier Transform, when there already are so many great resources that do an amazing job at it would be futile, here are three more resources, with varying degrees of difficulty, that should be more than sufficient to help you understand the history and mathematics behind the fourier transform: 1.
- ‘Highlights in the history of the Fourier Transform’ by Alejandro Dominquez is a great read and can be found here. The introduction provides a nice recount of Fourier’s life, before moving on to a quite in depth analysis of the transform.
- More historical context behind the fourier transform can be found in the article titled Fourier Transform: Nature’s Way of Analyzing Data by Rohit Thummalapalli.
DFT, FFT, STFT
For someone new to all of this, the lingo might be quite confusing. The type of Fourier Transform that we use today is the DFT, short for Discrete Fourier Transform. The DFT is used on Discrete Time Signals, such as arrays of numbers representing audio signals (but can be used for all sorts of discrete time signals such as seismic information for example). The word ‘discrete’ literally meanining: individually separate and distinct. FFT stands for Fast Fourier Transform, which is simply a faster and more efficient implementation of the DFT, which is commonly use nowadays.
Now the last things we need to clear up is, why isn’t there any phase information in the Spectrogram? The Fourier Transform actually returns a matrix of complex numbers, where the real components represent the frequency information in the signal, and the imaginary components the phase information. This is why we also call the of spectrogram we showed earlier a ‘Magnitude’ Spectrogram, because it simply shows the magnitudes of the frequencies present in the signal. The phase information is not included.
This begs the question, why do we have a time axis in the spectrogram then? This is because we don’t actually compute the Fourier Transform of the entire signal, we compute many DFTs of smaller chunks of the signal, and then collate them in the spectrogram image we showed. Computing the FT for the entire signal would simpply give us a graph that shows the magnitude of each frequency present in a signal, an example:
Another version is the STFT, which stands for Short Time Fourier Transform that computes the FFT of short segments of an audio recording and then collates them.
The Fourier Transform has been heavily used in recent years as a crucial step in many Machine Learning Pipelines. However, with many modern programming frameworks, this is as simple as writing one line of code, and it is extremely likely that you have used technology that utilizes a fourier transform internally. The Fourier transform originates from Fourier’s