From the outside, machine learning seems so cool. The projects you see online or the products created by tech firms make you believe that machine learning is the answer. It seemed so cool to me, that I wanted to know what it would take to apply machine learning to a problem from conception to completion. In this post, I want to highlight some key takeaways from my experience.
To give everyone some context, I’m not a machine learning scientist. I’m a software developer, who’s experience has largely been around full-stack or end-user applications, on both desktop and mobile. That’s been my gig for the last 18 odd years. I came into machine learning because it was so far off from what I do on my day-to-day. Keep in mind while reading this post, that a lot of what I was doing was through trial-and-error, fumbling my way through the dark, unsure if there was an actual exit to the room I was in. At most I’d completed some courses online, and read several books on machine learning, but hadn’t really tried things out on my own without someone narrating or holding my hand on what I should do next. I have a general understanding of what it means to create a fully-connected network, to apply activations, to test hyper parameters, etc, etc… With out of the way, let’s get started.
The project seemed pretty straightforward, take a mixed audio track, and extract the vocals. Can machine learning be used to solve this problem? How can I approach this problem? Although this isn’t strictly speaking a “Hello, World” problem, it was my first project with many, many unknowns, and was the first step towards, perhaps, becoming a machine learning person.
Before I could start developing my model, I needed data. Finding data can be hard, or really easy, depending on your situation. Maybe you already have access to the data, maybe you don’t. Don’t be afraid to read scientific papers that night point you to an openly available dataset. Luckily, audio segmentation seems to be a pretty popular area of study, and I managed to stumble on the MUSDB18 dataset by SigSep.
The dataset comes in MP4 format which included something called stems. There was a stem for just the vocals, just the drums, and just the accompaniment. I had no idea that MP4s could be stored like that! Moreover, the data was encoded at 44.1 kHz which means the data is pretty high quality. SigSep also provides a python library to work with this data.
Put yourself in the model’s shoes
To me, at first blush, this problem appeared it solved simply by “inputting audio into a model and getting the output.” This was an awful strategy. Machine learning can be powerful but you have to weigh the cost of what is reasonable and practical. Naively thinking I could pass in a part or whole audio file into a model was untenable. Just imagine putting yourself in the shoes of the machine learning model, if I gave you one sample of an audio file, could you tell me where the voice is? All you have is the amplitude data at some point in time. Even if i gave you what the expected amplitude value is at that time, you’d be hard pressed to tell me how much of the input audio’s amplitude at a given time step should be considered just vocals.
I turned to looking at the frequency domain of the audio file. I didn’t know much about working with audio, but I knew about Fourier transforms. After a bit of Googling, the idea of using Short-time Fourier transforms (STFT) came up. If you’ve never heard of STFT, in brief, it allows you break down your 1D input signal, into 2D where one axis is the time, and the second axis is frequency. Roughly speaking, it means I could get the amount of frequency of the signal at some point in time. Taking a look at a plot of a STFT, as you move along the time axis, you can move up along the frequency axis and notice some bright spots and some darker spots. Bright spots indicate a stronger value for that given frequency at that time.
I applied this STFT to both a mixed audio signal and a vocal-only signal, and could visually compare the difference. As a human, if I was given a STFT plot of some audio, and some vocals of the same song, over time I could probably decipher which frequencies belong to voices and which are everything else. I figured, if I could do it, maybe a machine learning model could do it too. My approach would be to use a single column of the STFT as input into my model. What about the output?
Again, I had to consider how I would try to solve this problem given I were the model. Even though I could identify which frequencies might belong to the vocal track, I wouldn’t be to generate an STFT for you. What I could do was say “This frequency might be part of the vocal track.” This sounded a lot like I could generate a mask of the frequency from the input image. I would take the same approach with my output. Since I have access to the vocal stem in my data, after passing it through STFT, and isolating a time slice, I would use an arbitrary threshold to generate a binary mask as my “expected” output.Here’s what that might look like:
This means, that in order to extract the vocal stem from an input mixed audio, my program would have to do the following steps:
- Get the STFT of the mixed audio
- Pass it into my model, which will generate a mask
- Apply the mask to the input STFT. This should generate a STFT of the expected vocal stem
- Use an inverse STFT on the expected vocal stem to get back the audio signal
All of these steps seem reasonable to me. Moreover, it was a mental model I could wrap my head around. Now it was time to move forward with training a model.
Don’t be afraid to preprocess your data
In order to accomplish the steps laid out above, I would have to write code gasp! to pass into my model for training. Taking the naive approach again, I decided to generate the STFT of the mixed song and the vocal stem on the fly. If I was just trying to test things out quickly, this would have been fine. Unfortunately, I quickly ran out of RAM trying to hold all that data in memory. Moreover, I found parts of this pipeline to be very slow, sometimes taking dozens of seconds just to get one sample passed into my model for training.
I won’t bother you with all the gory details of what I did next, but I can list some of the steps I did take to make all of this much more tenable.
- I preprocessed all the audio files and generated CSV files with some metadata. The metadata included things like the duration of each file. Part of what I found during this part of the project was, loading the audio file, just to get the length required to compute the STFT to be too slow. Reading it from a CSV make things much better
- I had to use an LRU cache to keep recent tracks in memory. My first implementation would just load a track on the fly. This was unbearably slow when I was trying to iterate quickly. Moreover, ffmpeg (native library needed to read the files off disk), was super slow on some machines I was testing on when loading files in such a tight loop
- I had to drop frames of data. At times, for reasons I never bother investigating, the expected size of my audio was not loaded into memory. Instead it would silently fail, and my model would crash while training because the expected size of my data was… not expected 🙂
Ultimately I settled on these tradeoffs simply to get things going. Looking back, I would stay that I spent a disproportionate amount of time working on my data loading for this project. It’s important to note, that this could be your reality. You may want to solve some problem using machine learning, trying out cool new model topologies, however, prepare yourself to spend a significant amount of time just massaging your data and how it’s loaded just to get your model to start training.
Another approach I would have to taken would have been more aggressive about creating intermediate forms of my data. Much like I did with creating those CSV files. Creating intermediate forms of your data can shave hours off your model’s training time. Ultimately, I found this part of the machine learning process to be less machine learn-y and more software engineering-y.
Your model’s topology isn’t going to be obvious
PS, I decided to use Keras as my machine learning for this project. I’ll leave out the decision making reasons here, and leave some reasons at the end of the article, however, I just wanted to let you know in case I start using some Keras specific language.
Trying to figure out a model topology can be hard. Maybe this is something that you get better with over time. But given a blank piece of paper and what your input/outputs are, I could not figure out what my model should be. As was the approach until this point, I decided to take the naive route, and just create something. This was another bad idea.
The models I created ended up with millions and millions of learnable parameters. Millions of learnable parameters means more computations are required by your model, making it either too slow, or too large or both. If your model is slow, then training your model will be slow. Coupled with slow data loading from the previous section, this was a recipe for failure. A better approach would have been to try a working topology and start from there. If there is no literature on your data specifically, then pick a model which has shown success with data that somewhat resembles your data. In my case, I came across a paper which used LSTMs as part of their model. LSTMs makes sense, since I am dealing with data that has a time element to it. The model was quite small (about 300,000 parameters) and trained somewhat quickly. There was some other literature I came across that passed in a larger slice of the STFT (instead of just using one time step), as a matrix into their model, and used common image based techniques to derive the same output.
Be prepared to wait, but don’t waste your time
If preprocessing your data takes a disproportionate amount of your engineering time, training takes a disproportionate amount of your patience. Training models can be an exercise in patience. If you see that building your model will take hours, bail out, see if you can break things down to a smaller problem and try again. Its not worth your time to wait only to see that your model has very poor accuracy.
With Keras (and I’m sure this is true with other libraries), the ETA and the accuracy of each epoch (one iteration through your data) was displayed in the output console. If I saw that it would take me 2+ hours to train this model, I would stop the program, modify some piece (usually from the data generator), and try again.
If the accuracy was bad (10% or less) after a few minutes, I would quit the problem again. To me, this usually meant that my model is having a hard learning any patterns given the input/output I’d given it. Although I present this recount of my experience as a linear journey through time, this was far from the case. I had to often iterate from the beginning; deciding whether or not the input/output combination I’d chosen make sense, reworking my data to fit any new choice, and figuring out if my model can learn from my new approach. However, it’s important to know where to spend your precious time. Your time should be spent
- Trying to understand your data
- Preprocessing the data it into the right format for training
- Understanding what your model’s topology is, and how it can be improved
If training your model will take hours, allow it to do so with the knowledge that you’ve exhausted all other options. Your time is precious, and you should spend it wisely. I often made the mistake of allowing a model to train too long even when the accuracy appeared to be too low. Don’t make the same mistake I made.
It’s never too early to try a prediction
Although this section comes after the section on training, it could have easily been put before. With Keras (again, I’m sure this is true for other libraries), you can run a prediction with your model before training. My suggestion to anyone reading is to always try running a prediction. A model’s learnable parameters are initialized with random values, which means, at worst, your output is random, at best, it works and you don’t have to go through training! The idea is that trying to run a prediction before training forces you to write the code that uses your model and generates the eventual output.
In my case, attempting a prediction early was beneficial. Since the input to my model was a single time slice of the STFT, and the output to my model is a mask, I would have to apply some post-processing steps before I could determine if my model worked. The steps that needed to be written were:
- Take an MP4, and apply the STFT
- For each time step of the STFT, pass it through my model’s prediction to generate a mask
- Collect each time step’s mask and rebuild a mask to match the size of the STFT
- Multiply the mask and the input STFT to generate an STFT corresponding to a vocal STFT
- Invert the vocal STFT (using an inverse STFT), and save the result as a wave file
After running a prediction before training my model, I had two things
- I had working code to see the results of my model, and
- I had a baseline for a randomly generated output
Once your model is done training, simply run the same prediction code again, and bask in the awesomeness of your results! In case you were wondering, here are mine. The first image is the STFT of the actual vocal track, the second image is what my untrained model generated:
Basically it looks like my untrained model just return a mask that accepted the whole song as a vocal!
Now after all that, you might be wondering, how’d I do? Did things work out? I’d say yes and no; The best accuracy I could muster was about 60%. I strongly believe it has to do with how I split my data (which is a whole other topic on its own). While trying to find a balance between the amount of data I could get into my model for training and trying to get my model to train in some acceptable amount of time, I don’t think I gave my model enough variety in the data it generate a strong result. You can see my result for yourself below. The first image is the STFT of the actual vocal track, the second image is what my model generated:
So those were the major takeaways from my experience. I have to admit that I genuinely enjoyed the experience of trying to solve the problem of audio segmentation using machine learning. Were there moments when I thought none of this was possible? Absolutely. But machine learning, in its current state requires you to persevere.
It’s fun to think that you could create a machine learning model to solve some of the world’s complex problems. And truth be told, you probably could. However, as a hobbiest machine developer, you have to focus on what small problem you can solve, so you can effectively train a model to give you reasonable results. Build enough of these models, and you can start to connect the output of one into the input of another and then try to solve much larger complex problems!
I found trying to put myself in the machine learning model’s shoes was a good way to break down big problems. Breaking things down to smaller problems allowed me to interpret the data I had, and gave me an opportunity to solving it with machine learning.
There were some thing I wanted to share that I didn’t think quite fit into my article. So I’ll just put them here.
Machine learning on Macs can be less than ideal
I have a 2017 MacBook Pro I use for my day-to-day development. I don’t have local access to any other machines. Unfortunately, training on OSX is slow. This is in most part to the fact that most machine learning libraries only make use of GPUs if the GPU was made by Nvidia. My MacBook has an AMD video card, and unfortunately, none of the major python libraries that I could find could make use of the AMD card. This means that training on my machine was unbearably slow because it has to do so using the CPU for everything.
The only glimmer of hope was PlaidML. My understanding of PlaidML is that it allows some libraries to run some optimized computations on either Intel chips or AMD video cards. Whatever the case might be, they provide a means of integrating with Keras to help improve training on OSX. You can run benchmarks on your local machine to see which configuration performs best. In my case, I configured PlaidML to use my AMD card using Metal instructions (other options included running OpenCL on either the Intel chip or the AMD video card) and this reduced training down by an entire order of magnitude.
Alternatives to training on an OSX machine include trying out cloud provided services, like FloydHub. I tried out FloydHub, but I found the feedback loop to be too big for me. I wanted something tighter. I ended up trying Colab as a viable alternative. Colab provides a cloud based service for working with Jupyter notebooks, for free! I believe its meant to train machine learning models because you can configure your environment to either operate on the CPU, GPU or TPU! If you can spare the space required to upload your training and testing data, it’s a pretty compelling option if you don’t have access to machine learning capable computer.
Python has some super sweet libraries!
I’m fairly new to Python, so I’m not currently steeped in how best to use all of its libraries, however, there was more than one instance while working with either numpy, pandas, or python itself where I was truly impressed with what I could with one or two lines of code that would make me cringe in any other language. A simple would be to compare if an integer value was between two values. In any other language, the code I’d write might look as follows:
if (value > min && value <= max)
With python, it would look as follows:
if min < value <= max:
Isn't that delightful? Similarly, array slicing with numpy is so convenient, I sincerely wished it exists with any library I use. Just look at this syntax I could use to get a whole column from a two by two matrix:
column = my_array[:, 0]
Isn't that so nice? I'm sure you can coax languages like Swift to do something similar, but I was genuinely pleased with this ability. Of course, with great power comes great responsibility. Python seems to have this thing called broadcasting, which I believe means if python tries to apply an operation to two tensors of differing sizes, it attempts to reshape one of the tensors to match sizes. This lead to some unexpected array sizes and from time to time. I constantly found myself writing
print(X.shape) after each operation just to make sure I was doing the right thing. The need to do this probably goes away after a lot Python experience, but it something I constantly found myself getting wrong.