Having reached a decent stopping point with LSTMs (see previous post) I decided to implement attention myself and see if I could improve on the Music Transformer. I’d also heard about the OpenAI GPT-3 recently, and I wondered if I could borrow from their idea of a decoder-only Transformer. (GPT explanation)
For reference, a feed forward neural network is just a pretty simple layer of neurons.
Over the course of a few days, I implemented this architecture. I ended up training on a much larger dataset with many artists, and so I added an artist token to my data representation (see previous post) to account for different styles, and possibly allow generation in the style of different artists.
This model reached around 78% train accuracy and similar validation accuracy on a dataset of over 1200 pieces of music. Among other improvements, I significantly reduced the size of the dataset on the disk, which allowed me to train on more pieces. Previously, to make the model more robust, I had created many transposed, time-stretched copies of each piece, but I now just apply a random transposition (within a reasonable range) to each piece when it’s used for training.
Here’s an interesting sample (I forgot which artist it was based on):
I also tried training the model on Touhou and got pretty nice results:
The model somehow reached about 90% train accuracy and 85% validation on the pieces I used, which is pretty high for this task.