In this blog, I’m going to be mentioning attention a lot. So what is it?
Let’s start with neural networks.
A neural network is a program that learns from data, based loosely on a human brain. In this model, software neurons (functions) are grouped into layers, and they’re connected between the layers. The connections are shown by the lines in the image. The general idea is that you give the network some data and ‘train’ the network to produce the desired output by strengthening successful paths and weakening unsuccessful ones. In the image, the color of the lines darkening represents certain connections being strengthened.
However, there’s many more types of layers. Attention, the key idea behind many state-of-the-art sequence-to-sequence (seq2seq) models today, looks something like this. The high-level idea is to weight the elements of the input (bottom) differently when producing the output (top). This allows models to remember patterns and focus on key information better.
This is the kind of attention I use in my model (self-attention, as the model ‘pays attention to itself.’) Specifically, as it uses a dot product to get the attention scores, it’s called dot-product self-attention.
In many seq2seq problems, you have data such as text that can’t easily be represented as numbers. However, a neural network needs a numerical input. Let’s say I want to translate ‘I ate an apple.’ into French. I could give each word a number, turn the sequence into [1, 2, 3, 4], and input that to the model. But why is the value of ‘apple’ closer to the value of ‘an’ than the value of ‘ate’?
Enter the embedding layer, which generates an n-dimensional vector representation of each input. In training, the model will learn these embeddings and determine which input values should be closer to each other. That’s the inputs you see in the image.
For attention, each input must have a query, key, and value. If we wanted to create 3-dimensional query, key, and value vectors, like the diagram, we’d create 4 by 3 matrices of weights for the query, key, and value. When we multiply the 1 by 4 input vector by these weights, we’d get a query, key, and value for each input.
For now, we’ll try to create the first output. To achieve this, we take the dot product of the first query [1, 0, 2] with every key. This gives us 3 attention scores [2, 4, 4], which we softmax for consistency. This makes them sum to 1, and we get [0, 0.5, 0.5].
This is where the value vectors come in. We multiply them by the score for each input to get weighted value representations for each input.
Finally, we sum to get the first output. This process is then repeated with the second query and then the third to create the second and third outputs.
In larger models, the input/output sequences are longer, the query/key/value vectors are longer, etc.
Anyways, I hope that helps you understand what’s going on under the hood here!