Self-Attention is fundamentally a Weighted Voting

12 min readSep 26, 2023

What’s in it for you?

If you search online trying to find explanation of how Self-Attention Mechanism works, you will likely find something very abstract to say the least, like the following:

The self-attention mechanism allows the inputs to interact with each other (“self”) and find out who they should pay more attention to (“attention”).
The self-attention mechanism enables the model to weigh the importance of different elements in an input sequence and dynamically adjust their influence on the output.
The self-attention mechanism is a deep learning mechanism that allows a model to focus on different parts of an input sequence by assigning weights to each part, determining their importance in making predictions, it computes a representation of the same sequence by relating different positions within the sequence.

“Pay more attention to”, “dynamically adjust their influence”, “Focus on different parts”, all those ”explanation” sounds really just like a vague metaphor.

So the question still remains: What exactly does the Self-Attention Mechanism do?

This article provide a different way to understand what Self-Attention Mechanism truly does at a highly concrete conceptual level using it’s underline connection with Dynamic Routing Mechanism introduced in Capsule Neural Networks (CapsNets).

We will go through how both mechanism works first, from abstract idea to mathematical formula. Then we will cross comparing the two and to draw conclusion that both Self-Attention and Dynamic Routing are fundamentally two different implementations for the same idea of using weighted voting to reach global consensus, e.g. Like how voting works in a “normal” election.

Self-Attention

What was it designed for?

Originally the Attention Mechanism was designed for Recurrent Neural Networks (RNNs) to handling longer sequences.

Source: Illustrated Guide to Transformers- Step by Step Explanation by Michael Phi

Consider translating a sentence from one language to another. Translating a sentence word-by-word is almost destined to lose contextual information, and in order to translate effectively, the model need to be selective and determine which words are most important in a specific context.

Source: Understanding and Coding the Self-Attention Mechanism of Large Language Models From Scratch

In Jun 2017, the famous paper: “Attention Is All You Need” drops, introduced a standalone Self-Attention Mechanism and to an extent pushed RNNs into the history book of Machine Learning.

Intuitively, Self-Attention Mechanism enhances the input representations by including information about the input’s context. It forms new representations by aggregates the input representations from lower layer, kind of like convolution, but weights of the kernel is depending on its current inputs.

How does Self-Attention works mathematically?

Attention Formula:

Q, K, V stands for QUERY, KEY & VALUE, each matrix are calculated by multiply the same inputs (X matrix) with three different weight matrix (WQ, WK, WV), where each row in the X matrix is a representation vector, e.g. a word embedding.

Softmax Function is used for converts a vector of numbers into a vector of probabilities, where the probabilities of each value are proportional to the relative scale of each value in the vector.

Softmax function applied along the rows of its input matrix, converting elements of that vector into probability distributions called attention score, i.e. elements of the row vector sum to 1.

At last step, attention score will be used as weights to sum up the vectors of all rows in the V matrix to get the Z matrix as output representations.

Dynamic Routing

What’s the big idea?

In Oct 2017, just four months after “Attention Is All You Need” paper released, a paper titled “Dynamic Routing Between Capsules” introduced CapsNets along with its core mechanism which the title suggests.

In essence, the key idea that paper proposed is: In a ML model, for a given hidden layer, instead of relaying on neurons at that layer (e.g. layer A) to learn a set of fixed weights matrix for produce the representations, we let the neurons in the previous layer to individually predict the representations for the next layer (i.e. layer A) and routing the representations among each others to reach the consensus about the final representations for the next layer (i.e. layer A).

Let’s see how Dynamic Routing works overall.

Source: **Dynamic Routing Between Capsules**

Above is an architecture of a simple CapsNet, i is index for capsule in PrimaryCaps layer, j is index for capsule in DigitCaps layer, i.e. predicted representation in next layer. Each capsule is represented as a feature vector with size 8, and PrimaryCaps layer has 32 different types of capsules. Whole calculation process works as follow.

Independent Predictions For each capsule, we multiply it by a unique weight matrix Wij to predicts 10 representations for the next layer, and initialize the weight values for all predicted representation of all capsules with the same constant value.
E.g. Like individual voters come up with their own opinion about an election
Trying to reach Consensus For each of those 10 predictions types, we calculate the weighted mean vectors of each predicted representation across all capsules.
E.g. Like opinion poll.
Update Predictions base on Consensus For each capsule, we update the weight values of its 10 predicted representation base on their similarities (Dot product or Euclidean distance) with weighted mean vectors of corresponding prediction type. And then we can loop back to step 2.
E.g. Individual voters change their opinion by opinion poll results
Settle on a single Consensus After a few iterations, we return 10 weighted mean vectors as predicted representations for the next layer.
E.g. At last step all the voters will vote out the winner.

What advantages Dynamic Routing gives us?

In order to understand the advantages it can give us, let’s first consider how widely used Convolutional Neural Networks (CNNs) using convolution kernels together with parameter sharing (i.e. each depth slice of the input feature map uses the same kernel weights and bias) to process images.

Source: Convolutional Neural Network with TensorFlow implementation

Source: How does a convolutional layer convert a 64 channel input into a 3 channel or one channel?

From above illustrations, it’s easy to tell that the convolution operation is a fixed-weighted sum of feature vectors that are in a fixed range of neighborhood.

The idea of the convolution works well to a certain extend even surpasses human’s visual ability, but it has a big well know problem: when CNNs are fed with images of different sizes, orientations and feature displacements, they fail. e.g. resize a face image, rotate it upside down or offset the position of the eyes or nose.

The reason why CNNs have this problem is that: In order for CNNs to detect an image where object inside has certain transform, each convolution layer need to have kernels that can detect necessary features that are displayed with that transform, if any of those convolution layer don’t have corresponding kernels, then transform information of the object will be lost at that layer, thus leads incorrect prediction from CNNs. Moreover, the inexhaustible array of transformations possible for each individual image renders the problem insurmountable through mere scaling of the model’s size.

And the reason why CapsNets don’t have this porblem is that: Dynamic Routing in CapsNets is like convolution operation with parameter sharing but the kernels weights are determined dynamically by input feature maps (plus, it doesn’t reduce representation dimensions by sum up all the elements of feature vector across the channel dimension), so there is no need to learn any kernels that can detect transformed features in any layers except a few feature detection layers between the input layer the first capsule layer.

How does Dynamic Routing works mathematically?

Following is the Dynamic Routing Algorithm, where

u^j|i is a predicted feature vector j given by capsule i multiplied by a unique weight matrix Wij.
bij is a raw weight given to u^j|i.
cij is a prediction probability (Normalized bij) given to u^j|i.
Sj is the weighted sum of u^j|i from all capsules in layer l.
Vj is normalized Sj and the final output of the layer l that represents the consensus of u^j|i for all capsules in layer l.

Following is an illustration of Dynamic Routing process that using Euclidean distance to measure the similarity, each point represents a predicted feature vector by a capsule, and the weights for each point is determined by distance between its position and weighted mean position, i.e. closer points have more weight, vice versa, and the weighted mean position is the consensus of predicted feature vector among all capsules.

Source: Introduction to Capsule Networks (CapsNets)

Comparing the two

Step by step comparison

[1] Projection / Prediction phase

Both Self-Attention and Dynamic Routing taking input from a input matrix (i.e. X matrix) multiplied by a projection weights matrix.
What’s different though, is that, for Dynamic Routing, if we consider each row of vector inside the input matrix is a capsule and we only have one prediction type at next layer, then each capsule will only be projected once by its own unique projection weights matrix, whereas Self-Attention uses three projection weights matrix to project the same input matrix into Q, K, V matrix respectively.

But there is no necessity for this difference besides some potential flexibility and performance gains, just like how we using the same kernel weights in CNNs to slide across all channels of images, by apply the concept of parameter sharing, in Self-Attention we can use same projection weights matrix for all three Q, K, V matrix, and in Dynamic Routing we can share the projection weights matrix between all capsules, so in both case, all input vectors is only projected once by a shared projection weights matrix.

Note: in the “Dynamic Routing Between Capsules” paper there is no encoded positional information being inputted into the CapsNets, unlike Transformers which have positional encoding as part of its first layer input. It’s safe to assume that in the CapsNets, by given each capsule its own projection weights matrix remedies this problem.

[2] Weighting phase

Both Self-Attention and Dynamic Routing weight the importance of each projected vector from first phase by using Dot Product Similarity, and then they both normalize the weights.
There are two revocable differences:

Self-Attention measures similarity between all projected vector pairs, while Dynamic Routing measures similarity between each projected vector and a mean vector of all projected vectors.
If we think of each query vector in Self-Attention as a mean vector in Dynamic Routing, then the first difference is merely a quantitative gap. In the context of an election, it’s like how individual voters change their opinion by other voters who have similar opinion to their own, versus individual voters change their opinion by a single global opinion poll results. If we use a more precise metaphor of K-Mean Algorithm, then Dynamic Routing would have K equals 1 whereas Self-Attention would have K equals to the number of samples, apparently, former can only capture the mean value of the sample, but latter has sufficient capacity to model any distribution those samples may have.
Self-Attention normalize the weights by apply Softmax function along each row of Q * KT matrix (Divide by a constant value), i.e. all the weights that calculated from the same query vector. Dynamic Routing normalize the weights by apply Softmax function among all the weights that belong to all predictions of the same capsule, and later on apply Squash function for every weighted sum vector of each prediction type, e.g. if capsules have 10 prediction types at next layer then they will be projected 10 times, and after measured the similarities between those 10 projected vectors and corresponding mean vectors, Softmax will be applied on those 10 weights, and at next phase, after we sum up weighted vectors in each of those 10 prediction types, we squash those 10 weighted sum vectors approximately through divide those vectors by their own magnitude (i.e. length).
If we following the both calculation till the end, then we will find both row vectors of Z matrix (i.e. Output matrix of the Attention function) in Self-Attention and weighted mean vector in Dynamic Routing will have magnitude around 1, and that’s the goal of normalization in general, so this is simply a difference in implementation.

[3] Summarizing / Voting phase

Both Self-Attention and Dynamic Routing generate a output vector of given prediction type or attention score by multiply each projected vector with corresponding weight and sum them all up into a single vector.

There are two marginal differences:

Since Self-Attention has a list of attention scores one for each query vector, and Dynamic Routing only have a single weight vector for a given prediction type, thus Self-Attention will generates vectors of equal amount to the input vectors, while Dynamic Routing will only has one weighted mean vector for one prediction type.
In Self-Attention every query vector will have its own summarized vector at a certain row of Z matrix, and in Dynamic Routing for a given prediction type all capsules will agree on a single summarized vector as the final prediction vector. In the context of an election, this difference is like individual voters keeps their own opinion, versus all voters summarize their opinions and reach to a single consensus. The cause of the this difference is fundamentally a choice that depends on rather a layer needs to summarize all its information, E.g. If a layer needs to make a prediction of the class for the input image at next layer, then it got no choice but combine all its predictions one way or another to form the final output, on the other hand, if the model is doing image segmentation or a layer is some where in the middle of its model’s structure, then we have many good reasons to pass through as much useful information as possible to the next layer.
Dynamic Routing could looping various times between update weights and sum up weighted vectors, but both steps were performed only once by Self-Attention.
Obviously, there is nothing in the Dynamic Routing is stopping us from setting iteration number to one, in fact, just by doing one iteration, each capsule will be able to determine which prediction type they should put most of its weight on. Which is also figuratively true in the context of an election, just by doing one massive opinion poll, we can be almost certain regards who is most likely to be elected in the end.

Rewrite Multi-Head Self-Attention Algorithm using Dynamic Routing Algorithm

Finally, I wrote some pseudo code to implement Dynamic Routing Algorithm and Multi-Head Self-Attention Algorithm in the style of Dynamic Routing Algorithm. By comparing two algorithms with same implementation style together, we can definitively confirm that the two algorithms are distinct implementations of the same concept.

Note: Multi-Head Attention, as the name suggests, it just a bunch of independent Self-Attention stack together and combine their output in the end.

/// 
Pseudocode for Dynamic Routing Algorithm

Annotations: 
i: Index for capsule
j: Index for prediction
b_i_j: Raw weights capsule i given to predicted feature j
c_i_j: Probability capsule i given to predicted feature j
u_i_j: Predicted feature capsule i outputed for predicted feature j 
s_j: Weighted sum of predicted feature j from all capsules
v_j: Final agreed feature j from all capsules
///

b_i_j = 0 for all i, j
c_i_j = 1 / j for all i, j
For K number of iterations
    For J number of predictions
        s_j = 0
        For I number of capsules
            s_j += c_i_j * u_i_j
        v_j = Squash(s_j)
  
    For J number of predictions
        For I number of capsules
            b_i_j += Dot(u_i_j, v_j)
  
    For I number of capsules
        c_i = Softmax(b_i)

///
Multi-Head Attention Pseudocode Implementation using Dynamic Routing
Instead of calculate agreement between each prediction and the mean, 
we calculate agreement between every pair predictions, 
so instead of having single final consensus, 
each capsule will have its own convolution after taking consideration of other capsules opinion

Annotations:
i0, i1: Index for capsule
j: Index for prediction
b_i0_i1_j: Raw weights capsule i0 given to capsule i1's predicted feature j
c_i0_i1_j: Probability capsule i0 given to capsule i1's predicted feature j
u_i0_j, u_i1_j: Predicted feature capsule i0 / i1 outputed for predicted feature j 
s_i0_j: Capsule i0's Weighted sum of predicted feature j from all capsules
///

For J number of predictions
    For I0 number of capsules
        For I1 number of capsules
            b_i0_i1_j = Dot(u_i0_j, u_i1_j)

For J number of predictions
    For I0 number of capsules
        c_i0_j = Softmax(b_i0_j)

For J number of predictions
    For I0 number of capsules
        s_i0_j = 0
        For I1 number of capsules
            s_i0_j += c_i0_i1_j * u_i1_j