The transformer (and note) is a flashy additional machine

It is a relatively new subfield of AI, focusing on understanding how neural networks work by reverse engineering internal mechanisms and representations, and aims to translate them into human-friendly algorithms and concepts. This contrasts with traditional explanability techniques such as Shap and Lime.

Shap is short shEarpley additive expRenault. Compute the contribution of each feature to model prediction locally and globally, i.e., across a single example and a dataset. This allows you to use SHAP to determine the importance of general features in your use case. Lime, on the other hand, works with a single example prediction pair, where we ingest examples and use perturbations and their output to approximate a simpler alternative to the black box model. So both of these functions work at a functional level and provide some explanation and heuristics to measure how each input to the model affects prediction or output.

On the other hand, interpretations of mechanisms understand things on a more detailed level. We understand this in that different neurons in different layers within a neural network can provide a pathway for how the function is learned and how that learning evolves on layers within the network. This allows you to become adept at trace paths within the network for specific features, Also See how that feature affects the outcome.

Then Shap and Lime, answer the questions “Which features contribute most to the outcome?” Mechanical interpretations, on the other hand, answer questions.Which neurons activate which functions, how does that function evolve and how does it affect the outcome of the network?“

This subfield works primarily with deeper models, such as transformers, as explanability is generally a deeper network problem. There are several places where mechanical interpretations look at transformer transformers differently than traditional ones. One of them is Multi-head notes. As you can see, the difference is that we reconstruct the increment and concatenation operations defined in the paper, “What we need to be careful about is to be careful about” .

But first, a summary of the trans architecture.

Trans Architecture

These are the sizes we handle:

batch_size b = 1;
Sequence length s = 20;
vocab_size v = 50,000;
hidden_dims d = 512;
Head H = 8

This means that the number of dimensions for the q, k, v vectors is 512/8(l) = 64. (If you don't remember, an analogy to understand queries, keys, and values: the idea is that for a token at a particular position (k), we want to get an alignment (remeasurement) to the position associated with (v) based on that context (q).

These are the steps leading up to the transformer's careful calculation. (The shape of the tensor is assumed as an example of a better understanding. Italic numbers represent the dimensions that are multiplied by the matrix.)

Steps	Surgery	Input 1 Dam (shape)	Input 2 Dam (shape)	The output will be dark (shape)
1	n/a	bxsxv (1 x 20 x 50,000)	n/a	bxsxv (1 x 20 x 50,000)
2	Get the embedding	bxsxv (1 x 20 x 50,000))	V xd (50,000 x 512)	B x S xd (1 x 20 x 512)
3	Add a position embedding	B x S xd (1 x 20 x 512)	n/a	B x S xd (1 x 20 x 512)
4	Copy the embed to q, k, v	B x S xd (1 x 20 x 512)	n/a	B x S xd (1 x 20 x 512)
5	Linear transformation For each head H = 8	B x S xd (1 x 20 x 512))	dxl (512 x 64))	bxhxsxl (1 x 1 x 20 x 64)
6	Scaling Dot Products (Q@k') on each head	bxhxsxl (1 x 1 x 20 x 64))	(lxsxhxb) (64 x 20 x 1 x 1)	bxhxsx (1 x 1 x 20 x 20)
7	Scaled DOT Products (Note Calculation) Q@k'v on each head	bxhxsx (1 x 1 x 20 x 20))	bxhxsxl (1 x 1 x 20 x 64)	bxhxsxl (1 x 1 x 20 x 64)
8	I'll contact you all H = 8	bxhxsxl (1 x 1 x 20 x 64))	n/a	B x S xd (1 x 20 x 512)
9	Linear projection	B x S xd (1 x 20 x 512)	D xd (512 x 512)	B x S xd (1 x 20 x 512)

Surface view of shape conversion towards transformer attention calculations

The table was explained in detail:

Start with one input sentence of 20 sequence lengths, one hot encoded to represent the vocabulary words present in the sequence. shape(bxsxv): (1 x 20 x 50,000)
Multiply this input by a learnable embedded matrix (vxd) to get the embedding. Shape (B x S x D): (1 x 20 x 512)
A learnable position encoding matrix of the same shape is then added to the embedding
The resulting embedding is copied into the matrix q, k, q, q, k, v. Each is divided, d size. Shape (B x S x D): (1 x 20 x 512)
The Q, K, and V matrices are each fed to a linear transformation layer and multiplied by each of the learnable weight matrices of their respective shapes (DXL) WQ, Wₖ, and Wᵥ. (1 copy of each h = 8 heads). shape(bxhxsxl): (1 x 1 x 20 x 64) Here, this is the shape of the result. Each head.
Next, collect attention with the attention of the scaled DOT product where Q and k (transpose) are first multiplied Each head. shape (bxhxsxl) x (lxsxhxb) → (bxhxsxs): (1 x 1 x 20 x 20).
Next is the scaling and masking steps. I skipped this because it's not important in understanding the different ways to view MHA. So, then multiply QK by V For each head. shape (bxhxsxs) x (bxhxsxl) → (bxhxsxl): (1 x 1 x 20 x 64)
concat: Here we reconstruct the results of all head-on attention in the L dimension to regain the shape of (B x S x D) → (1 x 20 x 512).
This output is projected linearly again using yet another learnable weight matrix wₒ of shape (dxd). Final shape ending with (B x S x D): (1 x 20 x 512)

Rethinking Multi-Head Care

Image by the author: Rethinking Multi-Head Attention

Now let's take a look at how the field of mechanical interpretation views this. We will also look at why it is mathematically equivalent. To the right of the image above you will see a module that reconsiders multi-head attention.

Instead of concatenating the note output, proceed with multiplication “internal” Now the shape of wₒ becomes (lxd), multiply by the shape qk'v (bxhxsxl) and get the shape (bxsxxhxd): (1 x 20 x 1 x 512). Next, sum the H dimensions and end again with shapes (B x S x D): (1 x 20 x 512).

The last two steps have been changed from the table above.

Steps	Surgery	Input 1 Dam (shape)	Input 2 Dam (shape)	The output will be dark (shape)
8	Matrix multiplication on each head H = 8	bxhxsxl (1 x 1 x 20 x 64))	lxd (64 x 512)	bxsxhxd (1 x 20 x 1 x 512)
9	Total head (h dimensions)	bxsxhxd (1 x 20 x 1 x 512)	n/a	B x S xd (1 x 20 x 512)

Side notes: This “sum” reminds us of how sums occur on different channels in CNNS. In CNNS, each filter works on input and then Summarize the output The whole channel. Same here – each head can be considered a channel, and the model learns the weight matrix and maps the contribution of each head to the final output space.

But why Project + Total Mathematically equivalent concat + project? In short, the projection weights of the mechanic perspective are merely sliced versions of the weights of traditional views (slice) d Dimensions and divisions to suit each head).

Before multiplying with Wₒ, let's focus on the H and D dimensions. From the image above, each head has a vector of size 64, hung by a weight matrix of shape (64 x 512). Show the results with r and head by h.

To get r₁₁, we have this equation.

r₁, ₁=h₁, ₁xwₒ₁, ₁ +h₁, ₂xwₒ₂, ₁ +…. +h₁ₓ₆₄xwₒ₆₄, ₁

Now let's say you have connected the heads to get the weight matrix of the attention output shape and shape (512, 512): The equation was as follows:

r₁, ₁=h₁, ₁xwₒ₁, ₁ +h₁, ₂xwₒ₂, ₁ +…. +h₁ₓ₅₁₂ xwₒ₅₁₂₁

Therefore, the parts h₁ₓ₆₅ xwₒ₆₅₁ +… +H₁ₓ₅₁₂ xwₒ₅₁₂₁ would have been added. However, this part is added to the part that exists in each of the other heads of the Modulo 64 fashion. In other words, if there is no connection,₅₁ is the value behind wₒ₁, and the second head, wₒ₂₂₉, ₁ is the value behind wₒ₁, and the third head. So, even without concatenation, the “sum on head” operation adds the same value.

In conclusion, this insight places the foundation on which the transformer is viewed as a purely additive model in that all operations within the transformer take the initial embedding and add it to it. This view opens new possibilities like tracing features learned through addition As I show in the next article, what mechanical interpretability is is through layers (called circuit traces).

This view showed that multi-head attention is mathematically equivalent to very different views by parallelizing and optimizing attentional calculations by splitting Q, k, and V. Read more about this blog here. Here is an actual paper that introduces these points.

Source link