Tandev89
Tandev89

Interview Question Transformers

Can someone answer the question below. I was asked the question in a data scientist( 8YOE) interview?

Why large language models need multi-headed attention layer as appossed to having a single attention layer?

Follow up question- During the training process why does the different attention layers get tuned to have different weights?

4mo ago3.2K views
steppenwolf
steppenwolf

Many sentences can have ambiguous meanings and a single attention layer might not capture the true meaning of the sentence.
Multi head attention solves this by allowing each head to focus on different parts of the sentence. For eg. 'I saw a man with binoculars.' can have 2 meanings. Multi headed attention lets each head focus on one part of the sentence and 1 interpretation of it and later by combining the information the model can decide which meaning better suits the context based on different weights. That is why they are tuned to have different weights.

Tandev89
Tandev89

I agree that a sentence can have mutiple interpretations but are are usually limited to 2 or 3 if not 1. Why LLMs have 12 to 50 full self attention layers .

When you say one attention head focuses on a part of the sentence, what do you mean by it? In a single full self attention head, weighted relation of a word with all other words of a sentence is captured.

steppenwolf
steppenwolf

How many layers to be used has been proven Empirically so that we can extract complex and deeper understanding. One head here will focus on the relationship between man and binoculars the other between 2 other words and so on.

CompleteKamikaze
CompleteKamikaze

Multihead attention learns the various relationships between different words in different latent spaces , this way they capture better semantic and syntactic relationships.

Different layers learn different kinda relationships as inputs to them and learns different weights i.e. key, values and query values

Tandev89
Tandev89

Thanks for your answer. Naive question - why do they learn different semantic relations. Why don't the different multihead converge to same weights?

Umadbro
Umadbro

Bhai ignore these kind of questions. Mostly asked by people who have no idea what they want the hire to do. Been a trend recently, where the interviewer wants to show they know stuff. But very unlikely to be used in anything that you would do on the job. Does not even test the ability, just knowledge.

Discover more
Curated from across