Tandev89

Interview Question Transformers

Can someone answer the question below. I was asked the question in a data scientist( 8YOE) interview?

Why large language models need multi-headed attention layer as appossed to having a single attention layer?

Follow up question- During the training process why does the different attention layers get tuned to have different weights?

4mo ago3.2K views

steppenwolf

Stealth4mo

Many sentences can have ambiguous meanings and a single attention layer might not capture the true meaning of the sentence.
Multi head attention solves this by allowing each head to focus on different parts of the sentence. For eg. 'I saw a man with binoculars.' can have 2 meanings. Multi headed attention lets each head focus on one part of the sentence and 1 interpretation of it and later by combining the information the model can decide which meaning better suits the context based on different weights. That is why they are tuned to have different weights.

Tandev89

Paytm4mo

I agree that a sentence can have mutiple interpretations but are are usually limited to 2 or 3 if not 1. Why LLMs have 12 to 50 full self attention layers .

When you say one attention head focuses on a part of the sentence, what do you mean by it? In a single full self attention head, weighted relation of a word with all other words of a sentence is captured.

steppenwolf

Stealth4mo

How many layers to be used has been proven Empirically so that we can extract complex and deeper understanding. One head here will focus on the relationship between man and binoculars the other between 2 other words and so on.

CompleteKamikaze

Sprinklr4mo

Multihead attention learns the various relationships between different words in different latent spaces , this way they capture better semantic and syntactic relationships.

Different layers learn different kinda relationships as inputs to them and learns different weights i.e. key, values and query values

Tandev89

Paytm4mo

Thanks for your answer. Naive question - why do they learn different semantic relations. Why don't the different multihead converge to same weights?

Umadbro

Stealth4mo

Bhai ignore these kind of questions. Mostly asked by people who have no idea what they want the hire to do. Been a trend recently, where the interviewer wants to show they know stuff. But very unlikely to be used in anything that you would do on the job. Does not even test the ability, just knowledge.

Discover more

Curated from across

Gooner7Goldman Sachs5mo

Neural Machine Translation by Jointly Learning to Align and Translate

If you want more papers like these, drop a "+1" comment below. I will notify these people in DMs next time I upload a new paper.

Bahdanau, Cho, and Bengio published a pivotal paper that reshaped the landscape of artificial intelligenc...

arxiv.org

Neural Machine Translation by Jointly Learning to Align and Translate

Dzmitry Bahdanau, KyungHyun Cho and Yoshua Bengio

3.6K views

BirdsarentrealStudent13mo

Exploring Seed Funding Options for Our AI Innovation

I'm thrilled to share our latest developments in our startup: We launched our prototype and it's gaining momentum! Our previous product, currently in stealth mode, has remarkable insights. However, the GPU usage costs have skyrocketed....

altGrapeStealth6mo

As someone in AI, which concept blew your mind away when you first learnt about it?

For me? It was GradCAM was a gamechanger at selling computer vision initiatives internally to the non-technical stakeholders.

The gradCAM function computes the importance map by taking the derivative of the reduction layer output for...

LumencoderPhonePe1mo

How does ML work actually looks like in corporate is it just training they do with the data or anything different?

BabelYubi6mo

The Research paper that changed the world...

Anything happening in AI today can be traced back to this one brief moment in history...

Share your favourite papers.

GPT ftw :)

proceedings.neurips.cc

Attention is all you need

Or get it on the stores.

Privacy Terms

Guidelines Help