Interview Question Transformers
Can someone answer the question below. I was asked the question in a data scientist( 8YOE) interview?
Why large language models need multi-headed attention layer as appossed to having a single attention layer?
Follow up question- During the training process why does the different attention layers get tuned to have different weights?