Saikat's notes on AI
🏠🐦💼🧑‍💻
  • Hello world!
  • 🚀LLM
    • The Evolution of Language Models: From Word2Vec to GPT-4
      • [1] Word2Vec - Efficient Estimation of Word Representations in Vector Space
      • [2] Seq2Seq - Sequence to Sequence Learning with Neural Networks
      • [3] Attention Mechanism - Neural Machine Translation by Jointly Learning to Align and Translate
      • [4] Transformers - Attention Is All You Need
      • [5] GPT - Improving Language Understanding by Generative Pre-Training
      • [6] BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding
      • [7] T5 - Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer
      • [8] GPT2 - Language Models are Unsupervised Multitask Learners
  • Best LLM Resources on the internet
  • MPT-7B: A Revolutionary Leap in Language Models
  • From Rules to Vectors: How NLP Changed Over Time
Powered by GitBook
On this page

Was this helpful?

  1. LLM
  2. The Evolution of Language Models: From Word2Vec to GPT-4

[4] Transformers - Attention Is All You Need

Previous[3] Attention Mechanism - Neural Machine Translation by Jointly Learning to Align and TranslateNext[5] GPT - Improving Language Understanding by Generative Pre-Training

Last updated 2 years ago

Was this helpful?

Title: Attention Is All You Need

Authors & Year: Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez, Łukasz Kaiser, and Illia Polosukhin, 2017

Link:

Objective: Develop a neural architecture for sequence modeling tasks that does not rely on recurrent or convolutional networks, with a focus on machine translation.

Context: Recurrent neural networks (RNNs) and convolutional neural networks (CNNs) were the state-of-the-art approaches to sequence modeling tasks, but they suffer from issues like slow training and difficulty in capturing long-range dependencies.

Key Contributions:

  • Introduced the transformer architecture that uses self-attention mechanisms instead of RNNs or CNNs for sequence modeling.

  • Demonstrated the effectiveness of the model on English-to-German and English-to-French machine translation tasks.

Methodology:

  • The transformer consists of an encoder and a decoder, both composed of multi-head self-attention layers and feedforward layers.

  • The self-attention mechanism allows the model to attend to different parts of the input sequence for each output element, avoiding the need for a fixed-length vector representation.

  • The transformer is optimized using a novel objective function called "scaled dot-product attention".

Results:

  • The transformer achieved state-of-the-art performance on several benchmark datasets, including WMT'14 English-to-German and English-to-French machine translation tasks.

  • The transformer's parallelizable architecture allowed for faster training than RNNs or CNNs.

Impact:

  • The transformer introduced a new paradigm for sequence modeling tasks that has been adopted in several NLP applications.

  • Inspired further research in NLP, leading to innovations like BERT and GPT-2.

Takeaways:

  • The transformer is a neural architecture that uses self-attention mechanisms for sequence modeling tasks, eliminating the need for RNNs or CNNs.

  • The transformer is faster and easier to parallelize than RNNs or CNNs, leading to faster training times.

  • The transformer has significantly influenced the development of NLP models and the direction of NLP research.

🚀
https://arxiv.org/abs/1706.03762