What is a Transformer?

Transformers are a type of neural network architecture that do just what their name implies: they transform data. Originally, Transformers were developed to perform machine translation tasks (i.e. transforming text from one language to another) but they’ve been generalized to work on a variety of natural language problems like text to speech transformation, speech recognition, even co-reference resolution. Transformers are now one of the go-to tools for any task that requires sequence transduction.

But what is it?

A Transformer is a combination of a Convolutional Neural Network and a self-attention model. Transformers attempt to speed up the processing and transformation of data by using attention in place of recurrence, which allows it to encode the item and its position in a sequence. This is significantly faster and less computationally complex than training using recurrence.

You might be wondering why bother with attention if it’s recurrence that makes training slow. Why not merely switch from a Recurrent Neural Network to a Convolutional Neural Network? The reason is that Convolutional Neural Networks don’t necessarily solve the problem of maintaining and transforming the dependencies within a given piece of data.

Take, for example, the illustration below. The Transformer is what enables us to encode the position and direction of the words in the example sentence. By paying attention to each item in the sequence as it’s transformed, we can maintain the important context and positioning of the words as they are translated.

Source: http://jalammar.github.io/illustrated-transformer/

The Transformer itself is made up of multiple encoders and decoders, each of which contains both a Neural Network and a self-attention mechanism.

Inputs first feed through the self attention mechanism, which lets the model learn the context of a specific item (usually a word) by looking at the items around it within a given sequence. Decoders then take the encoded sequence and apply both a special Encoder-Decoder Attention layer, as well as self attention and a Neural Network layer.

In future articles we’ll cover how exactly Encoders and Decoders work, the role of Tensors and Embeddings, as well as more in-depth specifics on Self-Attention. Until then, leave any questions you have in the comments!

Until then, check out some of our recommended reads on Transformers.

Natural Language Processing with Transformers
Transformers for Natural Language Processing: Build, train, and fine-tune deep neural network architectures for NLP with Python, PyTorch, TensorFlow, BERT, and GPT-3
Transformers for Machine Learning: A Deep Dive
Deep Learning: A Visual Approach

Want a custom deep dive into this or related topics, need in depth answers or walkthroughs of how to build or use transformer models? Send us an email to inquire about corporate trainings and workshops!