Quantum Transformers ???🤔

Sampath Kumaran Ganesan
7 min readSep 23, 2024

--

Quantum computing is poised to revolutionize the world with its enormous capabilities, especially in solving optimization problems. These applications range from meteorology for accurate weather forecasting to precision medicine, portfolio optimization, drug discovery in life sciences, and advancements in metallurgy, among others.

We will explore the Quantum Transformers in this blog post.

Before we dive deep into the topic, we need to refresh quantum computing basics.

Let’s start with ‘Why there is a buzz in tech community about Quantum Computing?’. For that to understand, we need to know about why traditional computers fail in some aspects or in some particular kind of problems.

In classical computing, we have ‘bits’ which can be either ‘0’ or ‘1’ to represent the information. It is like a switch. Classical computers works by using millions of tiny transistors. Every transistor will act like a switch i.e., if a current flows into the transistor, it is ‘1’ else it is ‘0’. This is also called as serial processing where one task has to be completed before the other task can begin. It is a limiting factor for large optimization problems.

Depiction of classical computer’s transistor

But, Quantum computer works by quantum physics which deals with sub-atomic particles like ‘electrons’ or ‘photons’ or other ‘ions’. In Quantum Computing, we use qubits (Quantum Bits) to represent information. Unlike the classical computers, qubits can be in state ‘0’ or ‘1’ or ‘both’. This phenomenon is called as Superposition. The other property of quantum physics is Entanglement — If qubits are said to be entangled, if we measure the value of one qubit, the value of other qubit can be known even if they are separated by light years. This enables the quantum systems to be parallel rather than serial, so that operations / tasks can be performed in real quick time that even supercomputers don’t even come close to it.

So, the next question comes to our mind is ‘Why Quantum computing is not very prevalent?’ — The answer is, Quantum chips needs to present within zero kelvin (0⁰K) i.e., -273⁰C that is literally colder than outer space. As of now, most quantum computing chips are placed in such cooler temperatures with the help of liquid helium. It needs to be placed in such cooler temperatures because, the sub-atomic particles like electrons tends to be affected with even very light temperatures. Also, quantum computers cannot overshadow classical computers in certain type of problems.

Below is an illustration of how good the quantum computers are in search space.

First one is the classical computers — We can see that the search operation is serial i.e., It searches for various paths in sequential manner, thereby taking time.

Classical Computers in Search Operation

On the other hand, quantum computers processes the operation in parallel thereby saving enormous amount of time.

Quantum Computers in Search Operation

Let’s refresh about Transformers that paved the way for buzz in the technology industry called ‘Generative AI’ or ‘Gen AI’. Transformers was published in 2017 by google in a paper called ‘Attention Is All You Need’.

Depiction of Transformers (Courtesy: jalammar.github.io)

The above is a high level depiction of Transformers. We could see there are two modules. One is called Encoder and other is called as Decoder. In the actual paper, encoder stack is composed of six encoders and six decoders. From the figure, we can see that every encoder in the stack, comprises of Self-Attention layer and feed forward layer. Then decoder also looks similar but it will have an Attention layer in between the Self-Attention and Feed forward layer.

So, what is self Attention? — It is a technique to look into other words in the sentence while encoding a specific word in order to get the importance of the word with respect to the input sentence. Let us see with an example: Consider a translation task. Input is ‘I like India and I am proud of it’. Here ‘it’ refers to India but it is not the same for the machine. So, self-attention helps to associate ‘it’ with ‘India’. For every word to be encoded, self attention allows it to look at other positions in the input sequence for clues that can help lead to a better encoding for this word.

So, you can ask me a question, how the sentence is understood by the encoder. The answer is through embedding models, we convert the text information into vectors. Then, other question would be how to preserve the order of the input sentence. The answer is using positional encoding.

Depiction of Encoding (Courtesy: jalammar.github.io)

From the above diagram, we can see that the transformers add a vector to each word embedding. These vectors follow a specific pattern that the model learns, which helps it determine the position of each word.

Depiction of Decoding (Courtesy: jalammar.github.io)

Now, let’s move on to the decoder. The encoder processes the input sentence. The output of the top encoder is then transformed into a set of attention vectors K and V. These vectors will be used by the ‘encoder — decoder attention module’ in the decoder to focus on appropriate places in the input sentence. You can see from the above figure, the same process continues until a <EOS> token is reached and output is generated.

Depiction of Decoding (Courtesy: jalammar.github.io)

You can see from the above diagram, the output of each step from the decoder is sent to the bottom decoder and the next step continues. And just like we did with the encoder inputs, we embed and add positional encoding to those decoder inputs to indicate the position of each word. In the decoder, the self-attention layer is only allowed to attend to earlier positions in the output sequence. This is done by masking future positions (setting them to -inf) before the SoftMax step in the self-attention calculation.

Depiction of Feed Forward Layers (Courtesy: jalammar.github.io)

Now, let us focus on the SoftMax layer. You might have thought from the above figure, the output from the decoder stack is a vector. We need to convert back to the words as output. You can see that there is a linear layer which is nothing but a neural network which projects the output from the decoder layer into a larger logits vector. For e.g., If the total vocabulary that is learnt or specified be like 5,000 — the size of the logits is 5,000 and each cell corresponding to the score of a unique word. The SoftMax layer then turns those scores into probabilities (all positive, all add up to 1.0). The cell with the highest probability is chosen, and the word associated with it is produced as the output for this time step.

Next, we will talk a bit about ‘Quantum Machine Learning’.

People can ask a question like ‘why we need Quantum Machine Learning’, when classical machine learning performs very good. Yes, the classical machine learning is good, but for the fact, training the GPT-3 would require 355 years on a single GPU, and cost $4.6 million. Since we know quantum computing has the ability to compute multiple states concurrently, we reduce the time for creating very high end models. Yes, it is still in infancy and we need to create a quantum machine that has high fault tolerance.

Quantum machine learning is the interplay of ideas from quantum computing and machine learning.

Last year, Transformers like ‘GPT’ series took the world by storm and rest is history. Quantum transformers are a variant of Transformers inside the Quantum Computers. It is still an evolving field but it is very promising.

By utilizing the properties of qubits, quantum transformers can represent text in higher-dimensional spaces, facilitating better understanding of context and meaning.

Quantum encoding can potentially reduce the memory footprint required for storing large datasets, making it feasible to work with extensive datasets. Quantum models can adaptively adjust attention weights based on the quantum state of the input, allowing for more flexible and responsive NLP applications. Through superposition simultaneous evaluation of multiple grammatical structures, leading to more accurate parsing outcomes. Quantum models can help clarify ambiguous language constructs by evaluating multiple interpretations concurrently.

At a high level, quantum transformers will help in compute and storage bottleneck that we have in present.

The future looks promising for Quantum Machine Learning.

In the upcoming blog posts, we will look deeper into the Quantum transformers.

--

--

Responses (3)