23 Aug Sepp Hochreiter on Parallels Between Attention Mechanisms a…
Transformer and BERT language models, powered by attention mechanisms, have pushed performance on NLP tasks to ever-higher levels. Esteemed German computer scientist and inventor of long short-term memory (LSTM) Sepp Hochreiter says his attempt to explain transformers’ attention mechanisms for a lecture produced the pithy statement “a word is most similar to itself and gets a high score.”
Hochreiter told Synced in an email that the above reminded him of a resemblance between attention mechanisms and Hopfield networks — a form of recurrent artificial neural network (RNN) where the most similar pattern to a query is retrieved.
Hochreiter and his team at the Institute for Machine Learning of the Johannes Kepler University Linz in Austria joined researchers from the University of Oslo in Norway to publish the new paper Hopfield Networks is All You Need, which argues that “attention mechanism is the update rule of a modern Hopfield network with continuous states.” Drawing on this finding, the team set out to modify transformer and BERT architectures to make them more efficient in learning and boost overall performance.
As associative memory systems with binary threshold nodes, Hopfield networks were first proposed in the 1970s and popularized by John Hopfield in 1982. Although Hopfield networks had been deemed outdated by most in the machine learning community, Hochreiter decided to investigate nonetheless to determine whether attention is indeed related to associative memories.
The storage capacity of the initial binary Hopfield networks was very limited. In 2016, Hopfield and other researchers began laying the foundation for modern Hopfield networks with higher storage capacity and extremely fast convergence.
Hochreiter says that while investigating the relationship between associative memories and attention mechanisms he noticed the new developments in modern Hopfield networks, especially how new energy functions had vastly improved their properties and performance. A 2017 paper on modern Hopfield networks, On A Model Of Associative Memory With Huge Storage Capacity, proposed a polynomial interaction function to increase the storage capacity of Hopfield models. This motivated Hochreiter to develop a new energy function that generalized the modern Hopfield networks from discrete states to continuous states while maintaining properties such as high storage capacity and fast convergence.
“That was fantastic,” says Hochreiter. “The update rule that I derived was proven to converge globally, but, more importantly, it is the transformer attention mechanism.”
To derive the update rule and demonstrate that it converges globally was challenging, Hochreiter says. His team started with ideas based on gradient descent, but ran out of computing power when testing transformer and BERT models. Limited computational resources made it hard for them to test all their ideas connected to Hopfield networks.
The researchers therefore generalized the modern Hopfield networks proposed by the previous paper with exponential interaction functions to continuous patterns and states and obtained a new Hopfield network that proved able to converge in one update step with exponentially low error rates.
The most surprising part is, the new update rule uses the same key-value attention softmax-update also used in transformers and BERT.
In the companion paper, Modern Hopfield Networks and Attention for Immune Repertoire Classification, Hochreiter and his colleagues exploited the high storage capacity of modern Hopfield networks to solve immune repertoire classification, a challenging multiple instance learning problem in computational biology characterized by an unprecedentedly high number of instances per object. The researchers’ novel “DeepRC” method integrates equivalently modern Hopfield networks — or transformer-like attention — into deep learning architectures to perform the immune repertoire classification.
The team demonstrated that DeepRC outperforms baseline methods such as Logistic Regression and k-nearest neighbour (“KNN”) for immune repertoire classification, achieving better results with respect to predictive performance on large-scale experiments such as those with simulated and real-world virus infection data, and enables the extraction of sequence motifs that are connected to a given disease class.
The researchers believe accurate and interpretable machine learning methods for immune repertoire classification could pave the way toward new vaccines and therapies, a research field that has intensified due to the COVID-19 pandemic.
The team has also implemented the Hopfield layer in PyTorch, where it can be used as a plug-in replacement for existing pooling layers (max-pooling or average pooling), permutation equivariant layers, and attention layers.
Reporter: Yuan Yuan | Editor: Michael Sarazen
This report offers a look at how China has leveraged artificial intelligence technologies in the battle against COVID-19. It is also available on Amazon Kindle. Along with this report, we also introduced a database covering additional 1428 artificial intelligence solutions from 12 pandemic scenarios.
Click here to find more reports from us.
We know you don’t want to miss any story. Subscribe to our popular Synced Global AI Weekly to get weekly AI updates.