Attention is All You Need ... to Drive

Since the early days of Nexar, the products and services we develop have the objective of fulfilling our ultimate vision of Moving as One, where each vehicle collaboratively learns and helps others to drive. To achieve this vision, we have developed services and conducted research to get closer to our vision. From the start we have known AI and edge were the keystone of this vision, and we saw promising results with RNNs that could help us build holistic models. Alas, it was too early. Over the past 5 years, the world of AI has moved quickly and transformer attention mechanisms have finally taken the world by storm. We believe Nexar is uniquely positioned, thanks to our existing network and our vast amounts of data, to apply transformers and fulfil our initial vision of Moving as One.

In 2005, Sebastian Thrun and his team at Stanford University won the DARPA Grand Challenge for Autonomous Driving. Two years later, Thrun joined Google and founded Google X, which later developed the self-driving project now known as Waymo. Since then, numerous companies have aimed to achieve full automation in driving, leading to a surge in new companies from 2015-2018 and an increase in advanced driver-assistance systems (ADAS) features in vehicles worldwide.

From the beginning, the field of autonomous vehicles (AVs) has been split between those who come from the robotics domain and those who come from the machine learning domain. The approaches taken by each group differ in philosophy, and so far, the “roboticians” have been more successful. The current mainstream approach for advanced driver-assistance systems (ADAS) and AVs involves a phase of AI-rich driving and rule-driven behaviour curation for planning. Within the sensing > perception > planning > actuation pipeline, planning remains a manual coding step, relying on complex hierarchical control plans and heuristics to capture the nuances of driving. A lack of sufficient volumes of driving examples has prevented the creation of fully AI-driven systems.

Nexar’s network collects a unique set of corner cases, and its processing capabilities allow AV companies to benchmark their perception stacks and simulate vehicle behaviour to evaluate the planning phase. Currently, only Nexar and Tesla possess sufficient data to train AV systems to learn from human behaviour, making this a crucial ingredient for future AV development.

The evolution of AI has confirmed a prediction we made at Nexar’s inception when Google open-sourced TensorFlow: the architectures and frameworks of AI will become commoditised, and the key differentiator will be the ability to build the best data sets to train differentiated models. We recognised that data would become king, and it now holds true. When we examine the top AI companies pushing boundaries, we notice a common thread: they have access to the largest pools of human knowledge. The other essential component to success is computational resources, as it requires billions of dollars to collect, store, analyse, train, and refine these models. Only a select few companies can afford such an investment. Nevertheless, the biggest barrier to entry for new AI models is data availability. Nexar has collected all types of driving scenarios, both normal and corner cases, and is in a unique position to enable true learning from the 11 trillion miles that humans drive each year, bringing our initial vision closer to reality.

The publication of Google’s research paper “Attention is All You Need” marked a crucial milestone in the evolution of AI, though the authors may not have anticipated its impact at the time. The emergence of transformer attention models has given rise to Large Language Models (LLMs), which allow learning beyond n-grams and to infinitely long input sequences. This is possible because the dictionary of possible words, characters, or tokens in a language is limited, while the amount of written works in that language is infinite. Attention models are able to capture this infinite potential. LLMs are currently revolutionizing web search and advertising and enabling unprecedented human interaction with knowledge. Though it may seem hyped, we have only scratched the surface of the possibilities with LLMs.

Transformer attention mechanisms have already been applied to other domains beyond language, such as images and audio, though most have focused on transitioning between language and other modalities. However, the potential applications of these mechanisms need not be limited to this scope.

The scientific community has long debated whether human behaviour is a discrete or continuous phenomenon. Despite the widely accepted notion that it is continuous, the most plausible explanation for encoding human behaviour is that it is, in fact, discrete. For example, the pitches in the musical scale are finite, and although the combinations they create may seem infinite and continuous, the composition process is actually discrete. Similarly, driving, as a behaviour defined by humans, can be described through a finite set of tokens, terms, and words, that can encode all possible driving behaviours, whether in “normal” scenarios or “corner case” scenarios. This encoding process is akin to written and spoken languages and can be used to capture and replicate all driving behaviours, from navigating a car through a lane to avoiding danger.

This allows us to propose a new hypothesis: just as we can learn human behaviour in written language with a finite dictionary and a very large corpus using transformer attention mechanism, we can also learn and hallucinate the behaviour of driving language with a similarly finite dictionary and a very large corpus of driving data.

The hypothesis being true also opens up the possibility of creating a shared model for driving behaviour that can be used across the industry. This shared model can be used to benchmark and evaluate the performance of different autonomous driving systems, enabling fair competition and driving innovation in the field. It also encodes prior planning knowledge we are already generating in the form of 3D reconstructions in corner cases through a much more abstract and generalised solution, and it generalises the concept of behavioural maps for danger situations, such as work zones. Moreover, it provides a driving score rating, eventually filtering out the bad driving and ranking the best driving behaviour out. Finally, it can help bridge the gap between human-driven and autonomous vehicles by enabling autonomous vehicles to better anticipate and respond to the behaviour of human drivers, as well as warning drivers ahead of dangers ahead. In summary, proving this hypothesis can have far-reaching implications for the development and adoption of autonomous vehicle technology.

For moving from a hypothesis to an experiment, to prove it is true, we need the following:

Develop a finite dictionary of tokens that encapsulate all the possible tokens in the language of driving. Just like one can use characters, or words, or tokens in written language. We don’t know if this dictionary is enough:
- it could be that only observed behaviour is sufficient, with acceleration and steering; we don’t know how deeply and discrete we need to get, but we know that for example 8-bit representations are more than enough, and that floating point precision is unnecessary,
- it’s likely that visual context is also necessary (which can also be encoded in a discrete set), where the dictionary includes annotations for the surrounding environment, input of our own perception or the perception of 3rd parties, e.g. the vehicle’s ADAS, and particularly the behaviour of surrounding agents,
- it’s also possible that the dictionary is domain and location sensitive, but it could also be that we can generalise just like humans can learn outside of roads; but again a discrete vocabulary can be conceived, independently of any base map and roads, as a discrete H3 hierarchy polyfilling the world (and perhaps beyond known roads).
Collecting an extremely large dataset of human driving scenarios we can encode with the above fixed dictionary. We have candidate datasets we can use for that, from Waymo’s to MSI’s.
Training a transformer attention mechanism (self-attention) to develop a model that encodes driving.
Testing the performance of this model.
This is a formidable task, and if we can prove that the hypothesis holds true against the null one, it will require us investing time in developing partnerships. The cost of training a global model of driving will be considerable, and the required resources gargantuan. There is also the opportunity to aggregate additional 3rd party datasets that can be encoded with this dictionary to make the training set even more comprehensive, perhaps also including 2- and 3-wheelers, and pedestrians.

In summary, human behaviour, as exposed in actions and decisions, can be encoded as a discrete set of tokens, just like language. If proven true, this hypothesis can have significant impacts, from enabling a stepwise change in the development of autonomous vehicle technology to filtering out bad driving. To move forward with this initiative, we need to develop a finite dictionary, collect the world’s largest dataset of human driving scenarios and train a transformer attention mechanism. Given our data, our network and our previous experience with similar tasks, we have a solid foundation to build upon. However, this task is critical, and we must take action to move forward with it. Let us take on this challenge and push the boundaries of AI to create a safer, more efficient, and more enjoyable driving experience for everyone.