Project Dojo

Anyone who watched Tesla’s AI day back in August 19, 2021 will recall the stunning statistics and architecture of their new chip, and modular supercomputer, but no one but me (below) has explained what it does for the layman.

As a recap:

The D1 chip is designed and produced by Tesla in-house using a 7-nanometer manufacturing process, with 362 teraflops of processing power produced by its 50 billion transistors. The D1 is a system on a chip so there is no motherboard or other integrated components.

D1 Chip

D1 is equipped with 9TBps off-chip bandwidth through connectors on each of its four edges (36TBps in total) allowing it to connect to and scale with other D1 chips without sacrificing speed.

9TBps training tile

Instead of a Server 25 D1 chips sit on a training tile. Like the D1 chip the training tile operates with modularity. Its power and cooling operate through the top of the tile, allowing its four edges to be outfitted with connectors designed for a bandwidth of 36TB/s to connect to other tiles.

Training Tile

Each training tile is less than a cubic foot in size. There will be six tiles to a tray and 2 trays in a cabinet.

Traing matrix

10 cabinets will generate over an exaflop [a million teraflops] of power.

At AI Day 2021 Tesla admitted it had not solved the tile to tile interconnect and software. Each tile has more external bandwidth than the highest end networking switches. Tesla now has Dojo up and running and has foreshadowed the performance will be discussed at Tesla’s AI Day II on September 30th.

Yes but what does it do?

Tesla provided this slide at the Hot Chips 34 (HC34) conference (August 21-23, 2022).

Compute platforms

The Blue Learning Computer is the Deep Learning of Project Dojo. The Green Learning Computer is the Tesla’s onboard HW3, that is the computer in each production vehicle that controls the car through Full Self Driving

To use a basic illustrative example FSD needs to know what a traffic cone looks like. Dojo will receive “Input Data” from the Tesla fleet (telemetry showing videos of traffic cones under a myriad of circumstances), the cone will be labelled by Human Operators (and auto labelling), represented by the green output data. The trained logic is then stored onboard the vehicles as a sort of program, represented by the blue “Trained Logic” which the onboard HW3 executes to generate the green “Useful Outputs” being commands to the cars actuators to navigate the car.

Tesla has been using human labellers so far, together with crude auto labelling functions. However the task is mind bogglingly complex. This is because Tesla is doing 4D labelling, that is labelling in both space and time. 

So the Cone is viewed by the car from a distance, then its look and shape change and alter as the car speeds past it and it is viewed in the rear vision mirror. When you then add the complexity of other moving cars moving in different directions at different speeds, the play of lights and other factors that make the colour and appearance of the same object appear different the task of labelling becomes consciously incomprehensible to a human. 

Of course just because we cannot consciously comprehend what is going on it doesn’t mean out brains are not processing this data subconsciously. After all, Elon Musk has repeatedly pointed out that in adopting a vision only approach (dropping Lidar) he is merely emulating a successful system for navigating vehicles that has worked very well involving two gimbal camera (our eyes) on a larger gimbal (our necks). 

When we perceive objects while driving to work we have to judge their distance and speed all the time by various cues we see with our eyes. 

Motion parallax contributes to our sense of self-motion, when we move our head (binocular cameras) back and forth. Objects at different distances move at slightly different speeds. Closer objects move in the opposite direction of our head movement and faraway objects move with our head. 

Interposition is another cue. When objects overlap each other, it gives us monocular cues about which one is closer.

Aerial perspective colour and contrast cues offer clues as to how far away an object might be. As light travels, it scatters causing blurriness, which our brain interprets as being farther away.

Our binocular vision allows our brain to build a 3D image based on the disparities between what each eye sees in a process similar to trigonometry. 

To rival and excel the human capability by an order of magnitude, which is Elon Musk’s stated target for full self-driving cars (with no driver supervision) the onboard computer needs to be able to do all this and more. More because it is processing not just two front cameras but also rear cameras and side cameras and needs to meld the whole into one unified 4D model. 

So it is that the task exceeds the capabilities of human labellers. Instead Project Dojo is designed to train itself. It does this by making predictions offline (not in real time) and then running its videos forwards to compare its predictions to the truth. Thus if Dojo sees what in real life is a Pelican and thinks it is a Cone, it may for a fraction of a millisecond look like a Cone, but then it can fast forward and see it from other angles and determine it is definitely a pelican. Not only that it can compare the video with millions of other cone and pelican videos and recognise and learn the difference. In the end, after billions of computations, it produces a neat compact piece of code it hands to the HW4 in the car to use. When the HW4 sees the object that for a microsecond looks like a Pelican it is not fooled and does not have to wait to see the other angles of the object as it flies past to make a hyper accurate prediction. 

Nor is it just object recognition, there is also trajectory recognition and behaviour recognition, where will things go, is a car going to stop based on its current deceleration, is another driver going to attempt an illegal U-Turn. Does their failure to indicate and erratic lane keeping indicate a medical emergency. What will likely happen when they hit a pothole. 

So in summary the job of the Dojo computer is like a policeman reviewing security camera footage to work out what happened and where and why inferences drawn by the onboard HW4 computer were wrong. The HW4 computer is on the front line, living it and processing it in real time, without the benefit of hindsight. The job of Dojo is to give the onboard computer a preternatural ability to make inferences from fleeting context clues that a human being driving could not even begin to process (like analysing the behaviour of cars behind and in front and on either side simultaneously) to give safety performance an order of magnitude better than a human – and then of course beyond.

To join our network fill out the form.

Leave A Comment