We could talk more about measuring meaningful metrics by Paul Newman CTO & Founder (Oxbotica) | Oxbotica

7th February 2020

We could talk more about measuring meaningful metrics by Paul Newman CTO & Founder (Oxbotica)

A constant theme in my interactions with those interested (vested or otherwise) in the progress of Autonomous Vehicles (AV) is the “intervention / revoke rate question”. There’s been quite a lot of conversation around this for a while now. That’s good.

I’m sympathetic to the question because at first look, it seems to relate to a defining feature of AVs. After all, in the end, the number of times the vehicle needs help, is going to be a pretty good indicator of how good your AV is. So, it’s not an obviously bad question — it’s certainly part of the ecosystem and answers are needed. It also is not without safety benefits — if a release of AV software requires safety interventions because of buckaroo lane changes then that’s something to be aware of. But I’m going to leave the “what is an intervention?” topic for another post.

Good makers, developers and designers of AV systems have always known that seeking a single number by which to compare has shortcomings. But this post is not going to be a treatise on what’s lacking. I think it’s more positive to share a perspective on metrics we track and utilise at Oxbotica — to get some discussion going. I’ve been meaning to do it for a while — at a recent talk at the Information’ AV Summit an audience member challenged me to actually say what we used to measure progress. I listed a good few metrics that we use and was surprised to hear that they appreciated the answer. The thing is it wasn’t detailed — it was simply five or six module-centric and system level numbers that spoke to reasonable functional requirements. Maybe it was my accent. Anyway, let’s break it down and think under the hood.

Now, if only AVs had access to an oracle — the future would be clear and plans easy to make. Brutally, they don’t.

Instead, AV stacks see complex super-low-level data streams and from this, they must construct a world model that explains where stuff is, what stuff is, and how that stuff is going to move over coming seconds. Tough stuff about stuff. To build such a thing, that can interpret the scene, predict and then create and execute a plan requires careful measurements. Here are some of the motivations and types metrics we use around internal components. Both in the arms of simulation and in real-life settings.

The full list of things we use to measure ourselves, would make this a madly long article. But giving a few examples makes the point that much can be learnt, and much good behaviour can be driven by interesting metrics. The unsolved mystery is how common, somewhat intimate, metrics could be used across AV companies to encourage best practice when architectures and creeds vary. And then with them established rates of change tell you a great deal.

So, consider perception — knowing what kind of stuff, including hazards, surround the vehicle. We care about when we can identify them as well as how well. We often talk of average recall and average precision. But what we really care about was did an AV see that car, pedestrian or canine friend when it was small, far away, in the rain while we still had time to decide what to do about it. So, we need to measure perception performance as a function of distance, weather, size and of course thing-type.

Then, consider compute; we care about power a great deal — these stacks are likely to run on EVs and every joule counts. We care deeply about high water marks on different compute platforms — the last thing we need is in a complex scene running out of cycles to compute precisely because it is complex — which is when we need the best performance.

For control we care about jerks, including the rate of change of acceleration. Make this too high in any manoeuvre and your passengers get tense. Too low and it gives the impression of cowardice — the point here is that subjective impressions can be made objective if you get the right metrics.

We care about positioning on a road, tunnel or track as a function of both weather and visibility. Poor weather can have a profound impact on raw sensor data. Knowing when you know where you are on average is important, sure, but knowing if that changes in biblical rain or blinding, burning sun is much better.

We care about development metrics profoundly. Halving your build time is like growing the company by 30% overnight. “Zero to deploy” — from a barebones computer to an installed AV stack in 10 minutes makes iterations faster and feedback more acute. Personally, my favourite is number of lines of code removed per month. It’s clearly valuable but often missed — as B. Gates famously said, “Measuring programming progress by lines of code is like measuring aircraft building progress by weight.”

Then put this all together — what we really care about, ultimately, is how good are we at knowing where we are, what’s around us and then predicting the future. Prediction is so much harder than perception. It is the monster problem. Accordingly, we have heaps of metrics for measuring progress towards its taming. How good is that oracle getting in SI units?

None of this appears in single high-level numbers around interventions. Yet to us, maintaining the rate at which the metrics above (and many others) change is our prime directive. Via them, and only via them, can we build an AV system that sensibly and honestly is really intervention free