Research challenges in using computer vision in robotics systems


Professor Anna Choromanska : Welcome to the first talk of 2019 seminar series on modern artificial intelligence at the NYU Tandon School
of Engineering. This series which we launched last year, aims to bring
together faculty students and researchers.. to discuss the most
important trends in the world of AI, and this year’s speakers like last year’s
our world renowned experts, whose research is making an immense impact on
the development of new machine learning techniques and technologies. In the
process they are helping to build a better smarter more connected world and
we are proud to have them here. By the way the talks are life stream and
viewed around the globe…. helping to spread the word about the amazing work
going on in the AI community I would like to thank Jelena Kovačević as
well as my own Department of Electrical and Computer Engineering for supporting
this series and graciously hosting our esteemed speakers. As many of you may
already know Dean Kovačević which came to us from Carnegie Mellon University in
Pittsburgh and she has that in common with today’s speaker Marshall Habert who
heads the robotics Institute at Carnegie Mellon, just like Brooklyn Pittsburgh has
a storied history that captures the imagination, many people know it as a
Steel City because of its status as a hub of Industry and steel production,
today thanks in some part to the efforts of Marshall and his fellow researchers
it has another nickname robotics borough. Today more than 60 robotics companies
make their home in Allegheny County, many as products of the strong robotics
program at CMU. Marshall own research interests include computer vision and
robotics especially recognition in images and video data, model building and
object recognition from 3d data and perception for mobile robots and for
intelligent vehicles. His group has developed approaches for object
recognition and scene analysis in images, 3d point clouds and video
sequences. In the area of machine perception of Robotics his group has
developed techniques for people detection tracking and prediction and
for understanding the environment of land vehicles from sensor data. He also
currently serves as editor-in-chief of the International Journal of computer
vision, Marshall will be speaking to us today about the formidable research
challenges using computer vision and robotics
systems and illustrating his presentation with compelling examples
from autonomous air and ground robots. I know none of us can wait to hear him, so
without further ado I’ll invite him to the stage. [Marshall Habert :] Thank you… great to be here on this campus, can you hear me yes, obviously you can hear me in the back
right.. all right, so I’m going to talk to you a little bit about computer vision
and more specifically about the issues that we face in using computer vision in
autonomous systems or more generally robotics robotic system. The story here
is that we’ve seen extremely rapid progress over the past few years in
computer vision, fueled in part by progress in machine learning, deep
learning and other other developments. However, progress has been much slower in moving those results to autonomous system.. robotic system and what I’d like
to discuss is a few of the reasons why that is and a few of the research areas,
if you will, that we need to address to accelerate that connection between the
progress in computer vision and autonomous system. So when I talk about
autonomous systems I mean things like this.. this is a video from some work from
some years ago of a drone flying through dense.. a forested area like this, the
drone has input only from a single camera and has to generate in real-time..
the steering command left/right steering command here. This is another example
here from another group at CMU, the robotics Institute, this is a 3d mapping
of an environment.. this is in the context of safe driving… for example a car
driving on the street, you see the speed here, if you do the calculation.. that’s
about 100 kilometres an hour in the street of Pittsburgh, we’re not supposed
to do that.. so unfortunately I’m being streamed.. so that’s bad but ok. This is a
real application, this is a from a group at national robotics engineering Center,
the robotics Institute, this is a collaboration with Caterpillar, those
trucks have been operate for many years now and have driven
thousands of miles.. those are fully autonomous.. okay, there’s no human
interaction here, so this is basically one example of self-driving type of.. type
of thing, this is a slightly simpler environment than the own world
self-driving because of course you don’t have other agents and so forth,
but that gives you an idea of what we try to do in robotics and in autonomous
systems. This is another example, this is from a bass teachers group and since
you’re seeing, this is a fully autonomous helicopter.. now you have to deal with you
know, everything from bad sensor data from complicated dynamics.. okay, if you
think of the planning that you need to do, there’s no room for error here and
this is kind of the point I’m trying to make with those examples, this is the key
difference between.. you know what is called some time… time type1 AI, the kind
of thing that you have in a question/answer system, you know Siri or
Alexa.. all that kind of thing were first of all you can make some mistakes.. it’s
okay, nobody dies.. okay, you also.. you also don’t have a strict time and computation
constraint, unlike the example that I showed there… where an output needs to be
generated within some number of milliseconds and the decision has to be
taken within some number of milliseconds and you cannot make a mistake. So those
are the kind of additional constraints that we have to take into account now,
when migrating those computer vision techniques to those kind of system, okay.
Another way to say it is, if you see your typical vision paper you know and you
have 95 percent performance on, whatever, that.. I said or 99 percent well for those
systems 99 percent is not good enough, you need to do.. you need to have
basically continuous performance in every single image that you acquire,
every single bit of sensor data. So those are the kind of problems that.. that we
need to look at to be able to use those techniques effectively and by the way if
you interested there’s a couple of sources here, the you know some special
issues of Robotics Journals on long-term autonomy and robust autonomy
and since ideas that analyze in more details, the kind of requirements that we
have for the those systems. So what I’m going to do is look at some.. basically
three or four classes of issues that we need to address in.. envision to make it
work basically in those kind of systems and to do that I’m going to use this
diagram, which is the simplest possible view of an autonomous system.. okay,
actually you have one letter.. that’s even simple with only two boxes, but we’re
going to start with four boxes.. okay. So we basically have some sensory input for
discussion doesn’t really matter, if it’s images or videos or 3d data, you have
some sensory input continuously, you have some perception box here.. that, that’s
outputs some form of output.. some interpretation of the environment and
from that you want to make a decision, in the example, I shall be previously, it’s
for example planning for the helicopter or for the for the truck and so forth
okay. So the first problem that I’m going to look at.. is this arrow here, the way
I’ve drawn this here… it looks like this perception box is always going to give
an answer right ! You get an input it gives an answer, you make your decision
based on this, the problem with that.. is that this assumes, that this perception
box is always able to give an answer, that the input is always going to fall
within a universe of input that does precession box can actually process, that
is a huge assumption because it assumes that the system is always going to be in
conditions that it is it’s being trained for… so the first set of problems that we
look at and I will show you one example of an approach for this.. is to as the
question should we always give an answer through the perception system always
give an answer and number one, and number two, can we anticipate when the
perception system should not give an answer… or at least that sensor is not
reliable. So let me, and this comes on the various
names, introspection, self-evaluation and failure detection.. since I guess.. in fact
there is a an entire new in our program you
reprogram exactly on this topic of introspection and self evaluation, so this is
this is an important.. important.. area. So let me show you straight that on one
example, this is a typical vision test cemented segmentation, this is what you
would do in self-driving, this is a video from a car and the output interpretation
here.. big progress there of things work pretty well now with this, you now put
this on.. on a robot and then since yhings can go pretty badly okay, they can go pretty
badly because of poor illumination condition, things are too close, too far..
any number of of condition, and the reason of course is that, it’s impossible
to anticipate all the conditions under which the system is going to operate and
it’s impossible to anticipate basically all those things that we.. that we go on,
in fact, we did this we provided our code for semantic segmentation to people from
the Army Research Lab, with whom we’re with which we were working and they
basically Twitter our code and evaluated on the robot and then produce this 80
page report on everything that went wrong, it’s… it’s a really humbling
experience, you know and of course a lot of things go, so what you really want to
do.. is to be able to look at the input and automatically figure out.. whether I
should really use this input or the the output of the perception system to make
a decision, so a simple analogy and that’s only an analogy is, if you’re
driving and you suddenly surrounded by fog.. you know.. you know, instantly that
your vision system is impaired and you know instantly that you cannot make
decision the way you were doing decision. So you’re going to automatically switch
to a different mode of driving instant, so the question is can we.. approaches
that allow us to do this automatically from the input, so this is kind of
related to a concept of introspection, basically the ability to evaluate the
input to vision algorithm and to assess whether this input is going to.. is going to yield to enough performance to be able to make
decision and have performance of view of the overall.. of the overall system. So
that’s what that’s… what we.. that’s what we’re trying to do. Now one way to do
this, would be to explicitly anticipate those failure cases, basically explicitly
try to code those conditions on the input, that are going to lead to bad
performance… but of course that’s not going to work.. because you cannot by
definition anticipate all those all those condition, the second way of doing
this is to learn features from the input that are predictive of of performance..
okay, if we can do that.. given the input, we can have some prediction of the
overall system performance and we can then decide whether or not we’re going
to make those decision. So this is what we do, you basically have a lot of
examples of the system in operation and you try to have a predictor that will
take input and predict the overall performance and from that you can decide
whether or not to use the input, well for example.. in my example of the fog whether
or not your visual system is good enough to drive, this is an example non-verbally
example.. this is a full task your segmentation task… image segmentation
task and what the system here the train system here, predicts is.. that the images
on the on the Left are hard to segment, in other words your vision algorithm is
not going to do very well and the images on the right are easy to segment, now of
course you look at this and you go well the obviously… those are harder than
those the, the important thing is to be able to do this automatically using a
systematic approach to be able.. to be able to do this, this is another example
closer to what I showed earlier in the symmetry segmentation case, where again
those are hard images.. ok, we want where the performance is going to be low, since
I’m going to go on and good images here, so if we can do this.. we can basically
anticipate when failures of the vision system are going to occur and we can
make a decisions… decisions based on this okay,
by the way this is a very old concept the… this goes back to early days of
pattern recognition in radars and since like this, where people used at those
days… use the term that is not used anymore very much.. but should be
something called the declaration rate, so you’re familiar with the detection rate
and the false positive rate… if you have a classifier or any type of you know
system.. so that’s typically the performance curve, you know precision
recall or our C curves with the two axes, there but you can have a third axis
which is what is called the declaration rate and the declaration rate means the
the percentage of time that you actually going to produce an output… okay, the
equation rate of a hundred percent means that you always use the output, now is
you always assume that you are going on performs.. performs reasonably, a very low
decoration rate.. means that you’re going to reject.. most of the… most of the most
of the inputs now, of course, the accuracy is very high, here you’ve been very
conservative in fact, the.. your vision system is going to work really well if
you never use the output, it’s never won okay and of course the performance is
worth.. when you use all of the inputs because sometimes it’s going to fail
catastrophically okay, so.. basically the idea is that if you can do this
introspection, meaning this prediction of performance… you can add the third axis
and you can basically set that threshold on the stick declaration rate axis.. okay,
if you’re very conservative you’re going to be on this side, if you want to.. if you
willing to take more risk… you’re going to be on that side okay. So let me give
you an example here, this is an example from a drone flying okay, this is what I
tell myself every morning when I get up you know and so basically what we’re
trying to do is look at the… so this is an example here vision system, this is
input video, this is monocular by the way input video and output estimated depth
map, there is nothing perform about the the vision system that is used here, this
is pretty poorest.. oh no this is another here.. the system in action so this is the
input video here, the estimated depth map and this is a 3d view of the same thing
the way this works is that, it constructs this 3d representation of… local 3d
representation of the environment and evaluates trajectories to fly… to fly
through. Now the problem of course is that.. if the vision goes bad here, if this
estimate of the 3d environment is bad.. then we can have a catastrophic failure
of this estimating the trajectory and catastrophic failure of the system. So what we do is
to… take that input video, learn from the from training data.. how to predict
performance from that from the input input data and use that predicted
performance to decide whether or not we’re going to use that input okay, a
little bit like in my for example.. if my.. if I sense that my vision system is
impaired, I’m going to slow down, I’m going to do something different.. that’s
basically the idea.. anticipating failures of the vision system… it’s not too important all
this is learn, this is using some neural networks, as usual to do this, the
important thing is that we can learn this… this directly, now there is a
question of how do you train such a system right and the way this is train
in this case is to… have a ground truth of the 3d environment, this is done as a
detail using stereo cameras on the.. on the drone at training time, since we have
gone through… we know which trajectories are clear and which are
obstructed, so we have gone through this on the trajectory and we can then
measure the performance of the system, meaning compare the actual trajectory
distribution with the one that I get from a… by using my vision system okay, if
it’s… if it’s the same.. then it’s good performance, if it’s not the same…. it’s
very different.. then it’s bad performance okay, that’s the ground rules if you will,
for the training system and from that we can learn how to do this this prediction.
okay so we can look at… the how well we predict those those failure those
failure… those failures of the.. of the system and we can then plot basically
this detection rate, we can plot as an our C curve, how
well we detect those those failures and this is the the example here… false
positive rate and true positive rate on detecting those those failures, the red
here.. the red curve here is what will happen.. if we were to use just the
confidence on the vision system itself okay.. any vision system is going to give
your confidence and internal confidence on its own output and you could use that
to do this this prediction, but it turns out we do much better by doing this.. this
introspection which is not surprising because the confidence of the region
system relies on the fact that the input is a viable input, so we can learn a
valuable presentation.. if you will of the performance.. performance prediction, will
show you this example here.. let’s see this video yeah this is going to work
here… so this is again the example of the thing flying here, what you have on the
right is a scale.. they are going to show how confidence is in the in the output,
when the scale is high.. it means high probability of failure.. which is what
happened right here, what it does at that point is to enter a different mode of
flying basically stops and look around until it gets enough data to have enough
confidence now on its on its output okay. So this is implemented that behaviour,
that I was saying earlier… to detect when the system itself from self
introspection, when the distance is, when the system itself is going or may fail
because of the of the type… of the type of input. If we measure the… the
performance I don’t know what this thing here is but okay.. if we measure the
performance by seeing how far we can fly without manual intervention… the green
bar here indicate the flying with introspection in other words having this
failure detection and being able to recover from it and the the red bar here
indicates without the introspection and not surprisingly we find that you know
you can go much further and as a much more robust… much more robust execution. We can look at.. what it.. what it form the if you go back and see what failures it
detects… it detects the things that some of the things that you would expect since like you know elimination issues… seems like rotation
of motion, which in this case affects the quality of the 3d reconstruction etc. So
of course, some of those things are not surprising that those are the kind of
failure modes that you would expect.. the important thing is that.. it’s able to
learn how to detect those things automatically… using a you know.. a unified
unified procedure to do this, alright ! let’s keep those things.. this is another
another example here of the.. of the same thing, this shows only the depth map here,
so here we have basically good performance of the vision system and
you’re going to see a case here where things get very very sparse here and you
have a high probability of failure of the system and it’s going to enter again
a recovery mode looking around until it gets again good data and then flies
again okay. So that’s basically the idea of introspection…. yes yeah.. yeah [Answering audience question :] So let’s be very careful here here, what I’m talking about is ideas on how to deal
with those failures of the vision system, where I have not addressed is what do
you do with that information.. you know, what strategy do you use, once you have
that information and.. and so I don’t have I don’t have an answer to do it to that
that.. that part is still very specific to the system at hand okay. [Audience question:]So the forces on aircraft are one-tenth the -9 error rate, so what would this get us to ? [Marshall :] Oh ! So again I don’t have the answer in this in this context, but in the.. in the case
of the drone we basically.. were able to for example in those experiment to
reduce the gear rate by.. by a factor of 10 or something like that, it’s.. it’s very
it’s very different it’s very dependent of course on the.. the system that goes
around it and and the way the.. way that shows, it’s more looking at the principle
of being able to.. to learn to characterize the performance
automatically, so one important thing that I mentioned in presenting this.. is
how it is trained and I mentioned this idea of measuring.. you know which
trajectories are correct and incorrect on all this, meaning that it’s.. it’s
trained on the behavior of the overall system, it’s not trained on the output of
the visual system and that brings up an important point.. which is a basic problem
that I think is pretty much still open, which is the you know the choice of the
right error metric or loss function that we use in building those visual system,
so let me illustrate this on this example.. this is an example here from a
semantic level in test so you have an image and you label pixels according to
different types of of regions, those are two different outputs and it’s clear to
everybody that the output on the top is better than the output on the bottom
because it’s that big green mistake here, let’s miss table, but if you think about
it,, why do we say that it’s better ? We say that it’s better because the error
matrix, the loss function, if you will that we use is the percentage of pixels
miss level.. right ! What we.. when we trained such a.. such a task.. we try to maximize
the number of correctly labeled pixels and in fact the objective here
implicitly is to label every pixel correctly, the problem with that is that from a robotic to autonomy point of view, there
is really no application.. where you need to label every pixel correctly, it just
doesn’t…doesn’t exist okay. Similarly when you think of the the flying
application that I showed.. that I showed earlier, the… you could measure the
performance of the system by looking at the accuracy of the 3d representation,
but there again you don’t care if the.. the point cloud is you know five
millimeter accurate or 2 millimeter accurate, the only thing you care about
is that your trajectories are correctly evaluated.. that’s it.. that’s the only
thing you care about, so the basic problem is that in many cases we are
using the wrong error metric or objective function in the vision box
that we have as part of the system, for example the number of pixel labeled or
accuracy of a 3d point cloud and we’re using one that is actually harder than
the actual test.. that we’re trying to solve, if I try to label every pixel
correctly.. I’m actually solving your task that is harder than what I need to.. the
only thing I need to solve.. the.. the robotic test. So this is a.. this is a
basic issue that we have in trying to implement those things and trying to
really use vision effectively indoors in those testers.. so in some cases it is
possible so they are shown an example here in flying, there are many other
examples in manipulation for example, where there the the right objective
function if you will is.. whether I’ve grasped an object or not for example, so
that’s a well-defined objective function.. I can label things, I can measure things
that way.. but if you look at more complex tasks like this.. test on semantic
navigation for example, how do you characterize an end-to-end metric for
this for this test, so that’s a major challenge and that’s the idea of.. of
being able to use.. you know the right the right error metric.. okay, so that was the
idea of introspection and understanding the performance of the other system, let
me look at it’s second problem.. which is the problem of limited supervision and
the idea here is this, if you look again at my little system here.. the data that
is used this perception box has to be trained from from somewhere.. from some
data and so the addition to this is that, we have lots of data here.. that we are
going to use for training this.. this perception box. The problem of course is
that.. that data is generally supervised data, labelled data… it also has to be
very large very.. large data, this is a major limitation in being able to use
those systems…. simply because in the kind of online applications that I showed
earlier, we don’t have the luxury of being able to acquire very large amount
of annotated supervised data, we also don’t have the luxury of having time and
computation to retrain, every time we want to introduce a new concept, every
time we change.. we change the operating condition. So the second area that we
look at is this idea of reducing supervision okay and being able to learn
those those visual models with as little data as possible.. ok so again.. the idea
currently is that we train those.. those visual model using a very large amount
of labeled data, what we want to do is to be able to do this from a very few
example or maybe even just one example ok.. and.. and the reason why that is
should be possible of course, is that even though we have only one example
here, we might have a lot of experience offline with many many training example
and many many different visual.. visual tasks. So the idea here is to be able to
design those.. those vision systems so that instead of just being trained from
you know.. one set.. one level set of data to use the prior experience,
to be able to learn those… those. those models okay … so imagine that you have to
train your system for many many different tasks right.. you now have a new
tasks, a new class that you want to deal with, a new type of object or something,
you want to use all that prior experience to be able to train quickly
on that new task with very few examples okay and if we can do that, we can reduce
the amount of supervision that is needed and have more practical system. Now this
falls on the various heading you know meta learning, learning to learn things..
like this which basically means again trying to use the prior experience
having learned many models, to now learn quickly on a new… on the new model. So let me tell you some.. some of the things that we’ve been looking at in this.. in
this space and I’ll talk a little bit about one aspect of that – which I call
model dynamics, which is the idea of not just reasoning in.. in feature space or in
image space.. as in offer.. as is often the case in.. in in learning, but in the model
space, so let me explain what I mean here… let’s say that we have a some kind of
classification task… simple image classification test okay, let’s say you
know living rooms images versus non living room images… from one of those
data set, so one way to think about it in data space… is to think of that as each
image corresponding to some feature.. is some.. in some very high dimensional space maybe and those features are constrained to be on some you know subset of that of
that feature space, that… that might look like a something like that and then we
have the other class here which is also features in that in that feature space,
in that data space somewhere else and the classification problem can be
visualized as you know learning some kind of boundary between the nothing you
hear, this is basically.. what what one does in machine learning , setting
typically. Now the dual way of thinking about this
is.. like this.. you can think of this task classifying living room versus non
living room, as generating some kind of classifier, which I use trade by this
blue screwball here and this classifier is basically a big vector you know, it’s
it’s… you can think of it living in a high dimensional space as well and it’s
also constrained because you know visual task you have a certain structure… they
construe this those classifiers are implicitly constrained to be on some
subspace.. some subset of that high dimensional space and if you take
another task, say classifying another another type of.. another type of class
then you have another model that somewhere else in that high dimensional
space okay.. so the idea here is is to understand the relationship between
those models, with the idea that if I have trained many many one model like
this.. this is my past experience, then I can have an idea of the structure of
that space of models and given a new task.. maybe I can use that knowledge to
learn quickly on that new test okay. So that’s basically the idea okay, so let me
show you an example here of using that that idea yeah or it can be it does not
have to be similar it can be eaten yeah in fact it’s better if it’s not because
if you have lots of those task conceptually no listen.. am NOT saying anything
mathematically okay… conceptually use sampling that… that that set of tasks
right, so it’s the dual idea of sampling that you know the feature space… so you
have dense sampling instead of that you’re sampling in model space, so let me give
you an example of what what we can do with this… this is this kind of idea,
suppose that I train a classifier for this category dot with only one example,
so with only one example I’m going to get a classifier… for some reason on the
screen this is very dark.. as a classifier your line here okay. So I tried this
classifier by your line here and it’s not going to be a very good one right..
because you have only one example that you use to do this, then if
put more example.. your classifier is going to be a little bit better and then
with more example it’s going to be a better and better okay …. it’s a little bit
hard to see I’m sorry about this but it’s right here, what’s moving closer
this way to the boundary okay and so forth okay.. until you have really a lot of example and now you have to the final..
you know, what you would call the best classifier alright. So the question now
is could you somehow say something about how we went from that guy to this guy,
basically some kind of transformation here that goes from this classifier to
small sample classifier to the large sample classifier okay and the idea is
if you had enough examples of this kind of situation, in other words enough
example where you have the classifier for one sample and the classifier with
lots of samples… which is somewhere else then maybe you can learn how to figure
out the transformation between it okay. So that’s basically.. that’s basically the
the idea right, we essentially regress the.. what we classify would have been
with a large sample from the one day it is from the from the small sample now,
the idea behind this of course is that this there would be no reason for such a
for such a thing to be useful to exist except for the fact that the underlying
assumption is that the visual world is in fact highly constrained and the
corresponding classify all the corresponding models are also highly
constrained, they don’t change in completely arbitrary way and that’s
basically the idea here, so how this is done in in practice.. this is done using a
network that takes as input the model learned with few example and produces
our output the model that we’d have been learned… with… with a lot of examples okay and of course it should have the behavior that the network is simpler.. if
you have more examples, in fact if you have a lot of examples
or the network should do nothing, it should be the identity and as you have
fewer and fewer examples, it’s going to you’re going to need a deeper and deeper
network to generate that regression ok. So that’s kind of the.. the idea of what
we what we look at what we look at here, now the interesting thing about this is
that, this allows us to implement something that is a very natural.. should
be a very natural behavior in learning those things, imagine that you have a lot
of domain basically of tasks in some region of that space, so you have lots of
categories.. in this example this 400 from one of those data sets, that say you have
400 categories and some of those categories are going to have lots of
examples right, some of those categories are going to have very few examples ok,
this is basically those categories 400 ordered based on the number of examples
in each… in each category, so this.. this is what happens in real life right right in real
life… you’re going to have lots a few things that are very common and very
easy to find examples for training and then many things that are very rare with
very few example, so the behavior that you would like to see happen – is to use
those categories here, to help train on these categories.. basically to use the
stuff that you can get data on very easily… to help train on the stuff for
which you have very little data. So that’s basically what what we can do
there… because we can learn this transformation using the stuff for which
we have lots of example and then use it on the stuff for which we have very few
very few example and this is… this is a view of what happens here so again I
have my four hundred categories you can think of that as four hundred you know
classification… classification tasks and they are ordered according to the number
of examples in the category here, on the left with the most example on the
left and few example on the right,.. it’s right on the
right.. extreme right, those categories are only one one example, what the red bar,
the red bar here show are the increase in performance… that we get by using this
technique basically by training this transformation using the categories with
lots of example… to predict that that that transformation and you see that it
has the dissolve behavior which is that it doesn’t change anything for those
common categories, we already have a lot of data it doesn’t do anything and then
it increase the performance dramatically on those that have very few very few
example. So this… this implements this natural idea again to to have the stock
that’s easy helping to stop…. that it that is hot all
right, so that’s one idea, the another idea that we’ve worked on very recently,
especially is the idea of hallucinating data.. is the idea of hallucinating data and
this is now a very common idea now in in learning which is the idea of using our
initial training data – to create new data basically hallucinate new data, so that we
can train a better better model… so this is just a quick view of the idea here,
let’s say that again you have your training data with very few examples
because you don’t have yet the luxury of labeling a lot of examples, what we’re
going to do is to use those initial examples… to pass it through a box here
that is going to generate new a new example here… a new sample.. this different ways of doing that nowadays, I’m not going to detail that but it’s just going
to generate a new new sample and then if we do that enough time… we can augment
our training data with those examples, so now we have the real data here… we all
example here and the one that we generate here, the new ones combine them
and then train.. train the classifier with this combined data and then the
interesting thing is that we can, a training time… we can adjust the
primary basically train the thing so that the system basically learns how
to generate those example… so that it best benefit the training okay, so that’s
that’s basically… the that’s basically the idea here, so this combines two idea
the first one was this idea of regressing from source sample, the second
one is the idea of generating new data to help in the endo train and we see
major improvement the the orange bar and the yellow bar.. our technique compared to
to state-of-the-art techniques in this in this area on some classification
tasks. I mentioned all of this in the context of classification, which is not
that useful in.. in practical application so we can do the same thing for
detection, this is actually new results that are not published yet but…. those are
is also a standard imagenet type of task, where the original axis is the number of
training examples starting with one example… so that’s one shot and one shot
learning, the red the red line here indicate the the techniques that I just
described and the black line indicate reference technique.. euro fast erosion
and etc, the important thing here is that for very few example here we had very
low performance here, here we get higher performance to the level where we can
basically bootstrap the training and be able to do things that are so nice
supervised and other techniques okay. This… okay… the last thing that I want to
talk about, because I have to be careful about the time.. is a one key aspects of
the the kind of robotic system that we saw, which is the time aspect okay and to
explain that let me go back to my basic diagram here… the way it’s drawn here is
the standard way of building those things, we have input I’m going to
compute for Y here and then I’m going to get
to get an output and the way it’s drawn here… I have to wait until this is done.
how much time.. how whatever time it takes to be able to use this output, the
problem is if you that helicopter or that car or whatever of the system, you
may not have the luxury to wait that long.. maybe you need an answer right now
okay and in fact maybe there is a situation where for whatever reason you
have even less resources and even less time and now you need an answer you know
right now maybe you don’t even care if the answer is not quite the best you
could get… but you need some indication right now okay you know if you think of
a very simple obstacle detection… you know for mobile robot right, maybe
because of the limited computational budget I have, I need an answer right now
to be able to plan my motion.. I don’t care if I have the precise location of
the thing… if I know that this may be something on the right and nothing on
the left that’s good enough, but I need that information, so that means that we
need to think of the perception system in a different way. We need to think of…
instead of thinking of a perception module as something that takes an input
produces an output, whatever time it takes.. whatever time it takes, we need to
think of the perception system as being able to produce some usable output, no
matter what computational budget and no matter what time the overall systems
need… need that answer okay, so that means we need basically a perception system
that gives us an output at any time and in fact that’s a technical term in
computer science… the there’s a concept in fact I have a reference here from 96,
there is a concept in computer science of any time algorithm and any time I go
home it’s an algorithm such that if you interrupt the execution at any time, you
get the best possible with all that you could get within that time.
So for example if you want to invert a matrix instead of designing implementing
the algorithm, so that you have to wait until the entire matrix invert it is
computed, you can design the algorithms such that if you interrupt at any time
you get the best approximation of the inverse of that matrix that you could
get within that time okay. Now that’s a different way of thinking about how to
design those rows those algorithms, so this more formal definition of this… this
is a reference here from a paper in 96 that define… more formally what that
means, we want the algorithm to be interrupted more meaning that it can
give an answer at any time, you want monotonicity meaning that the answers do
not get worse over time, that’s the minimal that you want okay, you want
basically to have better and better answer.. better and better interpretation
of the environment over time and the third one is a diminishing return, in
other words that… this is important actually because you need to know how
long you wait, so you want the the answer to improve less and less over time
basically diminishing return, if we can do that then we can have system that are
anytime and implement this idea of being able to provide an answer that is usable
for reasoning and decision-making no matter what computational budget is, so let
me give you just a quick while so this is an example here in terms of semantic
segmentation.. which in an ideal world what you would like is to have, if you
have very little time… your interpretation of the scene maybe look
something like this.. very close interpretation.. if you have a lot more
time it’s going to look like, that the basic.. basic idea here okay. So let me
show you some of the things that.. that we’ve been doing in that.. in that space
just to give you the rough idea of what what we’re trying to do here, so one way
to look at that is using again a neural network kind of architecture, this is a
completely you know kind of high-level idealized view of the system here, you
have the front layers, different levels in that in
that network, the typical approach would look at the last level, the output at the
last level and try to optimize with respect to a loss from the last level..
instead of that we can general or in addition to that we can generate
intermediate losses at different levels of this… of this architecture and then
try to optimize all of those losses together, so that if I look at the system
or if I interrupt the system at any.. any stage here, I will get a sensible output
or useful output okay. Now the problem is how to optimize those things together, so one
way to do that is to take all those losses and combine them example in your
linear combination and train the system instead of using the loss on the last
layer on… on on the output, to use this combined… combined loss okay with the
idea again that if we do that, we not just trying to get the best possible
output in the end.. we also trying to get sensible output at intermediate stages
in.. in the network. The key challenge here the key difficulty… is how to choose
those.. those weights and there’s been of course prior work on this, the simple
thing that could be done is to use constant weight and display your work on
on doing this… you could think of using linear weight so work from two years ago
and on doing this, it turns out that we can show this some theoretical result
that we have on this as well as empirical result, that I’m going to show
that a good way of choosing those weights is to adjust it dynamically such
that the ways… the weight is inversely proportional to the loss at that… at that
layer… another way to say that.. is that we’re going to get more weight to the
the layers that have a lower lower loss okay
and it turns out that you get a good behaviour of the system in.. in that way okay,
so again basic idea.. the main point here is this basic idea of any time
prediction as an important component of your system, being able to make decision
with arbitrary budgeted resources, the second point here is this kind of
structure… with waiting a structure with this strategy to set the weight, yeah so
so that depends a little bit on the on the application, but for the let’s say in
the semantic segmentation that would be … those intermediate level you can
reconstruct an output labeling and you compute the error on that which I go. Now
it has to be adapted a little bit on each application depending on the actual
format of the output, but in most cases you can you can read okay. So this is..
this is an example here on a classification test..
okay were some the classification test it’s even easier right because you can
you can just have the right output, so here this is what this shows, the… the
vertical axis shows the loss, the loss of performance at each of those level right,
so building blocks means those levels from 1 to 14 in this particular example
okay, loss of performance means that if it’s
in an ideal world… it should be 0, in other words I should get at each level
…the best possible performance, that I can direct can possibly get with that number
of levels, that’s the data grid that’s the reference here see ok, that’s the
best I can possibly ok.. this is what you get with the constant weight and with a
constant weight you have this really annoying thing… that we pretty close here
to the optimal except for the final output…
meaning that you know you can interrupt here and get some sensible output, except
that the overall performance of the network goes down dramatically, which is
not what you want.. you still want of course the the final
output to be to be correct, this is using other strategy here and this the right
one is using the strategy direction The behavior that you want is… you want this redline to be as close as possible to zero, so you don’t lose performance
because it’s never going to be at zero that’s impossible… right zero means it’s
an Oracle where we’ve trained the system completely with two layers, with four
layers on so forth okay. So that’s you cannot reach that you want to be as
close as possible and you want to maintain the funnel the final
performance, so basically you want something flat that’s close, so that
shows you via the behavior there okay.. we’ve used this with the various
classification.. classification task and again for segmentation test, let me
, to answer your question, this is how it looks like for semantic
segmentation right… you get intermediate results that are we find as you go
further in in the network right.. so hopefully that implements the kind of
thing I was saying, you know if you interrupt at any time, you have a course
output right… which is a higher error but is still usable output, if you need to
make that decision quickly.. let me skip some of the details here of this. So this
shows basically kind of the evolution of the output throughout the network, going
for me cause noisy output to the final refined output all right… okay so
what I showed is some ideas on doing introspection or input filtering..
basically I mean the visual system being to understand its own performance and be
able to adjust its behavior, I talked a little bit about this problem of being
able to design the error matrix or the objective function.. such that it is tuned
to the to the overall test.. one thing I did not talk about is the idea of
multiple hypothesis… which is the idea of instead of generating your single output
from the visual system, a single interpretation, for example a single
semantic leveling of the scene to generate multiple hypotheses… that you
don’t have time to talk about that… But this is also an important thing because
if you don’t do that… you assume basically that whatever interpretation
is generated by the division system is basically the only one that you can
reason on and you cannot go back in the reasoning chain. I talked about the
anytime prediction and a small this idea of small sample training.. so
that small sample learning to reduce the amount of supervision. So what does a few
of the topics that that we look at in vision in the context of robotics and
autonomy and I think I should stop here because it’s right about on time… “thank
you”. yeah [Audience question :] Thanks for presentation I have a
question about any time prediction.. yeah so as I understand busy you want at
every stage the L1 or the Li should be reasonable prediction, but the
way you optimize the loss basically the waiting if the loss is not good, this
weight goes down.. so is that loss basically not motivating…
[Marshall :] There’s no, so there is an important… yeah it’s counterintuitive basically, you
would expect the waiting to be the other one right… the reason why it is that way
is that.. if the loss is very small but if you put this one one level that.. that
works very well okay… the loss is very small, if you’re not careful because it’s
the loss is very small, if you give it a very low weight, then it’s going to in a
way get neglected in the later stage of training and what’s going to happen is
that this loss is going to go up again, you see one, so you need to make sure
that if a level is so low loss has been is that good level of training… that you
heighten the weight.. so that it gets preserved if you were that’s.. that’s the
larger the intuition as you where behind this…. yeah [Anna :] The discussion and the filth water you
know.. we should rather invest more time and resources to develop and to end
learning systems or whether autonomous systems are it’s better to keep them
modular and one of the key you know arguments behind modular is that you
know you can control each part of the system… it’s more interpretable
potentially, so what would be your take ? [Marshall :] Oh well ! yeah so I’m not sure because as
I said it’s still an open problem, but in the introspection part the interesting
thing that’s going on there is that the, the system is designed in a modular
fashion right… the other way, the visual system is trained separately and
designed separately like you said, but this performance prediction part is end
to end.. so you could actually have both right, where you you have the system
trained your modular fashion, but then it’s performance prediction and
performance control and overall control of the system being done unto him okay…
and I think that’s I think for from what we’ve done… that’s that’s a good
compromise.. because training fully end-to-end, the problem there is of
course you completely lose control on what the system does or any explained
ability at all of the system all right, doing it purely modular then you have
the pollen.. that that I explain like that that you have mismatch between the
individual module and your overall test, so doing it modular and then having the
overall performance characterizable and predictable is still compromised. [Claps]

Leave a Reply

Your email address will not be published. Required fields are marked *