Lecture 1 | Introduction to Convolutional Neural Networks for Visual Recognition
Articles Blog

Lecture 1 | Introduction to Convolutional Neural Networks for Visual Recognition

– So welcome everyone to CS231n. I’m super excited to
offer this class again for the third time. It seems that every
time we offer this class it’s growing exponentially
unlike most things in the world. This is the third time
we’re teaching this class. The first time we had 150 students. Last year, we had 350
students, so it doubled. This year we’ve doubled
again to about 730 students when I checked this morning. So anyone who was not able
to fit into the lecture hall I apologize. But, the videos will be
up on the SCPD website within about two hours. So if you weren’t able to come today, then you can still check it
out within a couple hours. So this class CS231n is
really about computer vision. And, what is computer vision? Computer vision is really
the study of visual data. Since there’s so many people
enrolled in this class, I think I probably don’t
need to convince you that this is an important problem, but I’m still going to
try to do that anyway. The amount of visual data in our world has really exploded to a ridiculous degree in the last couple of years. And, this is largely a
result of the large number of sensors in the world. Probably most of us in this room are carrying around smartphones, and each smartphone has one, two, or maybe even three cameras on it. So I think on average
there’s even more cameras in the world than there are people. And, as a result of all of these sensors, there’s just a crazy large, massive amount of visual data being produced
out there in the world each day. So one statistic that I
really like to kind of put this in perspective is a 2015 study from CISCO that estimated that by 2017 which is where we are now that roughly 80% of all traffic on the
internet would be video. This is not even counting all the images and other types of visual data on the web. But, just from a pure
number of bits perspective, the majority of bits
flying around the internet are actually visual data. So it’s really critical
that we develop algorithms that can utilize and understand this data. However, there’s a
problem with visual data, and that’s that it’s
really hard to understand. Sometimes we call visual
data the dark matter of the internet in analogy
with dark matter in physics. So for those of you who have
heard of this in physics before, dark matter accounts
for some astonishingly large fraction of the mass in the universe, and we know about it due to the existence of gravitational pulls on
various celestial bodies and what not, but we
can’t directly observe it. And, visual data on the
internet is much the same where it comprises the majority of bits flying around the internet,
but it’s very difficult for algorithms to actually
go in and understand and see what exactly is
comprising all the visual data on the web. Another statistic that I
like is that of Youtube. So roughly every second of clock time that happens in the world,
there’s something like five hours of video being uploaded to Youtube. So if we just sit here and count, one, two, three, now there’s 15 more hours of video on Youtube. Google has a lot of
employees, but there’s no way that they could ever
have an employee sit down and watch and understand
and annotate every video. So if they want to catalog and serve you relevant videos and maybe
monetize by putting ads on those videos, it’s really
crucial that we develop technologies that can dive in
and automatically understand the content of visual data. So this field of computer vision is truly an interdisciplinary
field, and it touches on many different areas of science and engineering and technology. So obviously, computer vision’s
the center of the universe, but sort of as a constellation of fields around computer vision, we
touch on areas like physics because we need to understand
optics and image formation and how images are
actually physically formed. We need to understand
biology and psychology to understand how animal
brains physically see and process visual information. We of course draw a lot
on computer science, mathematics, and engineering
as we actually strive to build computer systems that implement our computer vision algorithms. So a little bit more about
where I’m coming from and about where the teaching
staff of this course is coming from. Me and my co-instructor
Serena are both PHD students in the Stanford Vision Lab which is headed by professor Fei-Fei Li,
and our lab really focuses on machine learning and
the computer science side of things. I work a little bit more
on language and vision. I’ve done some projects in that. And, other folks in our group have worked a little bit on the neuroscience
and cognitive science side of things. So as a bit of introduction,
you might be curious about how this course relates
to other courses at Stanford. So we kind of assume a basic
introductory understanding of computer vision. So if you’re kind of an undergrad, and you’ve never seen
computer vision before, maybe you should’ve taken
CS131 which was offered earlier this year by Fei-Fei
and Juan Carlos Niebles. There was a course taught last quarter by Professor Chris
Manning and Richard Socher about the intersection of deep learning and natural language processing. And, I imagine a number of
you may have taken that course last quarter. There’ll be some overlap
between this course and that, but we’re really focusing
on the computer vision side of thing, and really
focusing all of our motivation in computer vision. Also concurrently taught this quarter is CS231a taught by
Professor Silvio Savarese. And, CS231a really focuses
is a more all encompassing computer vision course. It’s focusing on things
like 3D reconstruction, on matching and robotic vision, and it’s a bit more all encompassing with regards to vision than our course. And, this course, CS231n, really focuses on a particular class
of algorithms revolving around neural networks and
especially convolutional neural networks and their applications to various visual recognition tasks. Of course, there’s also a number of seminar courses that are taught, and you’ll have to check the syllabus and course schedule for
more details on those ’cause they vary a bit each year. So this lecture is normally given by Professor Fei-Fei Li. Unfortunately, she wasn’t
able to be here today, so instead for the majority of the lecture we’re going to tag team a little bit. She actually recorded a
bit of pre-recorded audio describing to you the
history of computer vision because this class is a
computer vision course, and it’s very critical and
important that you understand the history and the context
of all the existing work that led us to these developments of convolutional neural
networks as we know them today. I’ll let virtual Fei-Fei take over [laughing] and give you a brief
introduction to the history of computer vision. Okay let’s start with today’s agenda.
So we have two topics to cover one is a brief history of computer vision and the
other one is the overview of our course CS 231 so we’ll start with a very
brief history of where vision comes from when did computer vision start and
where we are today. The history the history of vision can go back many many
years ago in fact about 543 million years ago. What was life like during that
time? Well the earth was mostly water there were a few species of animals
floating around in the ocean and life was very chill. Animals didn’t move around
much there they don’t have eyes or anything when food swims by they grab
them if the food didn’t swim by they just float around but something really
remarkable happened around 540 million years ago. From fossil studies zoologists
found out within a very short period of time — ten million years — the number of
animal species just exploded. It went from a few of them to hundreds of
thousands and that was strange — what caused this? There were many theories but for many
years it was a mystery evolutionary biologists call this evolution’s Big Bang.
A few years ago an Australian zoologist called Andrew Parker proposed one of the
most convincing theory from the studies of fossils
he discovered around 540 million years ago the first animals developed eyes and
the onset of vision started this explosive speciation phase. Animals can
suddenly see; once you can see life becomes much more proactive. Some
predators went after prey and prey have to escape from predators so the
evolution or onset of vision started a evolutionary arms race and animals had
to evolve quickly in order to survive as a species so that was the beginning of
vision in animals after 540 million years vision has developed into the
biggest sensory system of almost all animals especially intelligent animals
in humans we have almost 50% of the neurons in our cortex involved in visual
processing it is the biggest sensory system that enables us to survive, work,
move around, manipulate things, communicate, entertain, and many things.
The vision is really important for animals and especially intelligent
animals. So that was a quick story of biological vision. What about humans, the
history of humans making mechanical vision or cameras? Well one of the early
cameras that we know today is from the 1600s, the Renaissance period of time,
camera obscura and this is a camera based on pinhole camera theories. It’s
very similar to, it’s very similar to the to the early eyes that animals developed
with a hole that collects lights and then a plane in the back of the
camera that collects the information and project the imagery. So
as cameras evolved, today we have cameras everywhere this is one of the most
popular sensors people use from smartphones to to other sensors. In the
mean time biologists started studying the mechanism of vision. One of
the most influential work in both human vision where animal vision as well as
that inspired computer vision is the work done by Hubel and Wiesel in the 50s
and 60s using electrophysiology. What they were asking, the question is “what was the visual processing mechanism like in primates, in mammals” so they chose
to study cat brain which is more or less similar to human brain from a visual
processing point of view. What they did is to stick some electrodes in the back
of the cat brain which is where the primary visual cortex area is and then
look at what stimuli makes the neurons in the in the back in the primary visual
cortex of cat brain respond excitedly what they learned is that there are many
types of cells in the, in the primary visual cortex part of the the cat brain
but one of the most important cell is the simple cells they respond to
oriented edges when they move in certain directions. Of course there are also more
complex cells but by and large what they discovered is visual processing starts
with simple structure of the visual world, oriented edges and as information
moves along the visual processing pathway the brain builds up the
complexity of the visual information until it can recognize the complex
visual world. So the history of computer vision also starts around early
60s. Block World is a set of work published by Larry Roberts which is
widely known as one of the first, probably the first PhD thesis of
computer vision where the visual world was simplified into simple geometric
shapes and the goal is to be able to recognize them and reconstruct what
these shapes are. In 1966 there was a now famous MIT summer project called “The
Summer Vision Project.” The goal of this Summer Vision Project, I read: “is an
attempt to use our summer workers effectively in a construction of a
significant part of a visual system.” So the goal is in one summer we’re gonna work
out the bulk of the visual system. That was
an ambitious goal. Fifty years have passed; the field of computer vision has
blossomed from one summer project into a field of thousands of researchers
worldwide still working on some of the most fundamental problems of vision. We
still have not yet solved vision but it has grown into one of the most important
and fastest growing areas of artificial intelligence. Another
person that we should pay tribute to is David Marr. David Marr was a MIT vision
scientist and he has written an influential book in the late 70s about
what he thinks vision is and how we should go about computer vision
and developing algorithms that can enable computers to recognize the visual
world. The thought process in his, in David Mars book is
that in order to take an image and arrive at a final holistic full 3d
representation of the visual world we have to go through several process. The
first process is what he calls “primal sketch;” this is where mostly the edges,
the bars, the ends, the virtual lines, the curves, the boundaries, are represented
and this is very much inspired by what neuroscientists have seen: Hubel and
Wiesel told us the early stage of visual processing has a lot to do with simple
structures like edges. Then the next step after the edges and the curves is what David Marr calls “two-and-a-half d sketch;” this is where we
start to piece together the surfaces, the depth information, the layers, or the
discontinuities of the visual scene, and then eventually we put everything
together and have a 3d model hierarchically organized in terms of
surface and volumetric primitives and so on. So that was a very idealized thought
process of what vision is and this way of thinking actually has dominated
computer vision for several decades and is also a very intuitive way for
students to enter the field of vision and think about how we can deconstruct
the visual information. Another very important seminal group of
work happened in the 70s where people began to ask the question “how can we
move beyond the simple block world and start recognizing or representing real
world objects?” Think about the 70s, it’s the time that there’s very little
data available; computers are extremely slow, PCs are not even around,
but computer scientists are starting to think about how we can recognize and
represent objects. So in Palo Alto both at Stanford as well as SRI, two
groups of scientists that propose similar ideas: one is called “generalized
cylinder,” one is called “pictorial structure.” The basic idea is that every
object is composed of simple geometric primitives; for example a person can be
pieced together by generalized cylindrical shapes or a person can be
pieced together by critical part in their elastic distance between
these parts so either representation is a way to
reduce the complex structure of the object into a collection of
simpler shapes and their geometric configuration. These work have been
influential for quite a few, quite a few years and then in the 80s David Lowe, here
is another example of thinking how to reconstruct or recognize the visual
world from simple world structures, this work is by David Lowe which he tries to
recognize razors by constructing lines and edges and and mostly
straight lines and their combination. So there was a lot of effort in trying to
think what what is the tasks in computer vision in the 60s 70s and 80s and frankly
it was very hard to solve the problem of object recognition; everything I’ve shown
you so far are very audacious ambitious attempts but they remain at the level of
toy examples or just a few examples. Not a lot of
progress have been made in terms of delivering something that can work in
real world. So as people think about what are the problems to solving vision one
important question came around is: if object recognition is too hard,
maybe we should first do object segmentation, that is the task of taking
an image and group the pixels into meaningful areas. We might not know the
pixels that group together is called a person, but we can extract out all the
pixels that belong to the person from its background; that is called image
segmentation. So here’s one very early seminal work by Jitendra Malik and his
student Jianbo Shi from Berkeley from using a graph theory algorithm for the
problem of image segmentation. Here’s another problem that made some headway
ahead of many other problems in computer vision, which is face detection.
Faces one of the most important objects to humans, probably the most important
objects to humans, around the time of 1999 to 2000 machine learning techniques,
especially statistical machine learning techniques start to gain
momentum. These are techniques such as support vector machines, boosting,
graphical models, including the first wave of neural networks. One particular
work that made a lot of contribution was using AdaBoost algorithm to do
real-time face detection by Paul Viola and Michael Jones and there’s a lot to
admire in this work. It was done in 2001 when computer chips are still very very
slow but they’re able to do face detection in
images in near-real-time and after the publication of this paper in five years
time, 2006, Fujifilm rolled out the first digital camera that has a real-time
face detector in the in the camera so it was a very rapid transfer from basic
science research to real world application. So as a field we continue to
explore how we can do object recognition better so one of the very influential
way of thinking in the late 90s til the first 10 years of 2000 is feature based
object recognition and here is a seminal work by David Lowe called SIFT feature.
The idea is that to match and the entire object for example here is a stop sign to
another stop sight is very difficult because there might be all kinds of
changes due to camera angles, occlusion, viewpoint, lighting, and just the
intrinsic variation of the object itself but it’s inspired to observe that there
are some parts of the object, some features, that tend to remain diagnostic
and invariant to changes so the task of object recognition began with identifying
these critical features on the object and then match the features to a similar
object, that’s a easier task than pattern matching the entire object. So here is a
figure from his paper where it shows that a handful, several dozen SIFT
features from one stop sign are identified and matched to the SIFT
features of another stop sign. Using the same building block which is
features, diagnostic features in images, we have as a field has made another step
forward and start to recognizing holistic scenes. Here is an example
algorithm called Spatial Pyramid Matching; the idea is that there are
features in the images that can give us clues about which type of scene it is,
whether it’s a landscape or a kitchen or a highway and so on and this particular
work takes these features from different parts of the image and in different
resolutions and put them together in a feature descriptor and then we do
support vector machine algorithm on top of that. Similarly a very similar work
has gained momentum in human recognition so putting together these features well
we have a number of work that looks at how we can compose human bodies in more
realistic images and recognize them. So one work is called the “histogram of
gradients,” another work is called “deformable part models,” so as you
can see as we move from the 60s 70s 80s towards the first decade of the 21st
century one thing is changing and that’s the quality of the pictures were no
longer, with the Internet the the the growth of the Internet the digital
cameras were having better and better data to study computer vision. So one of
the outcome in the early 2000s is that the field of computer vision has defined
a very important building block problem to solve. It’s not the only problem to solve but in terms of recognition this is a very
important problem to solve which is object recognition. I talked about object
recognition all along but in the early 2000s we began to have benchmark data
set that can enable us to measure the progress of object recognition. One of
the most influential benchmark data set is called PASCAL Visual Object Challenge,
and it’s a data set composed of 20 object classes, three of them are shown
here: train, airplane, person; I think it also has cows, bottles, cats, and so on; and
the data set is composed of several thousand to ten thousand images per
category and then the field different groups develop algorithm to test
against the testing set and see how we have made progress. So here is a figure
that shows from year 2007 to year 2012. The performance on detecting objects the
20 object in this image in a in a benchmark data set has steadily
increased. So there was a lot of progress made. Around that time a group of us from
Princeton to Stanford also began to ask a harder question to ourselves as well
as our field which is: are we ready to recognize every object or most of the
object in the world. It’s also motivated by an observation that is rooted in
machine learning which is that most of the machine learning algorithms it
doesn’t matter if it’s graphical model, or support vector machine, or AdaBoost,
is very likely to overfit in the training process and part of the
problem is visual data is very complex because it’s complex our models tend to
have a high dimension a high dimension of input and have to have a lot of
parameters to fit and when we don’t have enough training data overfitting happens
very fast and then we cannot generalize very well. So motivated by this dual
reason, one is just want to recognize the world of all the objects, the other
one is to come back the machine learning overcome the the machine learning
bottleneck of overfitting, we began this project called ImageNet. We wanted to
put together the largest possible dataset of all the pictures we can find, the
world of objects, and use that for training as well as for benchmarking. So
it was a project that took us about three years, lots of hard work; it
basically began with downloading billions of images from the internet
organized by the dictionary we called WordNet which is tens of thousands of
object classes and then we have to use some clever crowd engineering trick a
method using Amazon Mechanical Turk platform to sort, clean, label each of the
images. The end result is a ImageNet of almost 15 million or 40 million plus
images organized in twenty-two thousand categories of objects and scenes and
this is the gigantic, probably the biggest dataset produced in the field of
AI at that time and it began to push forward the algorithm development of
object recognition into another phase. Especially important is how to benchmark
the progress so starting 2009 the ImageNet team rolled
out an international challenge called ImageNet Large-Scale Visual Recognition
Challenge and for this challenge we put together a more stringent test set of
1.4 million objects across 1,000 object classes and this is to test the image
classification recognition results for the computer vision algorithms. So here’s
the example picture and if an algorithm can output 5 labels and and top five
labels includes the correct object in this picture then we call this a success.
So here is a result summary of the ImageNet Challenge, of the image
classification result from 2010 to 2015 so on x axis you see the
years and the y axis you see the error rate. So the good news is the error rate
is steadily decreasing to the point by 2012 the error rate is so low is on par
with what humans can do and here a human I mean a single Stanford PhD student who
spend weeks doing this task as if he were a computer participating in the
ImageNet Challenge. So that’s a lot of progress made even though we have not
solved all the problems of object recognition which you’ll learn about in
this class but to go from an error rate that’s
unacceptable for real-world application all the way to on par being on par with
humans in ImageNet challenge, the field took only a few years. And one particular
moment you should notice on this graph is the the year 2012. In the first two
years our error rate hovered around 25 percent but in 2012 the error rate was
dropped more almost 10 percent to 16 percent even though now it’s better but
that drop was very significant and the winning algorithm of that year is a
convolutional neural network model that beat all other algorithms around that
time to win the ImageNet challenge and this is the focus of our whole course
this quarter is to look at to have a deep dive into what convolutional neural
network models are and another name for this is deep learning by by popular popular name now it’s called deep
learning and to look at what these models are what are the principles what
are the good practices what are the recent progress of this model, but
here is where the history was made is that we, around 2012 convolutional
neural network model or deep learning models showed the tremendous capacity
and ability in making a good progress in the field of computer vision along with
several other sister fields like natural language processing and speech
recognition. So without further ado I’m going to hand the rest of the lecture to
to Justin to talk about the overview of CS 231n. Alright, thanks so much Fei-Fei. I’ll take it over from here. So now I want to shift gears a little bit and talk a little bit more
about this class CS231n. So this class focuses
on one of these most, so the primary focus of this class is this image classification problem which we previewed a
little bit in the contex of the ImageNet Challenge. So in image classification, again, the setup is that your
algorithm looks at an image and then picks from among
some fixed set of categories to classify that image. And, this might seem like
somewhat of a restrictive or artificial setup, but
it’s actual quite general. And, this problem can be applied
in many different settings both in industry and academia
and many different places. So for example, you could
apply this to recognizing food or recognizing calories
in food or recognizing different artworks, different
product out in the world. So this relatively basic
tool of image classification is super useful on its
own and could be applied all over the place for many
different applications. But, in this course,
we’re also going to talk about several other visual
recognition problems that build upon many of
the tools that we develop for the purpose of image classification. We’ll talk about other problems such as object detection
or image captioning. So the setup in object detection is a little bit different. Rather than classifying an entire image as a cat or a dog or a horse or whatnot, instead we want to go in
and draw bounding boxes and say that there is a
dog here, and a cat here, and a car over in the background, and draw these boxes describing where objects are in the image. We’ll also talk about image captioning where given an image the system now needs to produce a
natural language sentence describing the image. It sounds like a really hard, complicated, and different problem, but we’ll see that many of the tools that we develop in service of image classification will be reused in these
other problems as well. So we mentioned this before in the context of the ImageNet Challenge,
but one of the things that’s really driven the
progress of the field in recent years has been this adoption of convolutional neural networks or CNNs or sometimes called convnets. So if we look at the
algorithms that have won the ImageNet Challenge for
the last several years, in 2011 we see this method from Lin et al which is still hierarchical. It consists of multiple layers. So first we compute some features, next we compute some local invariances, some pooling, and go
through several layers of processing, and then finally feed this resulting descriptor to a linear SVN. What you’ll notice here is that
this is still hierarchical. We’re still detecting edges. We’re still having notions of invariance. And, many of these
intuitions will carry over into convnets. But, the breakthrough
moment was really in 2012 when Jeff Hinton’s group in Toronto together with Alex
Krizhevsky and Ilya Sutskever who were his PHD student at that time created this seven layer
convolutional neural network now known as AlexNet,
then called Supervision which just did very, very well
in the ImageNet competition in 2012. And, since then every year
the winner of ImageNet has been a neural network. And, the trend has been
that these networks are getting deeper and deeper each year. So AlexNet was a seven or
eight layer neural network depending on how exactly you count things. In 2015 we had these much deeper networks. GoogleNet from Google
and VGG, the VGG network from Oxford which was about
19 layers at that time. And, then in 2015 it got really crazy and this paper came out
from Microsoft Research Asia called Residual Networks which
were 152 layers at that time. And, since then it turns out you can get a little bit better if you go up to 200, but you run our of memory on your GPUs. We’ll get into all of that later, but the main takeaway here
is that convolutional neural networks really had
this breakthrough moment in 2012, and since then there’s been a lot of effort focused
in tuning and tweaking these algorithms to make them
perform better and better on this problem of image classification. And, throughout the rest of the quarter, we’re going to really dive in deep, and you’ll understand exactly
how these different models work. But, one point that’s really important, it’s true that the breakthrough moment for convolutional neural
networks was in 2012 when these networks performed very well on the ImageNet Challenge,
but they certainly weren’t invented in 2012. These algorithms had actually been around for quite a long time before that. So one of the sort of foundational works in this area of
convolutional neural networks was actually in the ’90s from
Jan LeCun and collaborators who at that time were at Bell Labs. So in 1998 they build this
convolutional neural network for recognizing digits. They wanted to deploy
this and wanted to be able to automatically recognize
handwritten checks or addresses for the post office. And, they built this
convolutional neural network which could take in the pixels of an image and then classify either what digit it was or what letter it was or whatnot. And, the structure of this network actually look pretty
similar to the AlexNet architecture that was used in 2012. Here we see that, you know, we’re taking in these raw pixels. We have many layers of
convolution and sub-sampling, together with the so called
fully connected layers. All of which will be
explained in much more detail later in the course. But, if you just kind of
look at these two pictures, they look pretty similar. And, this architecture in 2012 has a lot of these architectural similarities that are shared with this
network going back to the ’90s. So then the question you might ask is if these algorithms
were around since the ’90s, why have they only suddenly become popular in the last couple of years? And, there’s a couple
really key innovations that happened that have
changed since the ’90s. One is computation. Thanks to Moore’s law, we’ve gotten faster and faster computers every year. And, this is kind of a coarse measure, but if you just look at
the number of transistors that are on chips, then that has grown by several orders of magnitude
between the ’90s and today. We’ve also had this advent
of graphics processing units or GPUs which are super parallelizable and ended up being a perfect tool for really crunching these
computationally intensive convolutional neural network models. So just by having more compute available, it allowed researchers to
explore with larger architectures and larger models, and in some cases, just increasing the model
size, but still using these kind of classical approaches
and classical algorithms tends to work quite well. So this idea of increasing computation is super important in the
history of deep learning. I think the second key
innovation that changed between now and the ’90s was data. So these algorithms are
very hungry for data. You need to feed them
a lot of labeled images and labeled pixels for them
to eventually work quite well. And, in the ’90s there just wasn’t that much labeled data available. This was, again, before
tools like Mechanical Turk, before the internet was
super, super widely used. And, it was very difficult to collect large, varied datasets. But, now in the 2010s
with datasets like PASCAL and ImageNet, there existed
these relatively large, high quality labeled
datasets that were, again, orders and orders magnitude bigger than the dataset available in the ’90s. And, these much large datasets, again, allowed us to work with
higher capacity models and train these models to
actually work quite well on real world problems. But, the critical takeaway here is that convolutional neural networks although they seem like this
sort of fancy, new thing that’s only popped up in
the last couple of years, that’s really not the case. And, these class of
algorithms have existed for quite a long time in
their own right as well. Another thing I’d like to point out in computer vision we’re in the business of trying to build machines
that can see like people. And, people can actually
do a lot of amazing things with their visual systems. When you go around the world, you do a lot more than just drawing boxes around the objects and classifying
things as cats or dogs. Your visual system is much
more powerful than that. And, as we move forward in the field, I think there’s still a
ton of open challenges and open problems that we need to address. And, we need to continue
to develop our algorithms to do even better and tackle
even more ambitious problems. Some examples of this are
going back to these older ideas in fact. Things like semantic segmentation
or perceptual grouping where rather than
labeling the entire image, we want to understand for
every pixel in the image what is it doing, what does it mean. And, we’ll revisit that
idea a little bit later in the course. There’s definitely work going back to this idea of 3D understanding, of reconstructing the entire world, and that’s still an
unsolved problem I think. There’re just tons and tons of other tasks that you can imagine. For example activity recognition, if I’m given a video of some person doing some activity, what’s the best way to recognize that activity? That’s quite a challenging
problem as well. And, then as we move forward with things like augmented reality
and virtual reality, and as new technologies
and new types of sensors become available, I think we’ll come up with a lot of new, interesting
hard and challenging problems to tackle as a field. So this is an example
from some of my own work in the vision lab on this
dataset called Visual Genome. So here the idea is that
we’re trying to capture some of these intricacies
in the real world. Rather than maybe describing just boxes, maybe we should be describing images as these whole large graphs
of semantically related concepts that encompass
not just object identities but also object relationships,
object attributes, actions that are occurring in the scene, and this type of
representation might allow us to capture some of this
richness of the visual world that’s left on the table when we’re using simple classification. This is by no means a standard
approach at this point, but just kind of giving you this sense that there’s so much more
that your visual system can do that is maybe not captured in this vanilla image classification setup. I think another really interesting work that kind of points in this direction actually comes from
Fei-Fei’s grad school days when she was doing her PHD at Cal Tech with her advisors there. In this setup, they had
people, they stuck people, and they showed people this
image for just half a second. So they flashed this
image in front of them for just a very short period of time, and even in this very, very rapid exposure to an image, people were able to write these long descriptive paragraphs giving a whole story of the image. And, this is quite remarkable
if you think about it that after just half a second
of looking at this image, a person was able to say that this is some kind of a game or
fight, two groups of men. The man on the left is throwing something. Outdoors because it seem like
I have an impression of grass, and so on and so on. And, you can imagine that if a person were to look even longer at this image, they could write probably a whole novel about who these people
are, and why are they in this field playing this game. They could go on and on and on roping in things from
their external knowledge and their prior experience. This is in some sense the
holy grail of computer vision. To sort of understand
the story of an image in a very rich and deep way. And, I think that despite
the massive progress in the field that we’ve had
over the past several years, we’re still quite a long way
from achieving this holy grail. Another image that I
think really exemplifies this idea actually comes, again,
from Andrej Karpathy’s blog is this amazing image. Many of you smiled, many of you laughed. I think this is a pretty funny image. But, why is it a funny image? Well we’ve got a man standing on a scale, and we know that people
are kind of self conscious about their weight sometimes,
and scales measure weight. Then we’ve got this other guy behind him pushing his foot down on the scale, and we know that because
of the way scales work that will cause him to
have an inflated reading on the scale. But, there’s more. We know that this person
is not just any person. This is actually Barack
Obama who was at the time President of the United States, and we know that Presidents
of the United States are supposed to be respectable
politicians that are [laughing] probably not supposed to be playing jokes on their compatriots in this way. We know that there’s these people in the background that
are laughing and smiling, and we know that that means that they’re understanding something about the scene. We have some understanding that they know that President Obama
is this respectable guy who’s looking at this other guy. Like, this is crazy. There’s so much going on in this image. And, our computer vision algorithms today are actually a long way
I think from this true, deep understanding of images. So I think that sort of
despite the massive progress in the field, we really
have a long way to go. To me, that’s really
exciting as a researcher ’cause I think that we’ll have just a lot of really
exciting, cool problems to tackle moving forward. So I hope at this point I’ve
done a relatively good job to convince you that computer
vision is really interesting. It’s really exciting. It can be very useful. It can go out and make
the world a better place in various ways. Computer vision could be applied in places like medical
diagnosis and self-driving cars and robotics and all
these different places. In addition to sort of tying
back to sort of this core idea of understanding human intelligence. So to me, I think that computer vision is this fantastically
amazing, interesting field, and I’m really glad that over the course of the quarter, we’ll
get to really dive in and dig into all these different details about how these algorithms
are working these days. That’s sort of my pitch
about computer vision and about the history of computer vision. I don’t know if there’s
any questions about this at this time. Okay. So then I want to talk a little bit more about the logistics of this class for the rest of the quarter. So you might ask who are we? So this class is taught by Fei-Fei Li who is a professor of computer
science here at Standford who’s my advisor and director
of the Stanford Vision Lab and also the Stanford AI Lab. The other two instructors
are me, Justin Johnson, and Serena Yeung who is
up here in the front. We’re both PHD students
working under Fei-Fei on various computer vision problems. We have an amazing
teaching staff this year of 18 TAs so far. Many of whom are sitting
over here in the front. These guys are really the unsung heroes behind the scenes making
the course run smoothly, making sure everything happens well. So be nice to them. [laughing] I think I also should mention
this is the third time we’ve taught this course,
and it’s the first time that Andrej Karpathy has
not been an instructor in this course. He was a very close friend of mine. He’s still alive. He’s okay, don’t worry. [laughing] But, he graduated, so he’s actually here I think hanging around
in the lecture hall. A lot of the development and
the history of this course is really due to him working on it with me over the last couple of years. So I think you should be aware of that. Also about logistics,
probably the best way for keeping in touch with the course staff is through Piazza. You should all go and signup right now. Piazza is really our preferred
method of communication with the class with the teaching staff. If you have questions that you’re afraid of being embarrassed about asking in front of your classmates, go ahead and ask anonymously even
post private questions directly to the teaching staff. So basically anything that you need should ideally go through Piazza. We also have a staff mailing list, but we ask that this is mostly for sort of personal, confidential things that you don’t want going on Piazza, or if you have something
that’s super confidential, super personal, then feel free to directly email me or
Fei-Fei or Serena about that. But, for the most part,
most of your communication with the staff should be through Piazza. We also have an optional
textbook this year. This is by no means required. You can go through the course
totally fine without it. Everything will be self contained. This is sort of exciting
because it’s maybe the first textbook about deep
learning that got published earlier this year by E.N. Goodfellow, Yoshua Bengio, and Aaron Courville. I put the Amazon link here in the slides. You can get it if you want to, but also the whole content of the book is free online, so you
don’t even have to buy it if you don’t want to. So again, this is totally optional, but we’ll probably be
posting some readings throughout the quarter
that give you an additional perspective on some of the material. So our philosophy about this class is that you should really
understand the deep mechanics of all of these algorithms. You should understand at a very deep level exactly how these algorithms are working like what exactly is going on when you’re stitching together these neural networks, how do these architectural decisions influence how the network is trained and tested and whatnot and all that. And, throughout the course
through the assignments, you’ll be implementing
your own convolutional neural networks from scratch in Python. You’ll be implementing the
full forward and backward passes through these
things, and by the end, you’ll have implemented a whole
convolutional neural network totally on your own. I think that’s really cool. But, we also kind of
practical, and we know that in most cases people
are not writing these things from scratch, so we also want to give you a good introduction to some
of the state of the art software tools that are used
in practice for these things. So we’re going to talk about
some of the state of the art software packages like Tensor
Flow, Torch, [Py]Torch, all these other things. And, I think you’ll get some exposure to those on the homeworks
and definitely through the course project as well. Another note about this course is that it’s very state of the art. I think it’s super exciting. This is a very fast moving field. As you saw, even these plots
in the imaging challenge basically there’s been a ton of progress since 2012, and like while
I’ve been in grad school, the whole field is sort
of transforming ever year. And, that’s super exciting
and super encouraging. But, what that means is that
there’s probably content that we’ll cover this
year that did not exist the last time that this
course was taught last year. I think that’s super
exciting, and that’s one of my favorite parts
about teaching this course is just roping in all
these new scientific, hot off the presses stuff and being able to present it to you guys. We’re also sort of about fun. So we’re going to talk
about some interesting maybe not so serious
topics as well this quarter including image captioning is pretty fun where we can write
descriptions about images. But, we’ll also cover some
of these more artistic things like DeepDream here on the left where we can use neural
networks to hallucinate these crazy, psychedelic images. And, by the end of the course, you’ll know how that works. Or on the right, this
idea of style transfer where we can take an image and render it in the style of famous artists
like Picasso or Van Gogh or what not. And again, by the end of the quarter, you’ll see how this stuff works. So the way the course works
is we’re going to have three problem sets. The first problem set
will hopefully be out by the end of the week. We’ll have an in class,
written midterm exam. And, a large portion of your grade will be the final course
project where you’ll work in teams of one to three and produce some amazing project that
will blow everyone’s minds. We have a late policy, so
you have seven late days that you’re free to allocate
among your different homeworks. These are meant to cover
things like minor illnesses or traveling or conferences
or anything like that. If you come to us at
the end of the quarter and say that, “I suddenly
have to give a presentation “at this conference.” That’s not going to be okay. That’s what your late days are for. That being said, if you have some very extenuating circumstances,
then do feel free to email the course staff
if you have some extreme circumstances about that. Finally, I want to make a note about the collaboration policy. As Stanford students,
you should all be aware of the honor code that governs the way that you should be collaborating
and working together, and we take this very seriously. We encourage you to think very carefully about how you’re
collaborating and making sure it’s within the bounds of the honor code. So in terms of prerequisites,
I think the most important is probably a deep familiarity with Python because all of the programming assignments will be in Python. Some familiarity with C
or C++ would be useful. You will probably not
be writing any C or C++ in this course, but as you’re
browsing through the source code of these various software packages, being able to read C++ code at least is very useful for understanding
how these packages work. We also assume that you
know what calculus is, you know how to take derivatives
all that sort of stuff. We assume some linear algebra. That you know what matrices are and how to multiply them
and stuff like that. We can’t be teaching you how to take like derivatives and stuff. We also assume a little bit of knowledge coming in of computer
vision maybe at the level of CS131 or 231a. If you have taken those courses before, you’ll be fine. If you haven’t, I think
you’ll be okay in this class, but you might have a tiny
bit of catching up to do. But, I think you’ll probably be okay. Those are not super strict prerequisites. We also assume a little
bit of background knowledge about machine learning
maybe at the level of CS229. But again, I think really
important, key fundamental machine learning concepts
we’ll reintroduce as they come up and become important. But, that being said, a
familiarity with these things will be helpful going forward. So we have a course website. Go check it out. There’s a lot of information and links and syllabus and all that. I think that’s all that I
really want to cover today. And, then later this week on Thursday, we’ll really dive into our
first learning algorithm and start diving into the
details of these things.

Leave a Reply

Your email address will not be published. Required fields are marked *

Back To Top