Computer Vision – image classification and object detection

In this blogpost we will first examine classic computer vision. Then we will move into modern computer vision, which uses deep learning to recognize and locate objects. And lastly, we’ll also cover how to train the deep learning networks

We cover both image classification (is there a cat in the picture?), and object detection (where is the cat in the picture), which is a more advanced form of image classification.

Take a look around the room and notice the different objects. You may see a desk, a lamp, or a chair. For a human, it’s easy to recognize and distinguish between these objects, but for a computer, this is a much more complicated endeavour. Since the 70’s computer scientists have worked computer systems that can ‘see’. While they’ve made some advancements towards this goal, the real breakthroughs have happened in the last couple of decades with the introduction of deep neural networks, into the field of computer vision. This has enabled computers to see, recognize and understand objects at an unprecedented level.

These advancements have opened up a massive space of opportunities in many different industries, from self-driving cars, to quality assurance in production lines, to detection of rust in wheat.

Classic Computer Vision

At its core, computer vision is computers understanding images and video. This is possible because the images and videos of today are digital. On a computer, a photo is represented on a matrix of pixels, representing colors, defined by numbers.

The picture is an example of a low-resolution image of Abraham Lincoln, made up of white, grey and black pixels. But these pixels are also defined by numbers, in this case 0 and 255, where 0 represents white, 255 represents black.

By assigning numerical values to the pixels, we are then able to use mathematical calculations that can tell us something about the image.
This basic property can lead to something as advanced as the image classifier which can tell us whether an object is present in a picture or not.

The picture is an example of a low-resolution image of Abraham Lincoln

Image Classification

An image classifier tells us whether an object is present in a picture or not. It receives an image, for example, an image of a cat or a dog, and then out- puts a probability of it being either a cat or a dog.

To make this distinction the image classifier needs to extract some characteristics that distinguish the two. Since both dogs and cats vary between races, let’s define cats as house cats, and dogs as Golden Retrievers.

So what are some of the features that distinguish a house cat and a Golden Retriever?

There are several features we could come up with but two come to mind.

1. House cats tend to have more pointy ears than Golden Retrievers.

2. House cats tend to have lighter noses than Golden Retrievers.

In computer vision lingo we call these differences features.

To assess the pointedness of the ears we can create an algorithm that calculates triangularity, as pointed ears have a triangular shape.

To assess the color of the animal’s noses, we can similarly create an algorithm that identifies the animal’s noses and calculates the pixel color.

Nose Color on the y-axis and the pointedness of ears on the x-axis

Features

Having extracted the two features, we can plot them on a XY diagram with Nose Color on the y-axis, and the pointedness of ears on the x-axis.

The thing is that not all house cats and Golden Retrievers look the same. So we will usually include many pictures of both animals. In this case, let’s say we include 10 per animal. This means we will have 20 pictures, represented by 20 points on the graph.

We then need to make an algorithm that can distinguish between a house cat and a Golden Retriever. To do so, we train the algorithm to find the line that best separates the two clusters of points in the diagram.

Algorithm that distinguishes between house cats and Golden Retrievers

Once the computer program is able to do so we have essentially trained an algorithm that can make a distinction between a house cat and a Golden Retriever. This algorithm is represented by the line that separates the two clusters of points.

To use this algorithm that distinguishes between house cats and Golden Retrievers, we ‘feed’ it with an image. From this image it extracts two features, the nose color of the animal and the pointedness of the ears. These features are then mapped on the diagram. If it’s on the one side of the line, it predicts that the object is a cat. If it’s on the other side of the diagram, it predicts that the object is a Golden Retriever.

This is computer vision at its core. It can be much more complicated, and usually is, as there tend to be more than two features. Furthermore, the line that separates the two is not necessarily linear.

The algorithm is represented by the line that separates the two clusters of points

Manual vs Automatic Feature Extraction

In the case above, a feature engineer manually selects and creates the algorithms that can extract the features. But this is very time-consuming and requires an experienced feature engineer. Furthermore it is not always possible for a feature engineer to define algorithms which can consistently find the features we need.

The difficulty of feature extraction has historically been the core bottleneck in the advancement of computer vision. But with the introduction of neural networks in the late 90’s, a feature engineer no longer had to extract the features manually. Instead, we train an AI to do it for us.

One might think, “So now the computer is selecting and extracting features such as the nose color and pointedness of the ears”. Not exactly. While the computer is exposed to the same image as us, it interprets it completely differently. The features it will select and extract will likely seem quite odd to us. For instance, nose to feet ratio. As mentioned in the beginning, a computer ‘sees’ numbers, not pictures.

Object Detection

Object detection is an extension of image classification. It not only tells us if a certain object is present in the picture, but also tells us where.

On the left, we have an example of a classifier. It tells us that there is a cat in the picture.

The object detector, shown on the image to the right, tells us that there’s both a cat, a dog, and a duck, and furthermore where they are located.

It starts by analyzing an image and selecting regions in the picture where it “thinks” there is something. It doesn’t know what it is but it detects that there is a kind of object, it is not merely background. We call this process region proposal. Let’s look at an example.

Region proposal

In the picture we see a cowboy on his horse. The object detector evaluates the picture, and detects four different regions. These four regions are then individually run through image classifiers trained to recognize the different objects on the screen.

Based on this, it’s able to determine that in one region there is a person, in one region there is a horse, in another region there is a hat, and in one region, there is nothing.

Having looked at both object detection and image classification – Let’s examine how we train them, using a neural network.

Training a Computer Vision Model

To utilize a neural network, both for image classification and for object detection, we need to train it. To do so we need a lot of data of the right quality. And to get data of the right quality we need a good setup. In fact, a good setup is often said to be 80% of computer vision.

Suppose you have a production line where you use computer vision to check the quality of certain components. In such a case, we would help ourselves tremendously by using a lightbox with proper lighting. This would allow us to control the lighting and prevent light contamination. It would reduce the variation in our input significantly, and give us much cleaner data.

We also want to include pictures that cover the entire spectrum of our data. Meaning that, if we want a model that can recognize cars both during daytime and nighttime, we need to train it with pictures of cars both during the day and night. The model can’t recognize objects it hasn’t seen before.

It’s also crucial that we expose our model to what’s called ‘true negatives’. These are the objects that the model will likely see but we don’t want it to recognize. In the car example, it could be objects such as bicycles, trailers, trucks and scooters.

If the model hasn’t seen these during training, it may mistake them for cars once it’s taken into production. If we, on the other hand, expose the model to these objects during training, the model will learn that the negatives are not cars. Instead, it will consider them part of the background. And this is particularly important if the negatives share similarities with the positive, i.e trucks and trailers share many features with cars.