Skip to content

In this blogpost we will first examine classic computer vision. Then we will move into modern computer vision, which uses deep learning to recognize and locate objects. And lastly, we’ll also cover how to train the deep learning networks.

We cover both image classification (is there a cat in the picture?), and object detection (where is the cat in the picture), which is a more advanced form of image classification.

Take a look around the room and notice the different objects. You may see a desk, a lamb, or a chair.
For a human, it’s easy to recognize and distinguish between these objects, but for a computer, this is a much more complicated endeavour.

Since the 70’s computer scientists have worked computer systems that can ‘see’. While they’ve made some advancements towards this goal, the real breakthroughs have happened in the last couple of decades with the introduction of deep neural networks, into the field of computer vision. This has enabled computers to see, recognize and understand objects at an unprecedented level.

These advancements have opened up a massive space of opportunities in many different industries, from self-driving cars, to quality assurance in production lines, to detection of rust in wheat.

Abraham Lincoln
The picture is an example of a low-resolution image of Abraham Lincoln

Classic Computer Vision

At its core, computer-vision is computers understanding images and video. This is possible because the images and videos of today are digital. On a computer, a photo is represented on a matrix of pixels, representing colors, defined by numbers.

The picture is an example of a low-resolution image of Abraham Lincoln, made up of white, grey and black pixels. But these pixels are also defined by numbers, in this case 0 and 255, where 0 represents white, 255 represents black.

By assigning numerical values to the pixels, we are then able to use mathematical calculations that can tell us something about the image.

This basic property can lead to something as advanced as the image classifier which can tell us whether an object is present in a picture or not.



Image Classification

Image Classification

An image classifier tells us whether an object is present in a picture or not. It receives an image, for example, an image of a cat or a dog, and then outputs a probability of it being either a cat or a dog.

To make this distinction the image classifier needs to extract some characteristics that distinguish the two. Since both dogs and cats vary between races, let’s define cats as house cats and dogs and Golden Retrievers.

So what are some of the features that distinguish a house cat and a Golden Retriever?

There are several features we could come up with but two come to mind.

  1. House cats tend to have more pointy ears than Golden Retrievers.
  2. House cats tend to have lighter noses than Golden Retrievers.

In computer vision lingo we call these differences features.

To assess the pointedness of the ears we can create an algorithm that calculates triangularity, as pointed ears have a triangular shape.

To assess the color of the animal’s noses, we can similarly create an algorithm that identifies the animal’s noses and calculates the pixel color.

Nose Color on the y-axis and the pointedness of ears on the x-axis


Having extracted the two features, we can plot them on a XY diagram with Nose Color on the y-axis, and the pointedness of ears on the x-axis.

The thing is that not all house cats and Golden Retrievers look the same.  So we will usually include many pictures of both animals. In this case, let’s say we include 10 per animal. This means we will have 20 pictures, represented by 20 points on the graph.

We then need to make an algorithm that can distinguish between a house cat and a Golden Retriever. To do so, we train the algorithm to find the line that best separates the two clusters of points in the diagram.

The algorithm is represented by the line that separates the two clusters of points

Algorithm that distinguishes between house cats and Golden Retrievers

Once the computer program is able to do so we have essentially trained an algorithm that can make a distinction between a house cat and a Golden Retriever. This algorithm is represented by the line that separates the two clusters of points.

To use this algorithm that distinguishes between house cats and Golden Retrievers, we ‘feed’ it with an image. From this image it extracts two features, the nose color of the animal and the pointedness of the ears. These features are then mapped on the diagram. If it’s on the one side of the line, it predicts that the object is a cat. If it’s on the other side of the diagram, it predicts that the object is a Golden Retriever.

This is computer vision at its core. It can be much more complicated, and usually is, as there tend to be more than two features. Furthermore, the line that separates the two is not necessarily linear.

Manual vs Automatic Feature Extraction

In the case above, a feature engineer manually selects and creates the algorithms that can extract the features. But this is very time-consuming and requires an experienced feature engineer. Furthermore it is not always possible for a feature engineer to define algorithms which can consistently find the features we need.

The difficulty of feature extraction has historically been the core bottleneck in the advancement of computer vision. But with the introduction of neural networks in the late 90’s, a feature engineer no longer had to extract the features manually. Instead, we train an AI to do it for us.

One might think, “So now the computer is selecting and extracting features such as the nose color and pointedness of the ears”. Not exactly. While the computer is exposed to the same image as us, it interprets it completely differently. The features it will select and extract will likely seem quite odd to us. For instance, nose to feet ratio. As mentioned in the beginning, a computer ‘sees’ numbers, not pictures.

object detection
Object detection is an extension of image classification

Object Detection

Object detection is an extension of image classification. It not only tells us if a certain object is present in the picture, but also tells us where.

On the left, we have an example of a classifier. It tells us that there is a cat in the picture.

The object detector, shown on the image to the right, tells us that there’s both a cat, a dog, and a duck, and furthermore where they are located.

It starts by analyzing an image and selecting regions in the picture where it “thinks” there is something. It doesn’t know what it is but it detects that there is a kind of object, it is not merely background. We call this process region proposal. Let’s look at an example.

input image computer vision
The object detector evaluates the picture and detects four different regions

Region proposal

In the picture we see a cowboy on his horse. The object detector evaluates the picture, and detects four different regions. These four regions are then individually run through image classifiers trained to recognize the different objects on the screen.

Based on this, it’s able to determine that in one region there is a person, in one region there is a horse, in another region there is a hat, and in one region, there is nothing.

Having looked at both object detection and image classification – Let’s examine how we train them, using a neural network.

Training a Computer Vision Model

To utilize a neural network, both for image classification and for object detection, we need to train it. To do so we need a lot of data of the right quality. And to get data of the right quality we need a good setup. In fact, a good setup is often said to be 80% of computer vision.

Suppose you have a production line where you use computer vision to check the quality of certain components. In such a case, we would help ourselves tremendously by using a lightbox with proper lighting. This would allow us to control the lighting and prevent light contamination. It would reduce the variation in our input significantly, and give us much cleaner data.

We also want to include pictures that cover the entire spectrum of our data. Meaning that, if we want a model that can recognize cars both during daytime and nighttime, we need to train it with pictures of cars both during the day and night. The model can’t recognize objects it hasn’t seen before.

It’s also crucial that we expose our model to what’s called ‘true negatives’. These are the objects that the model will likely see but we don’t want it to recognize. In the car example, it could be objects such as bicycles, trailers, trucks and scooters.

If the model hasn’t seen these during training, it may mistake them for cars once it’s taken into production. If we, on the other hand, expose the model to these objects during training, the model will learn that the negatives are not cars. Instead, it will consider them part of the background. And this is particularly important if the negatives share similarities with the positive, i.e trucks and trailers share many features with cars.

Once we’ve gathered quality data, the next step is to annotate it. Here we describe the desired result, meaning we manually tell the model that the Skoda on the image is a car

Annotate data

We do this with many different cars – as many as we can, at least +1.000, but ideally +10.000 (note that this varies substantially between different domains). It’s crucial that we don’t leave out any annotation. If we only annotate some cars but not others, we are implicitly telling the model that the objects that look like cars are not cars. This ‘confuses’ the model.

With our annotated images at hand, we have our training data ready. We divide these into three sets.

  • A training set
  • A validation set
  • A test set

We give approximately 80% of the data to the training set, 10% to the validation-set and 10% to the test-set.

The reason we don’t funnel all the data into the training set is to avoid ‘overfitting’. In simplified terms, overfitting is when the model aligns itself too closely to the training data. It essentially memorizes the dataset. This means it considers this data, with all its idiosyncrasies and outliers as being a good model for the context it’s designed to operate in. The result is that, it won’t have a general understanding of the underlying system. When it sees data it may not be able to recognize it because it doesn’t look exactly like the data it has been trained on.

Under-fitting, Appropirate-fitting, Over-fitting

Validation set

To avoid and detect overfitting, we use a ‘validation set’, to validate whether it’s a good model or not. This set consists of data it has never seen before and has not been trained on. If the model is able to recognize pictures it has never seen before and we can validate that it is in fact a good model.

However, if the model performs poorly on the validation data, but performs well on the training data, that is a clear sign it has become overfitted.

Lastly, we have the testing set. And the reason we use this is somewhat similar to using the validation set. When we test the model on the validation set, we may redo the training many times adjusting the hyperparameters of the network to make it perform better. Hyperparameters can be characterized as architectural features of the model. This implicitly means that, similar to how it overfits the training set it can also overfit the validation set.

So to ensure that it will perform well in the real world we finally test the model on the test data, which the model has not been adjusted after. Should the model perform poorly here, then the most likely explanation is training data that was not a good representation of the domain it was supposed to model.

But, if the model also performs well on the test set, it is ready to take on the real world!



Keep the problem in mind

An important point to end on is that we always want to keep the problem in mind and compare the performance of the AI-model with the performance we get if we don’t use an AI-model.

Suppose we’ve created a model that can recognize whether apples are good or bad with an accuracy of 85%. If we compare this with the performance of humans, who can recognize bad apples with 95% accuracy, one might think that this is a subpar result, since the AI model doesn’t perform as well as a person.

However, a computer may be able to do this faster, and longer than people, and it’s entirely possible to have people involved in the last 15% of apples, and still reach a better result than a purely human-driven quality assurance process. So it’s essentially about keeping the objective in mind, thinking holistically, and not expecting magical results with 100% accuracy.

Key Points

  • Computer Vision is the field in computer science that allows computers to see, understand and recognize images.
  • The introduction of deep neural networks has advanced the capabilities of computer vision significantly, which has opened up a host of opportunities in many different industries.
  • Image classification tells us whether an object is in an image, while object detection also tells us where.
  • Computer vision is possible because image and video can be represented numerically
  • Feature extraction is about finding the features the object we want to recognize
  • Historically, Computer vision was bottlenecked by the large workload associated with manual feature extraction. But neural networks made automatic feature extraction possible, which has revolutionized the field.

Contact Ambolt AI for a talk about your Computer Vision project

development collaboration

AGCO Innovation Center Randers and Ambolt AI in development collaboration

AGCO Innovation Center Randers is an innovation center within the Global AGCO group which explores and...
Positiong system for mobile robots based on Computer Vision

Intelligent robots with positioning system

To spread the use of robots in production environments, the implementation must be as easy as...
Efficient gatekeeping from ATKI

Intelligent gatekeeping from ATKI

Intelligent gatekeeping from Atki unobtrusively keeps track of cars in parking lots and prevents frustrated customers....