In the last chapter we talked about how Deep Learning applications end up with a bunch of points in space, and how the nearest neighbor of a point might be interesting. We also said this only works if the front end AI engine creates those points in a meaningful way. So let’s look at that front end.
The first thing you need to do is figure out what useful role you want your engine to do a great job at: is it facial recognition, object recognition, is it visual similarity, is it predicting what Sally will buy on Amazon if she just bought Cheerios?
Then you need a “training set” which basically means a large bunch of objects and the right label. Think of this like the flashcards you show your child when you train them what a “Car” is or a “Boat” or an “Elephant”: they have a picture on one side and its name on the other. The bigger the training set the better. If you think about it, this is just like teaching a 3 year old: the more pictures of different kinds of dogs you show them, the more confident they get about what a “dog” is.
One important thing to remember about current AI: it does not gain intelligence. It just learns to recognize patterns, and usually just the patterns you train it to recognize. Do not expect conceptual leaps, but it can do a really useful job of pattern recognition.
What do we want to end up with? In the case of the flashcards, we want the engine to give us sets of numbers (“descriptors”) for each dog so that in the back end the dots representing pictures of dogs are clustered really close together, and pictures of cats are reasonably close by (since cats kind of look like dogs) in their own cluster, and pictures of cars are pretty far from both of them.
While a child can learn from flashcards after a few times through the deck, an AI front end can take many, many passes (sometimes hundreds or thousands of passes) through your training set, and usually tweaking is required to get them to “converge” on a working engine. Lots of techniques can be used, like “augmenting” your training set by rotating your sample images slightly, or cropping them, and thus giving your engine more data to look at. The more data the better!
The good news is we live in the age of “Big Data”: there are now huge repositories of every kind of data, most of them accessible. The trick is getting value out of them!
The end result is both parts of the Deep Learning system: a front end that takes in the object you are interested in and calculates its “descriptors” (the sets of numbers about that object), and the back end puts the dot those numbers represent into your cube of space, and lets you know what other objects are its “nearest neighbors”.
Next we will put this stuff to work, on visual similarity!