This is a clear case of the “old-fashioned way of doing things” getting displaced by something much better.
Autotagging of still images or video is the use of a third-party system to save you work by generating keywords (usually with a probability score of its relevance) that might help you find the image or video in question. It is an automated replacement for a human looking at each still or scene and manually type in keywords. It has the advantages of providing some metadata, but also shares the disadvantages of keyword mechanisms.
What disadvantage is that you ask? There are four big ones, and most of us have run into at least one, if not all of them:
- Keyword mechanisms, whether human or automated, pick words from an “authority list,” meaning you don’t use just any word you want for a keyword: either an in-house archivist or your third-party vendor has a list of words they use. So, when you search using a keyword mechanism, if you do not know the “right” word assigned to what you’re looking for, you won’t find it. If the authority list uses “dog,” and you search for “puppy,” sorry, you strike out. Without a knowledge of the keywords your autotagging vendor uses, how are you supposed to find things? Do you have time to scan and remember their authority list? If not, how do you expect to get great results? And there is nothing more frustrating to users than knowing the image they want is in their database somewhere, but then having to play guessing games to find the magic keyword.
- Keywords, as the name implies, either single words or very short phrases. How much understanding of an image or video frame can you get with one or a few disjointed words? If you search for a picture of “dogs jumping over a fence” using keywords, you will also get “Bob jumping over a fence while his dog watches.” Not the same thing, is it? We demo a traditional search for “alternative energy,” and the results include “the minister of Energy seeks a viable alternative …” – not a single wind turbine or solar panel in sight. And if you are a marketing person looking to illustrate a concept, how likely is a single word search mechanism going to get the job done?
- Many keywords resulting from autotagging are irrelevant and only add noise to searches, which means a lot of irrelevant objects are going to be put in front of your users. Some autotagging is far worse. What if your user or customer is looking for “summer camp” or “bunk beds” and your DAM returns scenes from a WWII concentration camp because of the autotagged keywords? (Yes, this is an actual customer example.)
- Autotagging is a point-in-time statement. What if new or different terms to describe something come into general use? What is the cost (time and labor) to retroactively locate those assets and update the keywords? What is the likelihood of that happening?
These are some of the reasons the marketplace has become disenchanted with autotagging. Perhaps all this was more or less acceptable when keywords were the only mechanism available to us, as they have been for the last 30 years. But times have changed (dramatically)!
Ok, let’s fast-forward to the present day. What’s new that will help us here?
Let’s talk about “understanding” versus “keywords.” A child learning a new language in school starts with individual words, but quickly becomes conversational by seeing words in context. The more conversations, the greater the context and more fluent they become. What if you make the child read all of Wikipedia for example? They would learn a lot of different ways (using different words) of saying the same thing (and you might be turned in to Child Protective Services). But they would understand a lot of concepts. When you asked them a question, they would be far better prepared to give you a good answer, wouldn’t they?
Similarly, if you showed a youngster millions of photos you might expect them to get sharper and sharper about first recognizing the objects in the photos, and then over time recognizing what is going on in the image, what famous location might be depicted, what “verbs” might describe the action, etc.
And if you then showed a youngster millions of images or video scenes and at the same time the text that describes them, you might expect they would have an even richer understanding of how words and images relate.
And they would do all this without keywords!
They would have a direct mental link between the words and the images, and they would know different ways of using words to represent the exact same thing, and they would also be able to show you images similar to the one they had just looked at.
Welcome to NOMAD™!
MerlinOne’s internally-developed and tunable NOMAD™ visual search technology completely bypasses the crutch of keywords by using highly-advanced AI to go from your query phrases directly to the visual object. NOMAD™ can do this because of its extensive training on huge quantities (hundreds of millions) of images and associated text. It “knows” what you are looking for. It “knows” the contents of your visual objects. And it can directly map one to the other, with NO reliance on specific words: NOMAD™ truly understands the concept you are looking for. It’s the difference between two people having a conversation using only single words, or having a conversation where each participant can use full sentences. There is simply no comparison between the results.
NOMAD™ has proven this with numerous customers and millions of still images, and is available now. We have a team moving this technology to video and will have NOMAD™ for video in production in Q1 2022.
Why invest in a bad technology that increases your reliance on a 30+ year old system of keywording and using metadata to search for visual objects when the true value AI to find visual objects in your DAM is at your fingertips? The future of visual search is NOMAD™.