The Art of Visual Editing Part 2: Search

In Part 1 we set the grounding for the art of visual editing, established the need to define goals, and set criteria for great visual objects we might want to use. The next step is locating candidate visual objects, and for that we need to Search.

For the last 30 years we have found visual objects by searching for text matches in “metadata”, the descriptive text attached to visual objects that is typically entered by someone… if indeed it is entered at all. If you have a still image database with extensive human-generated metadata you may have done OK, if you have stills with sparse metadata, historic video, B-roll, (or even most video), with next to zero metadata, searching has likely been frustrating or even impossible: you just cannot find the hidden gems. Nothing kills creativity faster than frustration at the onset of a project.

Search Categories

Search generally falls into one of two categories, or a blend of the two. One category is where you have “Named Entities” usually associated with some descriptive terms, i.e., you are looking for a specific person doing something, or a specific landmark, like “Elon Musk gets out of a car,” or “Empire State Building with red and green lights for Christmas.”

If metadata exists (rare in video since no one types in descriptions of each scene, but possible in stills), then the name “Elon Musk” might appear in a caption. But what if the caption says, “arrives at the Tesla factory” instead of “gets out of a car”? Your search for “Elon Musk gets out of a car” would fail.  However, that same search would return a result for an image with a caption like, “Senator Smith gets out of a car on his arrival for a hearing about Elon Musk,”: again, a failure! Similarly, the caption for the Empire State Building photo might read, “The Empire State Building is bathed in seasonal holiday lighting … .” So, unless your search terms contain the exact (or extremely similar) terms used in the metadata, you will fail to find what you are looking for. Most likely, you will have to dumb down the query (to just “Elon Musk” or “Empire State Building”) and then vet search results manually.

The second category of search is where you need to illustrate a concept or a feeling, like “alternative energy,” or “smiling woman out running.” Metadata searches are notoriously poor at these, because if there is metadata it might say, “The Energy Commissioner is seeking a viable alternative  way to get home… ,” or, “Lisa Smith warms up for a 10k race,” and in both examples these metadata searches would fail. The “visual understanding” of the scene is almost never captured in metadata.

Leveraging AI for Great Visual Searching

I’d like to try to make the case that traditional metadata-only search, without the use of recent AI advancements, has crippled us and impeded our ability to do great visual story telling:  we are just so used to it that we do not see the restrictions it has forced on us since day one. Even if you feel your metadata is excellent and has given you fine results, I’d make the case that you cannot know what you are missing. Let’s look at the examples above.

AI Visual Search

What if you had a brilliant 13-year-old nephew or niece they spent their lives looking at every visual element on the internet and had perfect recall? Do you think they would have a pretty clear understanding that “alternative energy” might include solar cells, wind turbines, solar powered vehicles, and “No Pollution” signs? Similarly, they would find it easy to identify a “smiling woman out running” And they would also be able to recognize relatively famous people or places, like Elon Musk or the Empire State Building.

Remember earlier when we said you might get 2,000+ images from a wedding, or a corporate or news event? Picture this (unintended pun): you start paging through those images for the “best” one of someone speaking (or dancing or whatever), going through pages of about 30 at a time. By the time you are part way through, your mind keeps coming back to one that really struck you because it had a bright red banner in the background that really caught your eye. Problem is you think it was somewhere in the first 15 or 20 pages, so now you have to go and search all over again. What if you could just describe the scene, adding “with bright red banner” and your smart 13-year-old assistant (who can search visually and knows what you mean by a “bright red banner”) would bring you right to the image you want?

That bright 13-year-old is how we personify our NOMAD AI Visual Search. So clearly there is the potential for an AI model trained on the breadth of the visual content of the Internet to know common sense subjects we might search for!

AI Facial Recognition

What if the person in your search is not all that famous, but important in your company, like your CEO, a college dean, or a major donor? And let’s say that unfortunately, there are almost no captions on the still images and videos in your DAM system. If you had an AI engine that can recognize faces, and you  told it on a specific image the person on the left was your CEO Susan Smith, might the AI visual engine be able to identify Susan Smith in all your other images and videos and add her name to the metadata for each, making them easily discoverable? Would that be a big help in future searches? How much time would it save over going through thousands of images and manually tagging Susan Smith in every occurrence?

AI Visual Similarity

What if you found a photo or video that was a really close, but not quite perfect match for what you are looking for? What if another AI engine would let you search for “more like this” meaning visual objects that look a lot like your selection? Couldn’t that find you otherwise undiscoverable objects that might be perfect for your project, but might not have enough metadata for you to find it with a traditional metadata-only search?


Another common problem for visual editors is they find a selection of candidate objects, but are torn over which one is the very best one to use (ask me how many hours I have agonized over which of two images to pick!). What if there was an AI engine trained on tens of thousands of images selected by world-class visual editors who were tasked with finding great images and horrible images, and grading them on both aesthetic and technical scales? Such a model could “recognize” the best and worst images in your set of candidates, and which are the worst, and sort them in that order. You can still, of course, pick whatever you want, but the AI ranking would give you an indication of which images have the best impact and would be most likely to capture and hold your viewer’s attention.

AI Brings Huge Improvements over Metadata-only Searching

These are four examples of ways AI can enhance the discoverability of visual objects and help you select the best, even if they have little or no metadata, or you simply happened to choose search terms different from what their metadata happened to have.

None of these examples are possible with a strictly metadata-based system.

(Continued in Part 3: Running and Visual Search)