As audio and video assets become more commonplace in business and everyday life (YouTube, voicemail, company videos, news media interviews) we recognized a need to make them useful in a digital asset management system.
How do you make a video/audio file searchable without paying someone to sit down and type out a transcript (an expensive and time consuming process)? Or do you have to sit someone down and have him or her view hours and hours of video?
About 5 years ago we looked at this question from the point of view of some of our media customers. Until then the only way to search a video file was by its filename, which frequently was some cryptic, machine-generated useless label. We wondered if there was some way to make the spoken words INSIDE the video or audio file available without human labor, so we started experimenting.
What we came up with was a process that extracted the audio track from a video file, and fed it through a speech-to-text engine. The text output was then “indexed” by our search engine against timecode (meaning how many seconds into the video/audio file was this particular word spoken). Most text search engines are built to index a word by its position in a document, like it is the 514th word from the beginning (similar to the book index we mentioned back when we discussed searching) so this took some innovation in software.
We also had to make a smaller version of the original video file, since we wanted to deliver to you the ability to go right to the point where your search term was spoken, and actually view that part of the video over your network, starting 5 seconds before your word appears (so you get a bit of context). Most video files are too big to deliver over a normal Internet connection, so we had to create a streaming “proxy”, a smaller version of the file. Of course there have been many other little discoveries that have led to improvements as well.
The end result is the ability to type in a search term, and if it appears in a caption, or a Word document, or a spreadsheet, or a video or audio file, it is at your fingertips. If a video or audio file, just click on the hit and the video is automatically queued up 5 seconds before your term appears, and you can quickly decide if it is important to you, or not.
Is it perfect? There are some things even advanced technology cannot deal with, so if there is a jackhammer or other noise source obscuring the audio, it will likely not be recognized. If there is loud music playing, the speech-to-text process will not work as well as the human ear. But if the audio is clean the accuracy is between 80 and 95% accurate, and that can be an extreme time saving feature in digital asset management.
Posted by David Tenenbaum
Flickr photo by AndyRobertsPhotos