During the recent SemTech SF event I met with Thomas Steiner of Google Italy to discuss a recent project entitled “SemWebVid.” In addition to working for Google, Steiner is working on his Ph.D. at UPC.edu in Barcelona, Spain. I found Steiner’s project fascinating because it provides a glimpse at how semantic technology will change the way we view, find, and interact with online video in future.

The SemWebVid project is part of a European Union research project entitled “I-SEARCH.”  Google, in addition to other companies, is one of the industry sponsors of this project. All of the research from this project is being done “in the open” and the findings are available to the public via published papers.

I-SEARCH is an interesting project taking place in Europe. From their website the project’s aim is quoted on their website as:

“The aim of the I-SEARCH project is the development of the first search engine able to handle a wide range of specific types of multimedia and multimodal content (text, 2D image, sketch, video, 3D objects, audio and combination of the above), which can be used as queries and retrieve any available relevant content of any of the aforementioned types.”

Steiner began by discussing how the web has evolved with video formats.  The web started with animated .gif images and then, in April of 1995, Real Player was introduced and was one of the first video players capable of streaming media over the internet. In addition, there were also media players like QuickTime and even Adobe Flash. Now, there are open formats for playing video such as WebM. WebM‘s goal is to create a video format that is free and open to everyone on the web.

Steiner then went on to describe the evolution of HTML and video.  In the beginning, he said,  “there was the use of the embed object.” An example of the embed element and the noembed element is shown below:

Fig. 1.1 – Example of the <embed> element.

Later the “object” container was introduced. The object container could hold the embed elements and describe parameters like full screen mode and script access.  Until the beginning of 2011 the object container was used by YouTube. Now with HTML5 YouTube supports the native <video> element. Using the HTML5 video element a video can be described as simply as:

Fig. 1.2 Simple example of the HTML5 Video Element

In Fig. 1.2 the description poster = “cute.jpg” refers to a still image that will appear when the video is not playing.

Steiner went on to explain that with the introduction of HTML5 we need “fallbacks all over the place to support previous descriptions of video and formats and all the available browsers”. Because of the these issues we move from a clean description, like the one shown in Fig. 1.2, to the one shown in Fig. 1.3 below.

Fig 1.3 – Video element in HTML5 with Fallbacks.

In order to support all types of Codecs, content producers need to produce their video in all the available formats. A few of the common ones are shown also shown above in Figure 1.3.

Steiner then explained what “the next big challenges  are for Video content.” The goals, Steiner said, are “to make video more accessible, searchable, skimmable, and Enjoyable.” Steiner then explained that “These challenges are what drive the SemWebVid  project.”

SemWebVid is short for Semantic Web Video. SemWebVid, says Steiner, “is  a way  to automagically annotate video using RDF.”  The project is currently tailored towards YouTube videos but can be generalized to any video portal. SemWebVid is based around video metadata like titles, descriptions, tags, and closed captions. SemWebVid also consumes Metadata from YouTube and then semantically enhances it.

Steiner explained how SemWebVid uses open source Natural Languages Processing (NLP) services like OpenCalais, Zemanta, and Alchemy to enrich the data and extract entities. Figure 1.4, below, shows, what this space currently looks like. All of the entities link back to the Linked Open Data Cloud. The entities are described open services like the DBPedia API and the Freebase API.

 

Fig 1.4 – Open NLP services used by SemWebVid and the LOD cloud.

 

SemWebVid first feeds the titles and description of a video into the open NLP engine. Then the subtitles and captions are fed into the NLP engine. Finally, SemWebVid submits the tags to the LOD  for URI look up. Using  this data, SemWebVid obtains a list of extracted named entities.

When discussing the future of SemWebVid he said “The long term objective is to store all the RDF information for videos and then make it accessible via a triple store.”

One of SemWebVid’s important processes is to align the extracted entities from a video to the video’s timeline. For example, if Barack Obama is discussed in the video say, one minute in, a picture of Barack or a link to a URL needs to be associated during the time he is being discussed. According to Steiner “there are two modeling approaches and right now there is still no winner.”

Variant one uses media fragment URIs. Media fragment URIs are defined by the W3C’s media fragments working group. The goal of this group is to develop standards which allow us to describe specific sections of media content. Or, in the W3C’s words,  “address temporal and spatial media fragments in the Webusing Uniform Resource Identifiers (URI).” For example, by using the description t=10s t=20s we can describe a specific section of the video and the by using the label “ctag” we can give a specific URI or description of the entity being discussed in the video at the specified time. The URI may link out to the LOD cloud for example.

The second variant uses the “event ontology” to describe specific sections of a video. Figure 1.5 shows an example of how events can be used. This variant starts by describing a timeline at the top and ends with the “event factor” which is the label, or the URI, that represents the entity in the specified timeline of the video. Steiner said “It is still not clear which ontology will be best long-term to describe video.”

Fig. 1.5 – An example of the “event ontology” to describe specific sections of a video

Steiner finished up the discussion by showing me how SemWebVid can help make your videos skimmable using “entity depiction” within the timeline of the video.

So how is all of this useful? Imagine you have a 2-hour video and you want to skim the video and see what is discussed and watch a specific section. With this technology you can quickly skim the video and look at images of all the things being described. And then click to watch a specific section of the video.

A live demo of Steiner’s software can be found at  Steiner’s URL.  In the demo you will see how SemWebVid extracts the entities from a YouTube video and then allows you to view them and convert to different formats. Examples are shown in these last two images below.


Fig 1.7 – Screen shot of SemWebVid showing different conversion options

 

Fig. 1.8 – Screen Shot of SemWebVid Showing Entity Extraction


About the Author:
Sean Golliher is founder and publisher of the peer-reviewed Search Marketing Research Journal SEMJ.org. Sean holds four engineering patents, has a B.S. in physics from the University of Washington in Seattle, and a master’s in electrical engineering from Washington State University. He is also president and director of search marketing at Future Farm, Inc., Bozeman MT, where he focuses on search marketing, internet research,  and consults for large companies. He has appeared and been interviewed on well-known blogs and radio stations such as Clickz.com, Webmasterradio.com, and SEM Synergy. He was featured in a radio interview on SEM synergy with representatives from eBay discussing the future of affiliate marketing. To maintain a competitive edge he reads search patents, papers, and attends search marketing conferences on a regular basis.