— PHILIP DUDCHUK, SERGE MINOR


Executive Summary

We introduce a web application which integrates the core idea of social bookmarking with semantic components allowing to enhance search and navigation, and to overcome the drawbacks of collaborative tagging.

Social Bookmarking and its Limits

Social bookmarking services such as StumbleUpon, Del.icio.us, Diigo, etc. have proved to be a relatively successful Web 2.0 application. These services allow users to tag web pages or page fragments and explore bookmarks created by others. For instance StumbleUpon claims to have over 6.8 million members as of January 2009. According to eBizMBA article ‘Top 25 Web 2.0 Sites’ published in December 2008 Del.icio.us had 1 699 128 monthly visitors in U.S. (Compete estimation).

However, such services have several major drawbacks.

First, most familiar systems of social bookmarking face the problem of uncontrolled growth of the tag set. And it’s quite natural: people tend to mistype their tags, to use different tags for the same thing, and conversely to use the same tag for different things. It would be nice to have a system which in a natural way persuades users to employ uniform tags, while still allowing them to introduce their own tags if necessary.

Many existing systems try to overcome this problem by introducing lists of recommended tags. These lists are mostly drawn from the tags previously created for the given page by other users. Obviously, such systems do not work well for recent pages (e.g. latest news), because these pages may have never been tagged by anyone. And this is the second problem.

Furthermore, even one and the same user can occasionally forget the meaning of her own tag. To retrieve the meaning of the ‘lost’ tag the user is forced to review all her bookmarks labeled by that tag. This problem could be solved by allowing the user to easily overview the basic semantic content of the bookmarks labeled by this tag.

Finally, the classic (multiple) tag search proves to be very limited. It does not take into account the semantic content of bookmarks labeled with the tag. E.g. in Diigo the tag search query ‘Hilton’ returns bookmarks for both the hotel and for the person Paris Hilton. This problem could also be solved if the system allows to refine the query by including information about the tag’s semantic context.

As Gartner puts it, ‘any approach that requires significant user tagging is suspect, because users often fail to use tags or use them improperly.’1

The Semantic Solution

Our approach allows to integrate classic social bookmarking with semantic technology, with mutual benefit for both domains. The basic idea is to combine Entity Extraction techniques and familiar collaborative tagging to considerably enhance lists of recommended tags. A further benefit of our approach is that it allows the user to navigate through both layers of metadata—user tags and entities identified on the tagged pages. This is accomplished by putting together both named entities and tags within a common ontology.

Our solution goes in line with Gartner’s idea to ‘use pre-defined vocabularies or ontologies during the course of scanning a document to automatically generate semantic information about the document‘, which ‘can automatically be tagged to a document, presented to a human for review before tagging, or generated again when the need arises.’2

There are three crucial ingredients of our solution:

  • A patented NLP engine which performs semantic analysis of web pages
  • A set of algorithms which calculate the list of recommended tags relying on entities extracted from a given page and on user tags added for semantically similar pages
  • A web application which allows the user to navigate through the metadata pertaining to all pages tagged by the user community
     

NLP Core

The dedicated natural language processing (NLP) engine called OntosMiner™ takes web pages or page fragments, analyzes their content and identifies objects (i.e. named entities) and semantic relations between them (i.e. facts about the objects of interest). Objects include names of organizations and people, geographical regions, towns, etc. Identifiable relations include invest, own, employ, mention, etc. Each object and relation is identified by means of special algorithms based on a system of syntactic and semantic rules pertaining to a given natural language (English, German and Russian at the moment).

Fig. 1 illustrates the process of text analysis performed by OntosMiner.

Fig. 1. Input/output of the NLP engine

Fig. 1. Input/output of the NLP engine

For each extracted fact, the system records not just the document URL, but also the related text fragment.

Next, the output of single texts analysis is consolidated, and identical objects are merged. Ultimately, a common graph representing the whole collection of documents is formed. This graph is then stored in a dedicated knowledge base (RDF Store).

Unifying Tag Sets

As in other systems the bookmarking functionality is implemented as a browser plug-in. The user can add a bookmark to a web-page or a page fragment. Once user selects the page/fragment she is going to bookmark, the system recommends tags which include most semantically salient entities from the selected text, and tags created by other users for pages with similar semantic content. The similarity of pages is calculated on the basis of sets of objects and relations extracted by the NLP engine. Nevertheless, the user always has a possibility to add a new tag not listed among the recommended ones.
 

Fig. 2. Adding a bookmark to a page fragment

Fig. 2. Adding a bookmark to a page fragment

In this way, the user gets a list of relevant recommended tags even for new pages. The existence of a rich list of semantically salient tags for every page increases the probability that users will select similar tags for similar pages. This ensures a new level of consistency and uniformity across tag sets employed by the user community.

Semantic Navigation

The browser plug-in highlights objects and bookmark-anchors on web pages where they are available. Clicking on a highlighted object or a bookmark-anchor opens a mini-browser allowing the user to navigate over entities and semantic relations extracted from various web resources, as well as over tags and bookmarks created by friends or all the users.

For instance, the user clicks on the highlighted object Wisconsin and gets the list of semantic relations and tags associated with this object (on Fig. 3 there is the relation Locates/Represents and five related tags associated with Wisconsin). A tag is associated with objects which occur most frequently on pages labeled with this tag.

Fig. 4. Exploring a tag

Fig. 4. Exploring a tag

The user can choose one of the tags and get a list of related objects (on Fig. 4 the tag books was selected, and the list of related objects includes various commercial and educational organizations, persons, locations, products, etc.). Then the user can again select an object to get associated tags and relations, and so on.

The contextual association of tags with named entities allows the user to easily recover the semantic content of her forgotten tags. Instead of getting buried in tens or hundreds of bookmarks, the user just quickly reviews the list of most salient objects connected to the tag.

Note that on each step of the navigation the user also gets lists of links to the pages most relevant to the selected entity or tag.

Enhanced Search Capability

Another user benefit of our solution consists in the possibility to combine tag search and search over named entities or facts about them. Returning to the Hilton ambiguity mentioned in the beginning, the user can now select either the person or the hotel among the named entities connected to the tag ‘Hilton’ and get a much more accurate search output.

There is also a converse scenario. Imagine that the user is looking for job opportunities in Yahoo! First she selects the named entity Yahoo! and browses tags associated with it. She easily identifies all the tags she is interested in (jobs, job, career, job opportunities, etc.). By including the object Yahoo! and all the relevant tags into the search query the user gets all the bookmarks she is looking for. Note that this has two major advantages over the classic multiple tag search:

  • the user does not have to guess what tags she should include into the search query
  • the search results include not only the bookmarks labeled with the tag ‘Yahoo!’, but all the bookmarks containing Yahoo! as a semantically salient named entity, so the search recall is increased.

Conclusion

Thus the approach described above is aimed at uniting the benefits of natural language processing techniques and those of classic social bookmarking in a semantically based social web application. The key competitive advantages are the following.

  • the brand new algorithms of recommending tags prevent the uncontrolled growth of the overall user tag set and resist the common problems of typos, synonymy, polysemy, etc.
  • recommended tags are generated not only for previously bookmarked pages, but also for recent pages not labeled by any users
  • since tags are associated with specific named entities, the user can easily retrieve the semantic content of forgotten tags
  • the user can navigate through the semantic metadata on the level of both tags and identified named entities and facts
  • our system allows to combine tags and named entities in search queries, which results in more accurate and complete search output.
1. David W. Cearley, Whit Andrews, Nicholas Gall. Finding and Exploiting Value in Semantic Technologies on the Web. Gartner Research, 9 May 2007. ID Number: G00148725.
2. Ibid.