— YAKOV KAMEN, TONY GINZBOURG

The RELETON platform was developed in order to find deep relationships between keywords, and to automatically analyze web pages and other types of unstructured text documents.  The platform was originally released in January 2006 and as of today it utilizes results from analyzing over 1.1 billion web pages and over 90 billion actual user search queries.

In this article we will provide a brief overview of the RELETON Platform, its architecture, main components, basic services and several typical applications.

PLATFORM ARCHITECTURE
The RELETON platform consists of the following components:

  • linearly scalable Keyword Data Center  for real-time data processing
  • database of ~50 million most popular keywords, and over 1 billion keyword proximity relationships
  • set of technologies allowing keyword clusterization and keyword proximity computation, keyword categorization, document analysis, intelligent web page parsing, keyword and document rating, sentiment and product extraction
  • databases of adult/sexual keywords, legal and illegal drugs, tobacco and alcohol products

KEYWORD DATA CENTER
RELETON’s Keyword Data Center is a system of identical, and fully interchangeable HW servers or nodes. Each node is an 8-core PC system with 500Gb HDD and 32Gb RAM. RELETON’s SW configuration includes Linux, MySQL, Releton Keyword and Keyword Relationship database, and proprietary technologies implementing algorithms for semantic cluster generation, categorization, arbitrage, rating generation, etc. All node servers are statistically equivalent and interchangeable.

KEYWORD CLUSTERS AND PROXIMITY SCORES
The Keyword and Proximity Relationship Databases are the most important components of the RELETON Platform. In the RELETON Platform, keyword is defined as a set of words and/or symbols representing legitimate search requests made to an internet search engine. Each keyword is associated with a set of semantically close (relevant) keywords or neighbors. Each neighbor is systematically assigned a number between 0 and 100 called a proximity score, measuring closeness between an original keyword and its neighbors and computed as a very complex measure of co-occurrence between two keywords within over a billion internet web pages. A neighbor keyword with a proximity score of zero means that the two keywords never co-occur within analyzed web pages; and a proximity score of 100 means that the two keywords co-occur in each document that either one is present, and therefore can be considered as semantically equivalent keywords. The set of semantic neighbors of a given keyword are called a keyword cluster, and the minimal proximity score for all clusters is called proximity threshold.  Any neighbor keyword which does not meet the proximity threshold is withheld from the keyword cluster.  After three years of operation, the RELETON Database has collected approximately 50 million of the most popular search keywords, and over a billion keyword neighbors and proximity scores. Table 1 below shows a part of the semantic cluster for keyword “British agent 007,” with neighboring keywords ordered based on their proximity scores.

Table 1. Some Semantic Neighbors of the Keyword “British Agent 007”
Neighbors Proximity Score
British Agent 007 100
James Bond 98
James Bond 007 96
Agent 007 90.1
British Agent 34.4
Daniel Craig 16.5
Ian Fleming 14.2

KEYWORD CATEGORIZATION
A slightly modified1 DMOZ categorization (http://www.dmoz.org) is used in the RELETON Platform as an internal categorization structure. For each keyword, RELETON generates an ordered list of the most relevant categories, as well as a numerical characteristic called level of confidence, measured between zero (zero confidence) and 100 (absolute confidence). Tables 2 and 3 below show two examples of RELETON’s categorization solution:
 

Table 2. DMOZ-like Categorization for Keyword “British Agent 007”
RELETON DMOZ Category Level of Confidence
Arts/Movies/Titles/James_Bond_Series 100
Table 3. DMOZ-like Categorization for Keyword “Atkin’s Diet”
RELETON DMOZ Category Level of Confidence
Health/Nutrition/Dietary_Options/Low_Carbohydrate/Atkins_Diet 100
Shopping/Food/Diet/Low_Carbohydrate 82.3
Home/Cooking/Special_Diets/Low_Carbohydrate 82.3
Health/Weight_Loss 73.2
Health/Nutrition/Dietary_Options/Low_Carbohydrate 17.6

KEYWORD RATING
For many semantic applications, it is important to be able to rate keywords and recognize adult, alcohol, tobacco, and drug-related keywords. To make a decision on keyword ratings, RELETON owns a comprehensive set of “special” keywords, for instance, a list of all UNESCO classified illegal drugs. To define whether a keyword belongs to one of the special groups of keywords, RELETON analyzes each keyword’s neighbor cluster and assigns to the original keyword an aggregated weighted average score of its belonging to one or more of the predefined rating groups. RELETON rating is one the most accurate and sensitive semantic rating systems available today. It can accurately rate keywords which are only semantically related to the special groups. For instance, the keyword “rolling paper” belongs to the “tobacco” group, because its’ proximity cluster consists of several tobacco-related words.
 

KEYWORD POPULARITY, POWER, AND DEMOGRAPHY
RELETON categorization and clustering solutions allow The Platform to provide very complicated semantic keyword characteristics, such as:

  • Keyword popularity, as a function of keyword occurrence in web content, and its actual usage as a search query
  • Keyword power, as a function of a keyword’s popularity, and number of bidding advertisers
  • Keyword demographics, as a keyword’s popularity within certain demographic group

UNSTRUCTURED TEXT ANALYSIS
In the RELETON semantic paradigm, an original text is replaced by a set of the most representative keywords, weighted by their location within the text, text punctuation, and frequency of a keyword’s usage within the text.  After that, unique keyword analysis technology is applied to the extracted set of keywords. First, the RELETON Platform forms a set of proximity neighbors for the text by adding semantically relevant and popular keywords and generating corresponding proximity scores.  After that, RELETON creates a text categorization and text rating. With a library of over 170,000 user-intent patterns, RELETON is able to automatically assess a text’s controversy, as well as public and author sentiments. A library of approximately 800,000 product names and brands allows RELETON to extract products from the text, as well as the sentiment towards those products within the given text.

RELETON POWERED SERVICES
The Platform has several real-time semantic analysis services, defined below:

  • ReleKey – ReleKey is a semantic keyword analysis service.  Given a keyword, ReleKey generates a ranked list of relevant keywords and their proximity scores
  • ReleText – ReleText is a semantic web page and text analysis service.  Given a document or web page, ReleText analyzes its content, generates a ranked list of relevant keywords and their relevance scores, and optionally displays contextual advertisements from leading advertising networks.  ReleText generates not only the most relevant keywords which actually appear within the text, but it also provides related keywords which are semantically relevant to the analyzed text
  • ReleCat – ReleCat is a semantic keyword, web page and text document categorization service.  Given a keyword, text document or web page, ReleText analyzes its content and categorizes the content to the most relevant of 700,000+ DMOZ categories or a custom set of categories
  • ReleShield – ReleShield is a keyword, web page and text document rating and protection service. Given a keyword, web page or text document, ReleShield analyzes its content and determines if the keyword, web page or document contains any adult language, sexual content, illegal and prescription drugs, tobacco, etc.

USABILITY STUDY
The majority of current customers use the RELETON platform for ad placement yield optimization. The fundamental idea for its usage is that there are billions of legitimate (good) long-tail keywords on which nobody is advertising. However, relevant ad placement for long-tail keywords is still a highly desired option. For each long-tail keyword, RELETON generates a semantic cluster, and in many cases, such a cluster contains one or more keywords that have highly relevant associated ads. Using a custom arbitrage scheme, RELETON allows assignment of these ads to the original long-tail keyword.

CONCLUSION
The RELETON Platform is a powerful and versatile engine for deep semantic keyword and web page analysis. The Platform is designed to support real-time semantic analysis, and in many cases delivers result in under 50ms. RELETON has an automatic, real-time update system which periodically updates semantic relationships, as well as the collection of the most popular keywords, categories, and database of products, keyword ratings, sentiments, etc.
A comprehensive list of RELETON Platform demo’s is located at http://www.relevad.com/demos. Login: relevad Password: daveler
 

1.   In RELETON version of DMOZ, we eliminated some unnecessary levels of hierarchy in categories as well as several “trees” of international categories.