Archives: November 2008

Millions of Hits Force Europeana Portal to Reboot

Jennifer Zaino
SemanticWeb.com Contributor

As Semanticweb.com has been writing about, European governments and institutions are heavily investing in the development of the Semantic Web.

One effort that is built using semantic web standards is the Europe portal Europeana.eu, a prototype site billed as a European digital library that will give users direct, multi-lingual access initially to some 2 million digital objects, from film to photos to paintings to manuscripts to archival papers. The site launched November 20, but now it’s a victim of its own popularity: Ten million hits an hour crashed it, and now it’s not due to go live again until mid-December in what is said will be a more robust version.

The Europeana project is expected by 2010 to give users access to more than 6 million digital items, and is ultimately expected to include a business model to ensure the site’s sustainability. According to the web site, “Europeana is a Thematic Network funded by the European Commission under the eContentplus program, as part of the i2010 policy. Originally known as the European digital library network — EDLnet — it is a partnership of 90 representatives of heritage and knowledge organizations and IT experts from throughout Europe. They contribute to the Work Packages that are solving the technical and usability issues and developing the specifications for the prototype.”

The project is run by a core team based in the national library of the Netherlands, and builds on the project management and technical expertise developed by The European Library, the site says. The European Library is a portal that enables people to search across 150 million titles, from 172 collections in 31 European national libraries, and is a service of the Conference of European National Libraries.

Structured metadata is key to contributed content for the portal, which will support RDF triples. Europeana will use the OAI-PMH (Open Archives Initiative Protocol for Metadata Harvesting) harvesting approach. All content aggregators and contributors are required to provide metadata about their resources in unqualified Dublin Core which will be used to build a basic index for simple search, according to the Technical Requirements for providing content. Content aggregators and providers are strongly encouraged to provide more elaborate metadata to enable users to get straight to their content, and so that the portal can build the sophisticated services that users expect, and all data transfer will be based on XML structured files, according to those requirements. Content themes for the prototype include cities, social life, music, crime and punishment, and travel and tourism.

The problems leading to the temporary closure of the site appear to be related to an unexpected level of interest (it had expected up to 5 million hits per hour and reached 13 million hits at its peak), and a lack of computing capacity to support the high traffic. Three servers were online to support the system; these servers deliver contextual information about the digital items, including a small picture. Once users find what they want by searching this contextual information, they click to get to the full content that is stored on the servers of the respective content contributing institutions.

After the site went down for the first time, the Europeana management in The Hague increased computer capacity to deal with 8 million hits per hour,” according to a press release issued on the event. But that wasn’t enough to handle the load, and so “a serious upgrade of computer capacity will be carried out in the coming days and then tested in order to cope with the massive interest from the public.”

Please enter your content here.

Announcing Semantic Tech & Business Conference - San Francisco 2012

Semantic Tech & Business Conference is returning to San Francisco in June! Join us from June 3-7 for complete coverage of Big Data, Linked Data, Extreme Information Management, and Semantic Web. From breakthrough approaches to solving business problems to the big data implications of fast–evolving technologies, SemTechBiz provides you with an unparalleled interactive experience and delivers tangible business value. We're offering a special early rate when you register by February 17. Sign up now!

Restaurant Review App BooRah Beefs Up Menu

Jennifer Zaino
SemanticWeb.com Contributor

Former hedge fund manager and current CNBC TV host Jim Kramer has “boo-yah” as his catch phrase. Nagaraju Bandaru has “BooRah.”

That’s the name of the web site co-founded by Bandaru, also CTO of the semantic web-enabled personalized restaurant review site. (“Boo” for the bad restaurants, “Rah” for the good ones — “sentiment in semantics,” as Bandaru phrases it.) Started about three years ago, the private, venture-funded company has had a busy few weeks. In the last couple of weeks, it has announced a partnership with InfoUSA, one of the top providers of local business listings to every online directory, navigation and mobile player, to provide ratings and reviews for approximately 150,000 of InfoUSA’s restaurant listings; an API for its local syndication partners with enhanced mobile platform capabilities; and a version of its application for the Google Android software stack that has had over 30,000 downloads in two weeks.

The inspiration for BooRah, Bandaru says, was the fact that there wasn’t an easy way for people to get the information they needed out of the tons of formal and blog reviews of restaurants out there in their locales, and that tags didn’t adequately communicate what people were trying to say. The site has these core semantic technology aspects:

  • The ability to match web page to the correct local business. Particularly in blogs, people might discuss restaurants very casually — e.g., they had a great dinner at Joe’s Pizza, Bandaru says. But which Joe’s Pizza? “We are able to map any entity on any web page we crawl and associate that with the correct local business,” he says.

  • The ability to extract attributes from web pages in a very scalable, automated way. “Each sentence is analyzed for any term and grouped into a top level category — food, service or ambience,” he says, and those terms contribute to overall ratings on each of those counts.

    “At the macro level the real problem is you go to any review site you can find a rating, 3 or 3.5 or 4 stars. But it’s never beyond those because that’s how people write. But when they consume reviews they want more information — is that 3 stars for food or ambience, so we can break down that rating into a percentage score of 100 and give a more granular score,” with the help of its patented natural language processing capabilities. Since its launch, BooRah has grown from a vocabulary of about 1,000 to 50,000 terms.

  • The core technology in the semantic area is giving a quality score associated with sentiment. For example, if you say a restaurant was not so good vs. very bad, the technology would know that the former description rates slightly higher than the latter. “So the core of user reviews is all about negation, how do you handle people’s slang, associate what they are describing and what category and roll it up across every single sentence in a review across every different source and summarize,” Bandaru says.
  • Read more

    The Semantic Web, Deep in the Heart of Texas

    Jennifer Zaino
    SemanticWeb.com Contributor

    With a three-year, $550,000 grant from the National Science Foundation, UT Dallas researchers will be delving into issues around the scalability of the semantic web, entity resolutions, and policy specification and reasoning.

    The project, whose funding was announced in October, joins other semantic web research efforts taking place at UT Dallas funded by agencies such as the Intelligence Advanced Research Projects Activity, as well as collaborative efforts taking place with Raytheon Co. on creating semantic web technology that can find and analyze visual information.

    The principal investigator for the project is Dr. Bhavani Thuraisingham, a professor of computer science in the Erik Jonsson School of Engineering and Computer Science at UT Dallas. UT Dallas has been working with HP Labs in developing the JENA RDF engine. “but one problem is managing large, large graphs,” explains Thuraisingham.

    “It’s very well to talk about the semantic web doing this, that, and the other thing,” but representing all this information requires very large graphs, he said. That’s where UT Dallas’ expertise in data mining, one area of specialization for Thuraisingham’s colleague Latifur Khan, will come in handy.

    The entity resolution problem is one that is capturing the attention of many researchers. When the same word has different interpretations, ontologies are required to sort out what the particular reference is to. She uses her own name as an example of potential confusion: Bhavani is the name of a ferocious goddess in India, a river, and a city or town, as well as her own name. That ambiguity needs to be resolved. While she can’t provide details at this point, she notes that UT Dallas is looking at some clustering techniques as a unique approach to dealing with these issues.

    One of the reasons the government is interested in the semantic web is related to the issues of data sharing, springing from the 9/11 tragedy where information that might have helped target the terrorists in advance of the attacks had not been shared among different agencies. The semantic web opens the door to making the data that users want to get more available, useful and relevant, but what should be shared and policies around how that data should be shared will be an issue, whether it’s within government agencies or among parties such as health care practitioners, insurance providers, and patients, or even within various social networks.

    Read more

    The Next Generation of Video

    Jennifer Zaino
    SemanticWeb.com Contributor

    As video becomes a bigger and bigger part of the content we consume – call it Video 3.0 – the challenge grows for publishers to tag that content and make it navigable, searchable, and monetizable. Semanticweb.com recently caught up with Alex Castro, CEO of Delve Networks, the semantically enabled and speech recognition savvy video search platform, who was fresh from the Digital Hollywood conference. Here’s what he says some of the buzz is about, both at Digital Hollywood and throughout the industry, as it relates to next-generation online video content. On PC to TV video: At Digital Hollywood, Castro said there were a lot of discussions about how to get Internet video to consumers’ TV sets.

    “That’s been something that’s been kicked around but there’s a bit more tangible traction around people solving that problem,” he says. He pointed, as an example, to ClearLeap, which works with video content owners and cable, satellite and IPTV companies to offer a new model of video delivery on the Internet. “Essentially they tie Internet content companies into their system that works with set-top boxes,” says Castro.

    There’s potential for Delve Networks to put its semantic video platform to work with ClearLeap. “Our idea with them is that the publishers using our system can get video piped to ClearLeap, to make sure it’s in the right format so it looks good on TV. This is part of the recognition that video is becoming ubiquitous.” Traction around solutions like ClearLeap’s is good forDelve, Castro says, because it creates demand for publishers to want a video publishing platform.

    On High-Definition video: Castro says there’s been some surprise in the industry by consumer appetite for higher quality video on the Internet, which is leading to requirements to deliver high definition video online. The issue is that many of the consumers who want this don’t have the high-speed connections required to deliver it seamlessly. That doesn’t have an impact on the semantic capabilities Delve can bring to the video searching picture, but it does create challenges in terms of vendors being smarter about who they deliver video to, and how they do it.

    “If there’s a particular user with a terrible connection, and you are trying to give them high-definition video, that ends up in a terrible user experience, where you have rebuffering and jerky video and it’s irritating,” says Castro. “So, with a system like ours, we need to make multiple copies of a video at different bit rates and then try and determine, based on the quality of bandwidth for end users, which of those versions to serve, and maybe even adjust which bit rate you are using throughout as the video plays.”

    Read more

    Inform Helps Media Giants Monetize the Semantic Web

    Jennifer Zaino
    SemanticWeb.com Contributor

    Last month, The Washington Times became the latest media outlet to start using the services of four-year-old Inform Technologies. The newspaper said it was using Inform’s semantic web product to create topic-specific pages about significant news and newsmakers, power its Dig Deeper feature that helps readers find related themes and related stories, and link its video and multimedia content to articles throughout the site.

    Inform Technologies also counts among its customers the Washington Post, Sports Illustrated, the NY Daily News, CNN, and a number of other publishing sites.

    To CEO James Satloff, a list of such clients paying to use its technology makes Inform the standard in the industry.

    “By having so many different authoritative media use your technology for disambiguation and categorization and textual relevance, that’s how you get to be the standard,” he says.

    What Inform does for its clients, from established media sources to start-up bloggers, are four main things, Satloff says. The technology saves them money by providing consistent industry standard tagging of content; compels users to spend longer on a publisher’s site; attracts new unique visitors; and helps them monetize their digital assets.

    Journalists spend between 12 and 17 minutes per story hand-tagging content, Satloff says — and they hate it.

    “If you spend just 15 minutes doing this per story and you pub 100 stories a day that’s 500 man-days a year of just tagging,” he says. With Inform, writers or editors can submit their text and in 200ms get back tagged articles based on what Satloff says is a “phenomenally deep and rich ontology.”

    And Inform offers this part of its services for free, in the interests that better and more consistent tagging across the web is good for the publishers, the industry, and of course, for Inform, too.

    By surfacing valuable contextual links and extracting related topics based on its real-time “reading” of the article, Inform makes it easier for readers to dive further into a publication’s digital assets, keeping them on the site longer by pushing them deeper into areas they didn’t know they wanted to go. Take, for example, an article on Bristol Palin, the daughter of vice presidential candidate Gov. Sarah Palin. In addition to creating automatic links within the text, it extracts related contextually related topics that might be of interest to people reading the article, even if those words don’t actually appear in the text — stories on childbirth, for instance.

    Inform can also automatically create topic pages that can raise a site’s profile on the search engines, and with new visitors, by bolstering its credentials as an authority on a particular area. For instance, a story about the current crisis in the financials market ties nicely contextually to a topic page of articles the publication has done on Treasury Secretary Henry Paulson.

    “Like beachfront property, those topics pages are valuable from a user engagement perspective,” says Satloff — visitors go to those pages because they are specifically interested in the area, not because they got there by accident.. “And that’s brand new real estate that ads can be sold on. Topic pages create a tremendous increase in the volume of pages that exist for the publisher — some 20 or 30%. And since we know with such precision all the topics that story is about, it’s an easy way to pass contextual hints to ad partners for better monetization.”

    CoveritLive: Live Blogging 2.0

    Paula Gregorowicz
    SemanticWeb.com Contributor

    Live blogging took the Internet by storm this past week as millions turned online for perspective on the election, and CoveritLive played a pivotal role in that. The Toronto-based startup offers one-stop shopping for anyone needing to offer live coverage of an event.

    CoveritLive is new, having only been in development since April 2007 and out of beta since November 2007. President of CoveritLive Keith McSpurren and head of development Bob Barnard originally created the product for somewhat selfish reasons. McSpurren explained to me that he was a fan of ESPN’s Bill Simmons and the sports diaries he would publish the day after a game. As a sports fan, McSpurren thought it would be great if fans could get that commentary and perspective real time as the game was happening rather than after the fact. And thus CoveritLive was born and it has been gaining widespread popularity since with thousands of big name users.

    (Editor’s note: Jupitermedia has also used the CoveritLive tool at the Datamation Blog.)

    What is It?

    CoveritLive is Software as a Service that provides a robust, multimedia live blogging platform that anyone can use and integrate into their existing blog or website by using just a small snippet of embedded code. Unlike the many-to-many approach of real time software like chat or the online presentation application of online conferencing, CoveritLive offers a one-to-many approach. The premise is that people go to a live blog because they want to hear the perspectives and commentary of the person hosting the event while also being able to participate in real-time. Here is an example of a live blogging event from their online demo:

    CoveritLive1

    The live blog occurs right on the website and can include text and all kinds of multimedia. The host uses a console which allows them to administer the event. They can start polls and the results are updated in real time within the same window. They can include all types of multimedia from images to streaming video.

    Readers do not have to download any special software or create user accounts in order to participate. Readers can attend and contribute to the event simply by entering their name and message. The host then decides what level of reader interaction they wish to allow and can determine which comments, if any, are published. Nothing gets into the live blog feed unless the host lets it in. This feature is particularly useful for large groups to avoid a mess of comments detracting from the usefulness of the live event. Anyone who has ever been in a group chat room with numerous people knows how quickly things can turn to a mess of text.

    Read more

    Podcast: Analyzing the Twine Launch

    Paul Miller
    SemanticWeb.com Contributor

    Unveiled at the Web 2.0 Summit in November 2007 and released in beta earlier this year, version 1.0 of Radar Network’s Twine was opened to the world toward the end of October, and widely reported (see the SemanticWeb.com story “Radar’s Twine Emerges from Productive Beta.”)

    In this discussion we touch upon the purpose of Twine, review the first few days of live operation, and then focus upon the team’s plans for the future.

    When originally announced, Twine was closely associated with the Semantic Web, although the company’s current marketing is less quick to make that link. In conversation we discover more about priorities for the 1.0 release and dig into some of the ways in which semantic technologies will play an increasingly important role moving forward.

    Listen Now

    For further Talking with Talis podcasts on the emerging Web of Data, click here.

    At Talis, Paul Miller is active in raising awareness of new trends and possibilities arising from wider adoption of the Semantic Web.

    Web 3.0 Pioneers: Where Are They Now?

    Jennifer Zaino
    SemanticWeb.com Contributor

    In the spirit of keeping up with product updates to some of the interesting semantic web offerings we’ve seen the last couple of weeks, it seems time to take note of the enhancements and partnerships recently announced for a few other technologies we’ve covered.

    Calais

    Version 3.1, the latest release, has gone live and touts as one of its main features improved company and geographic disambiguation. The reference database of tens of millions of company names and their variations now will include a broader range of companies on an ongoing basis, Thomson Reuters says. Expanded name cross-referencing and textual hints, such as location or industry, should work to clear up ambiguities around whether a reference is, for example, to Acme Inc., Acme Corp., or Acme Ltd.

    Geographic disambiguation uses elements of Metaweb’s Freebase and other public data assets to determine to which town, city, state and/or country a given document is referring, as well as hints in the text to refine results. Calais 3.1 also returns geographic coordinates, which can help jump start developers working on mapping applications, the company says.

    When SemanticWeb.com spoke to Thomson Reuters in May at the time of its Version 2 release, Thomas Tague, Calais evangelist and project lead, said to expect a dramatic expansion in knowledge domains over the coming year.

    “What’s important is we proved a model where we can use open data sources, like Freebase, combined with NLP (Natural Language Processing) to generate entities more rapidly,” he said at the time. Among the additional semantic entities in the Calais 3.1 vocabulary are PatentFiling, PatentIssuance, FDAPhaase, PersonEmailAddress and PersonEmployment, as well as new elements for PersonAttributes and SecondaryIssuance. PersonRelation is an entity that extracts references to symmetric relationships between people in the areas of business, academics, military service or politics, as well as friendship or marital status.

    Cognition

    The company recently said its semantic map — 500,000 word stems and 4 million semantic contexts strong when we wrote about it in August — announced that the scope of that map now is more than double the size of any other computational linguistic dictionary for English. It includes over 10 million semantic connections that are comprised of semantic contexts, meaning representations, taxonomy and word meaning distinctions.

    “Our Semantic Map has grown substantially and now is the largest and most complete of the English language,” says CEO Scott Jarus.

    Delve Networks

    Delve has continued to add a few features to its semantically enabled and speech recognition savvy video search platform. It now has support for RSS feeds so that people can subscribe to publishers’ video content via iTunes and get it delivered to Apple’s devices. Also, it now has the ability for customers to schedule when and how often video will be published, avoiding the manual task of, as an example, publishing on the day after the election a retrospective of videos from the Obama campaign, according to CEO Alex Castro.

    Read more