Posts Tagged ‘HTML’

Turn Unstructured Web Content Into Structured Data With

whoseBringing structured data to users from unstructured web content – that’s what is offering in the way of a new API service that works with millions of blogs, forums, reviews and news sites, and comments posts. The company is leveraging its technology roots as a message board search engine, Omgili, to the work of getting posts’ clean text content, dates, authors, links, language and so on out in JSON, XML and RSS formats rather than as unstructured content via an HTML file.

For Omgili to do its job in the message board space, a world untouched by microformats and other semantic web structures, required creating a crawler that used heuristic techniques to extract text, titles and other details from those posts, he says. “Once we created that we were able to extract data from less complex sources like blogs, news sites and so on,” Geva says.

The output is similar to what does (see The Semantic Web Blog post on that technology here), but on a much larger scale,” says founder Ran Geva. The service, he says, is great when users want specific data from one or two sites but more is required for heavy lifting when you want to leverage a lot of data. “We save and download millions of posts per day,” he says. “When it comes to getting structured content out of complicated sources on a mass scale, that’s a unique technology challenge we conquered a while ago.”

Read more

Web Components: Even Better With Semantic Markup

W3C LogoThe W3C’s Web Components model is positioned to solve many of the problems that beset web developers today. “Developers are longing for the ability to have reusable, declarative, expressive components,” says Brian Sletten, a specialist in semantic web and next-generation technologies, software architecture, API design, software development and security, and data science, and president of software consultancy Bosatsu Consulting, Inc.

Web Components should fulfill that longing: With Templates, Custom Elements, Shadow DOM, and Imports draft specifications (and thus still subject to change), developers get a set of specifications for creating their web applications and elements as a set of reusable components. While most browsers don’t yet support these specifications, there are Web Component projects like Polymer that enable developers who want to start taking advantage of these capabilities right away to build Web objects and applications atop the specs today.

“With this kind of structure in place, now there is a market for people to create components that can be reused across any HTML-based application or document,” Sletten says. “There will be an explosion of people building reusable components so that you and I can use those elements and don’t have to write a ton of obnoxious JavaScript to do certain things.”

That in itself is exciting, Sletten says, but even more so is the connection he made that semantic markup can be added to any web component.

Read more

Latest Version of RDFLib Released

Ivan Herman reports, “This has been in the works for a while, but it is done now: the latest (3.4.0 version) of the python RDFLib library has just been released, and it includes and RDFa 1.1, microdata, and turtle-in-HTML parser. In other words, the user can add structured data to an HTML file, and that will be parsed into RDF and added to an RDFLib Graph structure. This is a significant step, and thanks to Gunnar Aastrand Grimnes, who helped me adding those parsers into the main distribution.”

He goes on, “I have written a blog last summer on some of the technical details of those parsers; although there has been updates since then, essentially following the minor changes that the RDFa Working has defined for RDFa, as well as changes/updates on the microdata->RDF algorithm, the general approach described in that blog remains valid, and it is not necessary to repeat it here. Read more

RDFa Working Group Publishes Last Call Draft of HTML + RDFa 1.1

Ivan Herman of the W3C reports, “The W3C RDFa Working Group  has published a Last Call Working Draft of HTML+RDFa 1.1. This specification defines rules and guidelines for adapting the RDFa Core 1.1 and RDFa Lite 1.1 specifications for use in HTML5 and XHTML5. The rules defined in this specification not only apply to HTML5 documents in non-XML and XML mode, but also to HTML4 and XHTML documents interpreted through the HTML5 parsing rules. Comments are welcome through 28 February.” Read more

Paper Review: “Recovering Semantic Tables on the WEB”

A simple table with no semanticsA paper entitled  “Recovering Semantics of Tables on the Web” was presented at  the 37th Conference on Very Large Databases in Seattle, WA . The paper’s authors included 6 Google engineers along with professor Petros Venetis of Stanford University and Gengxin Miao of UC Santa Barbara. The paper summarizes an approach for recovering the semantics of tables with additional annotations other than what the author of a table has provided. The paper is of interest to developers working on the semantic web because it gives insight into how programmers can use semantic data (database of triples) and Open Information Extraction (OIE) to enhance unstructured data on the web. In addition they compare how a  “maximum-likelihood” model, used to assign class labels to tables, compares to a “database of triples” approach. The authors show that their method for labeling tables is capable of labeling “an order of magnitude more tables on the web than is possible using Wikipedia/YAGO and many more than freebase.”

Read more

The Value of Semantic Markup to Retailers

A recent article informs online retailers that “Starting now, you’re going to need good structured markup on your X/HTML in addition to your white hat tactics. I see structured markup as being equally important to authoritative inbound links as a ranking factor when optimizing content. Why? Because search robots are designed to serve search engine users by matching their search query expectations, known as user intent. These bots are machines, and they’re trying to discern the human mind’s evaluation of information in answer to human-entered keywords.” Read more

Explaining HTML5 and RDF

A new article from provides some insight into HTML5 and RDF for those of us who don’t work on the technical side of things. The article explains that “since the data on the web is often in forms that make it computationally complex to parse or recognize, new HTML tags and standards had to be developed and integrated with HTML5 to provide this functionality.” Read more

RDFa Core 1.1 Last Call Working Draft published

The RDFa Working Group has
published a Last Call Working Draft of RDFa Core 1.1. RDFa Core is a specification for attributes to express structured data in any markup language, with an emphasis on HTML-family languages, the Scalable Vector
Graphics (SVG) Format, the Open Document Format and other Web-enabled document
formats. The specification enables the human-readable and machine-readable markup of people, places, events, products, recipes, social networks, and many other concepts that are frequently published on the web. RDFa 1.1 improves upon RDFa 1.0 by adding a
number of features requested by people to ease authoring.
The announcement as a Last Call Working Draft is an open invitation to
the general public to review and provide feedback on the specification via the RDFa Working
Group mailing list
. The deadline for review feedback is 6 December.

HTML 5: Spyware built in? – (blog)

HTML 5: Spyware built in? (blog)
HTML 5, which I suppose could be said to be ushering in Web 3.0, makes that much harder. It is already difficult to take steps to protect your privacy.

and more »

The Semantic Web: An Explanation in Plain English – Technorati

The Semantic Web: An Explanation in Plain English
The Semantic Web is a big step toward Web 3.0, where the ultimate goal is to make Web content more machine-friendly. Most Websites are produced using HTML,