RDF Due Diligence

Jonathan Hendler

on

December 19, 2006

RDF Due Diligence

The following provides a basic, op-ed case study on using RDF in a real-world search application for a small organization. Small organizations are resource constrained and investing in newer technology such as RDF poses special risks. The evolution of software, and the Semantic Web in particular, is a tension between a need to make something new and better, and a short-term need to make something practical and efficient.

The arguments presented are for both the engineer working with small organizations encumbered by reality, and the RDF purist encumbered by a vision. The "middle ground" is the goal - where practical software for small organizations can be built using RDF. This is vision being developed at CivicActions in through the NINA project[i] – an RDF based faceted search tool for Drupal. Since using RDF is not an easy or standard approach to building search applications, CivicActions supported by LINC's [ii]NINA project [iii], has been developing software that can allow organizations to leverage their existing Drupal content in new ways – via a new interface and RDF-based data storage mechanism.

Definitions:

  • RDF: Resource Description Framework is a W3C recommendation for representing “knowledge” in XML on the Web. [iv] RSS is the most widely used implementation of RDF. [v]
  • RDF Store: A database of RDF, usually broken into “statements” or “triples” which can then be queried/ reassembled. The most popular language for querying RDF is currently SPARQL which offers features similar to the query operations provided by SQL
  • Triples: in RDF are the most atomic representation of knowledge that can be represented by a directed graph. This is useful in genetics, linguistics and other very complex data where we have all the information but need to represent complex and unknown relationships.
  • Semantic Web: The vision for the next generation of the web - making it one giant self-explanatory web. The entire Web becomes a facilitated information exchange.
  • Faceted Search/ Faceted Classification[vi] - a controlled vocabulary to describe and organize information

Many Engineers ask:

Why use an RDF store when I can build search functionality “just like that” in SQL faster, cheaper, and more efficiently?

RDF is best NOT used when the data doesn't change, doesn't need to be shared with other sites, doesn't have a lot of meta-data, and doesn't relate to any other data. I'd guess that if your data doesn't need to do any of the aforementioned it's days are numbered.

Traditional database normalizations through entity relationship diagrams and other techniques followed by DBAs can yield better performance when the data is well understood. If the data format and context is not well understood – RDF provides a convenient abstraction for storing the information without imposing a constraining schema.

One can make specialized RDBMS engines faster than RDF stores or conversely, one can mimic RDF stores flexibility in an RDBMS. In fact, almost all of the RDF-stores available implement the RDF store as a layer above an RDBMS. Some engineers I've spoken with believe this is evidence RDF-stores aren't useful in general applications, since traditional databases can do everything RDF stores can do. However, it should also be clear now that if you are implementing a data store as flexible as an RDF-store in an RDBMS you are re-inventing the wheel – and you would run into the same performance issues as an RDF-store.
NINA uses ARC[vii] and an RDF store abstraction layer called SONIA.

To play devil's advocate – RDF stores lack a consistent method for data import and synchronizations, and they lack a consistent method for implementing RDBMS features like an auto increment fields, indexes, etc. I have personally replicated RDF-store like functionality in an RDBMS to achieve the functionality I wanted and achieved good performance results. But the goal from an engineering perspective is more than just performance or even adaptability – it's about standardization.

Some Drupal engineers feel this can all be achieved using the excellent “Views Module”. I am sure it can. It's just that RDF goes beyond the mission of Drupal.

Many seasoned business-types also ask:

Why use an RDF Store? What matters is the interface; faceted search, “folksonomies” and tagging work great and people like to use these interfaces. What is underneath is not as important.

Generally, this is the more important question for small organizations. RDF-stores can be difficult to bridge to search applications that expect consistent performance. Since the search interface is consistent, the underlying data can be consistent – which means that in the short term RDBMs can outperform in both price and performance.

However, over time, as diverse organizations bring loosely related data into a central database – there will need to be a way to mitigate the complexity of building search. Search comes not only from the interface on our site as the data set grows, but also from other sites which will aggregate our data and build new search interfaces.

RDF lets data be thought of in terms of relationships. The search interface only reflects this. Faceted search, for example, lets you think of search in the same way. Instead of a search for keyword distribution, the search process with faceted search is a process of discovering relationships.

The BIG win with integrating RDF with Drupal is that Drupal already has many interesting relationship models – so you can refer one piece of information to another (Node relations, taxonomies, menus, CCK types, etc. )

The well established standards and science that led to the development of RDF is well represented by engineers - but many of these engineers are not involved with small communities (the developer of ARC being an exception).

Here is what RDF purists would ask about NINA:

  1. Where are the strong types?
    In PHP it is not required to define something as an integer, float or string, but other languages, like Java, expect this information. The RDF in NINA currently defines everything as a string for convenience, but later will need this feature.
  2. Where are the constraints and rules on other types?
    A major feature of some RDF-stores is the ability to use AI-like reasoning engines on the data. These engines rely on descriptions of rules, often in OWL, but sometimes built-in with RDFS. NINA/SONIA are building a foundation for this functionality but do not take advantage of these features yet.
  3. Where is the data validation?
    Data validation is handled by Drupal and the Drupal developer by using CCK types and other built-in mechanisms.
  4. Why abstract the RDF store and SPARQL with SONIA? Why not let them access the RDF with SPARQL?
    In the Drupal environment, as would be the case in any CMS, there is a different level of usability that is achieved when the RDFstore is used in a ubiquitous way. The data can still be exported to RDF and used in any other RDFstore.
  5. The high level namespaces are inconstant.
    This feature

Here's summary of what RDF and RDF stores mean to small organizations:

  1. reduce the redundancy of data through managed data sharing
    quickly aggregate/create large but well understood collections of data
  2. Not only can RDF be converted to OPML, RSS, Atom, G-Data or other micro-formats – but also can more easily import these formats. (consider RSS feed aggregators)
  3. allow complex queries on arbitrary data
  4. be a part of change

Be prepared for dramatic change on the web over the next 5 years!

Thanks to Benjamin Nowack for edits and corrections in the definitions.

Links
i http://drupal.org/project/nina

ii http://lincnet.net/

iii http://lincnet.net/nina

iv http://en.wikipedia.org/wiki/Resource_Description_Framework

v http://xml.com/pub/a/2002/12/18/dive-into-xml.html

vi http://en.wikipedia.org/wiki/Faceted_classification

vii http://arc.web-semantics.org/

Share it!