The release of Facebook’s Open Graph Protocol has spurred renewed interest
in the semantic web. I give credit to Facebook for pushing forward an RDFa
derived format onto the world wide web. In fact, RDFa is the least
interesting part. Producing semantic data has been around for a long time.
Most importantly, I give Facebook credit for focusing on the interesting
part of the problem: consumption of semantic data.
And although it’s a great achievement, I regret the locked-in and
centralized nature of Facebook’s Open Graph Protocol. The web is an open
environment, where open wins at the end.
Entity Identifiers
Facebook assigns and owns internal identifiers to entities. These
identifiers do not look unique, nor something that external parties,
disconnected from Facebook could assign, inspect or do anything useful with.
By owning the identifiers, Facebook is owning the entity graph.
The assumption of a single party owning the identifiers for entities is
fundamentally flawed. Entities appear, disappear, merge and fork. Known
entities change constantly, and most importantly what you and I, or anybody
else understands and knows as an entity is different. IMDB and Netflix will
both describe the movie “The Book of Eli”, however each use different
identifiers. To assume that one unique identifier can be used consistently
and universally across the web leads to information lock-in.
Luckily, the web is an open environment. And, at the end, open wins on the
web (prove by induction to the avid reader).
Identity on the open web is federated, starting at the top-level domain
identifiers, and all the web to a fully qualified Universal Resource
Identifiers (URIs). In reality, there are many sources of truth. Some of
those are canonical, or trusted, yet many other sources exist. Movies data
from IMDB or Netflix will usually be considered canonical. Yet that does not
stop Wikipedia, or any web publisher, from creating its own entities. All it
takes is a URI.
Anybody on the web, be it Facebook or Freebase, can make statements such as:
this URI on Netflix is about the same movie as this other URI on IMDB. All
it takes is a web page with a couple of links. The difference between
Facebook and the “web” way, is that a web page is identified by a URI, and
anybody can create one in a federated way. Not just Facebook.
Semantic Web Annotation Techniques
The most interesting value of the web is not about describing resources, or
the roles of the relationships between resources, but the relationships
themselves. The statement “Madonna is related to Guy Ritchie” is, in
relative terms, more important than the more fact-complete “Madonna is Guy
Ritchie’s ex-wife”. Although the fact “ex-wife” increases knowledge, it has
only value once we asserted that the relationship exists. Establishing
relationships can be achieved through links and URIs. That’s what the web is
about. Pages and links to other pages.
The problem however with the current web of pages and links is that all URIs
are anonymous. Ideally, I would like to bring some structure to those
relationships so that I can codify exactly what “ex-wife” means. That’s what
semantic annotations address: naming the role in the relationship between
resources in a structured format.
Inlined semantic annotations, ala Facebook, is one possible way of linking
data sets. The issue however with this approach is that the meta-data about
the resource is mingled with the data. This creates a number of problems.
First, there is an arbitrary distinction between what constitutes data and
what is metadata. Additionally, one needs to be able to parse the resource
format in order to find the semantic annotations. Aside the parsing
computing cost, adding semantic annotations to videos, images, etc. would
require custom extensions to be container format, which is practically
unfeasible. Practically speaking inline semantic annotations are only
partially useful, and only for text-readable content types.
A better alternative would be to use out-of-line semantic annotations. In
this model, we cleanly separate meta-data from the resource data itself.
This separation should be not just syntactical, but structural. Semantic
annotation constructs should treat the data as an opaque resource. For all
we care, we should treat all data as binary resources. If we wanted to look
into the data, an specialized parser would read the data, and surface
interesting facts that we could then promote to metadata.
The first possible construct that provides out-of-line semantic annotations
is the Link HTTP
Header.
Using link headers we can describe a web resource without having to parse
the payload, by simply looking at the HTTP headers. Examples of usage for
link headers could include:
Link: <http://www.cern.ch/TheBook/chapter2>; rel="Previous"
Link: <mailto:timbl@w3.org>; rev="Made"; title="Tim Berners-Lee"
Although powerful, link headers are not particularly accessible for most
publishers, since they require programming access to the web server to
generate those link headers. Additionally, if we are only interested in the
semantic annotations, fetching a full document only to throw it away is
highly inefficient both for consumer and publisher.
A better construct to provide out-of-line semantic annotations is to create
a separate resource altogether representing the semantic annotations for the
parent resource. This alternative, alike link headers, also differentiates
between data and meta-data, but does not require stack changes to HTTP.
Furthermore, it does not require the publisher to generate those semantic
annotations. Anybody can publish semantic annotation documents for any
resource on the web. In that sense, the use of out-of-line semantic
documents that describe web resources is truly open and federated.
The Linked-Atom
There are many possible formats for describing resources in an out-of-line
fashions. One of the most interesting formats ones is the The Atom
Publishing Format. With Atom, the
focus is on identity and linking resources. The actual content is hidden
away.
But within the context of semantic annotations, Atom has its
shortcomings. Rather
than abusing Atom, perhaps we need to create a separate, specialized
out-of-line resource descriptor. I call such format the “Linked-Atom”. There
are a few differences between Atoms and Linked-Atoms. Whereas the atom is a
generic format for content publishing, the linked-atom is only using for
linking web resources.
Let’s consider a graph whereby:
- Each resource on the web is unique identified by a URI. Let’s make such
resource a vertex, and the identity of this vertex be the URI.
- A resource can link to other resources, also identified by URIs. We’ll
make each link a directed edge, and the link’s identity the URI of the
target resource.
- Edges have a type, which corresponds to the type of the target resource.
- Edges might be named, or remain anonymous.
In this graph, the set of a vertex and its outbound edges constitutes a
linked-atom. The linked-atom introduces some additional constraints:
- A linked-atom is immutable. A change in the graph (adding or removing
edges, or changing edge types or edge names) creates a new Atom, with the
vertex identifier and a new revision number.
- Linked-atoms are identified by a composite key composed of the vertex
identifier and the revision identifier.
- The list of all linked-atoms describing all revisions of a vertex
constitutes a collection.
- Collections are uniquely identified by the vertex identifier.
A possible JSON representation for a linked-atom would be:
{
"id": "http://example.com/foo.html",
"rev_id": 1,
"links": [
{
"id": "http://example.com/bar.html",
"rev_id": 2,
"type": "text/html",
"name": "bar"
},
{
"id": "http://example.com/toto.png",
"rev_id": 1,
"type": "image/png",
"name": "toto"
}
]
}
A more compact representation of the atom is perhaps more interesting for
extracting information. Instead of grouping all the edges into a single
document, we would create describe each vertex-edge relationship as a
N-tuple. A possible implementation of the linked-atom could use n-tuples to
store the data:
http://example.com/foo.html 1 http://example.com/bar.html 2 text/html bar
http://example.com/foo.html 1 http://example.com/toto.png 1 image/png toto
The Graph
Using linked-atoms, we can model the information in the web, not simply as a
graph of pages and links, but a graph of named and typed links between
vertices. Each linked-atom represents a statement about a web resource, a
piece of knowledge. The advantage of a linked-atom graph is that anybody can
publish a document making statements via linked-atoms and collections of
linked-atoms, and not just the publisher of the web resource.
This is in contrast with Facebook’s Open Graph Protocol, where only the
publisher of the web resource can make such statements, and where only one
consumer assigns identifiers to those statements. Maybe the Linked-Atom is
not the perfect construct, but it provides an alternative to what I see as a
centralized lock-in model that threatens the open nature of the web.