Archive for the ‘Internet’ Category

Why node.js Matters

Sunday, June 13th, 2010

Since the days when I coded my first reactor [POSA2] back at MIT, I have been convinced of the conceptual simplicity of non-blocking event driven server. Aside not blocking for I/O and being able to scale well beyond polling architectures, it is harder, sometimes impossible, to make concurrent programming mistakes with an event-driven programming paradigm. However, reactor servers never took off massively, probably because event based programming of server-side applications is a more complex programming paradigm for the average programmer than a thread- or process-per-request. In in a way, we have had not enough pressure to move, yet. Things would just work, reasonable well.

But I believe we are finally at a turning point. Multi-core architectures imply the end of the free lunch. It is happening. Writing highly concurrent server-side applications that are able to scale linearly as the number of cores increases is something that we should be preparing for.

Once you believe reactor servers are the future, and somewhat the present, comes the question of programming language. When it comes to programming for the web, there are many religions and options: PHP, Ruby, Python, Java, C#, … However, when it comes to writing web applications, there are three languages on which all web developers agree universally on: HTML, CSS and Javascript.

And this is what makes node.js exciting. node.js develops a reactor server on top of Google’s V8 Javascript engine. First, it solves the free lunch problem using a language that is universally accessible to web developers. Second, it opens the door for innovation to a world of extremely high concurrency, for example, MMORPGs, and potentially to a completely different web interaction paradigm.

With motivation in mind, I set myself to learn a bit more about node.js. I chose one possible stack, node.js + express + mongodb, but there are many more to learn. What follows in the rest of this post are my raw notes, which I am posting here just in case it helps fellow developers getting up to speed in all this new cool technology.

Installation

First, we install node.js itself, and kiwi, a packaging system for node.js, using homebrew (I have dropped Macports in favor of homebrew, as it allows me to create my own installation recipes really easily).

$ brew install node
$ brew install kiwi

That installed for me:

kiwi (0.3.1)
node (0.1.98)

Once we have kiwi, we install the express framework:

$ kiwi -v install express

Finally, we install mongodb:

$ brew install mongodb

If this is your first install, automatically load on login with:
    cp /usr/local/Cellar/mongodb/1.4.3-x86_64/org.mongodb.mongod.plist ~/Library/LaunchAgents
    launchctl load -w ~/Library/LaunchAgents/org.mongodb.mongod.plist

If this is an upgrade and you already have the org.mongodb.mongod.plist loaded:
    launchctl unload -w ~/Library/LaunchAgents/org.mongodb.mongod.plist
    cp /usr/local/Cellar/mongodb/1.4.3-x86_64/org.mongodb.mongod.plist ~/Library/LaunchAgents
    launchctl load -w ~/Library/LaunchAgents/org.mongodb.mongod.plist

Or start it manually:
    mongod run --config /usr/local/Cellar/mongodb/1.4.3-x86_64/mongod.conf

And finally, the native adapter to mongodb for node.js:

$ kiwi install mongodb-native

Hello World

Fire text editor and create app.js:

var kiwi = require('kiwi')
kiwi.require('express')

get('/', function(){
  this.contentType('html')
  return '<h1>Welcome To Express</h1>'
})

run()

And back to the terminal:

$ node app.js

Point the browser to:

http://localhost:3000/

Tutorial App

I found a good blog with many node.js tutorials, including one on building a blog app example that included source code using express and mongodb.

To run the demo, first start mongodb, and then node.js (I used two terminal windows):

$ mongod run --config /usr/local/Cellar/mongodb/1.4.3-x86_64/mongod.conf
$ node app.js

And then visit http://localhost:3000/. I could not create a new blog post by visiting http://localhost:3000/blog/new, as I’d get an error:

/Users/brunofr/Projects/express-mongodb-2/app.js:39
        title: article.title,
                      ^
TypeError: Cannot read property 'title' of undefined
    at /Users/brunofr/Projects/express-mongodb-2/app.js:39:23

Looking at the code, it seems like the route blog/* is being executed before blog/new:

get('/blog/*', function(id){
  var self = this;
  articleProvider.findById(id, function(error, article) {
    self.render('blog_show.html.haml', {
      locals: {
        title: article.title,
        article:article
      }
    });
  });
});

get('/blog/new', function(){
  this.render('blog_new.html.haml', {
    locals: {
      title: 'New Post'
    }
  });
});

Reversing the order fixed the problem.

Benchmark

I found one interesting benchmark, on Express vs sinatra, which as always should be taken with a grain of salt. But the benchmark illustrates some of the properties of reactor servers that we have been talking about.

The Linked-Atom Web

Friday, May 14th, 2010

The release of Facebook’s Open Graph Protocol has spurred renewed interest in the semantic web. I give credit to Facebook for pushing forward an RDFa derived format onto the world wide web. In fact, RDFa is the least interesting part. Producing semantic data has been around for a long time. Most importantly, I give Facebook credit for focusing on the interesting part of the problem: consumption of semantic data.

And although it’s a great achievement, I regret the locked-in and centralized nature of Facebook’s Open Graph Protocol. The web is an open environment, where open wins at the end.

Entity Identifiers

Facebook assigns and owns internal identifiers to entities. These identifiers do not look unique, nor something that external parties, disconnected from Facebook could assign, inspect or do anything useful with. By owning the identifiers, Facebook is owning the entity graph.

The assumption of a single party owning the identifiers for entities is fundamentally flawed. Entities appear, disappear, merge and fork. Known entities change constantly, and most importantly what you and I, or anybody else understands and knows as an entity is different. IMDB and Netflix will both describe the movie “The Book of Eli”, however each use different identifiers. To assume that one unique identifier can be used consistently and universally across the web leads to information lock-in.

Luckily, the web is an open environment. And, at the end, open wins on the web (prove by induction to the avid reader).

Identity on the open web is federated, starting at the top-level domain identifiers, and all the web to a fully qualified Universal Resource Identifiers (URIs). In reality, there are many sources of truth. Some of those are canonical, or trusted, yet many other sources exist. Movies data from IMDB or Netflix will usually be considered canonical. Yet that does not stop Wikipedia, or any web publisher, from creating its own entities. All it takes is a URI.

Anybody on the web, be it Facebook or Freebase, can make statements such as: this URI on Netflix is about the same movie as this other URI on IMDB. All it takes is a web page with a couple of links. The difference between Facebook and the “web” way, is that a web page is identified by a URI, and anybody can create one in a federated way. Not just Facebook.

Semantic Web Annotation Techniques

The most interesting value of the web is not about describing resources, or the roles of the relationships between resources, but the relationships themselves. The statement “Madonna is related to Guy Ritchie” is, in relative terms, more important than the more fact-complete “Madonna is Guy Ritchie’s ex-wife”. Although the fact “ex-wife” increases knowledge, it has only value once we asserted that the relationship exists. Establishing relationships can be achieved through links and URIs. That’s what the web is about. Pages and links to other pages.

The problem however with the current web of pages and links is that all URIs are anonymous. Ideally, I would like to bring some structure to those relationships so that I can codify exactly what “ex-wife” means. That’s what semantic annotations address: naming the role in the relationship between resources in a structured format.

Inlined semantic annotations, ala Facebook, is one possible way of linking data sets. The issue however with this approach is that the meta-data about the resource is mingled with the data. This creates a number of problems. First, there is an arbitrary distinction between what constitutes data and what is metadata. Additionally, one needs to be able to parse the resource format in order to find the semantic annotations. Aside the parsing computing cost, adding semantic annotations to videos, images, etc. would require custom extensions to be container format, which is practically unfeasible. Practically speaking inline semantic annotations are only partially useful, and only for text-readable content types.

A better alternative would be to use out-of-line semantic annotations. In this model, we cleanly separate meta-data from the resource data itself. This separation should be not just syntactical, but structural. Semantic annotation constructs should treat the data as an opaque resource. For all we care, we should treat all data as binary resources. If we wanted to look into the data, an specialized parser would read the data, and surface interesting facts that we could then promote to metadata.

The first possible construct that provides out-of-line semantic annotations is the Link HTTP Header. Using link headers we can describe a web resource without having to parse the payload, by simply looking at the HTTP headers. Examples of usage for link headers could include:

  Link: <http://www.cern.ch/TheBook/chapter2>; rel="Previous"
  Link: <mailto:timbl@w3.org>; rev="Made"; title="Tim Berners-Lee"

Although powerful, link headers are not particularly accessible for most publishers, since they require programming access to the web server to generate those link headers. Additionally, if we are only interested in the semantic annotations, fetching a full document only to throw it away is highly inefficient both for consumer and publisher.

A better construct to provide out-of-line semantic annotations is to create a separate resource altogether representing the semantic annotations for the parent resource. This alternative, alike link headers, also differentiates between data and meta-data, but does not require stack changes to HTTP. Furthermore, it does not require the publisher to generate those semantic annotations. Anybody can publish semantic annotation documents for any resource on the web. In that sense, the use of out-of-line semantic documents that describe web resources is truly open and federated.

The Linked-Atom

There are many possible formats for describing resources in an out-of-line fashions. One of the most interesting formats ones is the The Atom Publishing Format. With Atom, the focus is on identity and linking resources. The actual content is hidden away.

But within the context of semantic annotations, Atom has its shortcomings. Rather than abusing Atom, perhaps we need to create a separate, specialized out-of-line resource descriptor. I call such format the “Linked-Atom”. There are a few differences between Atoms and Linked-Atoms. Whereas the atom is a generic format for content publishing, the linked-atom is only using for linking web resources.

Let’s consider a graph whereby:

  • Each resource on the web is unique identified by a URI. Let’s make such resource a vertex, and the identity of this vertex be the URI.
  • A resource can link to other resources, also identified by URIs. We’ll make each link a directed edge, and the link’s identity the URI of the target resource.
  • Edges have a type, which corresponds to the type of the target resource.
  • Edges might be named, or remain anonymous.

In this graph, the set of a vertex and its outbound edges constitutes a linked-atom. The linked-atom introduces some additional constraints:

  • A linked-atom is immutable. A change in the graph (adding or removing edges, or changing edge types or edge names) creates a new Atom, with the vertex identifier and a new revision number.
  • Linked-atoms are identified by a composite key composed of the vertex identifier and the revision identifier.
  • The list of all linked-atoms describing all revisions of a vertex constitutes a collection.
  • Collections are uniquely identified by the vertex identifier.

A possible JSON representation for a linked-atom would be:

    {
        "id": "http://example.com/foo.html",
        "rev_id": 1,
        "links": [
            {
                "id": "http://example.com/bar.html",
                "rev_id": 2,
                "type": "text/html",
                "name": "bar" 
            },
            {
                "id": "http://example.com/toto.png",
                "rev_id": 1,
                "type": "image/png",
                "name": "toto" 
            }
        ] 
    }

A more compact representation of the atom is perhaps more interesting for extracting information. Instead of grouping all the edges into a single document, we would create describe each vertex-edge relationship as a N-tuple. A possible implementation of the linked-atom could use n-tuples to store the data:

    http://example.com/foo.html 1 http://example.com/bar.html 2 text/html bar
    http://example.com/foo.html 1 http://example.com/toto.png 1 image/png toto

The Graph

Using linked-atoms, we can model the information in the web, not simply as a graph of pages and links, but a graph of named and typed links between vertices. Each linked-atom represents a statement about a web resource, a piece of knowledge. The advantage of a linked-atom graph is that anybody can publish a document making statements via linked-atoms and collections of linked-atoms, and not just the publisher of the web resource.

This is in contrast with Facebook’s Open Graph Protocol, where only the publisher of the web resource can make such statements, and where only one consumer assigns identifiers to those statements. Maybe the Linked-Atom is not the perfect construct, but it provides an alternative to what I see as a centralized lock-in model that threatens the open nature of the web.

Brilliant Tech Video Ad

Tuesday, May 4th, 2010

Why facebook is just a game

Friday, May 2nd, 2008

I keep saying that the web follows the pattern of the fashion industry. It’s about image, it’s about fun, it’s about entertaining. Sure, you can also buy stuff and do useful things. But that’s what you have to do, not what you want to do.

That’s why facebook is popular – it’s fun, it’s about image, it’s about entertainment.

And now, to prove it, a nice chart for the breakout of Facebook applications.

Facebook application breakdown

Facebook application breakdown

SUNW becomes JAVA

Sunday, August 26th, 2007

Sun is changing its ticker symbol from SUNW to JAVA, as announced in Sun’s CEO Jonathan Schwartz’s Weblog on Thursday 23rd August. There has been a lot of mixed feedback. Most techies and engineers inside and outside Sun are criticizing the decision, as they see it narrowing Sun to Java technology. However, Wall Street did not seem to care much.

The stock went up 1.62%, while the Nasdaq index recovered 1.38% so one can possibly assume the market was insensitive to the change. While the volume was double the average, so was the market’s, so again no change.

This seems to be a change purely driven by brand awarenes. As Schwart’s puts it in his blog:

What’s that distribution and awareness worth to us? It’s hard to say – brands, like employees, aren’t expenses, they’re investments. Measuring their value is more art than science. But there’s no doubt in my mind more people know Java than Sun Microsystems. There’s similarly no doubt they know Java more than nearly any other brand on the internet.

It strikes me that as much as this change might help Sun’s marketing strategy, it will likely damage its ability to hire and retain smart engineers. Time will tell.

Open LinkedIn Platform Should Focus on Privacy

Tuesday, August 21st, 2007

LinkedIn’s CEO Reid Hoffman promised at the end of June to open the LinkedIn platform, very much aligned with Facebook’s publishing its developer APIs, and surely trying to experience some of the same growth Facebook is receiving thanks to opening their APIs. I hope however that LinkedIn is thinking about all the risks associated with opening up a business community.

LinkedIn will need to review and approve every single application out there consuming their services. The last thing you want is a pile a lawsuits on your desk because of misapproprated data, especially personal data covered by the EU/95 Privacy Directive, also implemented in the UK via the Data Protection Act, and somehow applicable to US companies under the Safe Harbor Agreements.

LinkedIn should focus on opening the APIs for its users. One of my main complains with LinkedIn is that it is very good at sucking my data, but it’s very hard to get some of that data back, let’s say synchronizing with my phone’s address book or even the more simple operation of importing my contacts into my Outlook calendar. That’s where I would like to see LinkedIn going, allowing developers to write such plugins, for us to access our own data. Anything beyond this very personal use of the data might end up hurting LinkedIn, and what is worse from a business perspective, possibly dillute it into another, smaller, does-it-all, Facebook.

Measuring time spent at a site rather than hits

Sunday, August 19th, 2007

In July, Nielsen’s NetRatings changed its web traffic measurements to focus on time spent at a given site rather than the traditional page views, and page views per user (PV/UU). Since then, many web 2.0 sites, including communities, gaming, video, etc. have received this change as the Holy Grail of web ratings, even those whose ranking went down.

While it is true that time spent at a site increases  exposure to ad display, and possibly CPM, the time-based measurement paradigm is only applicable to countries with deep internet and broadbrand penetration. In countries in Eastern Europe, Russia, South America, Africa and South-East Asia, much of the population still connects via dial-up modems and hits are a much better metric. The ability to watch streamed audio and video in these markets is very limited; gaming is not responsive enough; and the engagement in  social networking is rather limited. Or as Yahoo Peter Daboll put it: “You’re never going to have one metric that’s the holy grail of Internet measurement.”

The sad thing about Nielsen’s NetRatings change from hits to time spent has not been the change itself, but all the FUD around it. This is one of the things about the internet, and an annoying one, evil viral marketing takes over is significantly less time than on the non virtual world where the power of scrutiny stops the FUD.