Out of the Eclipse to Netbeans

August 20th, 2009

I have been using emacs and the command line for now almost 20 years. Once in a while, I dip into IDEs, but always go back to the command line. My biggest gripe with IDEs is that it keeps me away from the actual build, and that I normally have to spend time duplicating building configuration in the IDE. The canonical source for build configuration should be the build system (Make, ant, maven, etc.). With an IDE one would have to always be things in sync, leading to errors and all sort of weird stuff.

But I keep trying looking for the holy grail. In the past, I’ve been relatively happy with Eclipse for the odd refactoring job. I would fire it up, fix, and close it, back again to the command line.

This time around, I wanted to see if an IDE would make me stop using emacs. I set myself a few requirements:

  • Auto-complete. Not intellisense crap. Basic emacs completion is enough.
  • Maven is the source. I don’t want to keep a parallel reality when maven poms have all the information already on it.
  • Unit testing. I want to run and debug my unit tests from the IDE.
  • Basic re-factoring, especially code usage reports.
  • Light on resource usage.

I looked into IntelliJ IDEA 8.1, Netbeans 6.7 and Eclipse 3.5. Here are my findings.

IDEA

I got an evaluation license and only used it a few days. IDEA is a fantastic IDE, and despite high memory consumption, it felt light-weight and kept the system very responsive. In principle, it met all the requirements, and the refactoring support was simply the best out of the lot. However, at some point things started to go wrong:

  • Compilation errors. Cannot find symbol, although running maven from the command line would not throw any errors. I did not want to spend figuring out what was wrong here. I am sure some IDEA expert would have told me I was doing something wrong, or that some configuration had to be changed. But IDEA was already failing the requirement: maven is the source. I don’t want to fiddle around with the IDE settings.
  • Mercurial integration problems. The plugin kept on throwing up and breaking the environment. I had to uninstall it, but that’s really a minor point.

Eclipse

Then came Eclipse. Overall it looked good, but there was no Maven support out of the box. I tried installing Q4e, but getting JUnit support in debug mode was just too painful. m2eclipse worked fine, and Eclipse met all requirements with it.

Netbeans

Netbeans was last. I’ve never liked Netbeans. But I did now. A lot.

It looked good. It mapped nicely into OS X, which the others did not, in terms of user input and display. It supported Maven out of the box as a source of truth for build configuration. It offered substantial more refactoring features than Eclipse, but less than IDEA. Mercurial integration was spot on. And I could run my unit tests in debug mode.

But most importantly, Netbeans did not get on my way. I did not have to configure anything to get going. It was like using emacs, but with all the goodies.

The winner

I am going for Netbeans. IDEA looks good, but I am not sure I would pay out the license fee for the powerful refactoring features. Others might, but I probably won’t. Eclipse was also improved, but having to fiddle with plugins to get Maven support was really sub-standard.

So, it is Netbeans.

Making XMPP Work for the Mobile Environment

August 9th, 2009

I’ve been thinking about a standards-based client for a mobile environment. XMPP has quite a few strong merits, against it’s competitors such as OMA IMPS and SIP Messaging. For one, it’s a community standard, and it’s actually possible to submit new specs if so required. Secondly, it’s becoming the standard protocol for IM, and it’s emergent in the open Pub/Sub infrastructure, i.e. why-polling-is-bad.

Unfortunately, there are still a couple of key problems with XMPP in a mobile environment – none of them can be solved in a standard way. I wonder if we could harvest a couple of good ideas from OMA IMPS and spec them under XMPP.

Bandwidth

The XMPP stream is verbose, which is not a bad thing by itself, except when you live in a mobile environment. Compression, either at the application (gzip) or the payload level (XEP-138), shifts the problem onto CPU cycles: you still need to decompress first, then decode, which increase latency and shortens battery life. Ideally, you would like to compress the payload onto something that can be decoded and parsed at once without requiring decompression, i.e. WBXML.

Unfortunately it seems like the community has previously slashed against these changes. When Google shifted from XMPP to its custom binary XML protocol for GTalk API, there was a heated discussion about its suitability. But what are folks suggesting that works for the mobile environment, and not just smartphones? I have not seen anything that would work in all phones, hence why WBXML. Please prove me wrong.

Real-time Notifications

The default XMPP stream uses bi-directional TCP/IP connections to allow real-time notifications (messaging, presence, etc.). Open sockets in a mobile environment are a bad idea for quite a few reasons, and although BOSH/HTTP-binding for XMPP solves the TCP/IP persistent connection problem, it is still expensive from a device perspective, as there is always a hanging HTTP request. As far as I can tell, there is currently no XMPP solution to this, and I can see how a extension to XMPP adding WAP-PUSH, WAP-UDP, and perhaps standalone UPD, would make sense.

Again, I’d like to think there is a standard solution for this. And no, iPhone Push notification does not really count.

The Standard Way

We can let folks come up with their own implementations, and then try to refit a spec for the sake of interoperability. But we know how difficult this becomes in other processes like Oasis, JCP, etc. I wonder then, if it is time now to spec this out, (and no, spec jokes, don’t count):

  • Binary XMPP
  • Push Notifications

Composable and Concurrent

April 23rd, 2009

On my previous entry on the present and future of programming languages, I briefly covered on the reasons I think it is important to be looking at this problem now. I though I would expand the discussion.

The laws of physics that we know have stopped our ability to make chips significantly faster today, and rather hardware manufacturers now need to place more and more cores in one die. The result as software developers is that we are now faced with computers with multiple cores and multiple CPUs. It is now the norm to find computers with 2 cores, and most servers are now 4 or 8 core machines.

Historically, developers lived on the free lunch, and with time they knew code would run faster, since clock speeds would get faster. Chips will still continue to get faster in the future, but marginally what we’ve seen over the last 18 years. Today, it’s not about running faster, but having more transistors. And what once was a good programming paradigm for the single core, it’s not valid any longer for the multi-core situtation.

Here is where concurrent programming is born. Writing software that scales to multiple cores with the current programming paradigms is hard. The main mechanism we know is the usage of threads, and locks to protect shared estate between threads. It is a model somewhat understand by developers, but that it is however extremely hard to get right.

On the server-side, programming languages that can only leverage one core are going to have a really hard time to compete in the future. Programming languages that have adopted the thread-per-request model are better placed, but will nevertheless face a wall at some point because they rely on locks to operate to protect the transactionality of shared state. Multi-threaded programming and locks are not scalable, mainly because they are not composable.

Building composable software really requires us to move away from sharing state. Working with immutable data structures, and immutable containers, allows us to avoid state, hence allowing modularity and composability (the core principle of software development for the last 35 years, and the philosophy of UNIX tools). But working with immutable data goes against extensibility principles in object oriented programming.

There are a few paradigms that have been in the research arena for the last 10 years and are now finally being adopted. These paradigms take the shape of message passing, parallel programming and functional programming. Functional programming languages are naturally parallel and use immutable data. Unfortunately functional programming languages are not mainstream, and most functional languages that have been successful (Haskell and Lisp) actually allow mutable structures. Erlang however is an exception. Erlang is naturally parallel, uses only immutable data, and is based on message passing. Erlang is a good language to solve the concurrency programming paradigm.

The question though is whether adopting a language like Erlang would make sense for most organizations. Rather, some may argue that evolving the abstraction of concurrency into the current leading programming languages is the way forward. I’d disagree since this is somehow what Scala does, but given that it lives in a JVM, with threads and locks, and using libraries that use threads and locks, it’s unlikely new language constructs will allow existing code bases to be fixed. To me, Erlang looks like a much better choice. It’s a proven choice and with many years of history.

On Language Polyglotism

April 22nd, 2009

I believe being a polyglot is nothing but an advantage, and that polyglots are normally the best programmers. As a matter of fact, I challenge myself to learn a new programming language every year. Last year it was Scala, this year I’ve started learning and writing some Erlang (and yes, there is a pattern here for functional programming languages).

One of the reasons I like learning programming languages is a drive to reach the Holly Programming Grail: I find most fascinating discovering which language is better suited for which task. In other words, some languages excel at tasks where other languages fail. An organization could decide to allow all those languages to exist, and if we were doing best-of-breed in every problem domain, we would end up with a language soup. So the question then is, is such soup good or bad? In other words, when we think about developing Internet applications: which and how many languages should an organization use?

There are two factors to consider in choice of language choices (assuming the language choice is functionally the best choice for solving the problem at hand): talent and operations.

Talent is truly an interesting one. Whereas a startup can possibly afford to pick more exotic languages based on their goodness, large organizations can do less so because of talent. This is because A-class developers will excel in any language, and usually A-class developers are polyglots. But one can only hire so many A-class developers, and eventually you end up growing into B- and C-class developers. Your ability to maintain software at that point really depends on your C-class hiring pipe. Pick a rare language choice, and you are calling for trouble.

Operations is very different from talent. Here the size of the company also matters. Developer cost will be the starting cost, but as the business becomes successful operational cost becomes the key driver (specially in consumer Internet applications). In a way, you want to pick the language choices that drive your productivity higher, and later slowly move into reducing your operational cost.

If you follow me, you’ll notice talent and operations are conflicting today. Java is the mainstream programming language worldwide, i.e. A-, B- and C-class talent is widely available, yet scaling Java horizontally at linear cost is difficult. On the opposite end, Erlang is a good choice for distributed concurrency and scaling linearly, yet B- and C-class talent is difficult to find.

I used to advocate the JVM as the minimum host environment, yet allowing language choice across the stack. As long as it could run on the JVM and I could manage it, any language was fine. One could run the complete stack on a JVM. The presentation tier in JRuby, Jython, P8 or Caucho’s PHP implementation. The application logic and persistence in Java or Scala. The batch computing in Scala. But it just does not work that well. Even though Scala has the artifacts to work, the JVM and the runtime libraries don’t. Shared memory, native threads and locks means the JVM will be more expensive to scale and operate on the long run. The JVM is a feasible, stable and proven alternative, yet it is an expensive choice (in operational cost and probably lost opportunity because of slower agility).

Given this context, my language choices would be:

  • Conservative stack, which gets the job done well enough, can be operated, and you can get accessible talent.
    • Frontend: PHP
    • Application Logic: Java
    • Persistence: Java and C/C++
    • Batch: Java
  • Medium Risk stack, that addresses some of today’s shortcomings, but reduces the talent pool:
    • Frontend: JRuby on a JVM, with Merb
    • Application Logic: Scala on a JVM, with Jersey
    • Persistence (non-relational): Scala on a JVM
    • Batch: Scala, with Hadoop
  • High Risk stack, forward looking to the highly multi-core CPU roadmaps announced by Intel and AMD, that reduces even further the talent-pool:
    • Frontend: JRuby on a JVM, with Merb
    • Application Logic: Erlang. Actually, I’d love to see a Scala-like language targeted to the Erlang VM to make it more palatable. But for now, I’d settle on Erlang.
    • Persistence (non-relational): Erlang
    • Batch: Erlang

Heads or Tails

April 7th, 2009

If you have been following the media industry over the last years, you’ll have surely realized that traditional media is struggling. Even before the crisis, advertisers have started switching from offline to online, and now with the crisis we see a consolidation in a few ad agencies and a few brokers.

Many magazines and newspapers are in such position that if their debt does not get re-financed they will have to file for bankruptcy. Many music labels find themselves in a very similar position, or have already gone belly up. Video houses are struggling with costs. It’s not possible to compete with pirated content. TV channels are going bust.

But there is also a lot of smoke-n-mirrors going on.

I hear claims. Some argue that traditional media is dying. Some advocate social media has taken over. Some forget the economics of running a media business. That the article is dying. That head content is now irrelevant. That the democratization of content production will increase quality. That user-generated content is the future.

Others blame ISPs and the likes of Yahoo, Google, AOL for this. They talk about stealing content and taking away the advertiser money flows the so badly need right now to survive. Some go preaching about how policing the unregulated wild wild web with Digital Rights Management is the only way for them to remain profitable.

What is for sure is that the media industry is suffering. What is also certain is that consumers are generally rational, and make purchasing decisions based on utility. Many consumers are today rather pissed off with the traditional media. And I believe what we are seeing it’s just a consequence of the media business own behavior, being extremely zealous to protect their content rights with solutions that make the consumer experience worse, rather than better.

Is it not annoying that you can’t, legally, make a copy of a DVD for backup purposes? Is it not annoying that you can’t buy DVDs while you travel to other continents? Is it not annoying that your DRM tracks only play with proprietary technology? Is is not insane that games are region locked?

Like Lola would say, I surely think it’s extremely highly very annoying.

Long, long time ago, information traveled slowly, snail pace actually. Controlling where a film was being shown, or who had the tapes, or who was selling the newspapers was a relatively easy thing to do. Yet, already then, there was sharing. You’d share a magazine at the hairdresser’s. You’d share a VHS tape. But, for most, things were controllable since the medium was perishable, and things were not flowing the pipes, yet.

The Internet has changed it. Information is available at speed of light, at whatever bandwidth you have contracted. Sharing is easy now. And because they can, and because that’s how humans are, consumers want to share what they own. The same way they can share a book, or photos, or toys, …

Revenue starts to go down on the rich media side of the business due to pirated copies flowing on the net. So DRM moves in. Badly in. Actually completely gone wrong. It alienates consumers. Napster, eMule, Kazaa, … all appear because consumers want it. But truth is that DRM is protecting some of the revenue, and that’s why content right owners keep lobbying for it. But it’s still going down the hill. High-Definition media appears and somehow fixes partially the problem. But it won’t last long. It’s only working because of constraints in bandwidth.

Paper media businesses have trouble getting audience. You’ll find many reasons for it, but the key underlying factor is that they’ve ignored speed of light. Information can flow immediately. And if it can, it will. Those media businesses that understood and embraced this, new incumbents mostly, but also some old faces, have been doing better. But they have made the industry more competitive as a consequence,aAnd those that did not understand it, find themselves struggling.

So what do you when you are losing? For starters, you find a scapegoat. Done that. Ironically, deep inside, you realize your business model is gone and you are just trying to hang onto it for as long as possible. In reality, you are just buying time. Precious time you don’t have, to figure out the next business model.

Yet, one has to ask where did it all go wrong? Audiences are actually not spending that much time on tail content. High quality production “head” content is still where online minutes are spent. Although I put some value to UGC, I still believe good productions require good funding. Good talent needs to be compensated. Be it via ads, via subscriptions, via concerts, …

To me the issue that traditional media needs to address is how to operate their businesses at a whole new level of revenue and cost. Since revenue levels per production unit are down by one order of magnitude, one must produce significantly more, and do it a much lower cost basis. I can think of three things the media industry can do to achieve this:

  • The tail, by definition, is long. Sorting out the tail is a mess. To start with, go manual, and use your editors to find the gems in the tail. Cream it out. Find the new head, at no production cost. But don’t stop there.
  • In parallel, invest in algorithmic matching. Eventually you want to surface all the metadata in media to replace the editor. We can’t today. But it should the goal.
  • Production costs are high and need to be shared among all players. By sharing production costs with your competitors (you hear me well) you are actually removing those cost components that should be commodities, so that you can move up the stack to solve the creative problem better. Likewise, if you can share the production costs with the long-tail of user-generated productions, all the better. You are basically getting that subsidy you need at production time. Since you have addressed costs, you can now play with content pricing. You can charge, or you can go free. You have gained that freedom back. You can design new business models. Consumers benefit. Producers benefit.

The first item is really a way to get going. The latter two are truly strategic, and are what I call the “open media infrastructure services”. Think of it like the Amazon services for media production and distribution.

Much of that infrastructure already exists for distribution, as the channels of pirated content, and they are actually almost free. Production is not there yet, and will require investment. But without taking these steps, I am doubtful head content production will last. Or if it does, it will do only out of government subsidies, using taxpayers money. And that, really, would be a pity.

A Web Linked by Atoms

March 27th, 2009

Looking back at the past 20 years of the history of the internet, one can realize that what we call today the web has changed dramatically, but that one thing is certain: it will continue to evolve. Gone are the days of manually editorialized directories, of static content, of one way conversations. Today we live in a dynamic, social, interactive web. And it’s a web that is traversing the digital realm and changing how we understand our physical world, breaking communication barriers and making information and knowledge easily accessible.

Yet, although much has changed, the web remains an almost completely deregulated environment, not only on the legal side, but also on the technical side. That’s one of the web’s merits, making it easy to publish documents. Those documents are part of a forest, many sit in isolation, and only a small number of links are connecting those documents. The hyperlink is the web’s way of connecting documents, yet, because the web is a flexible environment, there are no requirements on developers as of how they structure their documents, nor as what link schemes they use.

But what is an advantage on one end, reducing the barriers of entry to publishers and developers, is also a disadvantage on the other end. We end up with billions of documents that are difficult to navigate, and the information, although there, is difficult to find and access.

Organizing the web has been a growing pain since the early days. As the web evolved, so did the technologies used to organize it. In the beginning, only a few universities and government agencies had documents, and it was feasible to maintain a directory. Later on, commercial entities entered the web at exponential rate, and business were created just to solve exactly this problem: organize the web.

The web is growing so fast, that we are not able to reasonably scale the production and maintenance of web directories. And there came web search engines, based on the information retrieval research of the late 80s and early 90s, to solve this very problem. But search engines are starting to show their age:

  • We are accustomed to a web that is almost infinite, so deep and wide that we know that not even search engines can reach the very end-leafs of it. We know that the long tail is getting longer, and that the amount of disconnected content is just growing every day.
  • Traditional media is struggling to find its place in the web social ecosystem. “Head” content is expensive to produce, and is becoming less and less profitable to operate. The audience is moving to tail and (social) user generated content, where it spends hours. Media businesses, as we know it, are dying slowly. Content is no longer king, and guess what, search can’t find the new king.

Today’s problem is also today’s opportunity. Interestingly, the tools necessary to fix this are almost in place. This the “Atom Web”, or a web linked by Atoms, which some would perhaps deem as a bastard child of the semantic web. The Atom Web is a web where the hyperlink is really empowered as the new king, and takes the crown away from content. It’s a web where document and link meta-data is more important than data, just like the PageRank algorithm changed search by using link (citation) flows as a measure of relevance.

The Atom Syndication Format and the Atom Publishing Protocol have the capabilities to change the web as we know it, well beyond the initial purpose to serve as a content syndication mechanism. Atom contains almost all the necessary meta-data missing in a HTML document, plus it provides a neat way to link content. But to achieve this, we need to empower Atom by introducing three new principles:

  • Content is opaque. We shouldn’t accept (almost) any extensions to Atom. Atom should be reserved as an envelope containing only meta-data, with the source element containing all the content or a reference to the content resource. Perhaps in a few cases we might need extensions, like search, paging, presence and location. But rather, I see the very large majority of extensions as content.
  • Post Atoms, Get Feeds. Atom entries have little meaning by themselves. Instead, they live in collections through feeds. An entry may belong to many collections. Collections are views, representations, of the underlying documents, grouped in a wide array of combinations. Feeds can represent pretty much anything: the revision history of an entry, all known content about an entity, a blend of categories and topics, related content, etc. Congruent collections are much easier for a UI consumer, as a simple request returns everything necessary to render a widget, a module, or a complete page, perhaps even as JSON directly from the browser. Notice that I am not assigning any semantics to the feed, that’s up to the implementation, but specifying that a feed is the first class citizen, followed by the entry.
  • Links are type-aware. Links are the most important part of Atom, and content-types are the tools of choice to help bring semantic value to the relations (rather than custom schemas extending Atom). In a way, the Atom Web is a reduced subset, a predecessor, of the Semantic Web, since we don’t necessarily know the subject-predicate relationship for all structured data (yet). I rather think of something along the lines of <link type="application/atom+xml;content=application/foo+xml">, which would be describing the content-type inside the referred Atom document. Obviously, you may still point directly to the underlying resource, because you can, and because that’s how the web works today.

Certainly, there is a bit of web infrastructure required to make this true. The same way we have the web server for the sometimes-linked web, we need the atom server for the always-linked web. But an atom server is not just a web server adhering to the the AtomPub protocol. Atom server implementations can provide whichever machinery they want as long as the adhere to the three principles above. An atom server needs to have the ability for publishers to define collections, or more precisely, the matching rules that will define views, and which will be exposed as collections, through feed resources. Such view generation will be offloaded partially to batch computing, perhaps on the form of map/reduce jobs. Some implementations may for example decide to use schema-less document stores in order to generate those views.

One could get started today with this and build a gigantic store replicating copies of the content hosted elsewhere on the web. The store could be populated either by crawling or by allowing developers to publish into it. But eventually, this store would have problems finding all the deep end-leaf documents.

A few big players adopting Atom to expose their data, for example making the webmap available through Atom and opening the API for publishers to post onto the store, could dramatically change this game. There would still be plenty of problems to solve, ranging from the now even more important link-spam detection and moderation, to the not so obvious scaling problems, but at least we would be one step closer to organizing the web.

Painfully HTTP Java

February 11th, 2009

I have been working on AtomPub and coding a prototype over the past few evenings to get Apache Abdera talking to CouchDB. I know there is an experimental adapter in Abdera for CouchDB, but it does not use CouchDB the way I need to.

I decided to write my own provider, workspace manager, adapter, etc. I have been using couchdb4j, the Java library binding used to talk to CouchDB, and json-lib, a JSON library for Java.

Why, it has been really especially extremely very painful!

HTTP and JSON, REST a bit, and to a lesser degree XML, are supposed to be technologies that make your life easier as a programmer and as a server. But with Java, it’s the other way round. What should be easy, becomes complex. I am very disillusioned about the suitability for the Java language in pretty much any domain related to HTTP.

With Java in HTTP, you have to jump API hurdles, pull hundreds of dependencies, and end up with stack traces deep as a black hole. Yeah, it works, mostly because of the amazing development support and tools for Java, such as Maven and Eclipse. But that alone does not make Java a viable technology for programming HTTP.

Java has it uses, and it is a good programming language. I don’t think I’d ever code C++ again if I could use Java. But that’s more on the number crunching, for the Hadoop of this world.

I’ll stay with my PHP, Ruby and Python while I can. Perhaps it’s also time to learn a functional programming language like Erlang.

Snow blocks railway company websites

February 2nd, 2009

We knew it was coming. Met Office kept talking about. Heavy snow across the South East of England. Regardless, this morning is a complete mess, as expected, and the country comes to a halt, as usual.

And to a degree, it is normal. It is very difficult to size transport for peak emergencies. Possible, but costly to the tax payers. So, in a way, I understand we are stuck today.

But what I don’t understand is how all the web infrastructure associated with network railways is basically down.

Let’s start with Southern Railway:

$ curl -v http://www.southernrailway.com/ * About to connect() to www.southernrailway.com port 80 (#0) * Trying 213.86.249.117... connected * Connected to www.southernrailway.com (213.86.249.117) port 80 (#0) > GET / HTTP/1.1 > User-Agent: curl/7.16.3 (powerpc-apple-darwin9.0) libcurl/7.16.3 OpenSSL/0.9.7l zlib/1.2.3 > Host: www.southernrailway.com > Accept: / > < HTTP/1.1 500 Generated error < Date: Mon, 02 Feb 2009 07:57:42 GMT < Connection: close < Content-Type: text/html < <html><body> <h2>No suitable nodes are available to serve your request.</h2></body></html> * Closing connection #0

Okay, so let’s see SouthEastern:

$ curl -v http://www.southeasternrailway.co.uk/ * About to connect() to www.southeasternrailway.co.uk port 80 (#0) * Trying 213.86.249.125... connected * Connected to www.southeasternrailway.co.uk (213.86.249.125) port 80 (#0) > GET / HTTP/1.1 > User-Agent: curl/7.16.3 (powerpc-apple-darwin9.0) libcurl/7.16.3 OpenSSL/0.9.7l zlib/1.2.3 > Host: www.southeasternrailway.co.uk > Accept: / > < HTTP/1.1 500 Generated error < Date: Mon, 02 Feb 2009 08:04:03 GMT < Connection: close < Content-Type: text/html < <html><body> <h2>No suitable nodes are available to serve your request.</h2></body></html> * Closing connection #0

Not provisioning capacity for an unforeseen event could be excused, but not provisioning capacity for a foreseen event has no excuse, especially for an infrastructure company. This is the core of a public service, and if doing this is not possible, it’s time to revisit whether their current hosting arrangements are adequate. Definitely, not impressed, especially when cloud infrastructure is becoming so easily available.

pywbxml revisited

January 14th, 2009

I know I said I was not likely to fix pywbxml, but the alternative was even less appealing. I know I have to move away from Java:

  1. Continuing to use a mix of Java and PHP to leverage the PHP PECL wbxml extension creates an unnecessary complexity.
  2. Moving the PHP code doing wbxml2xml and xml2wbxml transformations to Java is possible but I noticed lack of activity on KML at sourceforge (the project that contains the wbxml library for Java). libwbxml in contrast is owned and maintained by the opensync folks, and it’s only the python bindings that needed some love.
  3. The XMPP library I use in Java, Smack uses a model of one-thread-per-socket, which for server-side and maintaining hundreds or thousands of users will require me to have much more memory in the box that I would like to. I rather use an event-based / reactor architectural style.

So I decided to patch the Python bindings for my needs (rather than fixing it properly by exposing the args through the API). The Python bindings seem to be generated out of SWIG, so editing src/pywbxml.pyx is enough to get the dictionary I wanted for IMPS (WBXML_LANG_WV_CSP12). The change is straight forward, simply:

params.lang = WBXML_LANG_WV_CSP12

or the unified patch:

--- pywbxml-0.1/src/pywbxml.pyx 2006-07-28 01:51:57.000000000 +0100
+++ pywbxml-0.1-wv/src/pywbxml.pyx  2009-01-14 21:55:40.000000000 +0000
@@ -14,7 +14,7 @@
     object PyString_FromStringAndSize(char *s, int len)
     int PyString_AsStringAndSize(object obj, char **buffer, int *length)

-class WBXMLParseError:
+class WBXMLParseError(Exception):
     def __init__(self, code):
         self.code = code
         self.description = <char *> wbxml_errors_string(code)
@@ -28,7 +28,7 @@
     cdef WBXMLGenXMLParams params

     params.gen_type = WBXML_GEN_XML_CANONICAL
-    params.lang = WBXML_LANG_AIRSYNC
+    params.lang = WBXML_LANG_WV_CSP12
     params.indent = 0
     params.keep_ignorable_ws = 1

You will notice an additional change in the patch for WBXMLParseError – this comes from the Debian package, and ensures the we are raising a proper exception.

With this applied, compiled, and installed, I ran the same test with an IMPS CSP document, this time successfully:

$ python
Python 2.5.1 (r251:54863, Apr 15 2008, 22:57:26)
[GCC 4.0.1 (Apple Inc. build 5465)] on darwin
Type "help", "copyright", "credits" or "license" for more information.
>>> from pywbxml import xml2wbxml, wbxml2xml
>>> xml = """<?xml version="1.0"?><!DOCTYPE WV-CSP-Message PUBLIC "-//OMA//DTD WV-CSP 1.2//EN" "http://www.openmobilealliance.org/DTD/WV-CSP.DTD"><WV-CSP-Message xmlns="http://www.wireless-village.org/CSP1.1"><Session><SessionDescriptor><SessionType>Outband</SessionType></SessionDescriptor><Transaction><TransactionDescriptor><TransactionMode>Request</TransactionMode><TransactionID>nok1</TransactionID></TransactionDescriptor><TransactionContent xmlns="http://www.wireless-village.org/TRC1.1"><Login-Request><UserID>hermes.onesoup</UserID><ClientID><URL>WV:IMPEC01$00001@NOK.S60</URL></ClientID><Password>xxxxxx</Password><TimeToLive>86400</TimeToLive><SessionCookie>wv:nokia.1789505498</SessionCookie></Login-Request></TransactionContent></Transaction></Session></WV-CSP-Message>"""
>>> binary = xml2wbxml(xml)
>>> text = wbxml2xml(binary)
>>> text
'<?xml version="1.0"?><!DOCTYPE WV-CSP-Message PUBLIC "-//OMA//DTD WV-CSP 1.2//EN" "http://www.openmobilealliance.org/DTD/WV-CSP.DTD"><WV-CSP-Message><Session><SessionDescriptor><SessionType>Outband</SessionType></SessionDescriptor><Transaction><TransactionDescriptor><TransactionMode>Request</TransactionMode><TransactionID>nok1</TransactionID></TransactionDescriptor><TransactionContent><Login-Request><UserID>hermes.onesoup</UserID><ClientID><URL>WV:IMPEC01$00001@NOK.S60</URL></ClientID><Password>xxxxxx</Password><TimeToLive>86400</TimeToLive><SessionCookie>wv:nokia.1789505498</SessionCookie></Login-Request></TransactionContent></Transaction></Session></WV-CSP-Message>'

Yes, I am now happier.

Getting libwbxml on MacPorts

January 13th, 2009

I need to be able to receive wbxml (binary XML) and transform it to xml using the IMPS 1.1/1.2/1.3 dictionary in Python. In PHP, I was using the PHP PECL extension wbxml, which uses libwbxml (wbxml2). In python, I need pywbxml. I like MacPorts, and I’ll use that instead of compiling myself the packages.

  1. Install MacPorts.
  2. Self-update:
    sudo port -v selfupdate
    
  3. Some general goodness I can’t live without (wget will pull a long dependency list of things we’ll need for web development):
    sudo port install wget
    
  4. Wait.
  5. Install wbxml
    sudo port install wbxml2
    
  6. And we get an error:
    $ sudo port -v install wbxml2
    --->  Configuring wbxml2
    Error: Target org.macports.configure returned: invalid command name "cd"
    Warning: the following items did not execute (for wbxml2):
    org.macports.activate org.macports.configure org.macports.build
    org.macports.destroot org.macports.install
    Error: Status 1 encountered during processing.
    

So I filed a ticket. It seems like wbxml2 ownership moved to the opensync folks since the port was added. The port builds on 0.9.0 and the latest version is 0.10.1. I don’t much (nothing?) about MacPorts, but I went ahead and I patched it. You can find the patch on the ticket itself.

Finally, moving onto pywbxml. It seems the only maintained pywbxml code is from the synce folks.

  1. Download pywbxml from sourceforge.
  2. Enter the virtualenv:
    $ source ~/Sites/python/imps/bin/activate
    
  3. pywbxml needs pyrex, so we’ll install that first with:
    $ easy_install pyrex
    
  4. Install pywbxml:
    $ tar xzfv pywbxml-0.1.tar.gz
    $ cd pywbxml-0.1
    $ ./configure
    $ make
    $ make install
    
  5. Optionally, move the libraries to the virtualenv to keep things clean:
    $ mv /Library/Python/2.5/site-packages/pywbxml.* \
        /Users/brunofr/Sites/python/imps/lib/python2.5/site-packages
    

Now, we can test it:

  $ python
  Python 2.5.1 (r251:54863, Apr 15 2008, 22:57:26)
  [GCC 4.0.1 (Apple Inc. build 5465)] on darwin
  Type "help", "copyright", "credits" or "license" for more information.
  >>> from pywbxml import xml2wbxml, wbxml2xml
  >>> xml = """<?xml version="1.0"?><!DOCTYPE WV-CSP-Message PUBLIC "-//OMA//DTD WV-CSP 1.2//EN" "http://www.openmobilealliance.org/DTD/WV-CSP.DTD"><WV-CSP-Message xmlns="http://www.wireless-village.org/CSP1.1"><Session><SessionDescriptor><SessionType>Outband</SessionType></SessionDescriptor><Transaction><TransactionDescriptor><TransactionMode>Request</TransactionMode><TransactionID>nok1</TransactionID></TransactionDescriptor><TransactionContent xmlns="http://www.wireless-village.org/TRC1.1"><Login-Request><UserID>hermes.onesoup</UserID><ClientID><URL>WV:IMPEC01$00001@NOK.S60</URL></ClientID><Password>xxxxxx</Password><TimeToLive>86400</TimeToLive><SessionCookie>wv:nokia.1789505498</SessionCookie></Login-Request></TransactionContent></Transaction></Session></WV-CSP-Message>"""
  >>> binary = xml2wbxml(xml)
  >>> text = wbxml2xml(binary)
  >>> binary
  '\x03\x01j\x00Imnp\x80\x19\x01\x01rtv\x80 \x01u\x03nok1\x00\x01\x01s\x00\x01]\x00\x00z\x03hermes.onesoup\x00\x01Jw\x03WV:\x00\x80\x12\x03PEC01$00001@NOK.S60\x00\x01\x01\x00\x01a\x03xxxxxx\x00\x01r\xc3\x03\x01Q\x80\x01p\x03wv:nokia.1789505498\x00\x01\x01\x01\x01\x01\x01'
  >>> text
  '<?xml version="1.0"?><!DOCTYPE AirSync PUBLIC "-//AIRSYNC//DTD AirSync//EN" "http://www.microsoft.com/"><Delete xmlns="http://synce.org/formats/airsync_wm5/airsync"><unknown><unknown><unknown><Truncation/></unknown></unknown><unknown><unknown><unknown><Supported/></unknown><unknown>nok1</unknown></unknown><unknown><Email3Address><unknown>hermes.onesoup</unknown><Fetch xmlns="http://synce.org/formats/airsync_wm5/airsync"><unknown>WV:<CollectionId/>PEC01$00001@NOK.S60</unknown></Fetch><HomeCity>xxxxxx</HomeCity><PagerNumber>\x01Q\x80</PagerNumber><OtherState>wv:nokia.1789505498</OtherState></Email3Address></unknown></unknown></unknown></Delete>'

It seems like it’s not using the WV CSP 1.1/1.2 dictionary (not surprising since I did not specify it anywhere). Looking at the code, in pywbxml.pyx, I can see:

  params.lang = WBXML_LANG_AIRSYNC

So, whereas the PHP PECL wbxml extension is exposing all the parameters in libwbxml, it seems like the Python version pywbxml is not, and it’s hard-coding assumptions.

Lesson learned with my day on python, twisted, wbxml, etc.: spending my time fixing the basics, does not allow me to work on the problems I need to solve. Will I fix pywbxml? I don’t know, but most likely not.