Sunday, December 16, 2012

A digression to the heart of the matter

I "invented" a new micro-language over the weekend. I put the word in air quotes, because I was actually trying very hard not to do any such thing, but was forced by dreadful necessity to do some inventing anyway.

Why the odd programmer self-loathing? Because there are already too many microformats in use, and I'm not sure that adding "yet another unifying standard" to the mix will help. But I need it.

Here's the problem: URLs. That's it in a nutshell. Bloody URLs.

There is a spec for them, but hardly anyone reads it. Even when they have, they usually just chop URL strings up with Regular Expressions intended to get the bit they want, and fail utterly if given anything that doesn't start with http:// or https://. So it's a minefield.

The latest crazy thing to do is to make extensive use of what they call the 'hash fragment'; everything that comes after the '#' sign in the URL, which is a special fragment for two reasons:

  • It is never sent to the server during page requests, it's only available to the client browser and scripts.
  • Clicking on a page link that differs only in it's hash fragment from the current page does not reload the page, and in fact does nothing at all (except notify your scripts) if the fragment id doesn't match up with a real element id.

Remember, making the browser focus on an element half-way down the page was the original intention of the hash fragment, so all browsers support it and are 'tolerant' of this abuse. (ie: they don't throw errors or 'correct' the URL because a weird fragment ID doesn't actually exist in the page) We are just extending that metaphor, which just happens to work consistently on almost every browser ever made.

A classic modern use of the hash fragment is to put database article identifiers in it that correspond to AJAX calls to obtain that 'page fragment'. When your script detects the user has clicked on a new 'fragment link', the corresponding article is loaded. Sites like Twitter and Facebook have reached the point where there is only one 'page' for the entire site, and everything you see is dynamically loaded into that container.

A consequence has been such AJAX-driven sites were difficult for search engines (ie: Google) to index properly as none of the 'content' was on any real pages anymore. So they came up with a scheme: When the search engine (which acts like a big automatic browser) sees URLs with hash fragments, why not call a 'special' url with that fragment as a normal server GET parameter (basically, juggle the URL around a bit to put the client part in the server part) and then the site can tell the search engine all about the text and paragraphs that the 'hash fragment' corresponds to.

The search engine can even publish those links, and since they still contain the article id, your site will load up that content and the user will see what they expect!

So long as everyone does their jobs right.

Just to be sure that they don't accidentally index sites which can't cope with this behavior  Google (and therefore all search engines) use the presence of the "!" symbol at the start of the fragment to indicate "this is an indexable hash link"  Since the 'bang' character must be the first one following the 'hash', it is informally referred to as a "hashbang". (For the sound made by primitive parsers mangling your parameter strings just before they crash)

Why does this matter? Well, let's say I have a map application (hey I do!) and I want to record the co-ordinates of their window location into the URL in such a way that if they copy the link and sent it to someone, or just bookmark it, then going back to that link reloads the map to that same geographic location.

These cannot be ordinary server GET parameters, because we can't change the browsers URL to that without causing a page reload. It has to stay in the hash fragment, which means the client has to do all the work decoding it's own parameters.

In fact, wouldn't it be nice if we could just append an entire serialized Javascript object up there in the URL? And then deserialize it again with the same ease as JSON? Hey, why don't you just turn Javascript objects into JSON strings, then URLEncode them? Well:
  • The standard javascript encode/escape routines don't quite do all the edge-case characters right.
  • JSON transforms really badly when URLEncoded. As in, every bracket and apostrophe turning into three or six characters bad.
  • Let's not even get into the unicode/utf8 conversion issues.
  • The result is usually an unreadable mess.
  • A lot of people do exactly that anyway.
Well, if the standard JSON format encodes badly because of its syntax character choices, why not just make some different choices that are URL-friendly? And then deal with the other character encoding issues?

...oh, and it would be nice if the data didn't change format when in the hash fragment compared to when it gets used as a server parameter, since different characters are allowed...
... and it would be nice if we could save some bytes by leaving out some of the JSON 'syntactic sugar' (like quoted identifiers) that aren't totally necessary...
... and if there were a more efficient way of encoding binary strings than 'percent' escaping, that would be good....
... and it would be nice if the user (ie, the developer) could still hand-edit some parameters, but then hide them away without affecting anything else...

That's pretty much what I have done. I'm calling the result JSON-U.


So, that's the why. Next time, the how.

No comments:

Post a Comment