XML is suboptimal, but you have to use it anyway. For all good reasons known and detailed below we are not using XML internally or between PSYC applications, but we are indeed using XML where it is useful or inevitable. See these pages for some wisdom on XML:
Using XML as a Protocol for Messaging
"PSYC has quite an unusual syntax. Why didn't you use XML?" is an occasionally asked question, as if XML was the panacea of syntaxes and a reason to replace all existing protocols.
The PSYC syntax has evolved out of the classic RFC 822 as used by most Internet Protocols, by wanting to extend the headers by PSYC concepts such as inheritance and state. XML is unsuitable for several of the purposes as you will see when we go into details:
Major disadvantage: Performance inefficiency
Many people are currently turning away from XML and turning to JSON for efficiency gains, but if you look at the libpsyc benchmarks comparing XML, JSON and PSYC you will find out that they both aren't very performant formats unless you really make use of their respective strengths. Check those benchmark results out, they're really quite enlightening. PSYC turns out several factors faster than both, whatever input you feed it.
Major disadvantage: Binary Data Transfer
Just like HTTP, PSYC has the ability to say, the following n bytes are binary data - transfer them without trying to interpret them. This may sound totally unspectacular, but several protocols cannot do that. It makes File Transfers or embedding of a photograph or cryptographic key as simple as a fingersnap. XML instead has no simple solution for binary transfers. You either have to escape all XML control characters, or encode the data using something like base64 that needlessly consumes computation power and bandwidth, or you have to encode each message as a separate XML document, and put a binary capable framing protocol around it. XMPP doesn't do that, either.
It also means that it can't just deliver XML data. Ironically, any non-XML protocol is much better suited for delivery of XML than XMPP – because XMPP itself looks like XML so it can't handle anything similar to it. Just like HTTP, PSYC can simply transmit XML as is while XMPP has to encode it to avoid collision with its own syntax. The result will also be quite hard to read, should anyone need to do so. Either escaped or base64-encoded.
Alternatively, if you twist the XML data in a way that the XML document becomes itself part of the XMPP protocol, then it works. But that's not simple, not transparent, not spontaneous at all: you have to define an XMPP protocol extension with a custom namespace for that. You can't say you're delivering an XML document at the snap of a finger as you would when using HTTP.
XMPP extensions like SOAP over XMPP or ATOM over XMPP re-implement the foreign XML format into XMPP stanzas. This means that whenever you want to receive those packets, you have to parse them, then render the tree structure into a regular XML document, then pass it on to the software library which implements SOAP, ATOM (or whatever).
On the other side when you want to publish a document you received via SOAP or ATOM you have to parse it, then re-render it into XMPP. This is waste of processing power! The only way to avoid this is either to not use libraries and implement everything yourself, or to not use XMPP. Guess what PSYC does: It knows there is an XML thing in its message body. It can pass it on without even looking at it.
Major problem with XMPP: Missing framing
Being able to provide the length of a packet isn't an advantage only for binary packets. It generally allows to wait for the complete packet before starting to work on it, which allows for more efficient low-level read operations. In the case of PSYC nodes providing routing services only, it is even better as parsing of the packet can stop immediately after the routing header of the packet. Everything else is just read into a buffer at once - very processing-friendly.
In XML you have to parse the complete tree before you can find out if something has reached completion, so you have to try parsing every little chunk of data coming from the network, because it might just be the last one you've been waiting for - this makes processing more expensive. XML parser implementations have gotten pretty smart at guessing the number of closed braces to figure out when it is worthwhile to start parsing, still it is an unnecessarily complicated heuristic approach. Several protocols solve this by wrapping a non-XML length-prefixed framing protocol around XML. In the case of XMPP this is however not being done:
- "You're correct in your assertion that framed data would mean clean binary transfers, but that isn't a goal of XMPP - anyone shipping binary data directly through XMPP is simply doing something inefficient. If you do want to ship large binary objects, it's more efficient to send them either via email, or send a URL via XMPP, and ship the data by a more suitable protocol."
Flash and some other applications introduced null bytes into the XML stream to frame packets. This doesn't help optimizing read operations, but at least gives you a chance to postpone parsing. XMPP could adopt that.
Minor disadvantage: No built-in state
PSYC makes use of persistence and modification of values kept during a TCP connection or even end to end between communication partners. To achieve this at the lowest syntactical level, the traditional ':' sign of RFC 822 has been extended by '+', '-' and '='. This cannot be mapped to XML without abusing the XML syntax. Jabber has no such concept of low-level state (and thus no bandwidth saving effects and no data structures for routing). You can use compression on top of XML to achieve similar results on a per TCP level, but it will stop working once we implement persistent multicast state beyond TCP connections. Something similar may even be applied to XML, but it is a very big hassle to harness the potential of a decentralized state infrastructure using an XML syntax.
Minor disadvantage: No built-in inheritance
Every method and even every single PSYC variable supports inheritance. This means you can take it, add a new keyword to it - and poof - you have a variation which by default behaves just as the original, yet can hold a finer behaviour that for now only you know about. It is a fantastic vehicle for protocol extension, clean and intuitive. In theory you could achieve something similar in XML by adding attributes to existing packet syntaxes, but XMPP has it, that you have to negotiate such an extension every time, which makes it a lot more bureaucratic. Also, it's not the same degree of flexibility - but it does come close.
Minor disadvantage: No built-in keyword compression
PSYC keywords in variables or methods will be, once standardized, compressed to single bytes. This allows even very complex inheritance structures to be reduced to a few bytes. A protocol which is by design compact will always perform better than a protocol that can only compare using compression technologies, which you can obviously also apply to PSYC. Should you choose to use short tokens in XML, too, you still have a wrapping syntax which is a lot more verbose.
Major disadvantage: No message display templates
psyctext is a simple but very effective way of bringing both technical information to the client and a useful text rendition of the message to the end user. Such a trick isn't easily feasible with XML. What would you put inside the templates to reference the variables.. XPath? (it is mentioned again, with a link, below)
In XMPP specifically every client needs to know every single XML message type it may ever receive - there is no way to invent future packets that the client will just handle in a default way until it learns about them. All you can do is wait until a new version of the client is released, which handles the new "XEPs."
Structured Data: Not always a plus for XML
This is notoriously a classic strength of the XML format, and when it comes to handling foreign data, we sometimes find ourselves routing XML packets in our body payloads, mostly to support new XMPP extensions that our code isn't directly aware of.
However we have not encountered many cases where structured data was necessary. Many times large data sets are transmitted when data is supposed to be pulled or synced to something else. This is quite often the wrong approach to the problem: You should push each change to your data to all who need to know about it, just as it happens. When you multicast your data, it automatically becomes little changes of the add something or remove something kind (and we even have + and - for that). You reduce update latency and save large amounts of bandwidth.
PSYC is more like XPath instead. XPath is a representation which allows to map an XML document to a table of pointers and its data. This looks a lot like a series of PSYC variables and its values. This is easier to parse and handle than XML, and given the small amount of structured data we encounter, also more efficient.
<presence to='[_INTERNAL_target_jabber]' from='[_INTERNAL_source_jabber]'> <show>xa</show> <status>[_description_presence]</status> <mood xmlns='http://jabber.org/protocol/mood'><[_INTERNAL_mood_jabber]/></mood> </presence>
You can see we had to introduce a couple of internal variables to keep the jabber style versions of PSYC variables, yet this template and its variables could be transmitted via PSYC and rendered on the receiving side. Luckily we only do that for Jabber packets that we don't understand.
We also intend to support binary variables in the future, which can contain arbitrary binary data structures. We use a length-prefixed approach for that, similar to s-expressions or BitTorrent's bencode. Additionally we support JSON if you're desperate.
So even to this purpose, XML is not necessarily the best choice.
Just because some complain we didn't mention the most obvious aspect of XML, here it comes:
XML has too much syntax, too many words.. when used with mobile phones or similar technology, it can really become a factor that XML is too verbose. PSYC is a bit verbose too, but only until compact mode is in place. Of course you can throw more technology at it by requiring compression - but even then, what's already compact is compressed to even less.
Major disadvantage: Namespaces are too verbose
XML namespaces are too academic. They are designed to give you a perfect way of extending the XML syntax which may never collide with somebody else's collision. This comes at the expense of plenty of syntactic overhead. Have a look at the ActivityStreams syntax comparison and benchmark we made after implementing libpsyc. Namespaces are so unfriendly that developers turn to JSON and choose to ignore the extension collision problem entirely. That's wrong, too. The right approach is to have a simple and easy extension strategy, that is unlikely to collide. PSYC uses method inheritance for this. Just append your (company) name to the method you need to extend, and it's sufficiently impossible you can cause havoc by being incompatible with somebody else's extension.
Minor disadvantage: Whitespaces in Attributes
Consider the following piece of XML:
<tagname param1="foo" param2="something with a whitespace" param3="somethingelse"/>
As you see, param2 contains unencoded whitespace. If this was illegal, you could simply do an explode(), split() or whatever your favorite language calls that function to get a list of
param1="foo" param2="something+with+a+whitespace" param3="somethingelse"
But you can't. Instead we use a huge crazy regexp to achieve the same job.
ronark notes: "We do a lot of XML based communications at work and even for simple messaging, we find that there is definitely a drop off in speed compared to less verbose techniques. Not just in terms of transmission speed, but a lot of time is spent in the XML parsers. Perhaps this is a by-product of using the XML classes in .NET, but that's the technology we're stuck with. If anyone has some simple benchmarks or tests of XMPP, that would be interesting to see."
<itior> XML is like violence, if it doesn't solve the problem, just use more.
<elmex> xml is like global masochism <elmex> or more like a wet dream: nice when you dream about it in theory, but when you actually wake up all that's left is a mess...
hehe, or what about this one:
el sagt: hey moment, xml ist super, das kann dann jeder parsen ohne zu wissen was drin steht el sagt: das ist wie geschenkpapier el sagt: das kann auch jeder auspacken
And another, on xHTML: http://glazman.org/weblog/newarchive/2002_12_08_glazblogarc.html#s85876669
XMPP isn't XML, really
Some say, in the light of today's processing power, it doesn't matter if XMPP uses an inefficient syntax, and yet performance optimizations were important enough for XMPP not to fully embrace the XML standard.
One advantage of XML is the existence of ready to use parsers, and by now, many of them handle the XMPP dialect pretty well. In fact XMPP developers prefer if you don't roll your own. In order to employ such an XML parser, Jabber developers need to tweak certain things before and after parsing:
- XMPP does not allow encodings other than UTF-8, but UTF-16 is a requirement for XML compliance.
- XMPP forbids XML comments, but they are a requirement for XML compliance.
- XMPP forbids processing instructions, like <?include ...
- XMPP does not allow unescaped use of > according to the XML spec (see also Mr. Karneges' mail).
- XMPP applications do not implement namespaces properly.
- XMPP doesn't properly close its document (the stream in that case) when negotiating TLS, instead it reopens a stream on the existing stream. The end result is not a valid XML document. Same thing when using compression.
See also elmex' AnyEvent-XMPP Parser comments.
Snippets and comments
If an XMPP implementation receives characters matching such features over an XML stream, it MUST return a stream error, which SHOULD be <restricted-xml/> but MAY be <bad-format/>
12:05 <elmex> also "wir haben xml mal gesehen, und es war voller < >" 12:06 <elmex> "und weil wir xml wirklich koennen tun wir ueberall noch namespaces ranschreiben" 12:06 <fippo> ich frag mich sowieso, was dieses ganze schema-zeugs soll wenn man doch "extensible" ist 12:18 <elmex> Because XMPP does not require the parsing of arbitrary and complete XML documents, there is no requirement that XMPP needs to support the full feature set of [XML]. 12:20 <elmex> es ist einfach kein XML 12:21 <elmex> ich kann mit meinen xml-generator, der valider weise utf-16 ausgibt, keine xml-streams erzeugen 12:22 * fippo ist froh, keinen xml-generator zu haben sondern sprintf. 12:22 <elmex> man muss halt fuer XMPP seine ganze xml-toolchain neuerfinden 12:23 <elmex> damit sie 1. XMPP-XML ausgibt und 2. kaputte xml-dokumente parsen kann
knorke verkündet: ich dachte, genau dafuer sei xml doch gut, im ignorieren von nicht gewollten/erkannten daten? :D knorke verkündet: wieso nutzt man sonst xml? oO
mjacob sagt: elmex: man kann es erweitern elmex spritzt: mjacob: mit XEPs ja, aber die basis bleibt trozdem muell mjacob sagt: elmex: aber wenn der keller brüchig ist, kannst du nichts machen mjacob sagt: ich bin zu langsam, sorry^^ elmex spritzt: man bruach ne version 2.0, so eine wie ich mir mal spezifiziret habe, die nur wenig handshaking braucht und alles fixen wuerde und vieles verdeutlichen wuerde mjacob sagt: dann mach doch ein "XMPP" 2.0 saga ::::: ich finde, wir brauchen xmpp überhaupt nicht
<elmex> SOAP/XML-RPC over XMPP is auch sonne kranke sache <elmex> wenn mir jemand sagt wie ich auch nur irgendein soap framework nutzen soll? <elmex> das xml muss ich ja selbst parsen <elmex> dann hab ich meine datenstruktur, und dann brauch das soap framework ne moeglichkeit aus ner datenstruktur das zeug rauszuziehen <elmex> die meisten wollen einfach nur nen xml-dokument <fippo> nja <elmex> das ist einfach kein sauberes layering <fippo> du renderst die datenstruktur wieder als string
<elmex> I'm using XML::Parser::Expat because expat knows how to parse broken (aka 'partial') XML documents, as XMPP requires.
Don't laugh, but PSYC isn't semantically all that distant from what the W3C experts expect from a potential future binary XML format. Binary XML and the parallels to PSYC are discussed on the Syntax page.