SQL was intended as a language to access relational databases, where the major point about it is to be able to join tables and discover information about your data by combining it. But the way it is being used by most applications is rather like a storage medium with a few extra capabilities. That's how SQL is mentioned a lot on Storage.

Contents

Problems with SQL and decentralized systems

One fundamental difference between SQL and an OODBMS or a file-based serialization of objects as psyced does it by default is the fact that SQL disables all encapsulation. Administrators seem to find it a gesture of freedom to be able to log into their data and do arbitrary modifications to it, but in the design of a decentralized system like PSYC such modifications can cause serious desynchronization between nodes. Administrator interference is probably only a minor source of desync, and yet, do we want it at all? Should the PSYC application provide a complete API instead, allowing for all database operations to happen in a way that goes hand in hand with the PSYC entity model? The advantages of having an SQL backend are the popular standardized backup strategies and tools etc. But another uncool aspect about SQL is that it still glues strings together and parses them again, even within one application or host, that's so 70s. SQL queries should be compiled data structures, not strings. Apparently there is no easy solution to this, so the common way to address this is to use precompiled and stored procedures in a non-standard way. So you need a language-specific wrapper interface like JDBC. For SQL within one application SQLite seems to be the most popular choice.

SQL as a glue between PSYC and other applications

Some people get to this page because they have large SQL based applications and would like to have a PSYC server share the same user data just by accessing the SQL database. Some others simply would like to use familiar tools to access the data. Still using these tools requires a gatewaying platform which in this case is the SQL database.

For web applications that do not keep state and always store things back to SQL sharing the database works quite well, but it won't work for PSYC as a PSYC server couldn't perform in any reasonable amount of time if it were to query a database each time it needs to find out whether an incoming message is from a friend or a stranger etc etc.

So either you only let the PSYC server modify SQL data, which forces you to rewrite the other application to go through the PSYC server, then you no longer need SQL for any purpose really (and you're thrown back to problem #1 above), or you leave the scope of traditional SQL and use a rather recent database feature called triggers to be informed automatically when the other application would like to commit changes to the data. As 20after4 points out there is also a third option of using the replication mechanisms as an event protocol interface, that is plug PSYC notification into the replication mechanism. And finally fippo suggested we could implement the query language ourselves in psyced.

One way or the other, this is the point when SQL goes beyond the scope of being a query language but becomes de facto a sort of network protocol with requests, replies and events.

Let's get into the details of each of these approaches.

Trigger-based Synchronization Daemon

Here's the trigger documentation: postgres (pgsql) and mysql to name some popular choices.

A trigger-based solution would install PSYC sendmsg() functions into the SQL database, using any of the trigger API languages, so that it can inform psyced of changes happening to the data by means of SQL statements.

According to the postgres-docs shown above it is evident that interfaces for perl and python are available at least, and our PSYC implementation in perlpsyc is terrific, so such a data gateway daemon could be done like this. Just look at the new psycsyncd, it already does the job half of the way.

Consider that such a daemon needs to be adapted to each specific case, as every web-based community platform or similar applications have their own table structure. So, psycsyncd needs the glue to your table structures now.

Still I'm not entirely sure such a trigger creature is a nice construct. Do the SQL vendors also provide a simpler change notification API than embedded code? Would be most logical to do so.

Related: Here's documentation about a NOTIFY and LISTEN extension of postgres.

Replicated Databases

sticky bit suggests that even having two databases on the localhosts of the two sides of the application with activated replication between each other may be of some use.

<lynX> I feel there is still a risk of wrong data in caches causing problems. Better ensure you have the current data, than put your trust in technology which is not built for distributed eventing.

Replication Notification

There is also the option of using the replication mechanisms as an event protocol interface, that is plug PSYC notification into the replication mechanism. 20after4 points out that the replication mechanism of sql servers is designed for such operations.

Insert links to appropriate documentation here.

Emulate SQL in a PSYC server

fippo suggests, PSYC could emulate a limited set of database functions so that the app talks SQL to PSYC instead of talking SQL to the database. This is an interesting approach.

Circumventing SQL in the end: Storage Synchronization

This all is based on the presumption that the other application doesn't cache data from the SQL database but rather always acts out of it. The whole idea fails if that is not the case.

All of the proposed solutions above suffer from this problem: You cannot tell the application when the data in its cache is old. If the application is caching, you can't use any of the approaches above.

In that case the best idea is probably to take your application and put a wrapper around your database caching interface to make it send and receive PSYC synchronization as described in storage to the PSYC server, effectively circumventing SQL and the cache.

The application we run at MAGIX now uses storage synchronization and it's great. You can change your profile in either PSYC or the website specific way, you can make friends either via PSYC or via the website's specific interface etc. You can keep all the things your users know from your website without crippling the equivalent PSYC features. They simply stay in sync. Okay, it's not simple in fact - but we are working on ways to be fault resistant when one of the two sides is down for some reasons. Remember, the alternative is: the whole thing not working if one of the two sides is down. See the Storage document for details on the sync API.

Discussion

<willo> doing cross syncing comes with a whole pot of other problems

<lynX> maybe you are right that the common sql is such a huge advantage that you can deal with occasional glitches with the web app having temporarily incorrect data. so in the end it depends on the situation at hand and a bit on a question of taste (although it better shouldn't). any of the possible strategies outlined above might just be the right one. and any of those may turn into a huge problem.

SQL for persistent storage in psyced

Alright.. so we have gone through the multi-application common-database thing. If you are absolutely positive that you are interested in SQL for psyced without turning it into any integration highway, start reading here.

For reasons explained in Storage it is pretty impractical to make psyced keep its data in an SQL database. Still...

Synchronous SeQueLing

fippo has modified the Storage implementation to support synchronous loading of user data from SQLite, MySQL or PostGreSQL (the latter with a patch to the driver), nonetheless.

This is a hack since natural operation would be to load data asynchronously. Synchronous means the server will freeze until the database has delivered the data. This will normally be hardly noticeable, but still it is a performance brake as the normal operation of loading data from the flat file can't possibly be slower.

So consider it a quick solution if you absolutely absolutely need to have that data in your database, which of course needs to run on localhost given the way it is accessed.

Still, the advantages may outweigh the hackiness, so go try it out. It's still fresh, experimental and incomplete though.

You really want it?

<lynX> I'd like to know, what is the gain of storing into an SQL database? Realtime backup strategies by replication? Being able to mess with the data while the server is down? Why if you could mess the data while the server is up, by using the PSYC interfaces to the server rather than low-level storage interfaces? Gimme some reasons. It's a wiki because of that.

<fippo> One of the main reasons: people have an existing userbase whose data is stored in a database.

<lynX> Alright, but they aren't switching from one PSYC server to another. They are merging applications, so the discussion should be happening in the paragraphs above, not here. This paragraph is exclusively about single application storage. If you think you are just coding storage, but in reality you are merging applications, you are bound for a lot of problems later and within the year I can see you rewriting the whole thing.