CM Arhitecture – content storage

One of the ‘strange’ ideas i have in my research is to try and completely remove a RDBMS from the equation.

Sure, having a RDBMS back-end brings along a lot of advantages and speeds up the “time-to-market”. Since I’m not building a CM system to go on sale by Christmas, i have plenty of time to experiment. So, why not take a different road?

My approach is also based on an old idea that a CM system is a data management system in its own, with its own specific requirements. Sure, it is similar with the existing DBMSs (notice i removed the R from the acronym) but that’s normal, and it’s also an extra argument on NOT to use a prebuilt system but be it instead.

So, i long thought on how to model and implement such a core system. My work started in late 90’s with testing and benchmarking some RDBMS’s. I used the newly created (back then) TPC tests (oohh, memories…). Also, Winsconsin and similar older benchmarks. This gave me a glimpse of what performance means and what can be expected when you try to analyze the impact design has on it.

To end the digression, my conclusion was that DBMS systems (mainly main-stream ones) are simply not built to handle “content”. They are good at handling “data”, as “pure” as possible. Throw some high transaction and concurrency in the soup, and here is your Oracle / MSSQL / DB2 /Postreges / MySQL …. whatever.

Content, on the other hand.. is special. Is small (think .ini files 😉 ).. Is big (think imaging stuff).. Is huge (think movies). At the same time. Also, it has versions… renditions… annotations…

Of course, this can be modeled by using a normal database but it just doesn’t seem right. I would like to see all of those implemented natively as core functions. Imagine having versions and renditions for a data row in a rdbms table.

My idea is to give it another shot and rethink the storage concept. And build on that thought.

So, lets have content (which means for me also metadata) stored as a unitary piece of data. Let’s say in a compound file on the filesystem. Self contained, self sufficient. Maybe the versions / renditions can be stored in parallel actual files since they may need to reside in other filesystems thn the original.

This has a nice advantage i am quite fond of: if the compound file structure is openly described, then a tool to process it can be easily built at any time, in any technology. So, if I archive that piece of content on a tape and throw it away for 20 years… When i go back, i don’t care if my original software is lost / can’t work. I simply build another one. (sidenote: Frankly, how many of you really believe that a records management software will not change from the grounds up until such records are due for disposal? Content won’t change.)

Ok, what about processing this stuff ? I’m thinking of building a system which works on top of this storage and builds up an index of all things. How does it do it? Well… that’s the secret recipe. Until either i publish my work or I get so bored i will discuss it here. Or somebody else comes with a smarter idea.

So, that’s one of my PhD thesis thoughts. Feel free to trash it.. I’m only thinking on it for the last 7 years. Seriously, any comment is highly appreciated and i’ll share my thoughts and results openly.


2 thoughts on “CM Arhitecture – content storage

  1. A number of companies have been doing exactly that for years with great commercial success. They’re called search engines. I had an apocalyptic event once and used the full text index of a repository to rebuild content lost from storage. It was not as hard as you might think.

    I seem to recall FAST S&T had a few white papers about using their index as the data source for business applications. They had some impressive metrics around improved system resource utilization as compared to a traditional RDBMS. Adding “mutability” and the other application functions are theoretically only incremental changes if you start with a well designed search engine.

    Interesting idea. Be sure to let me in on the IPO.

  2. Pingback: CM Architecture - yet another search engine? « Me and Content Management

Leave a Reply

Please log in using one of these methods to post your comment: Logo

You are commenting using your account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s