PCMT – The Perfect Content Management Tool, take one

In my last post I cried out about the need to have Content Management as a tool. This does not mean that “solutions” are not good or useful. Of course they are, but they can support (well) only a fraction of our CM needs. For the rest, we need a tool.

Note that I’m not talking about “Content Management” as in “Web(site) Content Management”. I’m talking about the CM as in the management of content, when content is defined as a set of data, usually not just a row in a table. Also, this is not a post about the definition of CM… there’s enough of that already going on…

The journey begins in 2004 when I was discussing with a customer and we were doing the requirements analysis for a contract management solution. On my question about how does he see the best software solution to its needs, he replied:

“I would like to have one big button on the screen. And when I push it, it should do whatever I want it to do. That’s about it, I’ll be satisfied with that.”

Perfect Content Management Tool. (obviously, you can replace CM with anything and will still be valid)

Nirvana.

After a few laughs we resumed the more down to earth talk and ended up with a decent set of requirements. But the idea of the “big button” still lingers on. With me and also with my – now old – customer.

Can it be done? Sure!

Now? Dunno. But I will try.

Let’s analyze, split and address the challenge ahead.

There should be “a button”. Ideally “the only button”, but we could probably negotiate on that. A button is something simple to understand by the user, it’s accustomed to its behavior. We should not necessarily stick to an actual GUI button, but instead understand the idea of offering to the user something very much close to the usual things it interacts with each day. For example it can also be a “link”. But it needs to be big, aka “visible enough”. Menus probably don’t work.. as they are a little more complicated.

The button should do “what I want”. IĀ  specifically remember the customer saying “what I want“. Not “need” or “should be done”. I’m not sure right now if it’s easier to to a button which does what’s needed vs. one which does what’s wanted. Anyway, one which does what “should be done” is definitely the easiest of all magic buttons.

How do we make such a button? Well, the button should “hear” the context in which it stands, should know the context in which his user is and must know what resources he can access and which are the processing paths it can follow. Probably also a lot of other things, but these are the ones I feel are important right now.

Hear the context in which it stands

The button is somewhere. Because it exists.

For example, it should know its location on the screen (c’mon, this is easy)… but not for presenting itself to the user. It needs to know this because the position might have a meaning. If it’s on the top right corner then maybe it will do something completely different than if it’s on the bottom. If the button is on a laptop screen it might perform other things than if its on an iPhone. I know it sounds simple, but remember we’re talking about “one” button. Only one, not several ones.

Ok, but you said “hear”. Yes, the button must be able to hear (not receive as in “destination”) the messages sent by neighboring other “things”. This is happening right now at a primitive level with GUI buttons on presentation software. But I’m not talking about that. The “button” is just a metaphor at this moment of the whole tool/software thingie. The messages are in my view, not technical ones.. but more linked to the data the user interacts with and to its meaning.

In order to hear, the button must have the receptors for this. Webservice listeners, message queue monitors, event handlers, psychic serial bus.. you name it.

In order to understand what it hears, the button must speak the language of the message. This is where standards or public formats come into play. Both are good, I’m not advocating one over the other, let the real life decide. The more standards / public formats will the button speak, the better the button.

Know the context in which his user is

Is my user in an airport, is he at his desk, is she in a meeting, is there some public wireless within reach? This sort of data must be available to the button and it should know them. By knowing this, it has more data to process and be more prepared when the user clicks it.

and so on…

What’s this got to do with Content Management?

Where is my ECM?

Nowhere near, I’m afraid.

But I bet we can get it closer. Not in one swoop, for sure. But would if be that hard to make a button called “Get” which will adapt to all of the above and try to surface my accessible content from wherever, and help me get to the one I want in some 2-3 clicks through a faceted navigation?

Since the button is sensitive to it’s context, we can move it around on our digital workspace and activate it when we reach the are we’re focusing on. Or when it “glows” (probably because it just detected that he can tell me something “new” about the context).

Does it sound Science Fiction? Not to me. Just came to mind: “instant search” anyone? Things are moving, get on with it.

You can call me PhD. In content management

That was it!

When I started this blog I was thinking of it as a way to interact with others on the topic of my PhD thesis. As many things in life, this turned it into something else (not sure what).

But, in order to keep the initial mission close… this post is as a marker to the date in which I publicly presented my research and I was accepted to bear the title “Doctor of Philosophy“. And I must say I am proud.

I’m proud I did not give up during all these many years and that I listened to my teacher, wife and friends and continued on the road. I’m proud I learned a painful lesson in “backup” which triggered a complete rewrite of the thesis just before it was done. I’m proud the research is my own work, although imperfect or reasoning to others (hey! very few ideas are completely new in IT).

Published works? Just a few. You will find them in the bibliography of the paper. Which should be published online any day now by the university. Link to it? Maybe.. I’m trying to keep this blog as free as possible. Free as in speech, so I find comfortable the fact that my name is not directly related to it. I might be acting stupid here, but I’ll keep it Web crawler anonymous for the time being.

For you humans out there, reading this stuff… It means you care about the subject. So, for the sake of it, here is the paper abstract:

Structural Synthesis of High Performance Databases

The increasing volume of unstructured data existent inside current organizations generates the need for its efficient management. Databases are usually the informational support used to manage such volumes. This paper will present studies and experiments for both the design and the implementation of data having considerable size ā€“ content management systems.
While content management tasks can be easily implemented using the existing database systems, we believe that high performance implementations require specialized processing and architectures.
We will present the characteristics of content management systems, focusing on its usage as a data management platform. One of the most important sections of the paper will define the performance metrics and the most important challenges generated from these performance expectations. Performance is defined not only as a quantitative measure of response time but also from other perspectives such as the system resilience in long time periods (tens of years) given the continuous technical evolution.
A new architecture is proposed in the paper, following the analysis of the main identified challenges. An actual implementation of the architecture is also described and several important design decisions are detailed. The implementation quantitiative performance behaviour is then assesed in order to validate the architectural decisions and observer the scalability to large data volumes. The implementation is then benchmarked in comparison with an alternate implementation of the same architecture but using an off the shelf standard database management system.
The observed performance tests show that the architecture allows high performance metrics to be achieved and that it compares very well to other common database management systems and therefore we consider that we succesfully designed a technical implementation of a new model which supplies greater performance than conventional implementations.
As next steps we aim to implement a standard interface to the designed system, using the CMIS (Content Management Interoperability Standard). This will open the possibilities to test the proposed system with third party software applications and will give a more precise indication of how it compares with existing traditional products from a performance point of view

Puzzled by the title? Don’t be. Back in 2000 when this title was conceived we thought the term “content management” or “ECM” would not sound academic enough to the community. Inside it’s all about a CMS engine (in the broader sense, not limited to web content management).

So long now, I go get some sleep and followup here shortly.

PhD almost here

A long, long time ago, (in a galaxy far, far away?)… I started my PhD in Computer Science.

Now it seems the final step is near, I just talked with my professor and we kicked off the arrangements for the public presentation. The internal university presentation was in September, last year.

I’m writing this post just to mark another milestone. And where better to place it than online… because the web never forgets. Like a giant records management system with permanent retention… almost.
This way I’m almost sure that when I will try to remember when this happened I will always be able to look it up on the net :). Isn’t it interesting?

On my computer I don’t have files older than several years… but my yahoo account is older than my first paycheck. And I never deleted any non-spam email in that account.

Of course, yahoo or wordpress can go away at any time… they made no guarantees to me but here I am trusting them with my valuable letters. It’s like using SharePoint for my ECM: not adequate but convenient…. and it sure gets the thing done.

One of my obsessions come to mind now: What’s the percentage of records management software (not necessarily certified) which does not need a major migration during their records retention timespan? I can hardly see more than 1% of current software installations surviving a 10+ year time, not to talk about 30 or 40 years. We were using floppy disks 15 years ago…

Anyway, I’ll better present the PhD while it still has some value in it. I am already thinking of rewriting it the third time.

Happy Content Management!

Get the things done and try not to worry to much of the future, we are all doomed anyway.. aren’t we? :)))

Content management in a box

“Can I have two of those to go, please?”

A recent announcement from Oracle talks about an OLTP database machine. I’ll let you read the details and other comments in the official announcement and blogosphere.

When I received this pre-announcement over the weekend I appreciated the synergy between the two product lines: RDBMS and server. The RDBMS runs on a server.. why not make a specially tuned RDBMS to run on a specific hardware and also tune the hardware to generate a whooping performance for that specific software? While I’m not sure the new Oracle product does all this, I can imagine it.

Now, back to our nice little ECM world. CM software is captive to the RDBMS. Its performance depends on it. The licensing goes hand in hand… You rarely (if ever) can use a major ECM suite without a properly setup RDBMS. Why is that? Well, I can think of several reasons like ease of deployment, portability, reasonable performance, time-to-market… but the question still remains: “Why not have a CM server?” One box to deliver it all. A CM “appliance”. An “Apple CM”… all in one box, no replaceable battery.

As I know EMC products quite well, it’s obvious this would be a very nice use case for xDB. Let’s see if the R&D can pull it off – I would do it until end of 2010 if I was EMC and release it in 2011. I could really use a Documentum package which does not need a DB license/product and runs at least acceptable if not better.

Back to the “box” idea (I really like the Apple analogy) I’m not necessarily talking here about the “no database CMs” (like the list here). I’m talking about a full fledged, powerful and highly performance CM which is “in tune” with its medatata storage (based on a RDBMS or not….).

I’m pretty sure somebody already has this in their lab or even shop. I have a PhD thesis which is almost on this, and I’m probably not the most innovative guy in the world. I would love to learn about any such initiatives, but I’m too lazy today to search for it today… that’s another to do post-it.

It is being said that crisis times are the best drivers for innovation. Really?

Document management with SQL Server

This is a placeholder post, I’ll update it as time goes by.

Currently I’m building a presentation to show to the IT community how SQL Server can be used to build Document Management systems.

I have built (me and my team) many applications on SQL Server and several for DM. So i need to structure my experience a bit and give back to the community while researching what anyone else did similar and what the new version of SQL 2008 brings to the table.

If you whish to share your thought, feel free..

later edit:

Of course I could not update the post as I researched…. but here are the outcomes:

Main topics of interest when trying to build a DMS solution on top of SQL 2008:

  • Integrated Fulltext Search
  • FILESTREAM data
  • Remote Blob Store (RBS)

Other significant SQL Server 2008 functionalities:

  • Backup compression
  • Data compression
  • Data encryption
  • New DATE/TIME field (UTC)
  • Improved XML processing (with Lax validation)
  • Improved reporting services (who doesn’t need reports ? šŸ™‚ )
  • last, but not least: Sparse Columns
  • more here

Full Text search

Now being integrated (and rewritten), the FTS engine provides more functions to the user and developer. The performance is kept somehow like in 2005 but some areas show significant improvements.

Fot the brave enough to use FTS in 2005 and previous versions, the migration options need to be considered (3 in total: rebuild, import, reset). Rebuild is needed especially if you want to take advantage of the new stemming and word-breaking rules and languages.

Nice things: stop words are now in the database. So they are accesible, programmable and transportable. They are also not only language dependent but you can also define other “set building” rules.

The thesaurus is still in XML but now is lazy cached and can be updated without restarting the server (yey!). Note that it behaves a little different then in 2005. So you need to take care when migrating your XML files.

Cool stuff: troubleshooting functions! Something always needed to look into the FT “magic”. baing able to see what keywords were indexed for a particular document / collection is very nice. To see it from SQL is even nicer. To be able to see how a query is parsed and transformed is great. I’m also happy since I can see how the stemmer and thesaurus work for a particular case.

Some advice: take care if you have many keywords (x 10 million). Use fast disks, IO is very important. Use 64 bits: 3 GB of RAM is usually not enough. Don’t confuse FREETEXT and CONTAINS, use them wisely.

BLOB related news

First of all, please don’t use IMAGE and TEXT/NTEXT fields anymore. They will no longer be supported / encouraged by Microsoft.

You can use VARBINARY(MAX), but you hit the 2 GB limit with it. Use the FILESTREAM modifier (new in 2008) to kill that limit.

FILESTREAM makes content to be stored in the NTFS drive. Nice. And tricky at the same time.Ā  Good for streaming, not so good for frequent updates. Good for big files, not so good for many files (especially when having short backup windows).

Nice: works from TSQL as well as Win32. Not so nice: behaves a little differently in TSQL vs. Win32 (transaction isolation level, performance – not necessarly better in Win32).

So, you really have to understand it before using. You can get in some not so obvious pitfalls. But is a good thing.

Remote Blob Store – RBS

Who does not know what CAS (Content Addressable Storage) is probably does not need it.

Is not another column type, it’s an API to be implemented by CAS vendors mainly and used by applications.

Somehow, it’s similar with EBS on SharePoint. In fact, there is a competition between the two (some nice cover is here), and I also feel that RBS is the way to go (regardless of the current limitation about accesing the context of the Blob).

EMC already has a RBS connector for Centera. Nice.

So, 2008 brings a lot of nice things on the table. Let me know when you use them.

PhD paper done. Phew!

Finally, after a long time i have now a complete version of my PhD thesis.

I would like to ellaborate more on this, but after spending 4 days in a mountain cottage secluded in front of my laptop… i simply can’t.

Now i just need to publish some articles and present my creation to the public. Behold šŸ™‚

PDF/A in Amsterdam

In the last days I’ve been participating in the first PDF/A International Conference in Amsterdam, trying to get a better understanding on the facts around the topic.

To simply put it, PDF/A is a PDF 1.4 with some more rules. And is an ISO standard (ISO 19005-1).

For those of you who are wondering why do we have yet another file format (which seems to be a branch of the oldie PDF) please learn that PDF/A aims to be the format in which documents are to be stored for long term archiving.

The idea is excellent for various reasons, and the PDF/A originators (which is not necessarily Adobe) are not the only ones who thought of this. Microsoft also tries to jump into the wagon with XPS – which was not designed to be an archiving format but it seems they think is useful for this as well.

The need is there, as organizations are tired of having to deal with old file formats always when going deep in the electronic archives. And we need to take into consideration the fact that electronic archives are not too old these days. As a fun fact, in the opening keynote Thomas Zellman showed a 5 inch floppy disk to the audience. I think that was an excellent idea of reminding everyone that many things (think content here) we create today, would need to be used a long time from now. And 5 inch floppies are not too old. Think 8 inch floppies and punch cards.
Therefore, archivists all over the globe are trying to think how to reinvent their job of storing and managing paper and bring electronic content along (yes, “revelation” – paper will not disappear). If you have worked with archivists you will find out that this job is highly conservatory (couldn’t help the wording šŸ˜‰ ). It’s in their nature not to change things and most of them they would not want to tackle anything but paper at all.
How do you address this? Make it a standard! “It’s ISO so it’s good”. At least easier to swallow by the archive world. Second, by deriving it from the ubiquitous PDF you get a file format which can be read by a lot of software and can be generated easily by others.

Of course, there are rules to take care if you want to be compliant.. Read all about it on the www.pdfa.org website, I’m not going into details here.

How is this relevant to the Content Management area?

First of all, it’s relevant to my PHD thesis since the objective of PDF/A is to be self contained (content and metadata). Which is how i store my objects in my great repository (wink).

Idea coming through: How about to define a storage area inside an ECM system so that everything you put there is stored/converted transparently by the CM system as a PDF/A including all its metadata?

Of course, there are some issues to ponder on, but i think this sounds good. The file format needs to evolve a bit to allow more content types to be included (think 3D, multimedia) and also to do more than a primitive implementation of digital signatures and metadata. But the scene is set.

Related to evolution, sadly (?) enough PDF/A needs to undergo ISO certification, so we all could expect the 2.0 version in 2010 i guess (and some speakers from the conference felt the same way).
I’ll stop for now, there were a lot of interesting things discussed in the conference and a lot of study cases and very interesting people to meet or rejoin for a beer.

Cannot help but add one more thought: Is IT Fashion? Rory Staunton thinks so.