Document management with SQL Server

This is a placeholder post, I’ll update it as time goes by.

Currently I’m building a presentation to show to the IT community how SQL Server can be used to build Document Management systems.

I have built (me and my team) many applications on SQL Server and several for DM. So i need to structure my experience a bit and give back to the community while researching what anyone else did similar and what the new version of SQL 2008 brings to the table.

If you whish to share your thought, feel free..

later edit:

Of course I could not update the post as I researched…. but here are the outcomes:

Main topics of interest when trying to build a DMS solution on top of SQL 2008:

  • Integrated Fulltext Search
  • Remote Blob Store (RBS)

Other significant SQL Server 2008 functionalities:

  • Backup compression
  • Data compression
  • Data encryption
  • New DATE/TIME field (UTC)
  • Improved XML processing (with Lax validation)
  • Improved reporting services (who doesn’t need reports ? 🙂 )
  • last, but not least: Sparse Columns
  • more here

Full Text search

Now being integrated (and rewritten), the FTS engine provides more functions to the user and developer. The performance is kept somehow like in 2005 but some areas show significant improvements.

Fot the brave enough to use FTS in 2005 and previous versions, the migration options need to be considered (3 in total: rebuild, import, reset). Rebuild is needed especially if you want to take advantage of the new stemming and word-breaking rules and languages.

Nice things: stop words are now in the database. So they are accesible, programmable and transportable. They are also not only language dependent but you can also define other “set building” rules.

The thesaurus is still in XML but now is lazy cached and can be updated without restarting the server (yey!). Note that it behaves a little different then in 2005. So you need to take care when migrating your XML files.

Cool stuff: troubleshooting functions! Something always needed to look into the FT “magic”. baing able to see what keywords were indexed for a particular document / collection is very nice. To see it from SQL is even nicer. To be able to see how a query is parsed and transformed is great. I’m also happy since I can see how the stemmer and thesaurus work for a particular case.

Some advice: take care if you have many keywords (x 10 million). Use fast disks, IO is very important. Use 64 bits: 3 GB of RAM is usually not enough. Don’t confuse FREETEXT and CONTAINS, use them wisely.

BLOB related news

First of all, please don’t use IMAGE and TEXT/NTEXT fields anymore. They will no longer be supported / encouraged by Microsoft.

You can use VARBINARY(MAX), but you hit the 2 GB limit with it. Use the FILESTREAM modifier (new in 2008) to kill that limit.

FILESTREAM makes content to be stored in the NTFS drive. Nice. And tricky at the same time.  Good for streaming, not so good for frequent updates. Good for big files, not so good for many files (especially when having short backup windows).

Nice: works from TSQL as well as Win32. Not so nice: behaves a little differently in TSQL vs. Win32 (transaction isolation level, performance – not necessarly better in Win32).

So, you really have to understand it before using. You can get in some not so obvious pitfalls. But is a good thing.

Remote Blob Store – RBS

Who does not know what CAS (Content Addressable Storage) is probably does not need it.

Is not another column type, it’s an API to be implemented by CAS vendors mainly and used by applications.

Somehow, it’s similar with EBS on SharePoint. In fact, there is a competition between the two (some nice cover is here), and I also feel that RBS is the way to go (regardless of the current limitation about accesing the context of the Blob).

EMC already has a RBS connector for Centera. Nice.

So, 2008 brings a lot of nice things on the table. Let me know when you use them.

Sun and IBM

Or, “there can be only one” – to quote a slashdotter.

I was reading this last evening as the news talk about the feeding frenzy which is seen in the market.

Such a huge deal would be sure to attract the attention of monopoly aware government organizations, both in the States and worldwide. I’m curious if it can stand this type of inquiry.

Anyway, if this happens the IT market will change significantly for a lot of people and while this does not directly touch the ECM space, I believe it will affect us (i see Lee feels this also).

Due to the lack of comment on the topic from Sun and IBM officials, i think there is definitely something going on and we might have to see a new and bigger giant appear in the IT space.

Is traditional ECM up for high performance

Last week I stumbled over the WordPress statistics for February.

Impressive. That got me thinking on how I would be able to implement this kind of backend system with a traditional (read “top Gartner stuff”). And I shivered inside while thinking of IBM CM, Documentum, FileNet… SharePoint (lol!).

I remember once I’ve seen a support issue with one of the above vendors in which the administration tool could not display the size of filestores bigger that 2 GB. And I believe the issue is still there, after 2-3 years. Tragic.

Imagine a customer seeing this (and they do) and saying… “Well.. what kind of Enterprise system is this? If it cannot show correctly storage spaces over 2 GB? How can I trust it with my Terabytes?”

My practice based experience tells me that a large scale performance cannot be normally achieved with Enterprise grade software. You have a better chance with some high skilled professionals (not many, 4-7 should be enough) which can put together some “indie” software.

In my work, I rarely (read “never”) seen a ECM system handle a load similar with the WordPress one. ECM in my area tends to cap normally at about several million items and a few TB of space. While requiring a huge deluge of hardware (I can’t still understand why on earth would I need minimmum 4 GB or RAM for an Index Server? Even in the smallest install)

You out there…. Working on ECM… I would like to know your statistics.

Documentum 6.5 ramblings, or something

I just noticed that another month passed by without me writing anything here.

Obviously because I’m busy… same (lame) excuse. Not really. Of course the job is always demanding, but getting 30 minutes a day to ramble about something can’t be that hard.

So I pulled myself and logged back here. Let’s talk about my recent Documentum experiences.

At this stage we are undergoing some D 6.5 implementations. SP1, just to be sure.
I’m not involved technically first hand, and the last time I got my hands on it was about 4-5 months ago. This is to settle expectations straight… you’re not about to see any technical revelations here.

First thing: “shit! is different!” the whole install process I mean. Where’s my DocApp? No more docapps, use the “.dar”. Learn Eclipse… headless.. that is. The Eclipse, not me. Not yet.

Ok, basic stuff works. TaskSpace… almost ok. Let’s move to the Imaging Services… Brr… Strike 1… Strike 2… Use Webtop.
I promise myself I’ll get back on this. Never did. And now my colleagues suffer the same things. Poor them.

But it moves faster. Lot faster. Then I increased the RAM available to the virtual machine. Shouldn’t have done that.
Note to self: don’t increase the memory for virtual machines upon the exhaustion of the host memory. Swapping is bad..

Business Process Manager now. Done. All ok. BP Services? Not yet. Later.

Oooo… BAM. I mean Business Activity Monitor. I feel like Dee-Dee in the laboratory. Same effect.
Rollback to VM snapshot.

At this point I think: there must be an easy way.

Then I read in the installation manual: “rename the XXX file to YYY file and then run setup.exe” (names were changed to protect the innocent). As a last line on the page. Why the %^&*() do I need to rename a file which is provided in the installation kit? Oh well.. this why we are payed the good bucks..  can only imagine how SAP looks like.Oh, stop, I know that also… tough job.

Wanna try Annotation Services? With 6.5 ? Sure…. bring it on. It worked! That brought the spirit up. For a moment.

Today it does not work anymore. Looking for some LiveCycle piece? Good luck!

Great. Now what? Laugh hysterically and get back on it.
You know, Oracle installation kits are free on the Internet. You know why? Because it takes a skilled person to install it properly. Go figure.

Why am I telling you this? Because next week when all the puzzle pieces will be nicely put toghether I can look here with my colleagues and laugh.

And since you read it all the way up to here: let’s build a HA environment toghether. And put some Document Sciences on it to spice it up.