PDF/A in Amsterdam

In the last days I’ve been participating in the first PDF/A International Conference in Amsterdam, trying to get a better understanding on the facts around the topic.

To simply put it, PDF/A is a PDF 1.4 with some more rules. And is an ISO standard (ISO 19005-1).

For those of you who are wondering why do we have yet another file format (which seems to be a branch of the oldie PDF) please learn that PDF/A aims to be the format in which documents are to be stored for long term archiving.

The idea is excellent for various reasons, and the PDF/A originators (which is not necessarily Adobe) are not the only ones who thought of this. Microsoft also tries to jump into the wagon with XPS – which was not designed to be an archiving format but it seems they think is useful for this as well.

The need is there, as organizations are tired of having to deal with old file formats always when going deep in the electronic archives. And we need to take into consideration the fact that electronic archives are not too old these days. As a fun fact, in the opening keynote Thomas Zellman showed a 5 inch floppy disk to the audience. I think that was an excellent idea of reminding everyone that many things (think content here) we create today, would need to be used a long time from now. And 5 inch floppies are not too old. Think 8 inch floppies and punch cards.
Therefore, archivists all over the globe are trying to think how to reinvent their job of storing and managing paper and bring electronic content along (yes, “revelation” – paper will not disappear). If you have worked with archivists you will find out that this job is highly conservatory (couldn’t help the wording ūüėČ ). It’s in their nature not to change things and most of them they would not want to tackle anything but paper at all.
How do you address this? Make it a standard! “It’s ISO so it’s good”. At least easier to swallow by the archive world. Second, by deriving it from the ubiquitous PDF you get a file format which can be read by a lot of software and can be generated easily by others.

Of course, there are rules to take care if you want to be compliant.. Read all about it on the www.pdfa.org website, I’m not going into details here.

How is this relevant to the Content Management area?

First of all, it’s relevant to my PHD thesis since the objective of PDF/A is to be self contained (content and metadata). Which is how i store my objects in my great repository (wink).

Idea coming through: How about to define a storage area inside an ECM system so that everything you put there is stored/converted transparently by the CM system as a PDF/A including all its metadata?

Of course, there are some issues to ponder on, but i think this sounds good. The file format needs to evolve a bit to allow more content types to be included (think 3D, multimedia) and also to do more than a primitive implementation of digital signatures and metadata. But the scene is set.

Related to evolution, sadly (?) enough PDF/A needs to undergo ISO certification, so we all could expect the 2.0 version in 2010 i guess (and some speakers from the conference felt the same way).
I’ll stop for now, there were a lot of interesting things discussed in the conference and a lot of study cases and very interesting people to meet or rejoin for a beer.

Cannot help but add one more thought: Is IT Fashion? Rory Staunton thinks so.

The guts to say “No”

Take one big customer who wants to have solution on an new, soon to be deployed, ECM platform.

Add some requirements to have a community website, web 2.0 as much as it can get, with a “wow” design (think flash all around), targeted to executive level (think huge multinational company)…

Picture you as the solution provider which aims to please this customer to whom it already provided some other solutions.

Sit down, and swallow the mission: do this in 1 month from moment zero to go-live, without having any well defined requirements (“be extremely friendly, web 2.0 with blogs, social networks wikis, personal pages, and work also great on mobiles”).

You have the technical skills. You can do it.. but probably not so fast (4 weeks to go live means actually about 2 weeks of working and 2 deploying and waiting for others). And there is this customer which whants a “wow!”. And the users are executives.

Will you say ‘Yes, bring it on!’?

Or will you say to the customer “No thanks, not like this”?

I said No.

Our mission as IT solution providers

We (“consultants”, “integrators”…) always like to say that we work close to our customers and that we treat them as partners. We say we provide quality services to them so they can fulfill their business activities better.

And is nice, and is cool when you feel you always come to the “rescue” and give them “solutions”. We solve their problems, and this makes us feel good (and it pays the bill :)).

And sometimes the customer wants something difficult. You beef up and do it. It’s not always easy (actually, usually it never is) but you do it (heck, is a job… is not supposed to be bells and whistles all the time).

And sometimes the requirement is really difficult. You assess the risks and decide to do it or not. Usually in IT you do it (at least this is my way of working, but i see it in many other IT providers as well).

Is it good to say no? Are you supposed to say this so bluntly after just a couple of meetings?

Will the customer disappointment hurt you too much?

I think that having the courage to say No in such an early stage is a major step in actually building customer confidence in you. It shows you aim to be a reliable partner, not a “yes person”.

On the other hand, it can be interpreted as a weakness and you can never combat it in any way other than by doing a lot of other succesfull solutions for that customer.

Did you ever said “No”? (i mean an important “No”)

CM Architecture – How to index

While building my CM engine, I take a deep breath and plunged into the still implementation empty area of “a new object is created, what to do with it?”.

The reason is that my CM is built like this: when a client application creates a persistent object,¬† it is quickly stored to disk (well.. “storage”) in a portable and self-consistent manner. After making sure it’s there for the keeping, a task is added to a background queue for “indexing” – aka inserting the new information to the indexing system so that the object would be found in searches.

The architecture allows for a virtually unlimited types of index providers (eg. hashes, btrees, blingy-blingy, whateva’). So i was now at the task to implement at least some default index providers, otherwise my content was only nicely stored and retrievable by ID.

Sleeves up… found some nice bTree variants discussed on the web, added my own some spice for multi threading optimization¬† .. and here i was diving in design (and i admit, also some coding – let’s call it “agile” approach). After index persistence was implemented and disk cache being considered.. i was having my hands full. It worked, and had reasonable performance. Not as stable as i would liked it, but.. come on.. nothing is bug free on first release.

What to compare with? I feel is not fair to dive right into a head-to-head comparison¬† with Documentum/SharePoint/CM/FileNet. Soo…

My approach is to use as much of the memory i can get my hands on – which sounds like TimesTen. Also, i address each metadata info individually, so is something like a column oriented database.

Thinking TimesTen is not a poorly  written DBMS (this is highly non-scientific approach, but i know Oracle usually acquires good tech).. I would like to give it a spin.

That being said, probably I’ll try to put TimesTen to the task to act as a column oriented storage for my metadata.

Let’s see what happens.¬† I’ll start with several millions of objects. And on my laptop.

Anybody want to bet how fast will ingest 1mil new objects with an average of 3 metadata (yes, i know is small)?

Hw config: 2 GB RAM, Core2Duo 2GHz, lame hdd

Small disclaimed: These test results (which I’ll probably publish in part) are not to be considered as a objective comparison of two systems but as an attempt to see how they perform in very particular situations which may not even be close to the real world situations.