CM Architecture – yet another search engine?

Is a Content Management System basically yet another search engine?

From what i learned until now.. might seem so. I consider that most of the requests a CM system needs to solve are requests to find and retrieve. Updates and deletions are usually addressing individual items (now i tend to generalize, please forgive me).

What was extremely funny is that last year I was at a technical event of one major ECM provider and something was said out in the open by a quite highly ranked person: “We never understood until now how important search is” (qoute is approximative).  Despite the tragic situation of having to say this after building “top” ECM systems for many many years… It really showed me that there are others which see it the same way as I do.

Actually this post was triggered also by a comment ldallas had on my previous one. Without me saying nothing (if anything) related to how I see that a CM system primary function is to aswer search requests, he saw from my approach probably that I might try to reinvent the wheel.

This is not far from the truth. I’m indeed thinking of a CM architecture in which search is almost the most important function of all. What makes it different from common search engines is that in a CM environment you need to take care of complex security rules.

It is not enough to build a perfect “Google”-like engine. One needs to quickly filter the results based on user permissions. And when user permissions are based on multiple hierarchies of groups and roles this becomes tricky.

This is why i believe that the search engine (including fulltext) needs to be a core part of the CM architecture. This is the only way it can provide quick and adequate responses.

In a system I work with (many of you reading this will recognize it) the search request is forwarded to an external search engine which returns chunks of resultlists (eg. 200 at a time). Then, these are stored in a temporary table in the RDBMS and joined with the security information to find out if the user actually has the rights to any of them! Plain ugly! Imagine if the search matches 1 million records by i have only the rights to see one of them.

What I’m building (i bet i’m not the only one) is a system which embeds the search function which knows natively to handle the security.  The security model is the common one: item level, based on user/groups with hierarchical permissions (read<…<delete). If any of you knows of a similar system and can provide some more technical details, I’ll appreciate.

Last but not least, the search functionality should know its content business purpose. I’m not sure right now if i should make it as a core function or is closer to the system front-end or even application specific. What i know is that it would be a real pleasure to have a CM system which will rank / group my results based on their business role (eg group contracts and related documents toghether in a result, then logfile, then SOPs…) not only on word matching rankings. This looks a little like dynamic taxonomies and result clustering… But not really. I think this topic needs another dedicated later post, anyway.

As a conclusion: A Content Management system is likely to a Search Engine in the same way is likely to a Database Management System: can be done like it but it deserves a specific implementation in order to do it right.


3 thoughts on “CM Architecture – yet another search engine?

  1. I do not really agree. Sure search is really important and that is why we have chosen to have an external FAST ESP with a security module for Documentum D6. That way we can not only provide really fast search results with their entity extraction features but also expose metadata from the repository. Both in the search result but also as drill-down navigators.

    However, ECM brings a lot more to the table and that is provide context around the objects. Not only “simple” metadata which sometimes is really useful especially for all that high-quality content that users produce themselves. But it also provide auditing which previously mostly have been used a compliance thing. That information however, exactly the same info that powers Amazon-like features like “People who have read this have also read the following”. Another powerful feature that has largely been disregared is links or relationships with their own attributes. That means that you can have a set of different link categories between content objects representent references in a report for instance or links grouping together objects based on themes or similarity procentage or whatever. These links can then be graphically explored providing a whole new way of exploring content in an ECM system.

    There are so much under the hood that rarely is used 🙂

  2. All the other functionalities you talk about (relations, links, entity extraction.. etc). are extremely important and should be at the core of all CM systems.
    Also, this information needs to be “understood”/”interpreted” by the search engine. This can only be done resonably if the search is performed by the CM server itself and not by a loosely coupled external engine.
    The external engine brings other advantages to the table but it cannot possibly be designed to know exactly what the CM system data “means”.
    FAST is an excellent search technology. So is D6 in terms of Content Management. Both, toghether make a powerful combination. But not necessarly the best in terms of performance. And this is what i aim in my research.

  3. At the heart of this argument is what I call the myth of unstructured data. I hate that term. Unstructured. The term was first applied by “database” snobs who were convinced that if you can’t put right angles around it then it must not have a structure. (i.e. normal forms)

    A CM is a database. What is radically different is that in design you must make choices as to the level of granularity and awareness the system has over the data itself. The premise here (I think) is that the balance of functionality has tipped in favor of the search engine. “Managing to metadata” may not be as efficient as “managing to the content” with current technology.

    As to the rarely used functions – if you are to take a new approach to CM, you must again make choices as to what will be the driving design criteria. Rarely used often equates to seldom needed. That being the case, optimizing to i/o, retrieval and state management is more important than to storing arbitrary descriptors and references that are just as easily maintained as “structured” content in the same architecture.

    BTW This topic has been great fun to discuss.thanks for posting it – I may expand on the “unstructured” concept later on my blog…

Leave a Reply

Please log in using one of these methods to post your comment: Logo

You are commenting using your account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s