The movement towards digital records is generating exponential amounts of data, tremendously valuable data. But building a system to manage that data and extract value from it requires acceptance of a paradox; the system needs to be flexible and tolerant, while simultaneously enforcing structure and standards. This is hard.
Many of us have felt the pain of Master Data Management (MDM). In large enough enterprises, even something as simple as keeping addresses current across multiple lines of business is a herculean task. Furthermore, it’s a problem that’s difficult to ignore because poor data management can drive substantial expense and leave invaluable opportunities on the table.
Since joining Health Market Science (HMS), a company focused on MDM for the healthcare space, that sentiment has been reinforced tenfold. Two things became immediately apparent: the complexity of a complete solution and the value of the same.
The problem is complex because there is a temporal aspect to the data. A complete MDM solution doesn't just provide a current view of entities, but a historical perspective as well. It provides that perspective across all the entities and the all the relationships between those entities. Then, add in the fact that each source system may have a different schema representing each entity and that the schema itself may evolve over time. Then, pile on all the necessary processing to analyze the data. Entities need to be consolidated based on precedence rules, standardized, and matched using fuzzy matching. What a fantastic recipe for a fun problem to solve.
But given the complexity, it isn't difficult to see why companies shy away from an "in-house" MDM solution. Solving these problems isn't easy. You may think that if you get yourself a good Data Architect and a massive Oracle instance, you could crank something out. You’ll soon find that standard relational structures quickly become unwieldy and you end up in “meta-meta” world building out schemas to manage schemas. This can be extremely painful.
I'm wondering if people have seen any open source solutions capable of tackling this problem. From what I can find, the open source "MDM" solutions only tackle the Extract-Transform-Load (ETL) portion of the problem. Unfortunately, that is the easy part. Does anyone know of any open-source communities focused on delivering a complete solution, ETL through down into storage?