Thursday, February 5, 2015
Absolute Truths, Perspectives, and Parellelism
We have a product called VerifyRx. Chances are, when you go to the pharmacy and hand over your prescription, our data is being used to verify that your doctor was eligible to write that prescription. We deliver this functionality as a web service. And because people will wait at the counter until our web service comes back, you can imagine the kinds of SLAs we are under. We are live/live across two data centers, servicing millions of transactions each day, with real-time replication across those data centers (courtesy of C* =). For compliance/auditing purposes, we store the results of those transactions in perpetuity, supporting ad hoc query capabilities, with about 5B records at the users finger tips (courtesy of ElasticSearch).
And while this was technically challenging, I don't believe it is the hardest problem that we've had to solve over the years. I say that because each transaction can run in parallel.
f(x, MF) -> y, where all x's are independent.
In the equation, there is only one source of the truth, the absolute truth, which we call our Master file (MF). That master file contains the most accurate and current information available anywhere about a doctor. And the masterfile changes, but not based on the transaction.
So, if that is easy, what's hard?
Well, problems that have dependent functions are difficult to parallelize. What if we want a real-time system to detect fraud? We want to count the number of prescriptions written by a doctor for controlled substance. Now, we would have something like:
f(x', MF) -> f(x, MF) + 1
People that have toyed with this problem know that the CAP theorem makes distributed counting hard. To solve this problem, the system must not only service transactions, it must be transactional. (emphasis on the AL)
In fact, mutability in general in a distributed system is difficult. In the old days of relational databases, you would simply start a transaction (in the "transactionAL" sense of the word), acquire a lock, and happily, safely update the data. But we know that locking mechanisms don't scale. (and in general, they can be a royal PITA sometimes -- I spent my week diagnosing an Oracle/Hibernate locking issue in production)
Nowadays, we lean on consensus algorithms and conditional updates. If two systems attempt to update the same data/count, one will win. The other will lose and retry. (See: http://www.datastax.com/dev/blog/lightweight-transactions-in-cassandra-2-0)
Conditional updates enable mutability at scale, but what if you want your customers to be able to define how data is counted/aggregated?Effectively, you want them to supply the functions and the "master version of truth". You also want them to be able to alter that truth transactionally? Well, then you would have:
c(x', mf') -> c(x, mf)...
Now you've got a challenge. And now you've got Master Data Management (MDM).
Our platform allows our customers to build highly interactive, customizable universes of data and analytics.
So, it's one thing to handle transactions, and an entirely different thing to handle them transactionally. =)
BTW -- if you are looking for a challenge, let me know. We are currently looking for a lead developer that wants to come play with us.