For anyone that has been through introductory computer science courses, you know that different data structures are suited for different applications. With a bit of analysis you can classify the cost of an operation against a specific structure, and how that will scale as the size of the data or the input scales. (See Big O Notation if you would like to geek out.) In Big Data, things are no different, but we are usually talking about performance and scaling relative to additional constraints such as the CAP Theorem.
So what are our "data structures and algorithms" in the Big Data space? Well, here is my laundry/reading list:
- Bigtable: A Distributed Storage System for Structured Data
- Dynamo: Amazon’s Highly Available Key-value Store
- MapReduce: Simplified Data Processing on Large Clusters
- Dremel: Interactive Analysis of Web-Scale Datasets
- Paxos Made Simple
- Druid: A Real-time Analytical Data Store
- Discretized Streams: Fault-Tolerant Streaming Computation at Scale
We know from our classes that to accommodate the increased variety of queries, we need different data structures and algorithms to service those queries quickly. That leaves us with no choice but to duplicate data across different data structures and to use different tools to query across those systems. Often this approached is referred to as "Polyglot Persistence". That worked, but it left the application to orchestrate the writes and queries across the different persistence mechanisms. (painful)
To alleviate that pain, people are collapsing the persistence mechanisms and interfaces. Already, we see first-level integrations. People are combining inverted-indexes and search/query mechanisms with distributed databases. e.g.
- Titan integrated Elastic Search.
- Datastax integrated SOLR.
- Stargate combines Cassandra and Lucene
The tight-integration between persistence mechanisms makes it transparent to the end-user that data was duplicated across stores, but IMHO, we still have a ways to go. What happens if you want to perform an ad-hoc ETL against the data? Well, then you need to fire up Hive or Pig (Spark and/or Shark), and use a whole different set of infrastructure, and a whole other language to accomplish your task.
One can imagine a second or third-level integration here, which unifies the interface into "big data": a single language that would provide search/query capabilities (e.g. lucene-like queries), with structured aggregations (e.g. SQL-like queries), with transformation, load and extract functions (e.g. Pig/Hive-like queries) rolled into one-cohesive platform that was capable of orchestrating the operations/functions/queries across the various persistence and execution frameworks.
I'm not sure quite what that looks like. Would it use Storm or Spark as the compute framework? perhaps running on YARN, backed by Cassandra, Elastic Search and Titan, with pre-aggregation capabilities like those found in Druid?
Who knows? Food for thought on a Friday afternoon.
Time to grab a beer.
(P.S. shout out Philip Klein and John Savage, two professors back at Brown that inspired these musings)