Druid is one of the best slicers and dicers on the planet (except maybe for the Vegematic-2 ;). And I know there are people out there that might argue that Elastic Search can do the job (Shay & Isaac, I'm looking at you), but MetaMarkets and Yahoo have proved that Druid can scale to some pretty insane numbers, with query times an order of magnitude better than Spark. And because of that, we've come to rely on Druid as our main analytics database.
The challenge is figuring out how best to pipe events into Druid. As I pointed out in my previous post, Kinesis is now a contender. Alas, Druid has no first class support for Kinesis. Thus, we need to figure out how to plumb Kinesis to Druid. (<-- druidish="" intended="" p="" pun="">
First, let me send out some Kudos to Cheddar, Fangjin, Gian and the crew...
They've made *tremendous* progress since I last looked at Druid. (0.5.X days)
That progress has given us the additional options below.
Here are the options as I see them:
Option 1: Traditional KinesisFirehose Approach
In the good ole' days of 0.5.X, real-time data flowed via Firehoses. To introduce a new ingest mechanism, one simply implemented the Firehose interface, fired up a real-time node, and voila: data flowed like spice. (this is what I detailed in the Storm book)
And since 0.5, Druid has made some updates to the Firehose API, to specifically address the issues that we saw before (around check-pointing, and making sure events are processed only once, in a highly-available manner). For the update on the Firehose API, check out this thread.
After reading that thread, I was excited to crank out a KinesisFirehose. And we *nearly* have a KinesisFirehose up and running. As with most Java projects these days, you spend 15 minutes on the code, and then two hours working out dependencies. Ironically, Amazon's Kinesis Producer Library (KPL) uses an *older* version of the aws-java-sdk then Druid, and because the firehose runs in the same JVM as the Druid code, you have to workout the classpath kinks. When I hit this wall, it got me thinking -- hmmm, maybe some separation might be good here. ;)
To summarize what I did: I took the KPL's push model (implemented IRecordProcessor), put a simple little BlockingQueue in place, and wired it to Druid's pull model (Firehose.nextRow). It worked like a charm in unit tests. ;)
(ping me, if you want to collaborate on the firehose implementation)
Option 2: The EventReceiver Firehose
Of course, the whole time I'm implementing the Firehose -- I'm on the consuming end of a push model from the KPL and on the producing end of a pull model from Druid, and forced to dance between those two. It made me wonder, "Isn't there a way to push data into Druid?". If there were such a thing, then I could just push data into Druid as I receive it from the KPL. (that sounds nice/simple doesn't it?)
This ends up being the crux of the issue, and something that many people have been wrestling with. It turns out that there is a firehose specifically for this! Boo yah! If you have a look at the available firehoses, you'll see an Event Receiver firehose. The Event Receiver firehose creates a REST endpoint to which you can POST events. (have a look at the addAll() method's
Option 3: Tranquility
In doing some due-diligence on Tranquility, I discovered that while I was away, Druid implemented their own Task management system!! (See the Indexing-Service documentation for more information) Honestly, my first reaction was to run for the hills. Why would Druid implement their own task management system? MiddleManagers and Peons... sounds an awful lot like YARN. I thought the rest of the industry was moving to YARN (spark, samza, storm-yarn, etc.) YARN was appropriately named, Yet Another Resource Negotiator, because everyone was building their own! What happened to the simplicity of Druid, and the real-time node?
Despite that visceral reaction, I decided to give Tranquility a whirl. I fired up the simple cluster from the documentation, incorporated the Finagle-based API integration and was able to get data flowing through the system. I watched the overlord's console to watch the tasks move from pending to running to complete. It worked! I was happy... sort of. I was left with a "these are small Hadoop jobs" taste in my mouth. It didn't feel like events were "streaming".
Now, this might just be me. In reality, many/all of the streaming frameworks are actually just doing micro-batching. And with Druid's segment-based architecture, the system really needs to wait for a windowPeriod/segmentGranularity to pass before a segment can be published. I get it.
I'm just scared. For me, the more nouns involved, the more brittle the system: tasks, overlords, middle managers, peons, etc. Maybe I should just get myself some kahonas, and plow ahead.
Regardless, we plan to meet up with Fangjin Yang and Gian Merlino, who are presently coming out of Stealth mode with a new company. I have a tremendous amount of faith in these two. I'm confident they will steer us in the right direction, and when they do -- I'll report back. =)