Monday, October 5, 2015

Getting started w/ Python Kinesis Consumer Library (KCL) (via IPython Notebook)


We are racing ahead with Kinesis.  With streaming events in hand, we need to get our data scientists tapped in. One problem, the KCL and KPL are heavily focused on Java, but our data scientists (and the rest of our organization) love Python. =)

We use IPython notebook to share python code/samples.  So, in this blog I'm going to get you up and running, consuming events from Kinesis, using Python and IPython notebook.  If you are already familiar with IPython notebook, you can pull the notebook from here:
https://github.com/boneill42/kcl-python-notebook/

If you aren't familiar with iPython Notebook, first go here to get a quick overview.  In short, IPython Notebook is a great little webapp that lets you play with python commands via the browser, creating a wiki/gist-like "recipe" that can be shared with others.  Once familiar, read on.

To get started with KCL-python, let's clone the amazon-kinesis-client-python github repo:

   git clone git@github.com:awslabs/amazon-kinesis-client-python.git

Then, create a new virtualenv.  (and if you don't use virtualenv, get with the program =)
 
   virtualenv venv/kcl 
   source venv/kcl/bin/activate

Next, get IPython Notebook setup.  Install ipython notebook, by running the following commands from the directory containing the cloned repo:

   pip install "ipython[notebook]"
   ipython notebook

That should open a new tab in your browser.  Start a new notebook by clicking "New->Python 2".   A notebook comprises one or more cells.  Each cell has a type.   For this exercise, we'll use "Code" cells.   We'll take advantage of "magic" functions available in IPython Notebook.  We'll use run, which will execute a command.  Specifically, need execute the python setup for KCL.

The amazon-kinesis-client-python library actually rides on top of a Java process, and uses MultiLangDaemon for interprocess communication.  Thus, to get KCL setup you need to download the jars.  You can do that in the notebook by entering and executing the following in the first cell:

   run setup.py download_jars

Next, you need to install those jars with:

   run setup.py install

That command installed a few libraries for python (specifically -- it installed boto, which is the aws library for python).  Because the notebook doesn't dynamically reload, you'll need to restart IPython Notebook and re-open the notebook. (hmmm... Is there a way to reload the virtualenv used by the notebook w/o restarting?)

Once, reloaded you are ready to start emitting events into a kinesis stream.  For now, we'll use the sample application in the KCL repo.  Create a new cell and run:

   run samples/sample_kinesis_wordputter.py --stream words -w cat -w dog -w bird -w lobster

From this, you should see:

Connecting to stream: words in us-east-1
Put word: cat into stream: words
Put word: dog into stream: words
Put word: bird into stream: words
Put word: lobster into stream: words

Woo hoo! We are emitting events to Kinesis!  Now, we need to consume them. =)

To get that going, you will want to edit sample.properties.  This file is loaded by KCL to configure the consumer.   Most importantly, have a look at the executableName property in that file.  This sets the name of the python code that the KCL will execute when it receives records.  In the sample, this is sample_kclpy_app.py.  

Have a look at that sample Python code.  You'll notice there are four important methods that you will need to implement:  init, process_records, checkpoint, and shutdown.  The purpose of those methods is almost self-evident, but the documentation in the sample is quite good.  Have a read through there.

Also in the sample.properties, notice the AWSCredentialsProvider.  Since it is using Java underneath the covers, you need to set the Java classname of the credentials provider.  If left unchanged, it will use: DefaultAWSCredentialsProviderChain.

It is worth looking at that documentation, but probably the easiest route to get the authentication working is to create a ~/.aws/credentials file that contains the following:

[default]
aws_access_key_id=XXX
aws_secret_access_key=XXX

Once you have your credentials setup, you can crank up the consumer by executing the following command in another notebook cell:

   run samples/amazon_kclpy_helper.py --print_command --java /Library/Java/JavaVirtualMachines/jdk1.8.0_60.jdk/Contents/Home/bin/java --properties samples/sample.properties

Note that the --java parameter needs the location of the java executable.  Now, running that in the notebook will give you the command line that you can go ahead and execute to consume the stream.   Go ahead and cut-and-paste that into a command-line and execute it.  At this point you are up and running.

To get started with your own application, simply replace the sample_kclpy_app.py code with your own, and update the properties file, and you should be off and running.

Now, I find this bit hokey.  There is no native python consumer (yet!).  And fact you actually need to run *java*, which will turn around and call python.  Honestly, with this approach, you aren't getting much from the python library over what you would get from the Java library.  (except for some pipe manipulation in MultiLangDaemon)  Sure, it is a micro-services approach... but maybe it would be better to simply run a little python web service (behind a load balancer?).   Then, we'd be able to scale the consumers better without re-sharding the Kinesis stream.  (One shard/KCL worker could fan-out work to many python consumers across many boxes)  You would need to be careful about checkpointing, but it might be a bit more flexible.  (and would completely decouple the "data science" components from Kinesis)  Definitely food for thought. =)

As always, let me know if you run into trouble.

No comments: