Wednesday, October 3, 2012

Hadoop simplified frameworks

There are many frameworks available for reducing the complexity involved in writing MapReduce programs for Hadoop.

In this arcticle I have discussed few of them and most of them are actively developed and have many production implementation. These framework will increase your productivity as they provide high level features wrapping the low level complexity. Few of them even allow you write directly java code with out thinking MapReduce.

Before using them make sure you will evaluate and understand the frameworks better so that you will not end up in selecting  wrong framework. Also think of your long term needs. In my experience any framework provide the basic tenets but as you explore and progress, you will have issues in doing writing thing in write way. Its going to be even more worst, in case those features are not actively supported by the framework. It is no different from selecting any other frameworks for any other job. All those rules applied. 

1. Cascading
2. Pangool
3. Pydoop
4. MRJobw 
5. Happy
6. Dumbo

Cascading :

  • It is an abstraction layer works on top Apache hadoop.
  • Can create and execute complex workflows
  • Works with any JVM based language ( Java, Ruby , Clojure )
  • Primarily workflow gets executed using pipes.
  • It  follows “source-pipe-sink” paradigm, where data is captured from different sources, follows reusable  ‘pipes’ that perform data analysis process.
  • The developers can write JVM based code without really thinking MapReduce.
  • Supported commercially by Concurrent inc.
  • URL : http://www.cascading.org


Pangool :

  • Works on most of the Hadoop distribution.
  • Easier map reduce development.
  • Support for Tuples instead of just key/value pairs.
  • Efficient and easy to use secondary sorting
  •  Efficient, easy to use reduce-side joins
  • Performance and flexibility like Hadoop without really worrying about the Hadoop complexit
  • First -class multiple inputs and outputs
  • Built in serialization support with thrift and protostuff
  • Commercial support  from DataSalt
  • URL : http://pangool.net/

Pydoop :

  • Provides simple python API for MapReduce.
  • It is based CPython package and being a CPython module provides access to an extensible set of python libs like numPY, sciPy etc.
  • More interactive High level Hadoop API available for executing complex jobs.  
  • Provides high level of HDFS API
  • Developed by http://www.crs4.it/
  • URL : pydoop.sourceforge.net

MRJob :

  • Simplified MapReduce scripts.
  • Developed by Yelp and actively being used in many production environments.
  • Built for Hadoop. Built using python.
  • Simple to use compared direct python streaming.
  • Available for running Hadoop on Amazon Elastic map reduce (EMR).
  • Can be used for running complex Machine Learning algorithms and log processing on Hadoop cluster.
  • URL: https://github.com/Yelp/mrjob

Happy :

  • Simplified Hadoop framework built on Jython .
  • Map-reduce jobs in Happy are defined by sub-classing happy.HappyJob and implementing a map(records, task) and reduce(key, values, task) function.
  • Using run() you can execute the job.
  • It also can be used for complex data processing and implemented in production environment.
  • URL : http://code.google.com/p/happy/

Dumbo :
  • Dumbo is also built using python.
  • Python API for writing MapReduce.
  • All the low level features nicely wrapped and provided as unix pipes.
  • It has many nice inbuilt functions/APIs to write highlevel MapReduce scripts.
  • URL : https://github.com/klbostee/dumbo


2 comments:

  1. Hi ,

    Great blog! Is there an email address I can contact you in private?

    ReplyDelete
  2. please provide your mail id. I would reply to that.

    Thanks.

    ReplyDelete