Businesss, Technology, Trends: Hadoop simplified frameworks

There are many frameworks available for reducing the complexity involved in writing MapReduce programs for Hadoop.

In this arcticle I have discussed few of them and most of them are actively developed and have many production implementation. These framework will increase your productivity as they provide high level features wrapping the low level complexity. Few of them even allow you write directly java code with out thinking MapReduce.

Before using them make sure you will evaluate and understand the frameworks better so that you will not end up in selecting wrong framework. Also think of your long term needs. In my experience any framework provide the basic tenets but as you explore and progress, you will have issues in doing writing thing in write way. Its going to be even more worst, in case those features are not actively supported by the framework. It is no different from selecting any other frameworks for any other job. All those rules applied.

1. Cascading

2. Pangool

3. Pydoop

4. MRJobw

5. Happy

6. Dumbo

Cascading :

It is an abstraction layer works on top Apache hadoop.
Can create and execute complex workflows
Works with any JVM based language ( Java, Ruby , Clojure )
Primarily workflow gets executed using pipes.
It follows “source-pipe-sink” paradigm, where data is captured from different sources, follows reusable ‘pipes’ that perform data analysis process.
The developers can write JVM based code without really thinking MapReduce.
Supported commercially by Concurrent inc.
URL : http://www.cascading.org

Pangool :

Works on most of the Hadoop distribution.
Easier map reduce development.
Support for Tuples instead of just key/value pairs.
Efficient and easy to use secondary sorting
Efficient, easy to use reduce-side joins
Performance and flexibility like Hadoop without really worrying about the Hadoop complexit
First -class multiple inputs and outputs
Built in serialization support with thrift and protostuff
Commercial support from DataSalt
URL : http://pangool.net/

Pydoop :

Provides simple python API for MapReduce.
It is based CPython package and being a CPython module provides access to an extensible set of python libs like numPY, sciPy etc.
More interactive High level Hadoop API available for executing complex jobs.
Provides high level of HDFS API
Developed by http://www.crs4.it/
URL : pydoop.sourceforge.net

MRJob :

Simplified MapReduce scripts.
Developed by Yelp and actively being used in many production environments.
Built for Hadoop. Built using python.
Simple to use compared direct python streaming.
Available for running Hadoop on Amazon Elastic map reduce (EMR).
Can be used for running complex Machine Learning algorithms and log processing on Hadoop cluster.
URL: https://github.com/Yelp/mrjob

Happy :

Simplified Hadoop framework built on Jython .
Map-reduce jobs in Happy are defined by sub-classing happy.HappyJob and implementing a map(records, task) and reduce(key, values, task) function.
Using run() you can execute the job.
It also can be used for complex data processing and implemented in production environment.
URL : http://code.google.com/p/happy/

Dumbo :

Dumbo is also built using python.
Python API for writing MapReduce.
All the low level features nicely wrapped and provided as unix pipes.
It has many nice inbuilt functions/APIs to write highlevel MapReduce scripts.
URL : https://github.com/klbostee/dumbo

Businesss, Technology, Trends

Wednesday, October 3, 2012

Hadoop simplified frameworks

Cascading :

Pangool :

Pydoop :

MRJob :

Happy :

2 comments: