Hadoop as an ecosystem has evolved and garnered by many
enterprises for solving their Big Data needs. However with current set of
development tools, making Hadoop run and able to get what user wants is not a
trivial task. In one hand there are many start-ups making Hadoop real time and
more suitable for real time query processing while others making the entire
ecosystem more simple to use. Hadoop is not a platform for only querying data.
It also helps in solving a diverse set of use cases from log processing to
genome analysis. The Hadoop ecosystem is fairly complex and getting matured to execute
a wide variety of problems. So beyond real time queries Hadoop can be
implemented to solve many different Big Data needs and all of them need a
fairly simple development environment to get started with Hadoop. MortarData is one
such start-up trying to ease the entire Hadoop development by many folds.
MortarData CEO K Yung and his team working on this technology
for a while and their simple USP is “Getting ready with Hadoop in one
hour”. Mortar launched Hadoop platform
as service on Amazon. Amazon also has Amazon elastic MapReduce which is more a
general platform for Hadoop compared to what Mortar is trying to achieve.
Mortar on other hand built a Hadoop infrastructure which can run using simple
Python or PIG scripts. Mortar also provides features to share public datasets
and codes for analysis to every one for to get started easily. Any one is
interested to share their public data set and code for analysis large scale
data sets can share using Github. It also provides other database storage
support like Amazon S3 and MongoDB other than HDFS. The data can be populated
from these external databases to HDFS to run the MapReduce as when it required.
The platform allows users to install python based analytical tools like NumPy,
SciPy an NLTK. According to Yung there will be more Tools will be added to the
platform as we progress.
I think more and more people will use these kinds of
platforms as it really removes the whole Hadoop installation process and
managing Hadoop cluster which is by itself a complex process. However, simple development environments are not
big differentiator, these companies need to focus on how to do auto scaling,
and other ways to minimize the cost of running Hadoop clusters based on their past workloads. Other areas could be more simple diagnostic and management tools to
help the debug process fairly simple and trivial. Allowing, important ecosystem
libraries to be pre-configured compared to do a manual installation. These are
the couple of core areas where I think most of work will be done in future.