Saturday, December 22, 2012

MortarData : Hadoop Pass


Hadoop as an ecosystem has evolved and garnered by many enterprises for solving their Big Data needs. However with current set of development tools, making Hadoop run and able to get what user wants is not a trivial task. In one hand there are many start-ups making Hadoop real time and more suitable for real time query processing while others making the entire ecosystem more simple to use. Hadoop is not a platform for only querying data. It also helps in solving a diverse set of use cases from log processing to genome analysis. The Hadoop ecosystem is fairly complex and getting matured to execute a wide variety of problems. So beyond real time queries Hadoop can be implemented to solve many different Big Data needs and all of them need a fairly simple development environment to get started with Hadoop.  MortarData  is one such start-up trying to ease the entire Hadoop development by many folds.

MortarData CEO K Yung and his team working on this technology for a while and their simple USP is “Getting ready with Hadoop in one hour”.   Mortar launched Hadoop platform as service on Amazon. Amazon also has Amazon elastic MapReduce which is more a general platform for Hadoop compared to what Mortar is trying to achieve. Mortar on other hand built a Hadoop infrastructure which can run using simple Python or PIG scripts. Mortar also provides features to share public datasets and codes for analysis to every one for to get started easily. Any one is interested to share their public data set and code for analysis large scale data sets can share using Github. It also provides other database storage support like Amazon S3 and MongoDB other than HDFS. The data can be populated from these external databases to HDFS to run the MapReduce as when it required. The platform allows users to install python based analytical tools like NumPy, SciPy an NLTK. According to Yung there will be more Tools will be added to the platform as we progress.

I think more and more people will use these kinds of platforms as it really removes the whole Hadoop installation process and managing Hadoop cluster which is by itself a complex process. However, simple development environments are not big differentiator, these companies need to focus on how to do auto scaling, and other ways to minimize the cost of running Hadoop clusters based on their past workloads. Other areas could be more simple diagnostic and management tools to help the debug process fairly simple and trivial. Allowing, important ecosystem libraries to be pre-configured compared to do a manual installation. These are the couple of core areas where I think most of work will be done in future.  

Thursday, December 20, 2012

Platfora and Qubole are taking Hadoop to next level with realtime interactive queries


As I have discussed earlier, one of the important disruption happening currently  in the Hadoop ecosystem is to make Hadoop more real time. Cloudera has already developed Impala which provides user to make real time queries. Mapr with Apache Drill project provides a declarative language like SQL to query underlined Hadoop data in real time. Google is already has its proprietary analytics engine and provides service called BigQuery to query petabytes of data. Facebook is also on other hand putting lot of efforts to increase performance of Hive and its ecosystem.

There are other start-ups like Platfora and QuBole taking different route for solving this big puzzle still using Hadoop. According to Ben Werther founder of Platfora,  business cannot wait days and weeks to get answers. It has to be instantaneous. The true innovation lies in the data agility and exploration made possible by Hadoop.

Hadoop is a great platform for solving the Bigdata. It has one of the big ecosystem in Bigdata. But the real issue with Hadoop is, it is not meant for all. Still the platform is restricted only for handful of researchers and calls for substantial amount of learning. This is the gap where most of the companies wants fill in many different ways.

Platfora got 20 million, Series B funding recently and has been working for more than a year to make Hadoop available to business users.  There are other first generation Hadoop start-ups built technologies around Hadoop. But most of them are run queries in batch oriented. This is where Platfora and QuBole differ in their implementation. Platfora has built an analytics platform from ground up. It has Vizborads for analytics, scale out in-memory data processing engine and hadoop data refinery as key components. Platfora wants to change the way traditional BI works.

One more start-up called QuBole also heading on the same lines. They have released QuBole data platform as service which run on Amazon  EC2 and Hadoop. QuBole founders Asish Thusoo and Joydeep Sen Sarma were from Facebook data infrastructure team where they have used and implemented Hive platform and they knew the bottlenecks of Hadoop and Hive. One of the main design goal of QuBole is to make Hadoop more real time and a simple platform for business user to run their queries.

To make queries run interactively, they use HQL and Hive but the entire platform is designed to run queries in real time.  The platform scales seamlessly up and down. The users need not to provision the no of systems needed upfront. In typical scenario it is unlike that user knows how many machines are required for query.  It is underlined platform’s job to allocate and free the machines in the cluster on demand based on the workload and past statistics.

It is good to see the NON Hadoop in memory databases like SAP HANA and recent announcement of Amazon RedShift also trying to build their technologies to solve big data more like traditional transactional database.

We need to see how much Hadoop can leap frog in solving this. As of now there is lot of innovation going on to make Hadoop more real time and more simple for many different use cases. Recently, Microsoft has also provided support for Hadoop on its Azure platform. This implementation is pure Hadoop Vanilla. Can we see Microsoft bringing something innovative to the Hadoop table or simply it follows the crowd ?

Monday, December 10, 2012

Trevni : A columnar file format for Cloudera Impala


Trevni is columnar file format developed by Doug Cutting for storing data in columnar format and will be core storage engine and part of  Cloudera Impala project. Project impala delivers real time queries on Hadoop file system.

Trevni features :
  1. Inspired by CIF/COF based column oriented database architecture.  CIF based columnar architecture works well with MapReduce.
  2. Stores data based on columns which provides good compression of data as data stored in single column will have same kind of data. Retrieval of data will be fast as the minimal scanning required for accessing the data within the same column as compared row store.
  3. To achieve scalable, distributed query evaluation, data sets are partitioned into row groups containing distinct collection of rows. Then each row group stores data vertically like column store. To understand more see below figure 1.
  4. Maximizes the size of row groups in order to reduce the Disc IO seeking latency.  Each row group size can be > 100 mb. This will help in reading sequentially to reduce the disk IO.
  5. Each row group will be written as separate file. All values of a column will be written in contiguously to get optimized IO performance.
  6. Reducing no of row groups results in reducing the no of HDFS file created and hence it reduces the load on the name node. So it is better to have few files per data set means fewer row groups.
  7. Allows dynamic access of data within row group. It also supports co-location of columns with in row group as per CIF storage architecture.   
  8. It also supports nested column structure for semi structured data in the form of arrays and dictionaries.
  9. Application specific data will be maintained at every level like file, column and block. Check sums have been used at block level for providing data integrity.
  10.  Provides many data type support like int, long , float , double , string and byte data type for complex aggregated data. It also supports NULLs and NULL occupies zero bytes which is one of the key differences between column storage and row storage to save disk space. 

Figure - 1 : Illustrates the row group concept in columnar store.


Tuesday, November 20, 2012

Cloudera Impala Beta - version 0.2 is available now

For download and learning more about Impala please go thru this cloudera site

http://blog.cloudera.com/blog/2012/11/cloudera-impala-beta-version-0-2-and-cloudera-manager-4-1-1-now-available/

For taking Impala training go thru this url

http://training.cloudera.com/elearning/impala/

Scraping Twitter with ScraperWiki


While I was searching for a good scraper in python, I encountered many of the scrapers written in python. Finally I tried with ScraperWiki and it was quite interesting.

Everything can be done within the browser and very simple to use. We can write python scrapper scripts in the browser and it allows you to run and test the code. Finally it shows the results within the same page. We also can use scripts with other languages like ruby and php.

It also has various other built in scrapers like scarping for csv and excel file and storing data back to database. Please go thru this URL ( https://scraperwiki.com/ ) to learn more about this.

I thought of writing a simple scraper for getting the results from twitter and here is my piece of python code. You can modify scripts from publicly available scripts in scraperwiki site and run it by yourself.

 import scraperwiki  
 import simplejson  
 import urllib2  
   
 # Get results from the Twitter API! Change QUERY to your search term of choice.   
 # Examples: 'newsnight', 'from:bbcnewsnight', 'to:bbcnewsnight'  
   
 QUERY = 'bigdata'  
 RESULTS_PER_PAGE = '100'  
 LANGUAGE = 'en'  
 NUM_PAGES = 5   
   
 for page in range(1, NUM_PAGES+1):  
   base_url = 'http://search.twitter.com/search.json?q=%s&rpp=%s&lang=%s&page=%s' \  
      % (urllib2.quote(QUERY), RESULTS_PER_PAGE, LANGUAGE, page)  
   try:  
     results_json = simplejson.loads(scraperwiki.scrape(base_url))  
     for result in results_json['results']:  
       data = {}  
       data['id'] = result['id']  
       data['text'] = result['text']  
       data['from_user'] = result['from_user']  
       print data['from_user'], data['text']  
   except:  
     print 'Failed to scrape %s' % base_url  
       

Thursday, November 15, 2012

Project Apache Drill and Impala wants to SQLize Hadoop for real time data access


There are lot of efforts going on for making real time data access using Hadoop ecosystem. It is evident that Hadoop is getting synonymous with the defacto BigData Architecture with in enterprises. The ecosystem is on sprawl and growing very rapidly as it provides a fundamental opportunity to solve petabytes of un-structured data for many different companies and communities. Organization from space, weather, genetic, carbon foot print, retail, financial using Hadoop for solving their Bigdata.

Hadoop is used primarily as a batch processing engine to crunch petabytes of data but not meant for real time processing. This triggered companies to think differently and allow Hadoop to access the data more like relational databases with real time queries.

There are two recent initiatives from MapR and Cloudera for making hadoop real time using SQL syntax more on the lines of Hive and HQL. This is not an entirely new concept with other analytics database vendors such as GreenPlum and Aster data. These vendors provide a SQL like interface for MapReducing large scale data analytics. However their design principles are different.

Apache Drill project is inspired by Google Dremel. Google Dremel is a scalable interactive query system used by thousands of Google engineers every day for querying their large scale data sets. This takes the advantage of Google GFS and Big Table and built on top of this. Google’s BigQuery is based on Dremel and exposed as a service. There are other open source projects developed for real time access like Storm and S4. The real difference between Dremel and Storm or S4 is later are streaming engines they are not meant for ad-hoc queries while Dremel architected for querying very large data sets in real time.
Apache Drill trying to achieve the same success of Dremel in Google in the Hadoop ecosystem. The design goal of Drill is to scale as many as 10,000 servers and querying petabytes of data with trillion records within seconds interactively. The project is backed by MapR which is one of the most visible vendors in Hadoop World.

Apache Drill architecture is designed to interact and scale well with existing Hadoop ecosystems and takes advantages of existing technologies rather than completely re inventing and being a different product.  It has four main components

    Nested query language : It is a purpose built nested query language, parses the query and builds an execution plan. The query language is called DrQL in Drill and  is more like SQL and HQL declarative language. It also supports Mongo Query Language as add on.

    Distributed execution engine : It takes care of physical plan and the under lined columnar storage and fail overs. Drill uses columnar storage like Dryad and Dremel.

    Nested data formats : This layer is built more like pluggable model so that it can work with multiple data formats. Drill can work with free form and schema based data formats. Schema less JSON and BSON data types and with schema protocol buffer, AVRO , JSON CSV.

    Scalable data sources : This layer supports various data sources. It is designed to support Hadoop and NoSQL in mind.

The other vendor wants to take advantage in this space is Cloudera. Cloudera is also well known brand in Hadoop’s ecosystem. Very recently, Cloudera announced a project called Impala based on Google Dremel architecture same as Drill. It is already in beta stage and Cloudera is promising to drop a production release by first quarter of 2013.

Unlike Hive, Impala directly access the data thru its purpose built query engine to provide more real time access. The Impala queries are not converted to MapReduce during runtime like Hive.

Impala allows users to query data both on HDFS and HBase and has inbuilt support for joins and aggregation functions. The query syntax would be very similar to SQL and HQL as it uses the same metadata supported by Hive. Like project Drill, impala also supports both sequence files and non sequence files. Supports CSV files and compressed file formats like snappy , GZIP, BZIP. It also works on additional formats like Avro, RCFile , LZO text files. According to Cloudera blog, Impala also wants to support a new Trevni  columnar format developed by Doug Cuttings.  

Cloudera bets big on Impala, as Impala can co-exists with existing Hadoop ecosystem and provides a better SQL like interface for querying peta bytes of data in real time. Still users will use pig , hive and map reduce for more complex batch analysis in cases where the declarative language are not an exact fit. All the ecosystem components can co-exist and provide a rich platform for Big data crunching and analysis. Projects like Drill and Impala can fill the void to strengthen the Hadoop Ecosystem for increase its adaptability across the various enterprises.


Monday, November 5, 2012

MongoDB and Cassandra makes up top 10 in DB-Engine ranking list

NoSQL Data stores, MongoDB and Cassandra adoption is gaining rapidly and it is obvious from DB Engine ranking site. Recently these two NoSQL data stores made an entry to top 10 database list where as HBase and CouchDB stand at no 14 and 15 respectively.

DB-Engines rank site has updated and added many new databases in to its list. The DB-Engine ranking is one of the interesting use cases for populating data from social networks and use of big data to understand the trend patterns.

DB-Engine uses various popular social sites to acquire and understand data. They collect data from search engines, Google trends, Stack overflow, Indeed job portal and Linkedin. Most importantly, DB-Engine rank does not depend on the technical details or the underlined database architecture and transactions per milliseconds which are typically measured in TPC rating (http://www.tpc.org/tpcc/).

However this re-enforces that, NoSQL databases are getting used in many projects and gaining lot of traction. However the popularity index value has a wide distance between Relational databases compared to NoSQL databases. In case of MySQL the index value 1273 and in case MongoDB it is 101( and order of 10). As the transactional systems installation base is very high for many many years compared to NoSQL data stores. So number of discussions and more number of search results would be resulted. This could be one of the main reason. 

Please find the URL : http://db-engines.com/en/ranking

Tuesday, October 9, 2012

An alternative framework for Mahout : CRAB


Have you imagined every time you purchase any item from online sites like Amazon, BestBuy etc you might have figured out there are other items have been displayed as recommendations for you. For example I buy book from Amazon, it recommends list of books purchased by other shoppers with similar interest. This is possible as the online stores actual process millions of data and finds out the item purchased by similar users who are having a common buying behaviour. This helps the online sites sell more to their users based on the user preferences and their online behaviours. Most of the times users will not know what items they are looking for over the net. The recommendation system helps them to discover similar items based on their interest.  

In today’s overcrowded world with millions of items, it is very difficult to search and narrow down our requirements. In that context the online stores provides a filtration of data and presented in most pleasant way. At times we discover items which we might not heard of.

The same is not only true for online retailers. In most of the social sites, we discover friends and people with similar interest. This is done by processing all the socio interests expressed over the net and finding similarities between them. In linkedin you will find jobs, professionals, groups with similar interest. This is a facilitated by underline recommendation system infrastructure. Building a recommendation system could become a fairly complex process as the number of variables are going to increase. Of course the important variable is your amount of data to be processed.

Mahout played a very critical role in solving this problem. But it is not that trivial to build applications with Mahout. Though, it provides a comprehensive set of tools to work with Machine Learning. This is where CRAB fits the bill. The main objective of CRAB is to provide a very simple way to build the recommendation engine.

Crab is a flexible, fast recommender engine for python that integrates classic information filtering recommendation algorithms in the world of scientific Python packages (NumPy, SciPy, Metaplotlib).

The project is started in 2010 by Muricoca incorporated as an alternative to Mahout. It is developed using python so it is much easier to work with for an average programmer compared to Mahout which is built using Java. It has implemented User based, Item based and sloped based Collaborative filtering algorithms.

Demo Example can be found at:
https://github.com/marcelcaraciolo/crab/blob/master/crab/tests/test_recommender.py






Wednesday, October 3, 2012

Hadoop simplified frameworks

There are many frameworks available for reducing the complexity involved in writing MapReduce programs for Hadoop.

In this arcticle I have discussed few of them and most of them are actively developed and have many production implementation. These framework will increase your productivity as they provide high level features wrapping the low level complexity. Few of them even allow you write directly java code with out thinking MapReduce.

Before using them make sure you will evaluate and understand the frameworks better so that you will not end up in selecting  wrong framework. Also think of your long term needs. In my experience any framework provide the basic tenets but as you explore and progress, you will have issues in doing writing thing in write way. Its going to be even more worst, in case those features are not actively supported by the framework. It is no different from selecting any other frameworks for any other job. All those rules applied. 

1. Cascading
2. Pangool
3. Pydoop
4. MRJobw 
5. Happy
6. Dumbo

Cascading :

  • It is an abstraction layer works on top Apache hadoop.
  • Can create and execute complex workflows
  • Works with any JVM based language ( Java, Ruby , Clojure )
  • Primarily workflow gets executed using pipes.
  • It  follows “source-pipe-sink” paradigm, where data is captured from different sources, follows reusable  ‘pipes’ that perform data analysis process.
  • The developers can write JVM based code without really thinking MapReduce.
  • Supported commercially by Concurrent inc.
  • URL : http://www.cascading.org


Pangool :

  • Works on most of the Hadoop distribution.
  • Easier map reduce development.
  • Support for Tuples instead of just key/value pairs.
  • Efficient and easy to use secondary sorting
  •  Efficient, easy to use reduce-side joins
  • Performance and flexibility like Hadoop without really worrying about the Hadoop complexit
  • First -class multiple inputs and outputs
  • Built in serialization support with thrift and protostuff
  • Commercial support  from DataSalt
  • URL : http://pangool.net/

Pydoop :

  • Provides simple python API for MapReduce.
  • It is based CPython package and being a CPython module provides access to an extensible set of python libs like numPY, sciPy etc.
  • More interactive High level Hadoop API available for executing complex jobs.  
  • Provides high level of HDFS API
  • Developed by http://www.crs4.it/
  • URL : pydoop.sourceforge.net

MRJob :

  • Simplified MapReduce scripts.
  • Developed by Yelp and actively being used in many production environments.
  • Built for Hadoop. Built using python.
  • Simple to use compared direct python streaming.
  • Available for running Hadoop on Amazon Elastic map reduce (EMR).
  • Can be used for running complex Machine Learning algorithms and log processing on Hadoop cluster.
  • URL: https://github.com/Yelp/mrjob

Happy :

  • Simplified Hadoop framework built on Jython .
  • Map-reduce jobs in Happy are defined by sub-classing happy.HappyJob and implementing a map(records, task) and reduce(key, values, task) function.
  • Using run() you can execute the job.
  • It also can be used for complex data processing and implemented in production environment.
  • URL : http://code.google.com/p/happy/

Dumbo :
  • Dumbo is also built using python.
  • Python API for writing MapReduce.
  • All the low level features nicely wrapped and provided as unix pipes.
  • It has many nice inbuilt functions/APIs to write highlevel MapReduce scripts.
  • URL : https://github.com/klbostee/dumbo


Monday, October 1, 2012

Map Reduce code in python for getting highest score


My Data set : ( scores.dat )

sachin,1996,86
sachin,1996,75
sachin,1996,145
sachin,1996,98
sachin,1997,97
sachin,1997,65
sachin,1996,98
sachin,1996,54
sachin,1998,98
sachin,1998,53
sachin,1997,34
sachin,1997,54
sachin,1997,54
sachin,1997,23
nikhil,1996,56
nikhil,1996,54
nikhil,1997,43
nikhil,1998,89
nikhil,1996,32
nikhil,1997,54
nikhil,1998,45
nikhil,1996,32
nikhil,1996,43
akash,1996,122
akash,1996,98
akash,1997,12
akash,1998,23
akash,1996,87
akash,1997,65
akash,1998,65
akash,1996,32
akash,1996,73

mapper.py ( /user/hduser/mapper.py )


#!/usr/bin/env python

import sys

for line in sys.stdin:
   (val1,val2,val3) = line.strip().split(",")
   print "%s\t%s" % (val1, val3)

reducer.py ( /user/hduser/reducer.py )


#!/usr/bin/env python

from operator import itemgetter
import sys

( last_name , max_val ) = ( None , -sys.maxint )

# input comes from STDIN
for line in sys.stdin:
    # remove leading and trailing whitespace
    name,val = line.strip().split("\t")

    if last_name and last_name != name:
print "%s\t%s" %(last_name , max_val)
(last_name , max_val) = (name, int(val))
    else:
(last_name , max_val) = (name , max(max_val,int(val)))

if last_name:
    print "%s\t%s" % (last_name,max_val)

Test first with in unix shell:
cat scores.dat | python mapper.py | sort -k1,1 | python reducer.py

akash    122
nikhil     89
sachin    145

Map reduce  command for running the job :
${HADOOP_HOME}/bin/hadoop jar ${HADOOP_HOME}/contrib/streaming/hadoop-*streaming*.jar -file /home/hduser/mapper.py -mapper /home/hduser/mapper.py -file /home/hduser/reducer.py -reducer /home/hduser/reducer.py -input /user/hduser/scores -output /user/hduser/scores-out




Attending pycon India 2012

PyCon India 2012:


It is indeed a great nice experience in participating at PyCon India 2012 and presenting Talk on Bigdata, Hadoop and Python.

http://in.pycon.org/2012/funnel/pyconindia2012/42-python-in-big-data-world

I  am sharing my presentation 


Please keep watching this blog. I will also provide all the codes presented during the talk.



Monday, September 17, 2012

Private cloud is not according to Gartner


  • Private Cloud Is Not Virtualization
  • Private Cloud Is Not Just About Cost Reduction
  • Private Cloud Is Not Necessarily On-Premises
  • Private Cloud Is Not Only Infrastructure as a Service (IaaS)
  • Private Cloud Is Not Always Going to Be Private

Please visit this url ...

http://www.gartner.com/it/page.jsp?id=2157015&goback=%2Egde_61513_member_163225161

Tuesday, May 1, 2012

Will unifying the user interface for all the personal devices such as personal systems and mobile devices makes sense?


Here is my take on this:

In today’s world there are many personal computing devices. The good old PC and notebooks, netbooks, tablets,  smart phones. Though at a high level they seem to solve more or less same problem but each one has different use cases for their use. All of them complement together but one cannot simply replace the other. May be there could be convergence of these devices and can some of these devices obsolete in future. That is all together a different case for argument.  

As said earlier each device is used for different purpose and for different scenarios. In that case their user interface should such that closely satisfies the target objective and very intuitive for that scenario.  So in this context can we build common unified user interface for all the devices? It improves productivity and reduces the learning curve as the user interface seems to be same on all the devices.

But I would vote against of this as this will hurt innovation. This will really make us stop thinking the customization really required for individual device for what it has been built. May be this is the fundamental reason why Apple departed from its desktop OS and built a complete new OS for mobile devices such as smart phone and mobile phone and people embraced that.  

On the contrary Microsoft has taken a unified approach for building a common interface for all its devices. If one has to think from Microsoft’s view point this strategy may be a right fit, as they have huge legacy of desk top OS and well established community. They definitely want the PC customers migrate to upcoming Windows8. May be that is the pull for Windows 8 on personal systems to have the same interface as like as mobile devices. This will also help them to build a single strong product and make it work on any devices. This is not anything new that Microsoft is trying. Earlier they have tried the same. They built the windows desktop interface on Windows CE and it was not well received by the users.

It is not simply, building the OS that matters, the OS future is dependent on the active community it can create. As Microsoft is late entrant into mobile OS though it has built earlier but failed to make a mark, it wants to capitalize on the existing user community to build applications. That may be a justification to be made but I personal think it will be good to have multiple user interface strategy for different categories of devices as each device has their own consumer strategy.