Thursday, January 10, 2013

Introduction to Cloud Computing

This video is very interesting introduction to cloud computing and nicely explained. So I thought to add this to my blog which might provide the basic understanding on the various components and features on which the cloud computing architected on.

Wednesday, January 2, 2013

BigData Analysis with Project Spark and Shark

  • Developed by AMPLabs, UC Berkeley
  •  Developers Michael Franklin and Matei Zaharia 
  •  Alternative to MapReduce parallel processing engine.
  •  In-memory storage for very fast iterative queries removing temporary writes of intermediate data like MapReduce jobs. 
  •  After each map and shuffle the data is written to local disk in Hadoop. Which increases the further  execution time. This bottle neck removed in SPARK by making the results available in the memory itself.
  • Spark writes data to RDD (Resilient Distributed Datasets) which can live memory and hence Spark provides the necessary execution improvements.
  • Up to 100x faster than Hadoop.
  • Compatible with existing Hadoop ecosystem and works well with existing HDFS systems. 
  • Spark can co-exist with existing Hadoop cluster using Mesos cluster manager. 
  • It is better suited for iterative algorithms like Logistic Regression and Matrix Factorization compared plain data processing algorithms.
  • Developed by Scala and provides clean APIs in Java and Scala. Python APIs will be added soon.
  • Meant for Hive replacement with high degree of speed improvement.
  • Built on top of SPARK data-parallel execution engine.
  • Uses SQL like declarative language and works on SPARK infrastructure.
  • Can execute complex queries using JOINs and GROPU BY
  • Uses column-oriented store to improve performance. The columnar compression provides better reduction in storage.
  • All the queries run in memory to improve the performance.
  • Shark provides descent integration with Machine Learning using (Resilient Distributed Datasets). User can call these functions using SQL like syntax. This minimizes the complexity involved in using machine language.
  • The entire software stack ( SHARK + SPARK + ecosystem ) is called as BDAS ( Berkeley Data Analysis Stack ).

Saturday, December 22, 2012

MortarData : Hadoop Pass

Hadoop as an ecosystem has evolved and garnered by many enterprises for solving their Big Data needs. However with current set of development tools, making Hadoop run and able to get what user wants is not a trivial task. In one hand there are many start-ups making Hadoop real time and more suitable for real time query processing while others making the entire ecosystem more simple to use. Hadoop is not a platform for only querying data. It also helps in solving a diverse set of use cases from log processing to genome analysis. The Hadoop ecosystem is fairly complex and getting matured to execute a wide variety of problems. So beyond real time queries Hadoop can be implemented to solve many different Big Data needs and all of them need a fairly simple development environment to get started with Hadoop.  MortarData  is one such start-up trying to ease the entire Hadoop development by many folds.

MortarData CEO K Yung and his team working on this technology for a while and their simple USP is “Getting ready with Hadoop in one hour”.   Mortar launched Hadoop platform as service on Amazon. Amazon also has Amazon elastic MapReduce which is more a general platform for Hadoop compared to what Mortar is trying to achieve. Mortar on other hand built a Hadoop infrastructure which can run using simple Python or PIG scripts. Mortar also provides features to share public datasets and codes for analysis to every one for to get started easily. Any one is interested to share their public data set and code for analysis large scale data sets can share using Github. It also provides other database storage support like Amazon S3 and MongoDB other than HDFS. The data can be populated from these external databases to HDFS to run the MapReduce as when it required. The platform allows users to install python based analytical tools like NumPy, SciPy an NLTK. According to Yung there will be more Tools will be added to the platform as we progress.

I think more and more people will use these kinds of platforms as it really removes the whole Hadoop installation process and managing Hadoop cluster which is by itself a complex process. However, simple development environments are not big differentiator, these companies need to focus on how to do auto scaling, and other ways to minimize the cost of running Hadoop clusters based on their past workloads. Other areas could be more simple diagnostic and management tools to help the debug process fairly simple and trivial. Allowing, important ecosystem libraries to be pre-configured compared to do a manual installation. These are the couple of core areas where I think most of work will be done in future.  

Thursday, December 20, 2012

Platfora and Qubole are taking Hadoop to next level with realtime interactive queries

As I have discussed earlier, one of the important disruption happening currently  in the Hadoop ecosystem is to make Hadoop more real time. Cloudera has already developed Impala which provides user to make real time queries. Mapr with Apache Drill project provides a declarative language like SQL to query underlined Hadoop data in real time. Google is already has its proprietary analytics engine and provides service called BigQuery to query petabytes of data. Facebook is also on other hand putting lot of efforts to increase performance of Hive and its ecosystem.

There are other start-ups like Platfora and QuBole taking different route for solving this big puzzle still using Hadoop. According to Ben Werther founder of Platfora,  business cannot wait days and weeks to get answers. It has to be instantaneous. The true innovation lies in the data agility and exploration made possible by Hadoop.

Hadoop is a great platform for solving the Bigdata. It has one of the big ecosystem in Bigdata. But the real issue with Hadoop is, it is not meant for all. Still the platform is restricted only for handful of researchers and calls for substantial amount of learning. This is the gap where most of the companies wants fill in many different ways.

Platfora got 20 million, Series B funding recently and has been working for more than a year to make Hadoop available to business users.  There are other first generation Hadoop start-ups built technologies around Hadoop. But most of them are run queries in batch oriented. This is where Platfora and QuBole differ in their implementation. Platfora has built an analytics platform from ground up. It has Vizborads for analytics, scale out in-memory data processing engine and hadoop data refinery as key components. Platfora wants to change the way traditional BI works.

One more start-up called QuBole also heading on the same lines. They have released QuBole data platform as service which run on Amazon  EC2 and Hadoop. QuBole founders Asish Thusoo and Joydeep Sen Sarma were from Facebook data infrastructure team where they have used and implemented Hive platform and they knew the bottlenecks of Hadoop and Hive. One of the main design goal of QuBole is to make Hadoop more real time and a simple platform for business user to run their queries.

To make queries run interactively, they use HQL and Hive but the entire platform is designed to run queries in real time.  The platform scales seamlessly up and down. The users need not to provision the no of systems needed upfront. In typical scenario it is unlike that user knows how many machines are required for query.  It is underlined platform’s job to allocate and free the machines in the cluster on demand based on the workload and past statistics.

It is good to see the NON Hadoop in memory databases like SAP HANA and recent announcement of Amazon RedShift also trying to build their technologies to solve big data more like traditional transactional database.

We need to see how much Hadoop can leap frog in solving this. As of now there is lot of innovation going on to make Hadoop more real time and more simple for many different use cases. Recently, Microsoft has also provided support for Hadoop on its Azure platform. This implementation is pure Hadoop Vanilla. Can we see Microsoft bringing something innovative to the Hadoop table or simply it follows the crowd ?

Monday, December 10, 2012

Trevni : A columnar file format for Cloudera Impala

Trevni is columnar file format developed by Doug Cutting for storing data in columnar format and will be core storage engine and part of  Cloudera Impala project. Project impala delivers real time queries on Hadoop file system.

Trevni features :
  1. Inspired by CIF/COF based column oriented database architecture.  CIF based columnar architecture works well with MapReduce.
  2. Stores data based on columns which provides good compression of data as data stored in single column will have same kind of data. Retrieval of data will be fast as the minimal scanning required for accessing the data within the same column as compared row store.
  3. To achieve scalable, distributed query evaluation, data sets are partitioned into row groups containing distinct collection of rows. Then each row group stores data vertically like column store. To understand more see below figure 1.
  4. Maximizes the size of row groups in order to reduce the Disc IO seeking latency.  Each row group size can be > 100 mb. This will help in reading sequentially to reduce the disk IO.
  5. Each row group will be written as separate file. All values of a column will be written in contiguously to get optimized IO performance.
  6. Reducing no of row groups results in reducing the no of HDFS file created and hence it reduces the load on the name node. So it is better to have few files per data set means fewer row groups.
  7. Allows dynamic access of data within row group. It also supports co-location of columns with in row group as per CIF storage architecture.   
  8. It also supports nested column structure for semi structured data in the form of arrays and dictionaries.
  9. Application specific data will be maintained at every level like file, column and block. Check sums have been used at block level for providing data integrity.
  10.  Provides many data type support like int, long , float , double , string and byte data type for complex aggregated data. It also supports NULLs and NULL occupies zero bytes which is one of the key differences between column storage and row storage to save disk space. 

Figure - 1 : Illustrates the row group concept in columnar store.