Thursday, January 10, 2013

Introduction to Cloud Computing

This video is very interesting introduction to cloud computing and nicely explained. So I thought to add this to my blog which might provide the basic understanding on the various components and features on which the cloud computing architected on.


Wednesday, January 2, 2013

BigData Analysis with Project Spark and Shark


SPARK :
  • Developed by AMPLabs, UC Berkeley
  •  Developers Michael Franklin and Matei Zaharia 
  •  Alternative to MapReduce parallel processing engine.
  •  In-memory storage for very fast iterative queries removing temporary writes of intermediate data like MapReduce jobs. 
  •  After each map and shuffle the data is written to local disk in Hadoop. Which increases the further  execution time. This bottle neck removed in SPARK by making the results available in the memory itself.
  • Spark writes data to RDD (Resilient Distributed Datasets) which can live memory and hence Spark provides the necessary execution improvements.
  • Up to 100x faster than Hadoop.
  • Compatible with existing Hadoop ecosystem and works well with existing HDFS systems. 
  • Spark can co-exist with existing Hadoop cluster using Mesos cluster manager. 
  • It is better suited for iterative algorithms like Logistic Regression and Matrix Factorization compared plain data processing algorithms.
  • Developed by Scala and provides clean APIs in Java and Scala. Python APIs will be added soon.
SHARK :
  • Meant for Hive replacement with high degree of speed improvement.
  • Built on top of SPARK data-parallel execution engine.
  • Uses SQL like declarative language and works on SPARK infrastructure.
  • Can execute complex queries using JOINs and GROPU BY
  • Uses column-oriented store to improve performance. The columnar compression provides better reduction in storage.
  • All the queries run in memory to improve the performance.
  • Shark provides descent integration with Machine Learning using (Resilient Distributed Datasets). User can call these functions using SQL like syntax. This minimizes the complexity involved in using machine language.
  • The entire software stack ( SHARK + SPARK + ecosystem ) is called as BDAS ( Berkeley Data Analysis Stack ).