Tuesday, November 20, 2012

Cloudera Impala Beta - version 0.2 is available now

For download and learning more about Impala please go thru this cloudera site

http://blog.cloudera.com/blog/2012/11/cloudera-impala-beta-version-0-2-and-cloudera-manager-4-1-1-now-available/

For taking Impala training go thru this url

http://training.cloudera.com/elearning/impala/

Scraping Twitter with ScraperWiki


While I was searching for a good scraper in python, I encountered many of the scrapers written in python. Finally I tried with ScraperWiki and it was quite interesting.

Everything can be done within the browser and very simple to use. We can write python scrapper scripts in the browser and it allows you to run and test the code. Finally it shows the results within the same page. We also can use scripts with other languages like ruby and php.

It also has various other built in scrapers like scarping for csv and excel file and storing data back to database. Please go thru this URL ( https://scraperwiki.com/ ) to learn more about this.

I thought of writing a simple scraper for getting the results from twitter and here is my piece of python code. You can modify scripts from publicly available scripts in scraperwiki site and run it by yourself.

 import scraperwiki  
 import simplejson  
 import urllib2  
   
 # Get results from the Twitter API! Change QUERY to your search term of choice.   
 # Examples: 'newsnight', 'from:bbcnewsnight', 'to:bbcnewsnight'  
   
 QUERY = 'bigdata'  
 RESULTS_PER_PAGE = '100'  
 LANGUAGE = 'en'  
 NUM_PAGES = 5   
   
 for page in range(1, NUM_PAGES+1):  
   base_url = 'http://search.twitter.com/search.json?q=%s&rpp=%s&lang=%s&page=%s' \  
      % (urllib2.quote(QUERY), RESULTS_PER_PAGE, LANGUAGE, page)  
   try:  
     results_json = simplejson.loads(scraperwiki.scrape(base_url))  
     for result in results_json['results']:  
       data = {}  
       data['id'] = result['id']  
       data['text'] = result['text']  
       data['from_user'] = result['from_user']  
       print data['from_user'], data['text']  
   except:  
     print 'Failed to scrape %s' % base_url  
       

Thursday, November 15, 2012

Project Apache Drill and Impala wants to SQLize Hadoop for real time data access


There are lot of efforts going on for making real time data access using Hadoop ecosystem. It is evident that Hadoop is getting synonymous with the defacto BigData Architecture with in enterprises. The ecosystem is on sprawl and growing very rapidly as it provides a fundamental opportunity to solve petabytes of un-structured data for many different companies and communities. Organization from space, weather, genetic, carbon foot print, retail, financial using Hadoop for solving their Bigdata.

Hadoop is used primarily as a batch processing engine to crunch petabytes of data but not meant for real time processing. This triggered companies to think differently and allow Hadoop to access the data more like relational databases with real time queries.

There are two recent initiatives from MapR and Cloudera for making hadoop real time using SQL syntax more on the lines of Hive and HQL. This is not an entirely new concept with other analytics database vendors such as GreenPlum and Aster data. These vendors provide a SQL like interface for MapReducing large scale data analytics. However their design principles are different.

Apache Drill project is inspired by Google Dremel. Google Dremel is a scalable interactive query system used by thousands of Google engineers every day for querying their large scale data sets. This takes the advantage of Google GFS and Big Table and built on top of this. Google’s BigQuery is based on Dremel and exposed as a service. There are other open source projects developed for real time access like Storm and S4. The real difference between Dremel and Storm or S4 is later are streaming engines they are not meant for ad-hoc queries while Dremel architected for querying very large data sets in real time.
Apache Drill trying to achieve the same success of Dremel in Google in the Hadoop ecosystem. The design goal of Drill is to scale as many as 10,000 servers and querying petabytes of data with trillion records within seconds interactively. The project is backed by MapR which is one of the most visible vendors in Hadoop World.

Apache Drill architecture is designed to interact and scale well with existing Hadoop ecosystems and takes advantages of existing technologies rather than completely re inventing and being a different product.  It has four main components

    Nested query language : It is a purpose built nested query language, parses the query and builds an execution plan. The query language is called DrQL in Drill and  is more like SQL and HQL declarative language. It also supports Mongo Query Language as add on.

    Distributed execution engine : It takes care of physical plan and the under lined columnar storage and fail overs. Drill uses columnar storage like Dryad and Dremel.

    Nested data formats : This layer is built more like pluggable model so that it can work with multiple data formats. Drill can work with free form and schema based data formats. Schema less JSON and BSON data types and with schema protocol buffer, AVRO , JSON CSV.

    Scalable data sources : This layer supports various data sources. It is designed to support Hadoop and NoSQL in mind.

The other vendor wants to take advantage in this space is Cloudera. Cloudera is also well known brand in Hadoop’s ecosystem. Very recently, Cloudera announced a project called Impala based on Google Dremel architecture same as Drill. It is already in beta stage and Cloudera is promising to drop a production release by first quarter of 2013.

Unlike Hive, Impala directly access the data thru its purpose built query engine to provide more real time access. The Impala queries are not converted to MapReduce during runtime like Hive.

Impala allows users to query data both on HDFS and HBase and has inbuilt support for joins and aggregation functions. The query syntax would be very similar to SQL and HQL as it uses the same metadata supported by Hive. Like project Drill, impala also supports both sequence files and non sequence files. Supports CSV files and compressed file formats like snappy , GZIP, BZIP. It also works on additional formats like Avro, RCFile , LZO text files. According to Cloudera blog, Impala also wants to support a new Trevni  columnar format developed by Doug Cuttings.  

Cloudera bets big on Impala, as Impala can co-exists with existing Hadoop ecosystem and provides a better SQL like interface for querying peta bytes of data in real time. Still users will use pig , hive and map reduce for more complex batch analysis in cases where the declarative language are not an exact fit. All the ecosystem components can co-exist and provide a rich platform for Big data crunching and analysis. Projects like Drill and Impala can fill the void to strengthen the Hadoop Ecosystem for increase its adaptability across the various enterprises.


Monday, November 5, 2012

MongoDB and Cassandra makes up top 10 in DB-Engine ranking list

NoSQL Data stores, MongoDB and Cassandra adoption is gaining rapidly and it is obvious from DB Engine ranking site. Recently these two NoSQL data stores made an entry to top 10 database list where as HBase and CouchDB stand at no 14 and 15 respectively.

DB-Engines rank site has updated and added many new databases in to its list. The DB-Engine ranking is one of the interesting use cases for populating data from social networks and use of big data to understand the trend patterns.

DB-Engine uses various popular social sites to acquire and understand data. They collect data from search engines, Google trends, Stack overflow, Indeed job portal and Linkedin. Most importantly, DB-Engine rank does not depend on the technical details or the underlined database architecture and transactions per milliseconds which are typically measured in TPC rating (http://www.tpc.org/tpcc/).

However this re-enforces that, NoSQL databases are getting used in many projects and gaining lot of traction. However the popularity index value has a wide distance between Relational databases compared to NoSQL databases. In case of MySQL the index value 1273 and in case MongoDB it is 101( and order of 10). As the transactional systems installation base is very high for many many years compared to NoSQL data stores. So number of discussions and more number of search results would be resulted. This could be one of the main reason. 

Please find the URL : http://db-engines.com/en/ranking