Showing posts with label Python. Show all posts
Showing posts with label Python. Show all posts

Saturday, December 22, 2012

MortarData : Hadoop Pass


Hadoop as an ecosystem has evolved and garnered by many enterprises for solving their Big Data needs. However with current set of development tools, making Hadoop run and able to get what user wants is not a trivial task. In one hand there are many start-ups making Hadoop real time and more suitable for real time query processing while others making the entire ecosystem more simple to use. Hadoop is not a platform for only querying data. It also helps in solving a diverse set of use cases from log processing to genome analysis. The Hadoop ecosystem is fairly complex and getting matured to execute a wide variety of problems. So beyond real time queries Hadoop can be implemented to solve many different Big Data needs and all of them need a fairly simple development environment to get started with Hadoop.  MortarData  is one such start-up trying to ease the entire Hadoop development by many folds.

MortarData CEO K Yung and his team working on this technology for a while and their simple USP is “Getting ready with Hadoop in one hour”.   Mortar launched Hadoop platform as service on Amazon. Amazon also has Amazon elastic MapReduce which is more a general platform for Hadoop compared to what Mortar is trying to achieve. Mortar on other hand built a Hadoop infrastructure which can run using simple Python or PIG scripts. Mortar also provides features to share public datasets and codes for analysis to every one for to get started easily. Any one is interested to share their public data set and code for analysis large scale data sets can share using Github. It also provides other database storage support like Amazon S3 and MongoDB other than HDFS. The data can be populated from these external databases to HDFS to run the MapReduce as when it required. The platform allows users to install python based analytical tools like NumPy, SciPy an NLTK. According to Yung there will be more Tools will be added to the platform as we progress.

I think more and more people will use these kinds of platforms as it really removes the whole Hadoop installation process and managing Hadoop cluster which is by itself a complex process. However, simple development environments are not big differentiator, these companies need to focus on how to do auto scaling, and other ways to minimize the cost of running Hadoop clusters based on their past workloads. Other areas could be more simple diagnostic and management tools to help the debug process fairly simple and trivial. Allowing, important ecosystem libraries to be pre-configured compared to do a manual installation. These are the couple of core areas where I think most of work will be done in future.  

Tuesday, November 20, 2012

Scraping Twitter with ScraperWiki


While I was searching for a good scraper in python, I encountered many of the scrapers written in python. Finally I tried with ScraperWiki and it was quite interesting.

Everything can be done within the browser and very simple to use. We can write python scrapper scripts in the browser and it allows you to run and test the code. Finally it shows the results within the same page. We also can use scripts with other languages like ruby and php.

It also has various other built in scrapers like scarping for csv and excel file and storing data back to database. Please go thru this URL ( https://scraperwiki.com/ ) to learn more about this.

I thought of writing a simple scraper for getting the results from twitter and here is my piece of python code. You can modify scripts from publicly available scripts in scraperwiki site and run it by yourself.

 import scraperwiki  
 import simplejson  
 import urllib2  
   
 # Get results from the Twitter API! Change QUERY to your search term of choice.   
 # Examples: 'newsnight', 'from:bbcnewsnight', 'to:bbcnewsnight'  
   
 QUERY = 'bigdata'  
 RESULTS_PER_PAGE = '100'  
 LANGUAGE = 'en'  
 NUM_PAGES = 5   
   
 for page in range(1, NUM_PAGES+1):  
   base_url = 'http://search.twitter.com/search.json?q=%s&rpp=%s&lang=%s&page=%s' \  
      % (urllib2.quote(QUERY), RESULTS_PER_PAGE, LANGUAGE, page)  
   try:  
     results_json = simplejson.loads(scraperwiki.scrape(base_url))  
     for result in results_json['results']:  
       data = {}  
       data['id'] = result['id']  
       data['text'] = result['text']  
       data['from_user'] = result['from_user']  
       print data['from_user'], data['text']  
   except:  
     print 'Failed to scrape %s' % base_url  
       

Tuesday, October 9, 2012

An alternative framework for Mahout : CRAB


Have you imagined every time you purchase any item from online sites like Amazon, BestBuy etc you might have figured out there are other items have been displayed as recommendations for you. For example I buy book from Amazon, it recommends list of books purchased by other shoppers with similar interest. This is possible as the online stores actual process millions of data and finds out the item purchased by similar users who are having a common buying behaviour. This helps the online sites sell more to their users based on the user preferences and their online behaviours. Most of the times users will not know what items they are looking for over the net. The recommendation system helps them to discover similar items based on their interest.  

In today’s overcrowded world with millions of items, it is very difficult to search and narrow down our requirements. In that context the online stores provides a filtration of data and presented in most pleasant way. At times we discover items which we might not heard of.

The same is not only true for online retailers. In most of the social sites, we discover friends and people with similar interest. This is done by processing all the socio interests expressed over the net and finding similarities between them. In linkedin you will find jobs, professionals, groups with similar interest. This is a facilitated by underline recommendation system infrastructure. Building a recommendation system could become a fairly complex process as the number of variables are going to increase. Of course the important variable is your amount of data to be processed.

Mahout played a very critical role in solving this problem. But it is not that trivial to build applications with Mahout. Though, it provides a comprehensive set of tools to work with Machine Learning. This is where CRAB fits the bill. The main objective of CRAB is to provide a very simple way to build the recommendation engine.

Crab is a flexible, fast recommender engine for python that integrates classic information filtering recommendation algorithms in the world of scientific Python packages (NumPy, SciPy, Metaplotlib).

The project is started in 2010 by Muricoca incorporated as an alternative to Mahout. It is developed using python so it is much easier to work with for an average programmer compared to Mahout which is built using Java. It has implemented User based, Item based and sloped based Collaborative filtering algorithms.

Demo Example can be found at:
https://github.com/marcelcaraciolo/crab/blob/master/crab/tests/test_recommender.py