Sunday, December 20, 2009

The Big Data Puzzle

The gen next new age database vendors claim support for petabytes of data. This list is ever increasing day by day. It’s quite interesting to watch the trends from EDW ( Enterprise data warehouse ) to EDA ( Enterprise data appliances ) to EDC ( Enterprise data cloud ). The EDC is the latest band wagon from Greenplum database vendor. Today most of the Big Data vendors singing the same tunes and at the core of it the following technologies responsible.

  1. Column oriented architecture
  2. Cloud support & Integration
  3. Map Reduce to process the data
  4. Massively Parallel Processing
  5. Data grid with Commoditized H/W

Companies like Vertica, ParAccel , GreenPlum , InfoBright , Sybase are built databases around above key technologies.

Column Oriented Architecture :

A column oriented database stores its content in columns rather than storing in traditional row format like OLTP database systems. This architecture has a definite incremental advantage of processing massive amount of data compared the row based architecture. This is most suitable for read operation and queries. In a typical data warehouse, the no of columns might be more than 100 per table. In those cases when the user queries only few columns, there is no need for processing the entire row. We can easily retrieve the necessary columns much faster compared to row based storage.

Cloud support & Integration:

All most all vendors provide a way to integrate their databases with clouds. Inherently cloud architecture provide the necessary tenets for solving the peta bytes of data problem and on the same hand it also provides better economy of scale. Typically, cloud computing provides the following key features which is the key for big data vendors.

  1. On demand Scale out and provisioning provides low capex which attracts all sized vendors.
  2. The whole analytical process can be configured with in external cloud. They need not be in house. This helps in reduction upfront cost of H/W installation.

Map Reduce:

Map reduce framework provides yet another dimension for solving this equation. Map Reduce provides a way for processing huge datasets by chopping in to small sets and aggregating back them in a distributed environment. Initially this technique was tried and implemented by Big Daddy Google. There are lot of architecture built on this framework. One of the notable variant is Hadoop, which can solve massive amount of data in key and value pair. The Map Reduce is extended to SQL by GreenPlum and AsterData systems. Some company feel this will help developers as they need not to learn a new language while working with these technologies.

Massively Parallel Processing:

The massively parallel processing with shared nothing technology is one of the key technologies behind today’s distributed computing and this provides the so called on demand scale out in cloud infrastructure. And hence this is one of the key ingredients within the Big Data equation.

Data grid with Commoditized H/W:

This is last but not the least technology used in the above stack. Above technological rise is happening due to availability of today’s low cost commodity H/W. This H/W commoditization led to proliferation of large grid architecture with low cost. Instead spending money on high-end H/W, cloud computing provides an opportunity to use low cost H/W and allows to reap the advantages with the same proprietary H/W.

Thursday, December 3, 2009

Scale UP versus Scale Out

Scale UP : This process is also called as Vertical scaling. With this approach you add more hardware like RAM, Processors and Network Interface cards. This provides more memory real estate and bandwidth for computation and hence your application will be faster. Having said that, if we can keep on adding new hardware to existing servers at one point of time they fail or they give us negative performance. So Scale up is always has limitations. While scaling up, the OS and application software also has to support the increased H/W and provide us better performance.

Pros :

  • 1. It doesn’t introduce additional maintenance cost and support cost.
  • 2. The current blade server provide plug in architecture to add new H/W. So adding new H/W is a fairly simple process.
  • 3. Lot of software vendors support Scale up, so getting support is more easier.
  • 4. It’s claimed that Scale up provides much better performance and low maintenance cost compared to Scale out process keeping all the variables constant.

Cons :

  • 1. We cannot simply keep on adding new H/W to existing servers. It whole lot depends on the server architecture and the software too.
  • 2. Change in technology compatibility. As technology is changing rapidly, we are not sure how well the new H/W will have the compatibility with old ones.
  • 3. The underlined software also should be designed for H/W Fail overs.

Scale Out : Scale out also called as Horizontal Scaling. You add more no of servers to existing infrastructure and do a load balancing. This approach catching up due the technologies like Map/Reduce. In scaling out there is no limitation. Until you can do the load balancing, you can keep adding the new servers to existing cluster and these servers need not to be high end like Scale Up scenarios. They can be commodity H/W.

Pros :

  • 1. Easier to add low end servers as per the demand.
  • 2. The servers can be instantly scaled out or scaled down based on the demand.
  • 3. This provides right economy of scale.
  • 4. In case one node fails in the grid, the system does not stand still, the other nodes will take the load automatically and hence the H/W failure is not a big risk.
  • 5. Companies like Aster Data Systems and GreenPlum Software provided enough ground on the Map/Reduce technologies.

Cons :

  • 1. Maintenance and support cost may be higher compared Scale Up process.
  • 2. The scalability is much more depends on the net work bandwith provided by the infrastructure. However in practice Scale out process will provide less performance over scale up process.
  • 3. Unless you use open source, the licensing cost and initial acquisition cost is more in Scale Out approach.

In typical enterprise, both the Vertical and Horizontal scaling used to take the benefit of both the approaches and they are co-existence. They start with a large vertical architecture server, adding resources as-needed. Once the single server approaches its limit then they scale out. In some cases the scale out is more practical than scale up. For example companies like social networking, they need to scale out rather than scale up. In case the performance is the key, then scale up may be the better choice over scale out.

Thursday, November 19, 2009

Comparing JSF, Spring MVC, Stripes, Struts 2, Tapestry and Wicket

This presentation is bit old but contains very nice info about various web frameworks. In my opinion Raible slid show provides an overview of various web framewroks but in real life it is not that easy to compare and start choosing the best web framework. It always depends the kind of the project you really have. Having said that it provides a good start for understanding the various frame works.
Check out this SlideShare Presentation:

Wednesday, October 21, 2009

Thursday, October 8, 2009

all the buzz around the cloud...

Over the last couple of years there is lot of buzz surrounding cloud computing. Every day there are mushrooms of startups built around this new computing platform and business model. And there is lot of VC investment happening around these companies. I am hoping we are not again in the same clouds of early 2000 when all VCs put lot of ambitions and aspirations and all the internet startups were overvalued and finally the bubble got collapsed. I am not sure whether this cloud syndrome is a real phenomenon. There are authors equated this cloud era data centers to electricity grids. This might be true but, I agree with others who say all the computing model exists and cloud computing will be one of the computing models in computing eco system. No doubt we are progressing towards much more deep collaborative environment. The latest raise in social networks and gaining importance in online collaborative tools and sites is true witness of these patterns. In all these the internet becomes main stream computing platform.

Will all the applications move to cloud? May be the answer is NO.

Every application written in the universe is very different in terms of its requirement and usage. There was a time when desktop PCs gaining a lot of power, everyone thought there is an end to main frame era and powerful desktops will replace the main frame computers. The power of computing will be decentralized. The fact that today mainframes are still ruling. Today’s majority of mission critical application are still running in mainframe and the interesting phenomenon is we are going back to the centralized computing model like main frame.

The cloud provides one of the strong business model, which justifies a perfect ROI and hence there is lot of buzz around this and this platform is new and immature. There are very few companies who mastered this art. However there is lot of proof has been made on the ground in terms of the business values this new platform provides. One of the earliest company to ripe the benefits of this platform is salesforce.com. Currently, there are lots of big software giants like Microsoft, Google, Amazon, Rack Space, IBM, SUN having their own definition of Cloud and each one provides their own infrastructure and utilities around their technologies. This is bound to happen and things will consolidate in form of standards in future as the technology matures. Though the cloud computing is gaining momentum rapidly, still there are lot of concerns in the minds of corporate. The most important concern is security and critical data governance. The current infrastructure and utilities are provided to the end user more like vendor lock in. In case the user wants to switch between other cloud providers what are the options. We really do not know yet and hence this is pretty clear that the data and processes which are not critical to an enterprise can be deployed on an on demand cloud to gain the much wanted business value where as the critical data and processes could be managed in house. This will result in having hybrid models with different computing strategies for different kinds of requirements with in an enterprise.

Thursday, September 17, 2009

polyglot language - New trend in multi-paradigm programming

Polyglot and Poly-paradigm programming ( PPP ) Model gaining a lot of momentum for last couple of years however the model was implemented successfully in various applications very long back. For example there are lot of application use C and shell/awk scripting together. Embedding the scripts with HTML is also yet another best example of PPP model.

Then, Why the PPP model creating some buzz of late. The simplest answer is as the enterprise applications getting more and more complex, there is no single universal language can be developed to satisfy and work with this diverse problems. Today a simple enterprise application might need language for domain specific problem, web interface, networking , concurrency , database etc. So we need to use an integration of various programming language like scripting, DSL ( Domain specific language ) , functional etc to solve various problems with in an application domain.

The scripting languages like python, ruby , groovy provide more agile, ready to use patterns, components for development and hence the development time getting shorter and gaining lot more productivity. The big problem with the languages like Java , C++, C# are the learning time is more and more oriented towards seasoned developers. For a domain guys to implement a small equation solver with in his program required to learn how to write a java or C++ code to solve the problem. His objective is to solve the issue rather than learning nuances of various languages. These are the clear cases where DSLs and scripting languages are more helpful as they inherently simplify the grammar of the program.

In addition to all of these, most of the modern language provide and targeted to deliver byte code which can directly deployed on either Java VMs or .NET VMs. This process minimizes completely the integration complexity with in various languages. For example scala for java , JRuby , JPython and Haskell.net for .NET are the best examples of PPP model.

This programming practice can be more nurtured and practiced, if the language creator can still simplify the calling conventions and integration issues with in multiple languages. I am hoping we are progressing towards the same goal.

More to find :

http://www.polyglotprogramming.com/


Thursday, March 26, 2009

Agile data integration using data mashups

The data integration is one of the key driver for companies to stay competitiveness. In today’s economic conditions it is much more imperative and explains the basics of the survival of fittest and adaption is the key.  Top executives use BI in order to bridge the gaps between data and knowledge exists inside their organization so that they can take proactive decisions and hence remain more competitive.

Today’s enterprise contains data in varied form and it is one of the biggest challenge for the companies to integrate and aggregate them to make meaningful decisions. Most of the time not more than 20 to 30% of data is  available for decision making. One of the way this can be addressed using data mashups.

In simple term data mashup is nothing but allowing data to be integrated and transformed from multiple sources with in an enterprise without any custom coding and understanding more complex underlined mechanism. This allows anyone to create a simple data integration using these mashup tools and remove the understanding of underline technology. This is one of the best fit for web 3.0 projects.  

For example Aptar data mashup tool allows users to pull data from different sources without any coding and more simples means. I personally like this as it allows business users can simply use drag and drop to create their prebuilt data sources and start integrating data. Any user without much knowledge about data integration can drive a data integration project. These tools can really help bridge the gap between so called “data integration project” which used to take multiyear now can be reduced to few days or weeks. However in the context of enterprise these tools still might need some time to show the real action. But the data mashups provide a way forward in agile integration.

 

Saturday, March 14, 2009

Some of key take away from Opensource tech day held in Chennai …. 13th March ‘09

In the morning I reached the Chennai Trade Center  around 9 am. Though I lived in Chennai for years but never visited this place. Thanks to open source community for providing me this opportunity.

I have attended the CXO summit in presidency hall of ITC, if I correctly recall the name of the hall. The session exactly started by 10am IST.

The theme of the day was SAAS, cloud computing and virtualization and foss. This is expected normally from any road shows and conferences happening currently.

I got a sense from the panelist that they were more talking on the technologies but  not on how to deliver the Maximum ROI quantum metrics using foss technologies. I think that should have been the key message delivered from the panelist.

First of all to start with, the virtualization technology means different to different players and each one has its pros and cons. As usual like all other technologies.  Each vendor in market crafted differently to differentiate themselves from the rest of the competition.

Dr. Mishra from Novel is more keen on explaining XEN virtualization features and how it competes with the other technology like for example Redhat’s sweet pot KVM and similarly the Amit from Redhat explains about the KVM. All these are good to understand but how as a manager I need to take a decision on which virtualization technology to adopt.

As Mishra pointed out that, currently there are two virtualization technology vendor  are enterprise ready and one is commercially available and is there for quite some time is VMWare. As of now VMWare is the market leader and XEN is the industry ready from Novel perspective which is open source.

There are lots of discussions on the Openvz ( openviz.org ) model of virtualization, which is more native to unix and exploits the unix features. It has a small memory and performance foot print. However this is not industry ready yet as it does not provide all the tenets of enterprise grade virtualization tools. Also one of the key important parameter to be considered as the performance can be different in different virtualization technologies. In case of full virtualization, the performance foot print will to high compared to a para virtualization model.   

What I got at the end as Amit pointed out is in today’s context the Hypervisor is anyway free and is available for without any cost. So it makes more sense for the IT managers to think louder and introspect the right Management tools for virtualization which is the key.

I will continue with some more articles on these technologies as these are the four key pillars of innovation and there is lot of attentions paid. I prefer to add one more technology to watch is SOA and how it is unfolding in coming days. 

 

 

 

Thursday, February 19, 2009

Is still Open source a Myth or influences the social freedom

There are enormous discussions and posts in the past whether to adapt the Open source or not. It was a matter of choice in the past. Today the whole world has changed. The clouds are looking dark. Every one is witnessing a sense of insecurity about the future due to what is happening around us. In early 2000, the slowdown was caused primarily due to tech industry but we did not witness  this bad. Today it’s different and the financial meltdown caused millions of layoffs and an insecure mind in so called white scholar tech savvy developers. Though, tech is not necessarily heart of this, but feeling its share of pain.

The corporate budgets are shrinking and the CEO and CIOs are bound to think innovative ways to cut costs. I think all these paying in positive direction towards open source community and open source business model. The open source companies are making money  due to they can be more innovative and at the same time can sell products and services in a much cheaper price of what the proprietary software can offer.

 The open source business model is gaining a rapid momentum. Companies like Redhat, IBM, Sun are the testimony for the fact that open source is not just a myth but is a viable and sustainable business model. At the core of this, what we are witnessing is the collaboration and hence it fosters innovation. One of the testimony for this is foundation like GNU,  Eclipse, Apache, Open office, KDE to name a few. I can say today most of the proprietary companies either one way or other use open source. Almost all the big tech companies like IBM, SUN , ORACLE, GOOGLE  more and more either adapting or changing to open source business model. It is so much so that the VCs are not willing to invest proprietary if the business model does not allow to outsource the  IP engineering. This momentum and growth is fundamentally restructuring the social freedom and sustainable one.