Sunday, December 20, 2009

The Big Data Puzzle

The gen next new age database vendors claim support for petabytes of data. This list is ever increasing day by day. It’s quite interesting to watch the trends from EDW ( Enterprise data warehouse ) to EDA ( Enterprise data appliances ) to EDC ( Enterprise data cloud ). The EDC is the latest band wagon from Greenplum database vendor. Today most of the Big Data vendors singing the same tunes and at the core of it the following technologies responsible.

  1. Column oriented architecture
  2. Cloud support & Integration
  3. Map Reduce to process the data
  4. Massively Parallel Processing
  5. Data grid with Commoditized H/W

Companies like Vertica, ParAccel , GreenPlum , InfoBright , Sybase are built databases around above key technologies.

Column Oriented Architecture :

A column oriented database stores its content in columns rather than storing in traditional row format like OLTP database systems. This architecture has a definite incremental advantage of processing massive amount of data compared the row based architecture. This is most suitable for read operation and queries. In a typical data warehouse, the no of columns might be more than 100 per table. In those cases when the user queries only few columns, there is no need for processing the entire row. We can easily retrieve the necessary columns much faster compared to row based storage.

Cloud support & Integration:

All most all vendors provide a way to integrate their databases with clouds. Inherently cloud architecture provide the necessary tenets for solving the peta bytes of data problem and on the same hand it also provides better economy of scale. Typically, cloud computing provides the following key features which is the key for big data vendors.

  1. On demand Scale out and provisioning provides low capex which attracts all sized vendors.
  2. The whole analytical process can be configured with in external cloud. They need not be in house. This helps in reduction upfront cost of H/W installation.

Map Reduce:

Map reduce framework provides yet another dimension for solving this equation. Map Reduce provides a way for processing huge datasets by chopping in to small sets and aggregating back them in a distributed environment. Initially this technique was tried and implemented by Big Daddy Google. There are lot of architecture built on this framework. One of the notable variant is Hadoop, which can solve massive amount of data in key and value pair. The Map Reduce is extended to SQL by GreenPlum and AsterData systems. Some company feel this will help developers as they need not to learn a new language while working with these technologies.

Massively Parallel Processing:

The massively parallel processing with shared nothing technology is one of the key technologies behind today’s distributed computing and this provides the so called on demand scale out in cloud infrastructure. And hence this is one of the key ingredients within the Big Data equation.

Data grid with Commoditized H/W:

This is last but not the least technology used in the above stack. Above technological rise is happening due to availability of today’s low cost commodity H/W. This H/W commoditization led to proliferation of large grid architecture with low cost. Instead spending money on high-end H/W, cloud computing provides an opportunity to use low cost H/W and allows to reap the advantages with the same proprietary H/W.

Thursday, December 3, 2009

Scale UP versus Scale Out

Scale UP : This process is also called as Vertical scaling. With this approach you add more hardware like RAM, Processors and Network Interface cards. This provides more memory real estate and bandwidth for computation and hence your application will be faster. Having said that, if we can keep on adding new hardware to existing servers at one point of time they fail or they give us negative performance. So Scale up is always has limitations. While scaling up, the OS and application software also has to support the increased H/W and provide us better performance.

Pros :

  • 1. It doesn’t introduce additional maintenance cost and support cost.
  • 2. The current blade server provide plug in architecture to add new H/W. So adding new H/W is a fairly simple process.
  • 3. Lot of software vendors support Scale up, so getting support is more easier.
  • 4. It’s claimed that Scale up provides much better performance and low maintenance cost compared to Scale out process keeping all the variables constant.

Cons :

  • 1. We cannot simply keep on adding new H/W to existing servers. It whole lot depends on the server architecture and the software too.
  • 2. Change in technology compatibility. As technology is changing rapidly, we are not sure how well the new H/W will have the compatibility with old ones.
  • 3. The underlined software also should be designed for H/W Fail overs.

Scale Out : Scale out also called as Horizontal Scaling. You add more no of servers to existing infrastructure and do a load balancing. This approach catching up due the technologies like Map/Reduce. In scaling out there is no limitation. Until you can do the load balancing, you can keep adding the new servers to existing cluster and these servers need not to be high end like Scale Up scenarios. They can be commodity H/W.

Pros :

  • 1. Easier to add low end servers as per the demand.
  • 2. The servers can be instantly scaled out or scaled down based on the demand.
  • 3. This provides right economy of scale.
  • 4. In case one node fails in the grid, the system does not stand still, the other nodes will take the load automatically and hence the H/W failure is not a big risk.
  • 5. Companies like Aster Data Systems and GreenPlum Software provided enough ground on the Map/Reduce technologies.

Cons :

  • 1. Maintenance and support cost may be higher compared Scale Up process.
  • 2. The scalability is much more depends on the net work bandwith provided by the infrastructure. However in practice Scale out process will provide less performance over scale up process.
  • 3. Unless you use open source, the licensing cost and initial acquisition cost is more in Scale Out approach.

In typical enterprise, both the Vertical and Horizontal scaling used to take the benefit of both the approaches and they are co-existence. They start with a large vertical architecture server, adding resources as-needed. Once the single server approaches its limit then they scale out. In some cases the scale out is more practical than scale up. For example companies like social networking, they need to scale out rather than scale up. In case the performance is the key, then scale up may be the better choice over scale out.