Sunday, December 20, 2009

The Big Data Puzzle

The gen next new age database vendors claim support for petabytes of data. This list is ever increasing day by day. It’s quite interesting to watch the trends from EDW ( Enterprise data warehouse ) to EDA ( Enterprise data appliances ) to EDC ( Enterprise data cloud ). The EDC is the latest band wagon from Greenplum database vendor. Today most of the Big Data vendors singing the same tunes and at the core of it the following technologies responsible.

  1. Column oriented architecture
  2. Cloud support & Integration
  3. Map Reduce to process the data
  4. Massively Parallel Processing
  5. Data grid with Commoditized H/W

Companies like Vertica, ParAccel , GreenPlum , InfoBright , Sybase are built databases around above key technologies.

Column Oriented Architecture :

A column oriented database stores its content in columns rather than storing in traditional row format like OLTP database systems. This architecture has a definite incremental advantage of processing massive amount of data compared the row based architecture. This is most suitable for read operation and queries. In a typical data warehouse, the no of columns might be more than 100 per table. In those cases when the user queries only few columns, there is no need for processing the entire row. We can easily retrieve the necessary columns much faster compared to row based storage.

Cloud support & Integration:

All most all vendors provide a way to integrate their databases with clouds. Inherently cloud architecture provide the necessary tenets for solving the peta bytes of data problem and on the same hand it also provides better economy of scale. Typically, cloud computing provides the following key features which is the key for big data vendors.

  1. On demand Scale out and provisioning provides low capex which attracts all sized vendors.
  2. The whole analytical process can be configured with in external cloud. They need not be in house. This helps in reduction upfront cost of H/W installation.

Map Reduce:

Map reduce framework provides yet another dimension for solving this equation. Map Reduce provides a way for processing huge datasets by chopping in to small sets and aggregating back them in a distributed environment. Initially this technique was tried and implemented by Big Daddy Google. There are lot of architecture built on this framework. One of the notable variant is Hadoop, which can solve massive amount of data in key and value pair. The Map Reduce is extended to SQL by GreenPlum and AsterData systems. Some company feel this will help developers as they need not to learn a new language while working with these technologies.

Massively Parallel Processing:

The massively parallel processing with shared nothing technology is one of the key technologies behind today’s distributed computing and this provides the so called on demand scale out in cloud infrastructure. And hence this is one of the key ingredients within the Big Data equation.

Data grid with Commoditized H/W:

This is last but not the least technology used in the above stack. Above technological rise is happening due to availability of today’s low cost commodity H/W. This H/W commoditization led to proliferation of large grid architecture with low cost. Instead spending money on high-end H/W, cloud computing provides an opportunity to use low cost H/W and allows to reap the advantages with the same proprietary H/W.

2 comments:

  1. I agree with your list of five core technologies, but I'm not sure all the vendors are singing the same tune. If you look more closely, I think you will find that Infobright and Sybase are not singing the MPP refrain, Paraccel and Infobright are not singing the MapReduce harmonies, and Greenplum can't really carry the column oriented architecture melody all that well.

    Dave Menninger
    Vertica

    ReplyDelete
  2. Thanks Dave for your comments and correction. I really appreciate for the clarification.

    I meant that today's big data vendors are building their databases around these core technologies. Not necessarily all of them using all the above technologies.

    Thanks.

    ReplyDelete