Tuesday, February 2, 2010

NoSQL database gaining momentum...

The next generation Non Relational, NonSQL/NoSQL databases are gaining quite a lot of popularity in the world of web scale data stores. There is an interesting shift happening from the traditional row and column based relational databases to key value pair based non sql databases. They are often called as document centric databases.

Some of the limitations of traditional relational databases are as follows:

1. The data has to be normalized and should be available in row and column format. This kind of data makes best candidate for storing in relational data store. That means relational databases are not really good candidate for string non structured data where data is not available in row and column format.

2. Normalization will reduce the performance.

3. Replication between nodes is very painful and more expensive.

4. Relational database are really very hard to scale horizontally.

In other hand the Key and Value based data stores provide the same features and flexibility and good for large amount of data to be stored and processed. They also work well with non structured or semi structured meta kind data sets.

The following are the key features of this new data stores :

1. Schema free , de-normalized document storage

2. Key/value based lookups

3. Good candidates for Horizontal scaling. ( Scales well with very large no of nodes )

4. Support for map and reduce style programming

5. Built in replication

6. Simple HTTP/REST based APIs

7. Most suitable for cloud based applications

Some of the most popular document based data stores are

1. Apache CouchDB

2. MongoDB

3. Riak

4. Redis

5. ThruDB

6. Tokyo Cabinet

7. Memcached

Apache CouchDB : Apache CouchDB is created by Damien Katz. This is a document oriented, highly distributed, schema-free database written in Erlang ideal for large concurrent applications.

The database can be queried and indexed using MapReduce style. CouchDB also offers incremental replication with bi-directional collision detection and resolution.

CouchDB provides a RESTful JSON API that can be accessed from any environment that allows HTTP requests.

This is one of the first true document oriented database is designed to scale with the web and this databases is already used by many software companies.

MongoDB : MangoDB is widely used database in this category. This is written in C++ and provides all most all the features of CouchDB. MongoDB is more matured and commercially available from a company called 10gen. The database manages collections of JSON-like document which are stored in a binary format referred to as BSON.

Riak : This one of the new entrant in this space. It combines a decentralized key-value store, a flexible map/reduce engine, and a friendly HTTP/JSON query interface to provide a database ideally suited for Web applications.

Redis : This is also a new project hosted in Google code project. Redis is also key value based database system written in ANSI C and runs much faster. The implementation is very similar to Memcached. Available in most of the platforms. Provides the most of the features like other document centric databases.

ThruDB : ThruDB is also hosted in Google code project. Thrudb is a set of simple services built on top of the Apache Thrift Framework. This offers much faster and flexible easy-to-use services that can enhance or replace traditional data storage and access layers.

Others : There are many implementation of document oriented database. Most of them try to provide the key features mentioned above. This is new way of storing and retrieving data. These data base models provide an alternate and very flexible way to solve the large scale web data problems, which was traditionally a big limitation with Relational databases and other database models.


Saturday, January 9, 2010

A List of ETL tools

http://www.etltool.com/etltoolslist.htm.

To the above list I would like to add the following ETL products.

Jitterbit ETL - Jitterbit
Expressor Integrator - Expressor

Friday, January 8, 2010

Snap Logic : The data flow company

Most of the commercial data integration companies today provide data integration in very traditional way. All the integration and ETL jobs run in on premise and integration is mostly focused on structured data with in corporate boundaries. Though most of them support unstructured data, the basic premise of integration never changed. As we know for past few years there are lot of web standards have been emerged and most of the software vendors embraced them. One of the important focuses was to provide a very loose coupling between different services, so that we can truly create a collage of services and able to deliver on demand agile applications. Web service and XML has taken center stage in this paradigm shift. Unfortunately this did not extend to real enterprise application integration. Always the integration was done more on using native APIs. And hence there is always risk of failure, when more no of monolithic applications communicates with each other.

On the other hand the types of the data being generated by various applications has also changed over the years. Today, the unstructured data like feeds, atoms, xml, csv files are generated more than the structured data. More over the unstructured data is not only getting generated with in corporates but also from outside like blogs, wikis. So the traditional integration is bound to fail as they were never built on these premises.

In recent years SaaS ( Software as service ) emerged as a very promising business model. This has changed the complete business dynamics. With the help of SaaS, Now vendors can offer and target any customer with in any price bracket for their applications. This completely reduces customer initial upfront investment on the software licenses. On the same time it has raised some more challenges. There are many corporate legacy applications have to talk to these externally hosted managed services . For example a company wants to integrate salesforce application with its in house legacy ERP application then we need to have to support web service compared to traditional data access layers.

So if we sum up all, in today’s data integration, we need to support various types of data from various sources with various data access apis/standards/protocols. This has to be accomplished in more loosely integrated, without any boundaries, securely, fast and agile unlike traditional data integration projects. This is what SnapLogic is trying to fill.

Snap logic is a open source data integration company. SnapLogic DataFlow is a scalable data integration platform that leverages Web technology and standards to provide organizations of all sizes with a flexible and cost-effective solution for on-demand data integration. It can connect and fetch data from various data sources like traditional data bases, SaaS apps like sales force, netsuite , social networking sites. The dataflow also provides a way to create custom components and integrate with Snaplogic.

Sunday, December 20, 2009

The Big Data Puzzle

The gen next new age database vendors claim support for petabytes of data. This list is ever increasing day by day. It’s quite interesting to watch the trends from EDW ( Enterprise data warehouse ) to EDA ( Enterprise data appliances ) to EDC ( Enterprise data cloud ). The EDC is the latest band wagon from Greenplum database vendor. Today most of the Big Data vendors singing the same tunes and at the core of it the following technologies responsible.

  1. Column oriented architecture
  2. Cloud support & Integration
  3. Map Reduce to process the data
  4. Massively Parallel Processing
  5. Data grid with Commoditized H/W

Companies like Vertica, ParAccel , GreenPlum , InfoBright , Sybase are built databases around above key technologies.

Column Oriented Architecture :

A column oriented database stores its content in columns rather than storing in traditional row format like OLTP database systems. This architecture has a definite incremental advantage of processing massive amount of data compared the row based architecture. This is most suitable for read operation and queries. In a typical data warehouse, the no of columns might be more than 100 per table. In those cases when the user queries only few columns, there is no need for processing the entire row. We can easily retrieve the necessary columns much faster compared to row based storage.

Cloud support & Integration:

All most all vendors provide a way to integrate their databases with clouds. Inherently cloud architecture provide the necessary tenets for solving the peta bytes of data problem and on the same hand it also provides better economy of scale. Typically, cloud computing provides the following key features which is the key for big data vendors.

  1. On demand Scale out and provisioning provides low capex which attracts all sized vendors.
  2. The whole analytical process can be configured with in external cloud. They need not be in house. This helps in reduction upfront cost of H/W installation.

Map Reduce:

Map reduce framework provides yet another dimension for solving this equation. Map Reduce provides a way for processing huge datasets by chopping in to small sets and aggregating back them in a distributed environment. Initially this technique was tried and implemented by Big Daddy Google. There are lot of architecture built on this framework. One of the notable variant is Hadoop, which can solve massive amount of data in key and value pair. The Map Reduce is extended to SQL by GreenPlum and AsterData systems. Some company feel this will help developers as they need not to learn a new language while working with these technologies.

Massively Parallel Processing:

The massively parallel processing with shared nothing technology is one of the key technologies behind today’s distributed computing and this provides the so called on demand scale out in cloud infrastructure. And hence this is one of the key ingredients within the Big Data equation.

Data grid with Commoditized H/W:

This is last but not the least technology used in the above stack. Above technological rise is happening due to availability of today’s low cost commodity H/W. This H/W commoditization led to proliferation of large grid architecture with low cost. Instead spending money on high-end H/W, cloud computing provides an opportunity to use low cost H/W and allows to reap the advantages with the same proprietary H/W.

Thursday, December 3, 2009

Scale UP versus Scale Out

Scale UP : This process is also called as Vertical scaling. With this approach you add more hardware like RAM, Processors and Network Interface cards. This provides more memory real estate and bandwidth for computation and hence your application will be faster. Having said that, if we can keep on adding new hardware to existing servers at one point of time they fail or they give us negative performance. So Scale up is always has limitations. While scaling up, the OS and application software also has to support the increased H/W and provide us better performance.

Pros :

  • 1. It doesn’t introduce additional maintenance cost and support cost.
  • 2. The current blade server provide plug in architecture to add new H/W. So adding new H/W is a fairly simple process.
  • 3. Lot of software vendors support Scale up, so getting support is more easier.
  • 4. It’s claimed that Scale up provides much better performance and low maintenance cost compared to Scale out process keeping all the variables constant.

Cons :

  • 1. We cannot simply keep on adding new H/W to existing servers. It whole lot depends on the server architecture and the software too.
  • 2. Change in technology compatibility. As technology is changing rapidly, we are not sure how well the new H/W will have the compatibility with old ones.
  • 3. The underlined software also should be designed for H/W Fail overs.

Scale Out : Scale out also called as Horizontal Scaling. You add more no of servers to existing infrastructure and do a load balancing. This approach catching up due the technologies like Map/Reduce. In scaling out there is no limitation. Until you can do the load balancing, you can keep adding the new servers to existing cluster and these servers need not to be high end like Scale Up scenarios. They can be commodity H/W.

Pros :

  • 1. Easier to add low end servers as per the demand.
  • 2. The servers can be instantly scaled out or scaled down based on the demand.
  • 3. This provides right economy of scale.
  • 4. In case one node fails in the grid, the system does not stand still, the other nodes will take the load automatically and hence the H/W failure is not a big risk.
  • 5. Companies like Aster Data Systems and GreenPlum Software provided enough ground on the Map/Reduce technologies.

Cons :

  • 1. Maintenance and support cost may be higher compared Scale Up process.
  • 2. The scalability is much more depends on the net work bandwith provided by the infrastructure. However in practice Scale out process will provide less performance over scale up process.
  • 3. Unless you use open source, the licensing cost and initial acquisition cost is more in Scale Out approach.

In typical enterprise, both the Vertical and Horizontal scaling used to take the benefit of both the approaches and they are co-existence. They start with a large vertical architecture server, adding resources as-needed. Once the single server approaches its limit then they scale out. In some cases the scale out is more practical than scale up. For example companies like social networking, they need to scale out rather than scale up. In case the performance is the key, then scale up may be the better choice over scale out.