'Source dedupe suits best for virtual environment'

24 Aug 2010 00:00 IST

Updated On 24 Aug 2010 23:32 IST

New Update

BANGALORE, INDIA: Of late, we have been hearing a lot about how data, especially unstructured data, are sprawling over the metro and how storage and networking experts are digging out newer ways to manage this.

Advertisment

Recently I have been to an HP POD (Performance Optimised Datacentre) event where HP showcased its 20-foot POD, built on the data centre in a shipping container concept. The POD can support over 1,500 compute nodes, 100 physical servers per rack, has an average power density of 27 KW per rack, where the maximum density can go up to 34 KW, which is much much higher than what a traditional data centre can provide, and that too in a smaller space.

It seems like the storage industry is set to brave the data growth storm and is waiting for the right wind to blow to go further and plan out how to curb and manage data more efficiently.

That went a bit off the topic, however, containing the growth of data is a common thread that I was trying to arrive at. Now coming to the point, experts have been trying their hands successfully at various data management technologies such as thin provisioning, storage tiering, e-mail or database archiving, data deduplication etc. to contain data, over the past few years.

Let's here take a look at the data dedupe concept. Though I have dealt with the topic before, let's see how data dedupe will suit the virtual environment.

Deduplication essentially make sure that only unique data gets backed up in back-up servers. Reference or pointers are created for the duplicated data, which in turn gets backed-up not on the primary machine but to another machine.

There are three ways of dedupe:
Source on client machine
Target on backup server
Appliances

Source Deduplication: Client software or software sitting on source checks whether data sitting on the back-up server is unique and is not the same across the network.

Target deduplication: The deduplication software is sitting in back-up server and when data comes from source through network to the back-up server, the source file is checked as to whether it is unique or duplicate and then deduplicated.

With appliance based deduplication, the back-up data is all sent to the device and deduplication occurs at the target. With appliances, users can add systems in place of, or along side of, existing back-up targets and make very little change in the overall back-up methodology.

{#PageBreak#}

Vijay Mhaskar, VP, Information Management Group, Symantec, says: "Source level deduplication is best suited for virtual environments. In virtual environment there are virtual machines which in turn have several copies of data, thus making the data extremely bulk. It’s efficient to optimize the data at source level itself and then move it to the network than getting it done at the target level."

Advertisment

Now, if you are thinking why data managers are wary of transmitting the whole chunk of data (both original and duplicate) over the network and then dedupe it at the target, here is the answer.

The downside of target deduplication process is that unique as well as duplicate data flows through the network to the back-up server, which affects network optimisation.

"By deduplicating data before it is sent to a target, less data can be sent over the network. So you save network bandwidth because you have not moved the duplicate data on the back-up. The idea is similar to performing compression in software. In fact deduplication processes almost always include compression as well," Mhaskar continues.

We have seen how source outsmarts target dedupe in this scenario. However, it is not so in all instances. This was a perfect scenario where all aspects such as source server memory, and network bandwidth were all perfect to carry out deduplication at the source itself.. However, it isn't so at all time.

“Source deduplication is most preferred, however there is also a huge cost associated with it. Moreover, if application operation consumes 85 percentage of your time, then there will be not much bandwidth left in the client or source machine to run a deduplication algorithm. Also, if applications are consuming a lot of bandwidth on the source machine then source deduplication is not recommended,” Mhaskar opines.

In countries such as India, where network bandwidth is very expensive, having dedupe software done on source is much more better that dedupe at appliances.

Advertisment

tech-news