Deduplication simply means getting rid of duplicates. If you’ve ever engaged in any sort of backup process, you’re probably familiar with this. You want to have all your data backed up, but you also don’t want to be copying the entire data set each time. One way around this is to do incremental backups – that is, the backup only updates what has changed since the last time it ran. This works well most of the time, but it isn’t a fool-proof process against duplication of data. It can still happen through user input, for example, or just through poorly handled backups.
The main reason for needing deduplicating is to conserve storage space. Storage is a limited resource after all, with all the costs associated, and so it makes sense that you only want to store what is necessary. Anything extra, like copies of data that has already been backed up, is redundant and therefore a wasted expense.
For your primary storage, deduplicating removes the physical copies of the data and ties them together logically. For example, you would have one physical version of the data, with any copies pointed towards that through metadata. In the public cloud, these sort of deduplication techniques aren’t visible to the user; it’s happening, but behind the scenes.
This is partly because at that point it isn’t particularly important to the end-user. Their data is being served to them and that’s what matters. But the actual benefit lies with the cloud provider. Storage space is being charged to the user on the logical capacity, rather than the physical. Any savings that the provider can make on the physical means that they can reduce their costs while still charging the same amount.
The situation changes when it comes to using cloud storage for backup. Having multiple backup images in the cloud uses up a lot of storage, more than a deduplicating platform like a disk system would.
Often, backup platforms will deduplicate at the source and store the source data on physical. The software is then managing the metadata that is translating from logical to physical. Another approach would be to be independent of the backup software, which is useful if you want to move the data around between platforms. A storage gateway can offer an interface which will do the deduplicating for you.
This latter approach is the better one, since it puts you in control of the metadata. It’s unadvisable to have your data tied down to a piece of software or provider. The storage industry, and especially the cloud industry, is constantly changing, and it’s not a given that a particular provider is even going to last five years. The process is also good when using groups of virtual machines, if your data isn’t suitable to go into the cloud.
It’s worth analysing your data and your provider to see if you are deduplicating your data. Some backup systems have a deduplicate rate of more than 20:1 – meaning twenty times more data is logically saved than the physical space required to store it.
How to Deduplicate Your Cloud Backups
No comments yet. Sign in to add the first!