Skip to content Skip to navigation Skip to footer

Data Deduplication

What Is Data Deduplication?

Data deduplication is the removal of duplicate data in a way that maintains the integrity of the system, as well as the functioning of applications dependent on the data being cleaned. The need for data deduplication arises when data gets copied in your system. There are a few reasons why this may happen:

  1. An application making a copy of a file it uses: This is so the file is more readily available within the application's dependencies. Every application has dependencies that it uses, or “depends on,” to function. For instance, a music application will need to reference the music on your device in order to play it. In this case, one of the prime dependencies is the storehouse of songs on your computer, tablet, phone, or another device. When you play a song, the application may reach into that area of your computer and play the song from there. However, some apps are set to make a copy of the item on which they are dependent. Apple’s iTunes can, for instance, be set to make a copy of every song you play, which results in duplicates on your computer. In many cases, you do not need another copy of the data being duplicated, and deduplication can get rid of these kinds of excess files.
  2. Data gets duplicated during backups: Often, when an application creates a backup, it duplicates the data it is backing up. When the backup process has finished, you end up with two identical sets of data. When you restore from a backup, what you get is a copy of the data you are looking for. Some systems have deduplication already in place to prevent this. For example, each time you do a backup to your PC or Mac, the size of the backup varies based on the changes and additions you have made to your system. Your computer examines what was there during the last backup, compares it to what it is going to back up, and then only backs up the new or different files.
  3. Music and video creation software duplicating data: Music and video files are some of the largest files you may have on your computer. With some music-creation software, such as older versions of Cakewalk, when you apply an effect to a clip of music, the clip itself gets copied. This allows the music program to reduce memory consumption and get the most out of the speed of its hard drive because it essentially creates its own, exclusive, readily available dependency. A similar thing may happen with some video-creation software. The video file is copied, and it is this new file the program works with. With data deduplication, once you have ensured that all video and music software is referencing files on your computer, you can delete any duplicates, freeing up space.
  4. In addition to doubles of individual files, a network may also depend on duplicate disk images stored on virtual machines. If a network has 300 images of Windows 10, for instance, that consumes 4.48 terabytes of storage. However, with a deduplication system, you can store a single copy of the virtual machines running Windows 10. 

When the deduplication engine finds data that has been put someplace else within the storage system’s environment, the system can keep a pointer in place of the copied data. The pointer is far smaller than the data, similar to a link or an application's icon on your desktop. The copy can then be deleted, and only the original element of the image gets referred to by the processes that need it. This saves significant space.

Exponential Growth in Data

The amount of data companies use has been increasing at an exponential rate. According to a recent study, the amount of data created worldwide is headed to eclipse 149 zettabytes in 2024. Because of this, the cost of storing data is also increasing.

When you factor in the duplicates of data that constantly pile up on servers and workstations, the growth curve of data can present serious budgetary and security concerns. The more data you have, the more storage you need. As you acquire more storage, your attack surface increases, giving criminals a wider target to aim for. Duplicate data unnecessarily exacerbates this issue. With data deduplication, you can get rid of space-devouring files, reduce your storage costs, and pave the way for a more secure IT system.

How Data Deduplication Works

Data deduplication solutions search for duplicates of data using different methods, and then delete them. To ensure that all apps that depend on the data can still function properly, data deduplication software only deletes data that will not negatively impact app performance. 

In some cases, data deduplication vendors may be able to recommend adjustments to your architecture to create opportunities for the deduplication to free up even more space. There are also data deduplication storage services that can both dedupe data and store it as a backup or for immediate access by your system.

In-line vs. Post-process Deduplication

In-line and post-process deduplication accomplishes the same general objective but using two different methods. Post-process deduplication involves the system taking new data and storing it on a storage device. Then, later on, there is another process that checks the data for duplicates.

With in-line deduplication, on the other hand, the detection of duplicates happens as the data first enters into the target storage device. In this scenario, if the data storage system finds a duplicate block of data, it stores just a reference to the data that block is a copy of.

In-line deduplication comes with the benefit of necessitating less storage and taking up less network traffic. This is because when duplicate data is found, it never gets stored. However, there is a trade-off. The process of finding the duplicates requires processor-heavy calculations. This can reduce throughput and cause lags in processes that depend on the data, as well as consume large amounts of power.

Source Deduplication vs. Target Deduplication

You can also categorize data deduplication solutions according to where their processes happen. When you have data deduplication happening close to the area where data gets created, this is called source deduplication. If the process happens close to where the data gets stored, it is referred to as target deduplication.

With source deduplication, the deduplication process typically occurs right inside the file system. The file system itself performs scans of new files. As it does so, it creates hashes representing each file it scans. These then get compared to the hashes that already exist. If a match is found, the copy gets removed, and the newer file is made to point to the original, or older, one.

However, because duplicated files are viewed as independent entities, if one of them gets altered later on, a copy of it is created. Also, when the deduplication system itself gets backed up, duplicates can be made during this process as well. For these reasons, source deduplication may not be an ideal solution for every organization.

Target deduplication, because it happens close to the storage destination, does not involve the server in the deduplication process. This may result in lower computational requirements on the server side. However, with target deduplication, you end up having more data transferred across the network because it all has to be checked for duplicates near the storage solution. This could add a burden to network resources.

Hardware-based vs. Software-based Deduplication

With software deduplication, the data cleaning happens on the machine whose data is being examined for duplicates. A software program examines the data, and when it finds a duplicate, it takes a predetermined action, such as assigning a pointer to the older version. Hardware-based deduplication services use a separate hardware device to inspect data and remove duplicates.

As with most hardware vs. software paradigms, there are trade-offs. For example, even though a software solution is typically less expensive to run, it can be harder to install and maintain. You have to install agents to ensure adequate data redundancy and storage utilization within the storage server.

Even though hardware deduplication may come with better performance and easier scalability, it is often the more expensive solution, so it may only make sense for enterprises or bigger organizations with large amounts of data.

Secure the Workday

Secure the Workday: Keep Your Organization, Data, and Employees Secure

Learn how to analyze organization’s and employees' data and ensure security for your workforce

Register for the Webinar

Data Deduplication Types

Some of the primary data deduplication types include source and target deduplication, as described above, as well as client-side and asynchronous deduplication. 

With client-side deduplication, the deduplication happens on the client, and only after this has been accomplished does the data get sent to the server for storage. Asynchronous deduplication happens in phases. First, all the data gets written. Then, in intervals, the deduplication system examines all the data, tagging the data that is new, getting rid of copies, and then putting in flags that point to the older data that had been duplicated.

What Is the Difference Between Data Deduplication and Data Encryption?

Data encryption works similar to data compression in that your data gets scrambled and then replaced with encryption that represents the data. When the data gets decrypted using a key, it is once again readable and usable by people and systems.

Data deduplication is different from encryption in that it is designed to get rid of copies of data. The process of assigning hashes to represent data may be similar to what happens during encryption, but the purpose is to label data so when matching labels show up, duplicated data can be eliminated. Encryption is therefore a security measure, while deduplication is a way of freeing up and maximizing the effectiveness of resources.

Benefits of Data Deduplication

Data deduplication provides several benefits that can directly impact how smoothly your digital infrastructure functions and how you use your resources. Data plays a significant role in the amount of space available in your system and how consistently your network or computer can access the data it needs to perform essential functions. 

For example, data deduplication enables you to do the following:

Achieve More Backup Capacity

By getting rid of redundant data within a backup system, you free up space that you can use for future backups. Because duplicate data can add up over time, you may not even realize how much space you can reclaim until you have gone through the deduplication process.

Retain Data for Longer Periods of Time

When you reclaim free space as a result of deduplication, you get to store data longer in your backup system. Many backup processes involve getting rid of older data to make room for newer, potentially more relevant data. With deduplication, your system does not have to get rid of the older data as frequently because there is less need to free up space.

Verify the Integrity of Backup Data

As you execute the deduplication process in a backup system, you compare stored data with the data it is supposed to emulate. The objective is for the backup data to match the primary data your systems depend on—but without being unnecessarily redundant. In the context of this data deduplication meaning, you get a thorough examination of the backup data versus what you need to back up. Consequently, you get an additional set of checks and balances that can be used to verify the integrity of your data.

Is Data Deduplication Safe?

Data deduplication can be safe as long as certain vulnerabilities and weaknesses are accounted for. Some of these include:

  1. The integrity of the file system: With some solutions, a file system is used to run the process. This file system needs to be shielded from viruses and other threats using a solution like a next-generation firewall (NGFW), for example.
  2. The integrity of the index: The various pointers that tell the system to reference the original data where the copy used to be are kept within an index. This index needs to be protected from corruption.
  3. In-place upgrades: You have to ensure your deduplication system still functions after software or hardware gets updated. Otherwise, the deduplication process may not work well with the updated version of the software or hardware.
  4. Many systems still require a tape backup: As you accumulate more and more data, older data may bog down your system—even with a deduplication system in place. At this point, it would still be advisable to store older files on a tape-based system.

How Fortinet Can Help?

Fortinet has partnered with Keysight Technologies to provide a flexible data deduplication system, as well as an in-depth security structure. The data gets sent through Ixia’s visibility solutions that examine packets of data as they move through the network. The data then gets sent to network packet brokers (NPBs) that are responsible for processing the data. At this point, the data goes through deduplication. 

Also, any unnecessary header information or packets can be removed from the data. The data then gets forwarded to a FortiGate NGFW that analyzes it according to the organization’s security needs. In this way, Ixia’s solution is able to reduce the workload of a FortiGate or another Fortinet device, ensuring it is not wasting compute power on inspecting duplicate data.

FAQs

What Is data deduplication?

Data deduplication is the removal of duplicate data in a way that maintains the integrity of the system, as well as the functioning of applications dependent on the data being cleaned.

How does data deduplication work?

Data deduplication solutions search for duplicates of data using different methods, and then delete them.