Electronic Discovery Center- Electronic Evidence and Discovery - Deduplication
Newsflash

An MD5 Hash value is calculated from an algorithm, which is then calculated from the contents of a file.  The MD5 hash calculates 128 bit values. It is unlikely that different documents would have the same hash value (same 128 bit code), so by comparing each hash values against each other, you can identify duplicate documents, emails, etc.

 
Main Menu
Home
Electronic Disc News
What is E-Discovery?
Electronic Disc Vendors
Electronic Disc Articles
Zubulake-The Details
Deduplication
Native File Review
Links
Search
powered_by.png, 1 kB
Deduplication PDF Print E-mail
Written by Administrator   
Wednesday, 08 March 2006

As referenced by wikipedia,

In database maintenance, deduplication, which is sometimes reffered to as referrential integrity and various other names, refers to the database maintenance task of removing duplicate data from within its databases. I.e. similar rows featuring, "J.Smith" and "John Smith" may well refer to the same conceptual individual and the rows within the database may need to be merged. This is often achieved with the merge/purge algorithm of Felligi and Sunters.

Deduplication is mostly used in comparing email records within custodians, or against custodians, to take out duplicate records to ensure a cost savings of extensive nature. Not only can email files (psts and lotus notes) be deduplicated from within, but stand alone edoc files as well. Most often, electronic evidence vendors will associate a hash value for any given file and then compare that hash value against the other files in a database.

This method of reducing the responsive data before any amount of imaging is started is a means to bring down the costs inherit inside this activity.

Last Updated ( Wednesday, 08 March 2006 )
 
Next >
(C) 2008 Electronic Discovery Center- Electronic Evidence and Discovery
Joomla! is Free Software released under the GNU/GPL License.
Electronic Discovery Center