Similarity Enhanced Transfer
|This article needs additional citations for verification. (September 2007)|
Similarity-Enhanced Transfer (SET) is a technique for improving the speed at which peer-to-peer file sharing and content distribution systems can share data. SET works by finding similar copies of the desired file, and looking for subsets of those copies that match (or are similar to) subsets of the desired file. If these are found, the similar copies can be used as additional download sources, which can increase the download rate as long as the downloader's connection is not already saturated.
The developers of SET found that for if a particular piece of content has several different versions available for download from a P2P network, there may be enough similarity between the files in the different releases that they can all be used as a download source for a single version. In particular they found, (quoted from ):
- MP3 music files with identical sound content but different header bytes (artist and title metadata or headers from encoding programs) were 99% similar.
- Movies and trailers in different languages were often 15% or more similar.
- Media files with apparent transmission or storage errors differed in a single byte or small string of bytes in the middle of the file.
- Identical content packaged for download in different ways (e.g., a torrent with and without a README file) were almost identical.
SET uses a technique called handprinting - which is based on earlier techniques known as "Shingling" that have been used to filter junk e-mails - to seek out the files that contain similar chunks of data to those in the requested file. The SET system computes a handprint for each file, and can take chunks of data from files which are both identical and similar to the one being searched for. The lower similarity ranking that SET searches for, the more sources for that data are likely to be found. The authors claim that the extra overhead of locating these sources does not out-weigh the benefit of using them to help saturate the recipient's available bandwidth and that exploiting similar sources can significantly improve download time.
In tests, SET improved the transfer time of an MP3 music file by 71% and a 55Mb movie trailer went 30% faster using the researchers' techniques to draw from movie trailers that were 47% similar. SET could help most with less popular files, but it is not believed to improve transfer rates much for popular data, where there is already a huge set of people downloading it. Experiments suggest that in the other cases, SET can help a lot.
Note however, that SET can only improve download speed when the downloader's connection is not the bottleneck. This is more often the case for unpopular downloads.
SET was developed by Professor David Andersen of Carnegie Mellon University, Ph.D student Himabindu Pucha, Purdue University and Dr. Michael Kaminsky, Intel Research Pittsburgh. Andersen believes that this technique could be immediately used by developers and applied to the BitTorrent file sharing system.
SET could be used to improve the speed of:
- Himabindu Pucha, David G. Andersen, Michael Kaminsky (April 2007). "Exploiting Similarity for Multi-Source Downloads Using File Handprints". Purdue Univ., Carnegie Mellon Univ., Intel Research Pittsburgh. Retrieved 2007-04-15.
- "Speed boost plan for file-sharing". BBC News Online-Technology (BBC News). 2007-04-12. Retrieved 2007-04-13.