While common hard drive errors can be manifested in a variety of ways, there are some instances where such errors can go completely unreported, and even undetected, in some systems. Although it is sometimes referred to as a silent read failure, these errors may very well happen on a write failure, as well. As such, the phrase "silent data corruption" is a much more accurate descriptor of the problem at hand.
A Very Real Problem
Although the issue of silent data corruption is rarely considered by the typical computer user, silent data corruption can be a very real and significant problem. That's exactly why scientists with CERN, the world's largest particle physics laboratory, spent more than a month researching silent data corruption within their own infrastructure and hardware.
The results of their study, which scrutinized data corruption in the form of hard disk errors, RAID system errors and memory errors, revealed some amount of data corruption in every single case. While these results are limited to that of CERN's hardware and infrastructure, it's easy to see how similar statistics could be derived from other organizations and enterprises. Moreover, it's easy to see how such corruption could wreak havoc for an unprepared organization.
Breaking Down the Numbers
In order to measure disk errors, the team with CERN wrote a 2 GB file to 3,000+ simultaneous nodes every two hours, after which the file was read back to the origin system. A total of 500 errors were found across 100 different nodes over a period of five weeks.
Single-bit errors and sector-sized errors each account for 10% of disk errors, while the remaining 80% of disk errors were attributed to incompatibility between WD's firmware and the 3Ware controllers used by CERN. This problem has since been resolved.
The team with CERN took a different approach when trying to identify potential RAID errors. After running the verify command on nearly 500 RAID systems each week over the course of one month, CERN's RAID systems experienced approximately 300 errors in the reading and writing of 2.4 petabytes of data. While this results in a bit error rate (BER) that is a bit lower than advertised, such errors could do a lot of damage to the highly advanced and sensitive nature of CERN's projects.
Finally, the team found only three errors spread across 1,300 nodes wile analyzing their systems for memory errors over the span of three months. While this may seem like good news on the surface, all of these errors were double-bit errors. It just so happens that double-bit errors cannot be corrected.
Additional Studies
CERN isn't the only organization who has attempted to take on the problem of silent data corruption in the 21st century. The team with NEC, a leading provider of networking and communications hardware, completed their own research on the subject. Their results, which were published in a whitepaper that was released in 2009, provide a direct link between silent data corruption within storage arrays and unrecoverable system failures. According to their numbers, as many as 10% of catastrophic storage failures are a result of such data corruption.
How Silent Data Corruption Can Affect Your Data
Comments
No comments yet. Sign in to add the first!