Silent Data Corruption: How It’s Detected & Fixed


What Is Silent Data Error

Silent Data Corruption is the worst type of disk error. These errors are caused by bugs in drivers and disk firmware, memory errors, drive-head crashes, as well as similar software and hardware problems. Silent errors that occur during the write to disk process are the most dangerous, as there is no indication that the data is incorrect.

Silent errors go undetected by drive firmware and host operating systems. No matter what a “regular” system does, silent errors and their corresponding data corruptions will not be detected, unless they result in data structure corruption, which is verified for integrity by the system or application software. For example, a silent error in a file system table will likely be detected at a point when the host OS will be unable to locate files or their fragments, or the database server will notice a malformed table caused by a corruption of information keeping the table structures.

Combating Data Corruption

As drive capacity is continuously increasing, silent data corruptions are becoming an ever-greater concern. Multiple independent studies have shown corruption rates starting from one error for every 67 GB to much higher rates.

Combating data corruptions has long been a high-priority concern of computer hardware and software manufacturers. Some errors are detected using checksums stored in each drive sector. When disk operations generate plural errors within a single sector, drives may copy the information to a reserve sector and remap the failing sector for a healthy one, without assistance from Operating System software.

Sun/Oracle’s ZFS file system was designed to combat data corruptions by storing redundancy information on multiple levels. RAIDIX’s forward error-correction algorithm analyzes RAID metadata to detect and fix silent corruptions, while regular disk operations are performed, without performance degradation. In addition, RAIDIX performs background scans for silent errors during lower-activity periods.

Unlike ZFS tying information protecting data from corruptions to the proprietary file system structures, RAIDIX implements error correction at the block level, rendering compatibility with any OS and the file system chosen by the platform’s administrator.

Common Approaches to Detect & Fix Silent Data Errors

The following approaches are commonly used by computer hardware and software vendors to detect and, if possible, to correct silent data errors.

1. Adding checksum information to each block.

This method works best with select SAS drives, allowing reformatting of disks with larger than standard sector size. Checksums, stored next to each data-block, are used during read operations to detect a possible corruption. This approach reduces the formattable disk volume, but has relatively low impact on performance.

Most drives support only the default sector size, so checksums have to be stored in a different sector. Since additional sector reads are constantly required, checksuming will severely degrade performance, in addition to taking up disk capacity.

2. Write-ReRead

After each write, the sector is read back and checksumed. The checksums of the data block before and after the write are compared to verify that the data was written correctly. RAID systems can employ two additional instruments for silent error detection:

3. Vertical-parity

In addition to stripping data and parity information, secondary parity records block checksums written on each drive in the array. This vertical parity strip is used for detecting data corruptions.

4. Using Q redundancy of RAID-level 6 arrays to detect corruptions.

RAID6 arrays use a scheme called P+Q redundancy. It utilizes Reed-Solomon codes to protect against up to two disk failures. Q checksum can be used to verify data integrity and to detect data corruptions.

How RAIDIX Combats Silent Data Corruption

RAIDIX developed a unique algorithm using mathematical properties of RAID6 checksums to detect and correct silent data corruptions. Since RAIDIX does not need additional storage and/or read operations, using only standard RAID properties, it exhibits performance far exceeding competing approaches.