Chipkill
Chipkill is a type of error correction technology, predominantly employed in server and workstation memory systems (typically DRAM), that aims to protect against memory errors. It is more robust than standard error-correcting code (ECC) memory.
ECC memory can typically correct single-bit errors and detect double-bit errors within a single DRAM chip. However, complete failure of a DRAM chip, which can manifest as multiple bit errors, would typically result in data loss and potential system instability. Chipkill addresses this by distributing the memory data across multiple DRAM chips.
The fundamental principle of Chipkill is to stripe the data and ECC information in such a way that the failure of an entire single DRAM chip does not result in uncorrectable errors. If a chip fails completely, the system can reconstruct the lost data using the redundant information stored on the remaining functional chips. This reconstruction is transparent to the operating system and applications.
The specific implementation of Chipkill varies between memory vendors and server architectures. However, a common approach involves distributing bits from the same memory word across different physical chips on the memory module. This allows the system to tolerate a complete chip failure without data loss.
The term "Chipkill" is often used generically to describe this type of robust error correction. However, some vendors might use different proprietary names for their specific implementations of similar technologies.
The primary benefit of Chipkill technology is increased system uptime and data integrity. By tolerating complete chip failures, Chipkill helps prevent system crashes and data corruption caused by memory errors. This is particularly important in mission-critical server environments where high availability is paramount.