DDR3 and DDR4 memory is pervasive and used in nearly all cloud server systems today. It is used in embedded applications and in military applications. Most critical applications do use error detection and in some cases error correction techniques on the DDR3/DDR4 memory. However these techniques are not a 100% guarantee for error free operation. In the case of multiple bit errors on a single transfer many error detection techniques fall short. Our dependence on DDR3 memory (and soon DDR4) and this known failure mechanism should be a wake up call for the industry. So far the workaround for this is to double the refresh rate to the memory. This is an attempt to ‘charge up’ the dormant memory cells so that they do not fall victim to adjacent rows that might become ‘hammered’. This hits performance and increases power consumption and the problem is not going away. This workaround just reduces the statistical probability.
What is Row Hammer?
In the quest to get memories smaller and faster memory vendors have had to make trade offs. One of these is very small physical geometries. These small geometries put memory cells very close together and as such one memory cell’s charge can leak into an adjacent one causing a bit flip. It has come to the attention of the industry that this is indeed happening under certain conditions. Very simply the problem occurs when the memory controller under command of the software causes an ACTIVATE command to a single row address repetitively. If the physically adjacent rows have not been ACTIVATED or Refreshed recently the charge from the over ACTIVATED row leaks into the dormant adjacent rows and causes a bit to flip. This failure mechanism has been coined ‘Row Hammer’ as a row of memory cells are being ‘hammered’ with ACTIVATE commands. Once this failure occurs a Refresh command from the Memory Controller solidifies the error into the memory cell. Current understanding is that the charge leakage does not damage the physical the memory cell which makes repeated memory tests to try to find the failing device useless.
Paper explaining the Row Hammer Failure Mechanism in DDR3 memory
Video explaining Row Hammer Failures
How to detect if your server/system is getting random memory errors due to Row Hammer failures.
Video explaining how to detect the "Row Hammer" event using the DDR Detective®
Search for the latest on DDR memory failures due to Row Hammer