Validation, Test and Performance Characterization of DDR Memory Subsystems

Servers are the biggest use of DDR Memory with some systems having up to 16 channels of memory with 4 DIMMs per channel.  In general the more DIMMs per channel the slower the memory bus speeds.  However Servers do not operate error free.  At a 2012 Server Design Summit Facebook's Matt Cordery explained that memory failures were the #2 failure seen in Facebook's data center.  In 2009 Google published a landmark study that described a much higher than expected error rate seen in their fleet of servers.  FuturePlus Systems' Barbara Aichinger referenced this study in her 2012 MemCon presentation on the potential for JEDEC specification violations as being a cause of these memory errors.  Most recently Row Hammer has been publicized as the reason.  This problem is so susceptible to creating undetectable data corrpution that this site has dedicated an entire page to the 'Row Hammer' problem.    

Real Servers with Real Problems!

Below are a few papers and communications between the engineers at FuturePlus and Server end users and Data Center employees in the industry.  These illustrate the growing problem that Cloud Computing is not safe.  The enemy is not the hackers but the companies that skimp on the validation process and sell a product that is not of high quality.  In many cases Server memory subsystems are not being validated properly. Design flaws and in some cases known issues are being deployed in the field.  

Customer Report to a Data Center Engineer evaluating an early OCP Server for deployment in their data center.

Cloud Computing Expo 2013:  Cloud Downtime due to Memory Errors

JEDEC Paper: A response to the Google Study and corresponding presentation presented in Shenzhen China

