WHAT IS THIS TECHNOLOGY? Many storage RAID (redundant array of inexpensive disks) systems employ data replication or error correction coding to support automatic recovery of data when disk drives fail; but most still require drive maintenance. Most often, maintenance includes hot-plug drive replacement to initiate data migration and restore data from replicated sources or to initiate error correction recoding or recovery after a single fault. Longer rebuild times increase the risk of double-fault occurrence and data loss. To minimize rebuild time and reduce the risk of data loss, replacement disk drives must be kept on hand and arrays need to be closely monitored.
WHAT DOES THIS TECHNOLOGY DO? Given the cost of stocking replacement disk drives and operator monitoring, Atrato Inc. has built spare capacity into its SAID (self-maintaining array of identical disks) architecture for fully automatic fail-in-place recovery requiring no monitoring and minimizing data loss exposure. The Atrato system’s unique approach eliminates drive tending and minimizes risk of data loss for a three-year operational lifetime. This design provides superior MTTDL (mean time to data loss), high service availability, lower cost of ownership, minimal spare capacity requirements, and enables deployments with mostly unattended operation.
Atrato’s fail-in-place design is a fundamental feature of the Atrato system that provides unmatched actuator density, IO performance, storage density, and zero maintenance. Users really don’t want to care for storage systems. What they really want is worry-free capacity, high performance, and no service or performance loss if components fail in the system. The Atrato system has fail-in-place throughout and takes maximum advantage of this alternative to FRU (field replaceable units). The only FRUs on the Atrato system are fans and cables and both can be over-provisioned so no FRU servicing is immediately critical. Given the relentless increase in drive capacity, very few users operate drives longer than three years and simple end-of-life migration from one Atrato SAID to another makes upgrade planning simple.
Fail-in-place is not only fundamentally less expensive than tending hot-plug RAID arrays, it’s also safer. Requiring an operator to swap out a failed drive simply delays the start of data migration and can sometimes result in inadvertent data loss due to human error. Spare capacity is kept on hand as shelved drives that may or may not have been recently tested, burnt in, or scrubbed for sector errors. By comparison, spare capacity managed by the Atrato Virtualization Engine (AVE) is constantly scrubbed at a low background rate and provides hot sparing. Furthermore, given the AVE hot spare management scheme, there is no reason to pull failed drives, again risking human error, so the system can be sealed, vastly simplified and packaged at a lower cost.
In addition, the failed drives managed by the AVE can be spun down and unlinked to isolate them and reduce power usage. With the fail-in-place strategy, the Atrato SAID has been designed to host higher actuator density and higher overall storage density than any other array ever built. Hundreds of drives are contained in a 3RU (rack unit) array, providing up to 50 terabytes of total capacity. More importantly, hundreds of concurrently operating actuators means that the Atrato system can provide multi-gigabyte IO from this 3RU SAID and tens of thousands of IOs per second with no cache. Fail-in-place not only makes this simplification and unparalleled performance possible, but also lowers total cost of ownership and simplifies administration to the point that the Atrato system can be mostly unattended.
Almost as fundamental to the Atrato system as fail-in-place is FDIR (fault, detection, isolation, and recovery). This terminology was coined by the aerospace industry to describe automation for highly reliable and available systems. Deep space probes must employ FDIR automation simply because operator attendance is either not possible or practical. Central to FDIR design is over-provisioning and redundancy. When faults are detected, spare resources and redundant components are employed to prevent service interruption. The space environment is much harsher than the data center environment, so FDIR is simpler and lower cost for RAID arrays like the Atrato system. But, surprisingly, many FDIR concepts are not employed in other RAID arrays. The Atrato system design enables cost effective and powerful FDIR features that differentiate the product from all other SAN/NAS devices.
In addition, RAID mappings in the AVE are integrated with IO to provide detection of corrupted data to handle failures and to provide improved performance, error handling, and overall system reliability. The RAID replicated regions for mirrors, RAID-5 parity, or RAID-6 parity and error correction codes are mapped so they are striped, integrated with error handling, and plug into a common overall FDIR framework. The methods of data corruption detection, recovery, and encoding/replication are specific, but detection, isolation, and heuristically guided recovery is common to all RAID mappings. Overall, MTTDL (mean time to data loss) is the important figure of merit. The tight integration of RAID, FDIR, and data protection mechanisms (data digest options) provide simple but very scalable mechanisms to maximize MTTDL even in very high bandwidth, high capacity, and high workload usage scenarios. Finally, unique mapping algorithms, rebuild strategies, and continued operation during rebuilds provides best-in-class performance.
Conclusion: The Atrato system maintenance-free design provides for management of storage by exception. Customers can minimize data loss exposure, while allowing their IT staff to focus more on deploying capacity, access, and quality of service, and less on specific drive types, drive tending, servicing rebuilds, managing spares, and tending to their storage.
10955 Westmoor Drive, Suite 300
Westminster, CO 80021