In this WP hardware systems are iteratively broken down into their basic components that will be characterized form the reliability standpoint. With characterization here we intend the computation of specific parameters and measures potentially impacting the overall system reliability (e.g., area, error masking probability, resource utilization, timing constraints, etc.). This approach follows the concept of modular design/IP-reuse, for fast TTM that is a common challenge for both HPC and ES applications.
One of the key aspects of CLERECO is its cross-layer approach to reliability evaluation where all systems elements from the raw technologies up to the software layers are carefully considered with respect to their impact on system reliability. Hardware architectures will be analyzed considering their components (CPUs, memories, accelerators, peripherals, interconnects, etc.) at different levels of detail but always maintaining a connection between components and related user available instructions. The reliability-related behavior of hardware components at different stages of the system design cycle will be evaluated (in isolation): from very early specification stages, through high-level and more detailed design stages, down to the prototyping phases of the design flow. Thus, the reliability evaluation will be an iterative process providing different levels of detail while moving from the conceptual design phases, through all intermediate design phases, down to the post-silicon design validation phase.
Different sets of hardware components that will be studied are the following:
- Microprocessor cores of different architectures (RISC, CISC, DSP) as well as their multicore and multithreaded versions. The different subcomponents of the processor cores will be separately analyzed: functional units, storage elements, control logic, performance mechanisms (such as predictors, prefetchers), etc.
- Accelerators. A large class of accelerator cores which will be common for future multi/many core architectures including Graphic Processing Units (GPU), Single Instruction, Multiple Data (SIMD) accelerators, cryptographic cores, etc.
- Memories both on-chip and off-chip. The project will work on the dominant memory technologies in ES and HPC applications: SRAM, DRAM as well as Flash-based memories. Moreover, emerging resistive memory types such as PCMs will be considered.
- Peripherals, input/output system. Different classes of peripheral devices will be evaluated mainly focusing on their electronic parts (the device controllers).
- Interconnection elements. The reliability of the hardware components of the on-chip and off-chip interconnection networks will be analyzed.
During the course of the project, different hardware components may emerge or may prevail in different market segments. The CLERECO research work is adaptable and flexible enough to consider the characteristics of new components even in later stages when the reliability evaluation framework has already been setup to some extent. Therefore, the impact of the emerging components in the overall system reliability will be normally considered.
For each hardware component, a set of important parameters for the reliability of the system will be extracted. In particular, the following important information will be either estimated or actually measured (depending on the level of abstraction of each phase of the design cycle):
- Size/area and "complexity" of the component: the reliability impact of hardware components significantly depends on their size (both for transient and for permanent faults). For each of the hardware components and each stage of the design cycle, the component size and "complexity" will be identified. Complexity will be indirectly measured by the available information at different design stages: number of inputs and outputs, sequential depth, estimated propagation delays, number of subcomponents, sizes of internal arrays, etc.
- Power/energy budget: power/energy modifies the temperature profile of the devices impacting their ageing and wear-out and thus having a relevant impact on the reliability.
- Utilization of the component: different hardware components of a computing system are utilized very differently by the instructions of a given processor architecture: some are intensively used while others are not; some are used in ways that may or may not affect reliability. This inherent "masking" of hardware faults by the instructions determines the component importance from a reliability point of view. Fault/error masking by higher software layers will be studied separately at WP3.
- Existing fault tolerance mechanisms of the component: depending on the system, the hardware components may be either free of fault tolerance mechanisms or may already employ known techniques such as hardware-, time-, and information-redundancy at low levels (transistor-level) or higher-levels (architecture-level). This information significantly affects the expected reliability of the hardware components and will be taken into consideration.
Both ASIC and FPGA implementation of different cores will be analyzed in this WP to fully consider all design alternatives available when working on real computing systems.
Finally, WP3 will be also engaged in the implementation of a preliminary library of characterized modules that will be employed for the validation and demonstration activity of this project in WP6. It is worth to mention here that realizing a full comprehensive library of components is out of the capacity of this project. We will show here the path for the analysis of future use cases focusing on instruments that will allow a fast technology transfer from the research domain to real cases.
Participants: POLITO, CNRS, INTEL, THALES, YOGITECH, ABB