





## **CLERECO** project overview

Hipeac CSW at Barcelona





#### **PROJECT GOAL**

CLERECO is a FP7 ICT collaboration project aiming at defining a new framework for Cross-Layer Early Reliability Evaluation for the Computing cOntinuum



- Motivation
- Overview of the activities
- Conclusions





- Motivation
- Overview of the activities
- Conclusions





#### **MOTIVATION**



# New technological processes:

- FinFET
- Scaled bulk
- 3D integration
- Spin logic



Aggressive shrinking of devices (<10nm)

Increased aging an process variability

Increased susceptibility to the environment

- Temperature
- Humidity
- Radiations
- •









#### **MOTIVATION**







#### **MOTIVATION**





#### **CONSORTIUM**

















- Motivation
- Overview of the activities
- Conclusions



### **OVERVIEW OF THE ACTIVITIES**



**APPLICATION SOFTWARE** 

**SYSTEM SOFTWARE** 

HARDWARE ARCHITECTURE

**TECHNOLOGY** 

System reliability evaluation





- Impact of hardware faults on:
  - Functionality
  - Performance
  - Power





#### **OVERVIEW OF THE ACTIVITIES**

WP2: Common and domain-specific sources of failure and unreliability



Characterization

WP5: System
Level Reliability
Estimation



### WP2: Common and domain-specific sources of failure and unreliability

- Analysis of the different failure mechanisms that will be relevant for the computing continuum (scaled bulk CMOS, III-V Ge, Finfets, spin logic, etc.)
- Identification and characterization of the main sources of failure
- Reliability requirements for the different computing segments within the computing continuum such as ES and HPC.
- Definition of the different operating modes of the system (e.g., voltage and frequency levels), and the different operating conditions (e.g., temperature, electronic noise, etc.) that may affect reliability



#### WP3: Hardware components reliability characterization

- Hardware systems are iteratively broken down into their basic components and characterized form the reliability standpoint
  - Computation of specific parameters and measures potentially impacting the overall system reliability
- Classes of considered components:
  - CPUs
  - Memories (e.g., DRAM, SRAM, Flash, RRAM, etc.)
  - Accelerators (e.g., GPUs)
  - Peripherals
  - Interconnects





#### WP4: SW level reliability characterization

- The software stack is break up into its basic components (from high-level application software modules down to the instruction set architecture level) and analyzes how hardware errors propagate through the software stack
- To allow early reliability estimation when the hardware architecture is still not defined WP4 aims at defining metrics and models enabling to abstract the behavior of the software from a specific hardware architecture



#### WP5: System level estimation models

- Measures and analyses performed in WP3 and WP4 are integrated into a comprehensive statistical framework able to estimate reliability metrics defined in WP2 (iteratively in the different design stages of the system)
- Estimated reliability metrics are used to develop algorithms able to support the designers in the reliability related decision-making process that will in turn allow the design of reliable systems with improved costrelated characteristics (area, energy/power, and performance) and reduced TTM



- Motivation
- Overview of the activities
- Conclusions





#### Conclusions

- The characteristics that CLERECO pursues for its framework are:
  - Flexibility: reliability analysis must be possible starting from the very early design stages
  - Speed: time-consuming simulations and/or fault-injection campaigns must be avoided and replaced by accurate statistical models and procedures
  - Accuracy: reliability estimations must be as precise as possible.
  - Comprehensiveness: a heterogeneous set of target systems must be analyzable ranging from very application-dependent ES to more generalpurpose HPC systems





### **THANK YOU FOR YOU ATTENTION!**