

## Towards Early and Accurate Reliability Evaluation

#### **Dimitris Gizopoulos**

University of Athens Computer Architecture Lab Nikos Foutris Manolis Kaliorakis Sotiris Tselonis George Papadimitriou







# Reliability – Dependability

Computing Continuum

Services

Application

System SW

Architectures

Silicon

• Importance across the Computing Continuum

- Failure rates (MTBF):
  - Today: days/weeks
  - 2018: mins/secs?



# Drivers of **Un**-Reliability

- Devices shrinking (10nm and smaller)
- Processes (FinFET, scaled bulk, 3D stacks, spin logic, ...)
- Variability
- Aging, wear-out
- Environment



- radiation, temperature, humidity
- Many cores, many memories
- Heterogeneity







### Protection

- Protection against (transient, permanent, intermittent) hardware faults costs
  - area, performance, power/energy, design time, ...
- Protection: detection, diagnosis, recovery, repair







# Protection: where and how much?

- Some components are more vulnerable
  - memory: DRAM, SRAM, registers
- But protection technique costs vary:





Y.Luo, *et. al.* "Characterizing Application Memory Error Vulnerability to Optimize Datacenter Cost via Heterogeneous-Reliability Memory", DSN, 2014.



HiPEAC CSW Fall 2014 – Towards Early and Accurate Reliability Evaluation

# How much protection: Decide Early

 Early protection decisions save costs but should be accurate

- Ideally, no re-design cycles to enhance reliability

- But can "early" be accurate ?
  - System details missing
  - Unknown software
  - Unknown context

#### • When can early be accurate ?





## Cost & Obtained Reliability



# Micro-architectural Simulators

- Reliability evaluation using microarchitectural simulator models
  - This is **early**

ereco

– Is it **fast enough**?



– Is it accurate enough?



 For which components can this be good enough?



# Components in uarch models

| Component<br>class                                     | Realism of<br>uarch<br>models | Size   | Vulnera-<br>billity |
|--------------------------------------------------------|-------------------------------|--------|---------------------|
| Memory<br>arrays (caches,<br>regs, buffersm<br>queues) | Yes                           | Large  | High                |
| Functional                                             | No                            | Medium | Low                 |
| Control logic                                          | No                            | Small/ | Low                 |
|                                                        |                               | Medium | cal@di              |

### Micro-architecture Arrays

- Storage arrays
  - registers, buffers, caches, memory...
- Arrays/tables in the uarch models
- #Bits as in final hardware implementation
- Most vulnerable
  - DRAM, SRAM, flip-flops, ...







# Fault Injection or Analytical ?

- Statistical Fault injection vs. Analytical Methods
  - SFI is accurate but can be very time consuming
  - Analytical methods (such as ACE-analysis for AVF estimations) are very fast (single-pass) but can be very pessimistic
  - Here we make the SFI choice to obtain accuracy and evaluate the speed of the measurement campaign





## **Requirements for Accuracy**

- A micro-architectural simulator can provide accurate reliability evaluation through SFI for storage arrays when:
- 1. All important arrays are modeled
- 2. #Injections is statistically significant
- **3. Injection throughput** is high





### Marssx86-FI

- Fault injector on Marssx86 uarch simulator
  - Transient/intermittent/permanent faults
  - Multiple faults on single or multiple components
  - Full system simulation
- Fault behavior classes:
  - **SDC** (silent data corruptions)
  - **DUE** (detected unrecoverable errors)
  - Masked (benign)
  - Deadlock



### Marssx86 Enhancements

Original Marssx86 model enhancements





# Statistical Significance

#### • Population (N)

- Bit positions (permanent faults)
- Bit positions x Execution cycles (transient faults)
- Confidence (t)
- Error margin (e)
- Equally probable (sa0/sa1, 0-to-1/1-to-0 flips)
- #Injections

#### $-n = N / [1 + e^2 x (N - 1)/(t^2 x 0.25)]$



R.Leveugle, *et. al.* "Statistical Fault Injection: quantified error and confidence", DATE, 2009.



HiPEAC CSW Fall 2014 – Towards Early and Accurate Reliability Evaluation

# Example Numbers of Injections

- Physical integer register file (IntRF)
  256 registers, 64-bit = 16,384 bits
- or a **32KB** L1 D-Cache or L1 I-Cache
- Benchmark exec. time = **100M cycles**
- Error margin 1%
- Confidence 99%
- **16,587** injections [transient faults] (for each of the 3 cases) (8,243 injections for permanent faults)
  - #injections saturates





# SFI Throughput

- Injection machine
  - Intel® Core<sup>™</sup> i7-3970X @ 3.50 GHz (6 Cores, 12 Threads, 32 GB RAM, Ubuntu 14.04.1 LTS 3.13.0-36-generic x86\_64

| Time/<br>injection |    | jections/<br>mponent | #Components         |                                                       | #Benchmarks |  |
|--------------------|----|----------------------|---------------------|-------------------------------------------------------|-------------|--|
| ~3 mins            |    | 16,587               | 3 (IntRF, L1D, L1I) |                                                       | 20          |  |
|                    |    | Total t<br>(12 thre  |                     | Total time<br>(10 injection machines,<br>120 threads) |             |  |
| ~2070 da           | ys | ~ 175 days           |                     | ~ 1                                                   | 18 days     |  |
|                    |    |                      |                     |                                                       | cal@di      |  |

HiPEAC CSW Fall 2014 – Towards Early and Accurate Reliability Evaluation

# SFI Throughput (more accuracy)

• All same but 0.5% error margin (instead of 1%) and 99.8% confidence (instead of 99%)

| Time/<br>injection       |                      | njections/ #Cor<br>omponent |      | nponents                                              | #Benchmarks |
|--------------------------|----------------------|-----------------------------|------|-------------------------------------------------------|-------------|
| ~3 mins                  |                      | <b>95,493</b> 3 (Int        |      | RF, L1D, L1I)                                         | 20          |
| Total time<br>(1 thread) |                      | Total time<br>(12 threads)  |      | Total time<br>(10 injection machines,<br>120 threads) |             |
| ~12,000 da               | ~12,000 days ~ 993 d |                             | days | ~ 9                                                   | 9 days      |



Calculations refer to single transient faults. Larger numbers of fault injections are needed for multiple transient fault studies.

HiPEAC CSW Fall 2014 – Towards Early and Accurate Reliability Evaluation

#### Results (1)

• Integer Register File



#### Results (2)

L1 DCache ullet



# Conclusions

- Micro-architectural simulators for SFI and Reliability Evaluation of Storage Arrays
- Early evaluation
- Simulators need **enhancements**
- Fault injection **throughput** depends on:
  - Benchmarks number and execution times
  - Seeked accuracy (confidence, error margin)
  - Arrays sizes and numbers





# Thank you.

#### **Dimitris Gizopoulos**

cal@di

University of Athens Computer Architecture Lab



