# From the Technology... - Identify target silicon technologies - Characterize their failure mechanisms - Estimate reliability for basic components - Memories (SRAM, DRAM, FF and latches) - Combinatorial blocks (gates, blocks and FUs) - Provide upper layers the fundamental vulnerability of the underlying circuits ## ...through the hardware... ### <u>CPUs</u> Register Files, Fetch Buffers, Issue Queues, Load/ Store Queue Control Logic Functional Units, Branch Predictors, RAS, Prefetchers #### **GPUs** Functional Units, Register Files, Branch Divergence Units ### <u>Peripherals</u> Interconnects #### Memory CPU: L1D, L1I, L2, L3, Main Memory GPU: Global Memory, Shared Memory, Local Memory #### Hardware Components Characterization Methods: - Statistical Fault Injection - Analytical #### Fault Injection Infrastructure: - **Tools** (microarchitectural simulators): MARSSx86 (x86-64 OoO model), GEM5 (x86-64 OoO model, ARM model), GPGPU-Sim (NVIDIA GPU Architectures), Multil2Sim (AMD GPU Architectures) - Fault models: transient, intermittent and permanent faults, targeting one or multi bits at one or more components. Benchmarks: Any workload ## ...and the software... #### Virtual ISA-based Fault Injection - Virtual ISA: LLVM (A framework that uses virtualization to perform complex analysis of software applications on different architectures) - Fault models: Software Fault Model (Effect of soft errors on the virtual ISA) - **Benchmarks**: Matrix Multiplication - Simple - With duplicated variables - Validation: Comparison with a simulation-based fault injector with 8086 μprocessor Results - With triplicated variables Fault Model Description An operand of the VISA instruction Wrong Data in an changes its value Operand An Operand of the VISA instruction Not accessible cannot change its value Operand **Software Outcomes** Software Application Software System Virtual ISA ISA propagation Hardware **Software Layer** **Hardware Layer** Software Fault **Models** An instruction is used in place of Instruction Replacement another Control Flow Error The Control Flow is not respected **CPU** time | Benchmark | Simulator | Masked | SDC | Detected | Crash | |-------------|-----------|--------|-------|----------|-------| | (1)mMul | LLVM | 44.0% | 55.0% | 0% | 1.1% | | | 8086 | 43.7% | 52.7% | 0% | 3.6% | | (2)mMul dup | LLVM | 22.4% | 17.9% | 58.8% | 0.9% | | | 8086 | 23.9% | 18.9% | 56.5% | 0.7% | | (3)mMul TMR | LLVM | 83.7% | 15.1% | 0.2% | 1.0% | | | 8086 | 85.8% | 13 3% | 0.2% | 0.7% | Evaluation of the overall system reliability 21 hours Simulation Time < 1 minute < 1 minute < 1 minute 18 hours 6 hours # All together #### Bayesian model used for system level modeling - Nodes represent HW/SW system's components - Edges represent error masking/propagation paths within the system #### Input: - Technology raw error rates - Environmental parameters - Conditional masking probabilities for single nodes computed resorting to the fault injection environments developed to characterize hardware and software modules #### Output: - An estimate of the system's error rate based on the input parameters - Back-propagation to identify critical nodes of the system