NOTE! This site uses cookies and similar technologies.

If you not change browser settings, you agree to it. Learn more

I understand
Super User

Super User

Email: This email address is being protected from spambots. You need JavaScript enabled to view it.

Some pictures collected at the CLERECO stand during DATE 2104 un Dresden, Germany.

Foutris, N.; Gizopoulos, D.; Chatzidimitriou, A.;  Kalamatianos, J.; Sridharan, V., in Proceedings of the 10th IEEE Workshop on Silicon Errors in Logic – System Effects (SELSE 2014), Stanford, CA, USA, April 1-2, 2014

PDF

Zuolo, L.; Zambelli, C.; Micheloni, R.; Galfano, S.; Indaco, M.; Di Carlo, S.; Prinetto, P.; Olivo, P.; Bertozzi, D., in Proceedings of the Design, Automation and Test in Europe Conference and Exhibition (DATE), 2014, pp.1,6, 24-28 March 2014

PDFDOI

Monday, 14 April 2014 14:53

WP1: Project Management

This WP includes all activities related for the appropriate coordination of both technical and administrative work between the CLERECO project partners and between them and the EC. Internal and external information flows will be defined, implemented and used as a result. 

Leader: POLITO

Participants: UoA, CNRS, INTEL, THALES, YOGITECH, ABB

Monday, 14 April 2014 14:50

WP7: Dissemination and exploitation

Dissemination of the CLERECO results to several selected groups of interested parties and exploitation of project results at the industrial and academic level is a key objective of the CLERECO project that will be pursued through the tasks of this WP. Given the importance of this activity a detailed dissemination plan and an exploitation strategy for the project will be developed and continuosly updated to reflect the results obtained from the research WPs.

Leader: INTEL

Participants: POLITO, UoA, CNRS, THALES, YOGITECH, ABB

Monday, 14 April 2014 14:43

WP6: Validation and proof-of-concept

WP6 is responsible for the validation of CLERECO early reliability estimation methodology and for the demonstration of project results exploiting deliverables and methods coming from WP3, WP4 and WP5.

Two main activities will be carried out within this WP. Although the main load of WP6 is at the end of the project, during the first half of the project it will define the evaluation methodology. The first activity will focus on the automation of all algorithms and methods defined within WP6 for the CLERECO reliability analysis within a prototype Electronic Design Automation (EDA) tool-suite. This is a mandatory task to enable the application of CLERECO methods to real test cases. The second activity will instead focus on the definition of realistic use-cases on which CLERECO concepts can be efficiently demonstrated and validated. This second activity will be steered by CLERECO industrial partners that will cooperate to the definition of relevant application examples.

Reliability of selected use cases will be analyzed through the developed EDA tool-suite at different design stages, considering different sets of available information, thus reproducing realistic situations typical of a product design cycle. Reliability results obtained through the use of the CLERECO method will be constantly compared to reliability measures obtained through traditional extensive (and clearly costly and time consuming) fault injection campaigns as well as laser/EM injections. This will enable CLERECO partners to clearly assess the accuracy of CLERECO estimation.

Eventually, CLERECO optimization design heuristics will be exploited to show how project results will help designers in optimizing developed systems gaining better performances.

Leader: THALES

Participants: POLITO, UoA, CNRS, INTEL, YOGITECH, ABB

Monday, 14 April 2014 14:39

WP5: System level estimation models

WP5 contains the core activities of this project. Descriptions of the target systems and related parameters will be integrated into a comprehensive statistical model that will be used to estimate reliability metrics defined in WP2 (iteratively in the different design stages of the system). Together with reliability assessment, WP5 includes research on the development of algorithms to support designers with valuable instruments for reliability related decision-making process that will in turn allow the design of reliable systems with improved cost-related characteristics (area, energy/power, and performance) and reduced TTM.

The leading concepts that will be pursued are:

  • Flexibility: reliability analysis must be possible starting from the very early design stages, in which HW/SW components are selected and interconnected at a very high-level description, down to post-silicon design in which systems components are well known and characterized.
  • Speed: time-consuming simulations and/or fault-injection campaigns must be avoided and replaced by accurate statistical models and procedures.
  • Accuracy: reliability estimations must be as precise as possible. Accuracy will be influenced by the quality of the information fed into the model and therefore will increase while moving from very early design stages to more advanced design stages.
  • Comprehensiveness: a heterogeneous set of target systems must be analyzable ranging from very application-dependent ES to more general-purpose HPC systems.
  • Interoperability: input and output data must be standardized throughout all steps of the analysis and, whenever possible, also considering well established industrial design flows and practices.
  • Usability: complexity and precision must be carefully handled to enable the application of the proposed model in real industrial scenarios.

WP5 has also a key harmonization role of the research activity of this project. WP2, WP3, WP4 and WP5 are closely related WPs that require an intensive exchange of information to achieve their goals. Information must be properly represented and standardized in order to guarantee easy and reliable circulation and integration among tasks. WP5 is in charge of this through a dedicated task.
Finally, due to the complexity of the activities performed in this WP a constant validation of preliminary and intermediate results is mandatory. Research activities within WP5 will be therefore organized as a continue alternation of solutions development and preliminary validation activities on simple cases.

Leader: POLITO

Participants: UoA, CNRS, INTEL, THALES, YOGITECH, ABB

Similarly to WP3, WP4 iteratively breaks down the software stack into its basic components (from high-level application software modules down to the instruction set architecture level) that will be characterized form the reliability standpoint.

To enable early reliability estimations, software analysis must be possible at early system design stages, even when a target platform is not yet defined. To cope with this requirement, WP4 aims at defining metrics and models enabling to abstract the behavior of the software no matter the specific hardware architecture of the system.

Several activities will be addressed:

  • Similarly to WP3 this WP only considers the impact of the interface between the software layer and the hardware layer represented by executed microprocessor instructions. To achieve abstraction from the target architecture a processor-independent instruction-set that is as generic and complete as possible will be defined. This can be further linked, if required, to abstract functional units (e.g., arithmetic logical units, memory managements unit) to further characterize the software activity.
  • Hardware-induced errors must be properly described in the software. WP4 will define precise hardware-independent fault models by considering three well-established types of errors: variable corruptions (that corresponds to an error in a data registers or in the memory), execution errors corresponding to wrong instruction op-codes and control flow errors.
  • Software faulty behaviors must be precisely described and characterized. WP3 starts from a well-established taxonomy including: (i) correct execution and correct timing; (ii) correct execution but incorrect timing, (iii) fail silent violations with correct timing, (iv) fail silent violation with incorrect timing, (v) system exception, (vi) crash. This taxonomy will be revised and improved based on specific requirements identified in WP2.
  • Early reliability estimation requires coping with all design stages and therefore with different software descriptions including black box modules, user-defined functions, library functions, legacy code, instructions.

Once all these issues are covered, we will analyze each software component level: system, selected drivers and application. This analysis will be the foundation for the construction of a set of characterized software modules to be used in WP5 and WP6.

Finally, WP4 will also be engaged in the production of a preliminary library of characterized modules that will be exploited for the validation and demonstration activity of this project. Similar to WP3, realizing a full comprehensive library of components is out of the capacity of this project and we will only show the path for the analysis of future use cases.

Leader: CNRS

Participants: POLITO, UoA, INTEL, THALES, YOGITECH, ABB

In this WP hardware systems are iteratively broken down into their basic components that will be characterized form the reliability standpoint. With characterization here we intend the computation of specific parameters and measures potentially impacting the overall system reliability (e.g., area, error masking probability, resource utilization, timing constraints, etc.). This approach follows the concept of modular design/IP-reuse, for fast TTM that is a common challenge for both HPC and ES applications.

One of the key aspects of CLERECO is its cross-layer approach to reliability evaluation where all systems elements from the raw technologies up to the software layers are carefully considered with respect to their impact on system reliability. Hardware architectures will be analyzed considering their components (CPUs, memories, accelerators, peripherals, interconnects, etc.) at different levels of detail but always maintaining a connection between components and related user available instructions. The reliability-related behavior of hardware components at different stages of the system design cycle will be evaluated (in isolation): from very early specification stages, through high-level and more detailed design stages, down to the prototyping phases of the design flow. Thus, the reliability evaluation will be an iterative process providing different levels of detail while moving from the conceptual design phases, through all intermediate design phases, down to the post-silicon design validation phase.

Different sets of hardware components that will be studied are the following:

  • Microprocessor cores of different architectures (RISC, CISC, DSP) as well as their multicore and multithreaded versions. The different subcomponents of the processor cores will be separately analyzed: functional units, storage elements, control logic, performance mechanisms (such as predictors, prefetchers), etc.
  • Accelerators. A large class of accelerator cores which will be common for future multi/many core architectures including Graphic Processing Units (GPU), Single Instruction, Multiple Data (SIMD) accelerators, cryptographic cores, etc.
  • Memories both on-chip and off-chip. The project will work on the dominant memory technologies in ES and HPC applications: SRAM, DRAM as well as Flash-based memories. Moreover, emerging resistive memory types such as PCMs will be considered.
  • Peripherals, input/output system. Different classes of peripheral devices will be evaluated mainly focusing on their electronic parts (the device controllers).
  • Interconnection elements. The reliability of the hardware components of the on-chip and off-chip interconnection networks will be analyzed.

During the course of the project, different hardware components may emerge or may prevail in different market segments. The CLERECO research work is adaptable and flexible enough to consider the characteristics of new components even in later stages when the reliability evaluation framework has already been setup to some extent. Therefore, the impact of the emerging components in the overall system reliability will be normally considered.

For each hardware component, a set of important parameters for the reliability of the system will be extracted. In particular, the following important information will be either estimated or actually measured (depending on the level of abstraction of each phase of the design cycle):

  • Size/area and "complexity" of the component: the reliability impact of hardware components significantly depends on their size (both for transient and for permanent faults). For each of the hardware components and each stage of the design cycle, the component size and "complexity" will be identified. Complexity will be indirectly measured by the available information at different design stages: number of inputs and outputs, sequential depth, estimated propagation delays, number of subcomponents, sizes of internal arrays, etc.
  • Power/energy budget: power/energy modifies the temperature profile of the devices impacting their ageing and wear-out and thus having a relevant impact on the reliability.
  • Utilization of the component: different hardware components of a computing system are utilized very differently by the instructions of a given processor architecture: some are intensively used while others are not; some are used in ways that may or may not affect reliability. This inherent "masking" of hardware faults by the instructions determines the component importance from a reliability point of view. Fault/error masking by higher software layers will be studied separately at WP3.
  • Existing fault tolerance mechanisms of the component: depending on the system, the hardware components may be either free of fault tolerance mechanisms or may already employ known techniques such as hardware-, time-, and information-redundancy at low levels (transistor-level) or higher-levels (architecture-level). This information significantly affects the expected reliability of the hardware components and will be taken into consideration.

Both ASIC and FPGA implementation of different cores will be analyzed in this WP to fully consider all design alternatives available when working on real computing systems.

Finally, WP3 will be also engaged in the implementation of a preliminary library of characterized modules that will be employed for the validation and demonstration activity of this project in WP6. It is worth to mention here that realizing a full comprehensive library of components is out of the capacity of this project. We will show here the path for the analysis of future use cases focusing on instruments that will allow a fast technology transfer from the research domain to real cases.

Leader: UoA

Participants: POLITO, CNRS, INTEL, THALES, YOGITECH, ABB

WP1 analyzes the different failure mechanisms that will be relevant in future technologies, which will be likely employed in the computing continuum (scaled bulk CMOS, III-V Ge, Finfets, spin logic, etc.) and works on identifying and characterizing the main sources of failure. Moreover, this WP also sets the reliability requirements for the different computing segments within the computing continuum such as ES and HPC.

The starting point of WP1 activities consists on studying the defects and reliability failure mechanisms that are anticipated in future computing systems (due to technology and architectural specifications). This operation involves the selection of a set of use cases ranging from very specific ES applications to general-purpose HPC systems. Among the wide range of possible use cases we will target those in which architectural and technological solutions among the ES and the HPC segments are rapidly converging including for instance multi/many cores in the 22/16nm Finfet technology node. Identified sources of failure will be characterized and will be used in the reliability estimation methodology developed in CLERECO.

System reliability is influenced by several parameters, which must all be carefully considered in the development of an accurate reliability evaluation framework. To take this into account, WP1 also aims at identifying the different operating modes of the system (e.g., voltage and frequency levels), and the different operating conditions (e.g., temperature, electronic noise, etc.).

The results of the project will be eventually captured in a series of reliability metrics that are required throughout heterogeneous segments of the computing continuum market. Some will be the well-known SDC/DUE FIT rates, but we expect to define some more metrics capturing different reliability aspects. Some will derive from safety standards, while some will derive from more implicit requirements (like the user experience, FIT rate that has impact on performance but not on correctness, etc.). WP1 is also in charge of determining the acceptable estimate error for the different design phases (from early abstract design phases up to final RTL).

 

WP Leader: INTEL

Participants: POLITO, UoA, CNRS, THALES, YOGITECH, ABB