http://www.dmst.aueb.gr/dds/pubs/conf/1999-ESREL-SoftRel/html/chal.html This is an HTML rendering of a working paper draft that led to a publication. The publication should always be cited in preference to this draft using the following reference:
Citation(s): 2 (selected). This document is also available in PDF format. The document's metadata is available in BibTeX format. Find the publication on Google Scholar This material is presented to ensure timely dissemination of scholarly and technical work. Copyright and all rights therein are retained by authors or by other copyright holders. All persons copying this information are expected to adhere to the terms and constraints invoked by each author's copyright. In most cases, these works may not be reposted without the explicit permission of the copyright holder. |
Diomidis Spinellis
Department of Information and Communication Systems,
University of the Aegean, Greece
Software increasingly forms a critical part in the design and operation of products, processes, equipment, and installations affecting their safety and reliability [WCD+98]. Despite the important advances made over the last decades in the area of software engineering and the successful realisation of many safety-critical software systems, the evolution of computer technology is creating new challenges and different types of failure modes. In the following sections we examine how advances and changes in the areas of computer hardware components and subsystems, operating systems, software system architectures, programming languages, and the software development process can potentially affect the safety and reliability of computer-based systems. Despite the luddite connotations of our presentation we believe that a critical examination and appraisal of these advances and their effects is of paramount importance in the area of safety-critical applications.
Increasing chip densities have resulted in significant advances in processor hardware performance based on large scale integration of functional units, pipelined designs, and the provision of additional functionality [HP90, pp. 250-349]. However, as a result of pipelined architectures, on modern processors it is practically impossible to reason accurately and completely about the execution of a program at the lowest level. The interdependencies of the multiple functional units, many levels of cache, and branch predictors all dynamically changing their behaviour as the program executes -- often in a multitasking environment -- make the isolation of problems that can arise at this level an experimental rather than analytical reasoning task. Safety critical software systems operating in such an environment can not therefore be proven to satisfy a specification using formal methods.
Processor | Errata |
Intel 80C186 Embedded Processor | 5 |
Texas Instruments TMS320C40 DSP | 17 |
MIPS R4000SC | 55 |
MIPS R4400PC/SC | 23 |
Intel 386 EX Embedded Processor | 40 |
Intel Pentium | 82 |
Intel Pentium Pro | 77 |
Intel Pentium II | 58 |
In addition, as the complexity of processors increases so do the errors that are part of a given implementation. Compiler and system software writers, but often also end-user software developers have to be aware of those errors and design their implementations around all known errors. Some of these hardware errors can result in a complete system lock, others in data corruption, and others in subtle differences in arithmetic results; obviously all of them important to the designer of a computer-based safety-critical system. Table 1 illustrates the number of different errors documented (e.g. [Int96]) in some processor implementations.
Modern microprocessor-controlled components such as disk drives, network adapters, and graphic controllers often contain enough intelligence to create a potential for problems at the system integration level. As an example, many modern hard disks rely on a thermal recalibration procedure to compensate against temperature-induced changes in the drive's physical characteristics. Under some circumstances, an unsophisticated implementation of this procedure can delay the drive's response time at random instances rendering it unsuitable for real-time applications [TCKK95].
The allocation of interrupts and input/output addresses in PCI-based ``Plug & play'' systems is performed at system startup using a complicated negotiation procedure among active subsystem components. In Windows-based systems supporting drivers and modules are then loaded and run in a nondeterministic order. As a result, Gutmann [Gut98] reports that the state of such systems after a reboot is relatively unpredictable. The implication of this is that the establishment of a stable test platform or the reproducibility of faults following a specific line of actions may not be feasible.
This increasing complexity has important implications for the reliability of software developed for a specific platform. Complicated interfaces are difficult to learn and use effectively [Spi98b]. As a result of their size and complexity, modern operating systems exhibit an increasing number of bugs; demonstrated by the numerous ``fixes'' distributed by their vendors. Developers of robust applications have to take this into account coding around them, or insist on the installation of all relevant fixes. Some fixes may even introduce new errors or render other system components inoperative. The bottom-line of this situation is, that the application developer is practically rarely singly responsible for the reliability of an application.
Modern networked, multi-tier software system architectures exponentially increase the number of failure modes based on the number of interconnected nodes [Spi98a]. Software commercial-of-the-shelf (COTS) components are increasingly used as parts of integrated systems [Voa98b]. Their quality is often difficult to assess [Voa98a] and due to the tight coupling between components enforced by some programming languages (structured exception handling, heap-based dynamic memory allocation, and unbounded pointers) they may affect the reliability of the software system in totally unforeseen ways.
The use of the Internet as a common network infrastructure often exposes applications to additional failure modes related to the open and insecure nature of the medium. Applications using the Internet as a data pipe can face problems related to connectivity, congestion, routing, and the domain name system. In addition, such applications are exposed to hostile attacks that can be carried out over the network [Bhi96,Den90]. Typical applications are not coded to guard against malicious attacks; in fact even system software that should have been coded in such a way is often compromised [Spa89]. Therefore, the connection of any safety-critical system to the Internet can severely affect its reliability.
Similarly to operating systems, programming languages also have a tendency to grow in size and complexity as they mature. Taking as a rough measure the page number of the language's canonical description Table 3 provides an illustration of the evolution of the C and C++ programming languages.
Title | Year | Pages |
The C Programming Language (Kernighan and Ritchie) | 1978 | 228 |
The C Programming Language; second edition (Kernighan and Ritchie) | 1988 | 272 |
The C++ Programming Language (Stroustrup) | 1986 | 328 |
The C++ Programming Language; second edition (Stroustrup) | 1991 | 669 |
The C++ Programming Language; third edition (Stroustrup) | 1997 | 910 |
This trend has important implications for the developers of high-reliability systems. Large languages are difficult to learn and use [Hoa83]. It is nowadays not uncommon for programming teams to lack people who understand the whole language at a level sufficient to advise other members on issues regarding the interrelationship between language elements. Subtle bugs arising from the misunderstanding of language features can thus survive code walkthroughs. In addition, language complexity and advanced optimisation techniques combined with processor complexity results in an increased number of bugs in modern compilers. This is clearly an additional risk factor for high-reliability designs.
The changing nature of the software development process can also negatively affect the reliability of the delivered system. Information technology outsourcing [LWF95] may exclude the contractor's software developers from a holistic system-wide perspective resulting in dangerous misunderstandings and grey areas of responsibility. In addition, the increasing adoption of quality systems such as the ISO-9000 series [ISO91] and the Capability Maturity Model [HZG+97] may provide software developers and procurers with a false sense of security.
Despite the advances made over the last decades in the design and implementation of safety-critical systems, major new challenges lie ahead. It is important for managers, designers, and developers to be aware that all architectural, technological, and organisational improvements in the realisation of software systems carry with them new challenges and dangers. Their solution is most probably not technology-based [Bro87]. In the demanding world of software-based safety-critical systems planning in advance for the new challenges is as important as embracing the new technologies.
The work reported herein was carried out within the context of ISA-EUNET, an ESPRIT (ESSI-ESBNET, project number 27450) R&D project funded by the Directorate General III of the European Commission.