Upload
internet
View
107
Download
2
Embed Size (px)
Citation preview
Tolerância a falha é a habilidade de um
sistema de continuar a realizar
corretamente as suas tarefas depois da
ocorrência de falhas.
3. Introduction to Fault Tolerance
3.1 Basic Concepts ...
Confiabilidade de um sistema é uma função
do tempo, R(t), definida como sendo a
probabilidade do sistema realizar
corretamente suas tarefas no intervalo de
tempo [t0, t], dado que o sistema estava
realizando corretamente no tempo t0.
3. Introduction to Fault Tolerance
3.1 Basic Concepts ...
Disponibilidade é uma função do tempo,
A(t), definida como sendo a probabilidade
de um sistema estar operando
corretamente e estar disponível para
realizar suas funções em um intervalo de
tempo, [t0, t].
3. Introduction to Fault Tolerance
3.1 Basic Concepts ...
A concepção de Sistemas tolerantes a falhas é baseada em duas técnicas distintas:
Mascaramento de falhas
Detecção, localização e recuperação (via reconfiguração) do sistema para remover o componente defeituoso.
3. Introduction to Fault Tolerance
3.1 Basic Concepts ...
Se a opção é pela técnica de reconfiguração, então utiliza-se ...
antes ...
Técnicas de detecção de falhas
Técnicas de localização de falhas
depois ... Técnicas de recuperação de falhas
3. Introduction to Fault Tolerance
3.1 Basic Concepts ...
Técnicas de recuperação de falhas ...
Recuperação para trás (Rollback Recovery)
Recuperação para frente (Forward Recovery)
3. Introduction to Fault Tolerance
3.1 Basic Concepts ...
Todas as técnicas para concepção de
sistemas TF são baseadas em algum
tipo e grau
de redundância .
3. Introduction to Fault Tolerance
3.1 Basic Concepts ...
Redundância é implementada através do uso de hardware, software, informação, ou tempo além do que é necessário para a operação normal do sistema.
3. Introduction to Fault Tolerance
3.1 Basic Concepts ...
Importante: resulta em um grande impacto no sistema em termos de desempenho, tamanho, peso, consumo de potência, e confiabilidade.
Passive
Active
Hybrid
3. Introduction to Fault Tolerance
3.2 Hardware Redundancy
1. Based on the concept of fault masking to hide the occurrence of faults and prevent the faults from resulting in errors (developed around the concept of majority voting)
Do not provide for faults detection, but simply mask
them
1. Passive
3. Introduction to Fault Tolerance
3.2 Hardware Redundancy
Module 1
Module 2
Module 3
VoterOutput
Basic concept of Triple Modular Replication (TMR)
Proc 1
Proc 2
Proc 3
Voter
The use of triplicated voters in a TMR configuration
Voter
Voter
Mem 1
Mem 2
Mem 3
3. Introduction to Fault Tolerance
1. Passive
3.2 Hardware Redundancy
Voting at Several Levels within N-Modular Redundancy (NMR) Systems
3 independent temperature sensors perform a vote on the 3 sensor values. Next, calculate the amount of heat/cooling by means of 3 separate modules, and then vote on the calculations to determine a result.
XX 3 independent sensors sample the temperature, perform
the calculations, and then provide a single vote on the final result.
3. Introduction to Fault Tolerance
1. Passive
3.2 Hardware Redundancy
3. Introduction to Fault Tolerance
1. Passive
Difference between the two approaches
fault containment: voting at the sensors will mask and contain the effects of an eventual sensor fault.
3.2 Hardware Redundancy
VoterTask
Example of SW voting
Task A
Task B
Task A
Task A
Proc 1
Proc 3
Proc 2
HW Voting x SW Voting ?HW Voting x SW Voting ?
3. Introduction to Fault Tolerance
1. Passive
3.2 Hardware Redundancy
1. The availability of processor to perform the voting
2. The speed at which voting must be performed
3. The criticality of space, power, and weight/volume
limitations
4. The # of different voters that must be provided
5. The flexibility required of the voter with respect to
future changes in the system
In practical applications of voting, 3 results in a fault-free
TMR system may not completely agree, even in a fault-free
environment:
e.g., A/D converters in sensors may produce quantities that
disagree in the least-significant bits. This disagreement can
propagate into larger discrepancies after computation, which
can significantly affect the voting process.
3. Introduction to Fault Tolerance
1. Passive
3.2 Hardware Redundancy
A TMR system selects the value that lies in the
middle of the others :
Corrupted signal
Uncorrupted signals
Selectedsignals
3. Introduction to Fault Tolerance
3.2 Hardware Redundancy
1. Passive
Solution Mid-Value Select Technique
3. Introduction to Fault Tolerance
3.2 Hardware Redundancy
2. Active (or Dynamic)
Attempts to achieve fault tolerance by means of fault
detection, fault location, reconfiguration, and recovery
(property of fault masking is not obtained: there is no attempt
to prevent faults from producing errors within the system).
More suitable for applications where temporarytemporary, erroneous
results are acceptable, as long as the system reconfigures and
regains its operational status in a satisfactory length of time.
3. Introduction to Fault Tolerance
3.2 Hardware Redundancy
2. Active (or Dynamic)
Duplicação de Unidades Funcionais
Técnica de Módulos em Standby Hot Hot StandbyStandby Sparing Sparing ColdCold StandbyStandby Sparing Sparing
3. Introduction to Fault Tolerance
3.2 Hardware Redundancy
2. Active (or Dynamic)
Comparison Task
Processor A
Comparison Task
Processor B
Error Signals
A B
Processor A’s Result
Processor B’s Result
Shared Memory
Processor A’s Private Memory
Processor A’s Result
Processor B’s Private Memory
Processor B’s Result
A software implementation of duplication with comparison
3. Introduction to Fault Tolerance
3.2 Hardware Redundancy
3. Hybrid
Combines the attractive features of
both the Active and the Passive
approaches.
3. Introduction to Fault Tolerance
3.3 Software Redundancy
Verificação de Consistência
Verificação de Capacidade
Programação N-Autotestável
Programação N-Versões
Blocos de Recuperação
3. Introduction to Fault Tolerance
3.3 Software Redundancy
Verificação de Consistência
Usa o conhecimento prévio das característicasconhecimento prévio das características de uma dada
informação para verificar a exatidão da informação.
Tipicamente, na maioria das aplicações é sabido que uma certa certa
quantidadequantidade de um dado operando não deve ultrapassar um
valor previamente definido.
3. Introduction to Fault Tolerance
3.3 Software Redundancy
Verificação de Consistência
Examples ... Examples ...
A processing system can sample and store many sensor
readings in a typical control application.
The amount of cash requested by a patron at a bank’s teller
machine should never exceed the maximum withdrawal allowed.
3. Introduction to Fault Tolerance
3.3 Software Redundancy
ExamplesExamples ... ...
The address generated by a computer should never lie outside
the address range of the available memory.
In a computer, each instruction code can be checked to verify
that it is not one the illegal codes.
Verificação de Consistência
3. Introduction to Fault Tolerance
3.3 Software Redundancy
Verificação de Capacidade
Capability checks are performed to verify that a
system possesses the capability expected.
3. Introduction to Fault Tolerance
3.3 Software Redundancy
Examples ... Examples ...
Check whether a computer has the complete memory available.
Check whether the processors in a multiprocessor system are
alive.
Periodically, a processor can execute specific instructions on
specific data and compare the results to known good results
stored in a ROM: check for ALU and Memory .
Verificação de Capacidade
3. Introduction to Fault Tolerance
3.3 Software Redundancy
Programação N-Autotestável
Program Version 1
Program Version 1
Acceptance Tests
Acceptance Tests
Sel
ecti
on
Lo
gic
Pro
gra
m O
utp
uts
Program Inputs
Program Inputs
The N-Self-Checking Programming Approach to software fault tolerance
3. Introduction to Fault Tolerance
3.3 Software Redundancy
Programação N-Autotestável
Hot Standby:
all programs are running concurrently
Reduced recovery latency:
reconfiguration process is very fast
Códigos de Paridade, Berger, m-of-n
Códigos Aritméticos
Códigos de Hamming
Códigos Checksum
Códigos CRC (Cyclic Redundancy Checking)
3. Introduction to Fault Tolerance
3.4 Information Redundancy
3. Introduction to Fault Tolerance
3.5 Time Redundancy
Detecção de Falhas Transientes
Detecção de Falhas Permanentes
Recomputação para Correção de Erros
3. Introduction to Fault Tolerance
3.5 Time Redundancy
The fundamental concept is to perform the
same computation two or more times and
compare the results to determine if a
discrepancy exists.
Detecção de Falhas Transientes
3. Introduction to Fault Tolerance
3.5 Time Redundancy
Detecção de Falhas Permanentes
Computation
ComputationEncode
DataDecodeResult
StoreResult
StoreResult
CompareResults
DataTime t0
DataTime t1
Error
3. Introduction to Fault Tolerance
3.5 Time Redundancy
Example encoding functions might be complementation operator
or arithmetic shift:
6 4 = 1, remain 2
7 x 8 = 56
7 x 8 = 56
2 + 9 = 11 0110.1010 AND 0111.1111 = 0110.10100110.1010
(1 x 4) + 2 = 6
56 8 = 7
8 7 = 56
11 - 9 = 2
0110.1010 shift right 2: 1001.1010,
0111.1111 shift right 2: 1101.1111,
1001.1010 AND 1101.1111 = 1001.1010
1001.1010 shift left 2: 0110.10100110.1010
3. Introduction to Fault Tolerance
3.5 Time Redundancy
Time redundancy approach can also provide for error correction
if the computations are repeated three or more times.
Consider the example of a logical AND operation. Suppose the
operation is performed three times: first, without shifting the
operands; second, with a one-bit logical shift of the operands;
and third, with a two-bit logical shift of the operands.
Recomputação para Correção de Erros
3. Introduction to Fault Tolerance
3.5 Time Redundancy
Then, the results generated using the shifted operands are
shifted back to the right position.
Because each of the three operations used operands that were
displaced from each other by at least one bit position, a different
bit in each result will be affected by the faulty bit slice.
If the bits in each position are then compared, the results due to
the faulty bit slice can be corrected by performing a majority vote
on the three results.
Recomputação para Correção de Erros