Bohrbug Vs Heisenbug
Bohrbug (Niel Bohr) – retrying will result in the same error.
Heisenbug (Werner Karl Heisenberg) – retrying may or may not result in the same error
- Omission Failure: abscence of response
- Timing Failure: system does not provide result within the specified timeframe (early timing failure/late timing failure)
- Response Failure – system provides incorrect value
- Crash Failure – no further responses until reboot
Principles of design
- Modularity
- Fail fast / heartbeat
- Independent failure modes
- Redundancy/repair – hot swapping
- Elimination of single point of failure – redundancy
Pair/Spare approach — provide two instances of critical resources and ability to swap one out.