a proposed byzantine fault-tolerant voting architecture ... · a proposed byzantine fault-tolerant...

23
A Proposed Byzantine Fault-Tolerant Voting Architecture Using Time-Triggered Ethernet Andrew Loveless, NASA Johnson Space Center Christian Fidi, Stefan Wernitznigg, TTTech SAE 2017 AeroTech Congress & Exhibition Fort Worth, TX 26 28 September 2017 https://ntrs.nasa.gov/search.jsp?R=20170010131 2020-01-06T16:28:45+00:00Z

Upload: others

Post on 11-Oct-2019

7 views

Category:

Documents


0 download

TRANSCRIPT

A Proposed Byzantine Fault-Tolerant Voting Architecture Using Time-Triggered Ethernet

Andrew Loveless, NASA Johnson Space Center

Christian Fidi, Stefan Wernitznigg, TTTech

SAE 2017 AeroTech Congress & Exhibition

Fort Worth, TX

26 – 28 September 2017

https://ntrs.nasa.gov/search.jsp?R=20170010131 2020-01-06T16:28:45+00:00Z

SAE INTERNATIONAL Paper # 2017-01-2111 2/23

COTS in Manned Spacecraft

COTS technologies are attractive for use in human-rated spacecraft.

• Reduces development and upgrade costs.

• Lowers the need for new design work.

• Eliminates reliance on individual suppliers.

• Leverages larger knowledge base.

• Minimizes schedule risk.

Problem? Hard to meet the high reliability and fault tolerance requirements.

• E.g. 10-9 failures/hour in ultra-dependable systems.

• E.g. Crit-1, “fly-through” fault tolerance.

• Studies for Orion showed purely COTS designs would result in poor reliability and undue expense.

Often custom proprietary solutions are needed.

SAE INTERNATIONAL Paper # 2017-01-2111 3/23

COTS in Manned Spacecraft (cont.)

But the inclusion of COTS technologies is becoming more feasible.

• Greater availability of rad-tolerant components.

• TMR (Maxwell SCS750), lock-step (ARM R5).

• Ability to realize fault-containment regions.

• Growing number of suppliers.

NASA’s strategy for future spacecraft has heavily prioritized using COTS parts.

• Includes launchers, landers, etc.

Multiple projects have explored realizing safety-critical systems using COTS.

• Scalable Processor-Independent Design for Extended Reliability (SPIDER).

• Heavy Lift Vehicle (HLV) Architecture Study.

• Evolvable Mars Campaign (lander).

SAE INTERNATIONAL Paper # 2017-01-2111 4/23

Fault Classifications

All Faults

Byzantine Omissive Symmetric

Transmissive

Asymmetric

Transmissive

Symmetric

Omissive

Symmetric

Strictly Omissive

Asymmetric OTH-4:

PFH-3:

BYZ-1:

SAE INTERNATIONAL Paper # 2017-01-2111 5/23

Fault Classifications (cont.)

All Faults

Byzantine Omissive Symmetric

Transmissive

Asymmetric

Transmissive

Symmetric

Omissive

Symmetric

Strictly Omissive

Asymmetric OTH-4:

PFH-3:

BYZ-1:

Different observers

see a fault manifest

in the same way.

SAE INTERNATIONAL Paper # 2017-01-2111 6/23

Fault Classifications (cont.)

All Faults

Byzantine Omissive Symmetric

Transmissive

Asymmetric

Transmissive

Symmetric

Omissive

Symmetric

Strictly Omissive

Asymmetric OTH-4:

PFH-3:

BYZ-1:

Different observers

see a fault manifest

in different ways.

SAE INTERNATIONAL Paper # 2017-01-2111 7/23

Fault Classifications (cont.)

All Faults

Byzantine Omissive Symmetric

Transmissive

Asymmetric

Transmissive

Symmetric

Omissive

Symmetric

Strictly Omissive

Asymmetric

Byzantine faults are often not considered in satellites.

• Possibility is considered low enough to not warrant additional complexity.

• Impacts of faults are less severe (e.g. not taking a picture).

Manned spacecraft must tolerate Byzantine faults.

• Especially for dynamic mission phases with short time to effect.

• Higher number of “all-or-none” events (e.g. deploy parachutes).

• Failure could result in loss of life.

SAE INTERNATIONAL Paper # 2017-01-2111 8/23

Byzantine Faults

Byzantine faults can disrupt consensus among redundant processors.

• E.g. on internal state information.

• E.g. on sensor data.

• E.g. on diagnosis of system faults.

Occur at rates much > 10-9 failures/hour.

• Slightly-off-specification (SOS) hardware.

• Stuck transmitter – different receivers can interpret a marginal signal differently.

• Time base corruption – messages received slightly too early or too late.

Several architectural approaches for Byzantine-resilient systems.

• Hierarchical – e.g. SAFEbus, Orion VMCs.

• Full exchange – e.g. Draper FTMP, SPIDER.

SAE INTERNATIONAL Paper # 2017-01-2111 9/23

A Typical Approach

OBC1

CCDL Interface

Bus Interface

OBC2

CCDL Interface

Bus Interface

OBC3

CCDL Interface

Bus Interface

External time

reference Cross-Channel Data Link (CCDL)

Interstage

CCDL Interface

COM1

PDU2 PDU1

RIU1

COM2

PDU3

RIU2

Bu

s C

han

nel B

Bu

s C

han

nel A

Bu

s C

han

nel C

“Channelized bus” approach is common in launchers.

• Each OBC can only access devices on its local bus.

• Uses full exchanges.

• Usually designed to be 1FT.

Examples:

• X-38 CRV, Ares I, Delta IV.

Shortcomings?

SAE INTERNATIONAL Paper # 2017-01-2111 10/23

A Typical Approach (cont.)

OBC1

CCDL Interface

Bus Interface

OBC2

CCDL Interface

Bus Interface

OBC3

CCDL Interface

Bus Interface

External time

reference Cross-Channel Data Link (CCDL)

Interstage

CCDL Interface

COM1

PDU2 PDU1

RIU1

COM2

PDU3

RIU2

Bu

s C

han

nel B

Bu

s C

han

nel A

Bu

s C

han

nel C

Shortcomings?

1. Requires separate CCDL for data exchange between OBCs.

SAE INTERNATIONAL Paper # 2017-01-2111 11/23

A Typical Approach (cont.)

OBC1

CCDL Interface

Bus Interface

OBC2

CCDL Interface

Bus Interface

OBC3

CCDL Interface

Bus Interface

External time

reference Cross-Channel Data Link (CCDL)

Interstage

CCDL Interface

COM1

PDU2 PDU1

RIU1

COM2

PDU3

RIU2

Bu

s C

han

nel B

Bu

s C

han

nel A

Bu

s C

han

nel C

Shortcomings?

1. Requires separate CCDL for data exchange between OBCs.

2. Often requires external timing hardware for synchronization.

SAE INTERNATIONAL Paper # 2017-01-2111 12/23

A Typical Approach (cont.)

OBC1

CCDL Interface

Bus Interface

OBC2

CCDL Interface

Bus Interface

OBC3

CCDL Interface

Bus Interface

External time

reference Cross-Channel Data Link (CCDL)

Interstage

CCDL Interface

COM1

PDU2 PDU1

RIU1

COM2

PDU3

RIU2

Bu

s C

han

nel B

Bu

s C

han

nel A

Bu

s C

han

nel C

Shortcomings?

1. Requires separate CCDL for data exchange between OBCs.

2. Often requires external timing hardware for synchronization.

3. Requires separate interstage to meet minimum number of FCRs.

SAE INTERNATIONAL Paper # 2017-01-2111 13/23

A Typical Approach (cont.)

OBC1

CCDL Interface

Bus Interface

OBC2

CCDL Interface

Bus Interface

OBC3

CCDL Interface

Bus Interface

External time

reference Cross-Channel Data Link (CCDL)

Interstage

CCDL Interface

COM1

PDU2 PDU1

RIU1

COM2

PDU3

RIU2

Bu

s C

han

nel B

Bu

s C

han

nel A

Bu

s C

han

nel C

Shortcomings?

1. Requires separate CCDL for data exchange between OBCs.

2. Often requires external timing hardware for synchronization.

3. Requires separate interstage to meet minimum number of FCRs.

4. Requires two rounds of data exchange between OBCs.

SAE INTERNATIONAL Paper # 2017-01-2111 14/23

A Typical Approach (cont.)

OBC1

CCDL Interface

Bus Interface

OBC2

CCDL Interface

Bus Interface

OBC3

CCDL Interface

Bus Interface

External time

reference Cross-Channel Data Link (CCDL)

Interstage

CCDL Interface

COM1

PDU2 PDU1

RIU1

COM2

PDU3

RIU2

Bu

s C

han

nel B

Bu

s C

han

nel A

Bu

s C

han

nel C

Shortcomings?

1. Requires separate CCDL for data exchange between OBCs.

2. Often requires external timing hardware for synchronization.

3. Requires separate interstage to meet minimum number of FCRs.

4. Requires two rounds of data exchange between OBCs.

5. Bandwidth limited.

SAE INTERNATIONAL Paper # 2017-01-2111 15/23

An Approach Using TTE

1FT “switched voter” using TTE.

• Requires only 3 full processors.

• Requires 2-3 redundant switches.

• Devices can connect to OBCs directly or via TTE network.

• Assumes minimum number of SMs and CMs are present for sync.

TTE network used for data distribution and sync.

• Eliminates need for separate CCDL.

• Eliminates need for timing hardware.

• Bandwidth up to 1 Gbit/s.

Switches act as interstages.

• Messages reflected to/from the switches.

• Eliminates need for fourth processor.

OBC1

TTE NIC

….. OBC2

TTE NIC

Bus Interface

COM1

OBCN

TTE NIC

Bus Interface

COM2

RIU2 RIU1

TTE Switch 3

TTE Switch 2

TTE Switch 1

SAE INTERNATIONAL Paper # 2017-01-2111 16/23

Failure Assumptions

End systems may be subject to Byzantine failures.

• May send arbitrary messages.

• May transmit at any point in time.

• May send different messages to different switches.

Switches are restricted to inconsistent omission failures.

• May not create (nor modify to produce) a new “valid” message.

• May drop or fail to receive an arbitrary number of messages.

• May relay messages asymmetrically – some receivers may not get data.

• Acts as a “trusted sender”.

Fault propagation from switches theoretically

requires dual-correlated simultaneous faults.

10-6 ×10-6 = ~10-12 failures/hour

SAE INTERNATIONAL Paper # 2017-01-2111 17/23

Agreement on Local Data

OBC1

TTE NIC

….. OBC2

TTE NIC

Bus Interface

OBCN

TTE NIC

Bus Interface

SW1 SW2 SW3

Bus Interface

RIU1

K 5 5

5

• OBC1 reads a value

of 5 on its local bus. 1

• A fault causes

OBC1 to send a

bad value to SW1.

2

I want to

share 5

SAE INTERNATIONAL Paper # 2017-01-2111 18/23

Agreement on Local Data (cont.)

OBC1

TTE NIC

….. OBC2

TTE NIC

Bus Interface

OBCN

TTE NIC

Bus Interface

SW1 SW2 SW3

Bus Interface

K 5 K 5

K 5

5

5 5

• Each switch

relays the data

to all OBCs.

3

K, 5, 5

Final: 5

K, 5, 5

Final: 5

K, 5, 5

Final: 5

• Each OBC votes

the values sent

from the switches.

4

Absent data is not

included in the vote. • Vote could be

implemented

in TTE NIC or

in software on

the OBCs.

!

SAE INTERNATIONAL Paper # 2017-01-2111 19/23

Agreement on External Data

OBC1

TTE NIC

….. OBC2

TTE NIC

OBCN

TTE NIC

SW1 SW2 SW3

5 5

5

K

K K

5, K, _

Final: ∅

5, K, _

Final: ∅

5, K, _

Final: ∅

IMU1

TTE NIC

5

K

• A fault causes different

values to be sent to

each switch.

1

• Each switch

relays the data

to all OBCs.

2

• All OBCs agree

that no majority

is found.

3

SAE INTERNATIONAL Paper # 2017-01-2111 20/23

Commanding

OBC1

TTE NIC

OBC2

TTE NIC

OBC3

TTE NIC

SW1 SW2 SW3

RIU1

TTE NIC

RIU2

TTE NIC

5 5

5 5

5 5

5 K 5

• A fault causes

OBC2 to send a

bad value to SW3.

1

SAE INTERNATIONAL Paper # 2017-01-2111 21/23

Commanding (cont.)

OBC1

TTE NIC

OBC2

TTE NIC

OBC3

TTE NIC

SW1 SW2 SW3

RIU1

TTE NIC

RIU2

TTE NIC

5, 5, 5 5, 5, 5 5, K, 5

• Each switch

relays the data

from each OBC

to all RIUs.

2

• Each RIU either:

1. Accepts the first valid

value from each OBC.

or

2. Votes the redundant

values from each OBC.

3 • Each RIU votes

the values accepted

in Step 3.

4

Absent data is

included in the vote.

SAE INTERNATIONAL Paper # 2017-01-2111 22/23

Commanding (cont.)

OBC1

TTE NIC

OBC2

TTE NIC

OBC3

TTE NIC

SW1 SW2 SW3

RIU1

TTE NIC

RIU2

TTE NIC

5, 5, 5 5, 5, 5 5, K, 5

• Each switch reflects

the original data

back to all OBCs.

5

• Each OBC votes the

redundant values

from each OBC.

6

Absent data is not

included in the vote.

• Each OBC votes

the results from

Step 6 to diagnose

faulty OBCs.

7

Absent data is

included in the vote.

Happening Simultaneously …

SAE INTERNATIONAL Paper # 2017-01-2111 23/23

Questions?