johns hopkins & purdue 1 approved for public release, distribution unlimited scalability,...

Johns Hopkins & Purdue 1Approved for Public Release, Distribution Unlimited

Scalability, Accountability and Instant Information Access for

Network Centric Warfare

Department of Computer ScienceJohns Hopkins University

Yair Amir (PI), Claudiu Danilov, John Lane, Jonathan Shapiro, Ciprian Tutu

Cristina Nita Rotaru

Department of Computer SciencesPurdue University

http://www.cnds.jhu.edu


Network Centric Warfare Environments

• Wide area network settings.– C3I systems usually span large geographical

distances.– Communication between sites is conducted

over unreliable channels.

• Timely decisions based on available information.

• Required update semantics are not general in many cases

• Critical information is often not large.• Source uniqueness.



• Wide area network settings.• Timely decisions based on available

information.– Intermittent network connectivity

• Results in high latency for propagation and for consistent replication of updates.

– Decisions may have to be made promptly.• Based on the best currently available information.

• Required update semantics are not general in many cases





information.• Required update semantics are not general in

many cases.– Weaker update semantics may suffice.– Common operation picture:

• Commutative update semantics.• Timestamp resolution (most recent update wins).






many cases• Critical information is often not large.

– Compared with current hardware capabilities.• Location of friendly forces and enemy forces.• A few plans.

– Allows storing all updates throughout the duration of engagement (several months).

• Source uniqueness.





many cases• Critical information is often not large.• Source uniqueness.

– Every input (update) is initiated by one unique source.





many cases• Critical information is often not large.• Source uniqueness.


Malicious Insider Threats

• The insider attack has traditionally been a primary threat to computer systems. ( http://csrc.nist.gov ).

• The explosion of the Internet made things worse: Insiders commit about 80% of all computer and Internet related crime (www.intergov.org) and CSI/FBI 2003 Computer Crime and Security Survey.

• Insiders: participants with legitimate access or those that bypassed the protection mechanisms and exhibit arbitrary (malicious) behavior.


Dealing with Insider Threats

• Detection: use intrusion detection systems; however, they are not perfect (high false positives rate).

• Prevention: use access control, firewalls, proactive security; but vulnerabilities still exist (OS bugs, buffer overflow, cover channels, etc).

• Mitigation (tolerate/cope): use mechanisms that provide service to correct participants while under attack, even if several participants are compromised.

• The above methods do not exclude each other.


Outline• Network centric warfare environments.• Peer Byzantine replication limitations.• Research approach.

– Scaling wide area intrusion tolerance replication via hierarchy

• Local Byzantine replication within sites.• Fault tolerant replication on the wide area.

– Client accountability.• Accountability graph.• Snapshots for fast regenerations.

– Exploiting application semantics.

• Next steps.• Technology transitioning.• Summary.


A Distributed Systems Service

• Message-passing system.

• Clients issue requests to servers, then wait for answers.

• Replicated servers process the request, then provide answers to clients.

Server

Replicas 1 o o o2 3 3f+1

Clients

A site


State Machine Replication

• Requests must be ordered in a consistent manner by all servers.

• Usually one server manages the ordering process based on information from the other participants, then informs everybody about what was decided.

• If the leader dies, a new leader must be selected to ensure progress.

• Benign faults: Paxos [Lam98,Lam01]: must contact f+1 out of 2f+1 servers and uses 2 rounds to allow consistent progress.

• Byzantine faults: BFT [CL99]: must contact 2f+1 out of 3f+1 servers and uses 3 rounds to allow consistent progress.


A Replicated Server System

• Maintaining consistent servers [Sch90] :– To tolerate f benign faults, 2f+1

servers are needed.– To tolerate f malicious faults: 3f+1

servers are needed.

• Responding to read-only clients’ request [Sch90] :– If the servers support only benign

faults: 1 answer is enough.– If the servers can be malicious: the

client must wait for f +1 identical answers, f being the number of malicious servers.


Peer Byzantine Replication Limitations

• Limited scalability due to multiple all-peer exchange.– 3-round all-peer exchange.

• Very costly on high latency wide area links.• Not very scalable.

• Strong connectivity is required.• Construct consistent total order.• Focus is solely on replica protection.



• Limited scalability due to multiple all-peer exchange.

• Strong connectivity is required.– 2f+1 (out of 3f+1) to allow progress and f+1 to

get an answer.• Partitions are a real issue.• Clients depend on remote information.

– Bad news: Provably optimal.• We need to pay something to get something else.

• Construct consistent total order.• Focus is solely on replica protection.




• Strong connectivity is required.• Construct consistent total order.

– Agreement is achieved on the order of updates before applying them.

• Very useful - supports general update semantics.• Maybe sub-optimal for C3I applications that need only

commutative semantics.

• Focus is solely on replica protection.




• Strong connectivity is required.• Construct consistent total order.• Focus is solely on replica protection.

– Compromised clients can inject wrong (though valid) input through authorized channels.

• Wrong input will be consistently replicated to all servers.


Local Byzantine Replication Within a Site

• No trust between participants in a site– A site acts as one unit that can only crash if the

assumptions are met.

• How to make sure that one server can not manipulate the order?– Threshold cryptography seems a good

direction.

• Use BFT-like [CL99, YMVAD03] protocols and threshold cryptography to guarantee that any valid message leaving the site is correct.


Fault Tolerant Replication Engine

RegPrim

TransPrim

ExchangeStates

NonPrim

Construct

Trans Memb

ExchangeMessagesUn No

Last CPCLast

CPC

LastState

PossiblePrim

No Primor

Trans Memb

Recover

Trans Memb

Reg MembReg MembTrans Memb

Reg Memb

Reg MembUpdate

update (Red)Update (Yellow)Update (Green)

1a 1b ? 0

[AT02]


Fault Tolerant Experiments over Wide-Area Network

• A real experimental network (CAIRN). • Was also modeled in the Emulab facility.

ISIPC

ISIPC4

TISWPC

ISEPC3

ISEPC

UDELPC

MITPC

38.8 ms1.86Mbits/sec

1.4 ms1.47Mbits/sec

4.9 ms9.81Mbits/sec

3.6 ms1.42Mbits/sec

100 Mb/s< 1ms

100 Mb/s<1ms

Virginia

Delaware

Boston

San Jose

Los Angeles


Throughput Comparison (WAN)

050

100150

200250

300350

400

0 14 28 42 56 70 84 98 112 126 140

number of clients (7 replicas on wide area)

upda

te t

rans

actio

ns /

sec

ond

FT Replication Engine Upper bound 2PC

[ADMST02]


Hierarchical Architecture

• Each site acts as a logical unit that can crash.• Fault-tolerant protocols between sites.

Server

Replicas 1 o o o2 3 3f+1

ClientsA site


Hierarchical Architecture Details

ByzantineReplication

Fault TolerantReplication

OverSecure Spread

Server Replica 1

Wide area representative



OverSecure Spread

Server Replica 2

Wide area standby



OverSecure Spread

Server Replica 3f+1

Wide area standby

o o o

Wide area network

Local area network

Local SiteClients


Mon

itorFault Tolerant

ReplicationOver

Secure Spread

Server Replica 1



Mon

itorFault Tolerant

ReplicationOver

Secure Spread

Server Replica 2

Wide area standby



OverSecure Spread

Server Replica 3f+1

Wide area standby

o o o

Wide area network

Local area network

Local SiteClients

Mon

itor


Payment & Potential Gain• Protects against f Byzantine faults in each

site for the price of having 3f+1 replicas in every site.

• Box numbers / a total site compromise.

• Read queries are limited to the local site.

• On a network with diameter of 50 ms.– It takes at least 300 milliseconds to

complete 3 wide area round trips used by peer Byzantine replication methods.

– FT Replication engine was shown to be achieve 5 times the performance of 2PC.

• Goal– > factor of 3 compared with a peer system.


Alternative Scalable Architecture

• Use physical trusted nodes assumed to be working under a weaker adversary: can crash and recover, but can not be compromised.

• Take advantage of the trusted nodes to run an optimized Byzantine replication algorithm, potentially reducing the number of rounds.

• Use protocols where communication over WAN only take place between trusted nodes, thus avoiding high-latency.

• Similar approaches: [CLNV02, Ver03, SurS03]


What About Corrupted Clients?

• We can not detect corrupted clients without external information (can take advantage of detection mechanisms).

• Can we bring the system to a “clean” state if we have external information about compromised clients?

• Proposed solution: accountability graph.

A -DAG


Client Accountability Graph

Client Update

Tim

e

• A direct acyclic graph of updates.

• Each update links to previous updates modifying data it read (causal predecessors).


Client Accountability Graph

X

Clean update Corrupted update Suspicious update

Tim

e

Limits adversary power:• Adversary can inject

updates only as a compromised client.

• Once a compromised network avoids delivering an update, it cannot deliver causally following updates.

Useful for risk assessment.


Enabling Fast Regeneration Using Snapshots

X

Most recent snapshot

Clean update Corrupted update Suspicious update

Tim

e

Periodic snapshots limit state regeneration calculation.

For our application domain, it seems feasible to maintain continuous information of a long period of time


Overall Architecture



OverSecure Spread

A-DAG

Server Replica 1




OverSecure Spread

A-DAG

Server Replica 2

Wide area standby



OverSecure Spread

A-DAG

Server Replica 3f+1

Wide area standby

o o o

Wide area network

Local area network

Local SiteClients


Mon

itorFault Tolerant

ReplicationOver

Secure Spread

A-DAG

Server Replica 1



Mon

itorFault Tolerant

ReplicationOver

Secure Spread

A-DAG

Server Replica 2

Wide area standby



OverSecure Spread

A-DAG

Server Replica 3f+1

Wide area standby

o o o

Wide area network

Local area network

Local SiteClients

Mon

itor


Risks and Challenges• Interface the Byzantine-tolerant replication and

Fault-tolerant replication components. • Investigate the impact of threshold digital

signatures on performance and complexity. • Interface Byzantine-tolerant replication with the

client accountability graph.• Use of application semantics to optimize

protocols. • Design optimizations to make the cost of the

architecture very small when no faults occur. • Take into account confidentiality under

corrupted servers model.


Impact

New ideas

Scalability, Accountability and Instant Information Access forNetwork-Centric Warfare

ScheduleResulting systems with at least 3 times higher throughput, lower latency and high availability for updates over wide area networks. Clear path for technology transitions intoMilitary C3I systems.

http://www.cnds.jhu.edu/funding/srs/

June 04

Dec 04

June05

Dec 05

C3I model, baseline and demo

Componentanalysis & design

ComponentImplement.

System integration & evaluation

Final C3I demoand baseline eval

First scalable wide-area intrusion-tolerant replication architecture.

Providing accountability for authorized but malicious client updates.

Exploiting update semantics to provide instant and consistent information access.

Comp.eval.

johns hopkins & purdue 1 approved for public release, distribution unlimited scalability,...

Documents