architectures for secure processing matt devuyst

Architectures for Secure Processing

Matt DeVuyst

Research Exam - Matt DeVuyst 2

Introduction

L2

L1 - D

L1 - I

Pipeline,Functional Units

L3

MainMemory

Memory Bus

CPU

Line of TrustPoints of Attack

EDU

Keys

EncryptionDecryptionUnitand keys


Introduction

What kind of security? Protection of what? For whom? From whom/what?

This work focuses on: Protection of execution (process data and control flow) Protection for users, copyright holders, software companies Protection from all other processes (including OS) and

physical attack This work focuses on general purpose security

mechanisms for general purpose computers.


Introduction

This research takes an architecture-centric approach. Cryptographic algorithms may be utilized but they

will not be proven Focus is given to hardware support

Software and OS reap the benefits


Goals

Execution Privacy Process control flow and data exposed only to the

CPU Execution Integrity

Process control flow and data cannot be tampered with without detection


Outline

Execution Privacy Execution Integrity Proposed Architectures Conclusions and Open Questions


Outline

Execution Privacy Naïve Encryption One Time Pad (OTP) Encryption Improved OTP Encryption

Execution Integrity Proposed Architectures Conclusions and Open Questions


Naïve Encryption

Encryption/DecryptionUnit

CPUMemory

Memory BusPlaintext Data Cyphertext Data Cyphertext DataPlaintext Data


A Closer Look At the Encryption/Decryption Unit

AES in Cipher Block Chaining (CBC) Mode


Issues With Naïve Encryption

On the critical path → Performance suffers Not secure against all attacks


Why Naïve Encryption Is Not Secure

Plaintext Ciphertextti

me

Pattern is identical

Encrypt Data Only


Why Naïve Encryption Is Not Secure


me

Pattern is still identical

Encrypt Data/Address

Writes to same address


Why Naïve Encryption Has Poor Performance Stores are effectively immune to encryption

latency Store buffer

Loads that miss in the cache cost: Time to bring in data from memory Time to decrypt that data

time

Memory Latency Decryption LatencyLoad Instruction


Outline

Execution Privacy Naïve Encryption One Time Pad (OTP) Encryption* Improved OTP Encryption


* Suh, et al. “Efficient Memory Integrity Verification and Encryption for Secure Processors” – MITand Yang et al. “Fast Secure Processor for Inhibiting Software Piracy and Tampering” – UC Riverside


How OTP Encryption/Decryption WorksEncryption Decryption


Why OTP Encryption is Secure


me

No pattern is expressed

Encrypt addr, seq #

Writes to same address


How OTP Encryption Solves the Performance Problem Decryption done in parallel with load

Taken off the critical path The key to how it works

Decryption cannot depend on ciphertext

time

Memory Latency Decryption LatencyLoad Instruction XOR


The Achilles’ Heel of OTP Encryption

Sequence number must be available long before memory access completes

time

Memory Latency

Decryption Latency

Load Instruction

Sequence numberavailable here

Sequence number associated with every cache-block-sized chunk of memory→ Cannot keep all sequence numbers on chip

XOR

One solution: sequence number cache


Outline

Execution Privacy Naïve Encryption One Time Pad (OTP) Encryption Improved OTP Encryption*


* Shi, et al. “High Efficiency Counter Mode Security Architecture Via Prediction and Precomputation” – Georgia Tech


Solutions To the OTP Problem Prediction and Precomputation

Predict sequence number Precompute pad When memory access completes, compare real

sequence number with predicted one If they match, use precomputed pad If they don’t match, compute real pad


Prediction and Precomputation

TLBRoot Seq #

Root Seq #

Root Seq #

Root Seq #

Page of memoryReal Seq #

Real Seq #

Real Seq #

Real Seq #

Real Seq #

Page table entry

Cache block



TLB129145

637432

179966

Page of memory343923

343923

343923

Page table entry

Cache block

343923

343923

Initially, all sequence numbers areset to page’s root sequence number

343923



TLB129145

343923

637432

179966


343924

343935

Page table entry

Cache block

343933

343925

Writes increment the sequence numbers



TLB129145

637432

179966


343924

343935

Page table entry

Cache block

343933

343923

Start predictions with this

343925

Memory Latency

Generate pad for seq # 343923

Load Instruction




Better Prediction and Precompuatation Problem: Frequently updated data will have

sequence number beyond prediction depth One solution:

Reset root sequence number Use a prediction history for each page This is called “adaptive prediction”

TLBRoot Seq #

Root Seq #

Root Seq #

Root Seq #

Page table entry

Prediction History

Prediction History

Prediction History

Prediction History


Better Prediction and Precompuatation Problem: Frequently updated data will have

sequence number beyond prediction depth Another solution:

Record past difference (diff) between root sequence number and real sequence number

On subsequent load, make predictions around root sequence number + diff

This is called “context-based” predictionTLB

Root Seq #

Root Seq #

Root Seq #

Root Seq #

Page table entry diff

Register


Prediction and Precomputation Accuracy

“Adaptive prediction” is reported to be about 80% accurate* “Context-based prediction” is reported to be close to 100%

accurate* (though this has not yet been verified by other researchers).

Cost Larger TLB Slightly larger memory footprint and bandwidth requirement

Conclusion Using OTP with optimizations, decryption latency is almost

completely hidden.

* Shi, et al. “High Efficiency Counter Mode Security Architecture Via Prediction and Precomputation” – Georgia Tech


Outline

Execution Privacy Execution Integrity

Basic Execution Integrity Cached Hash Trees Log Hashing

Proposed Architectures Conclusions and Open Questions


Execution Integrity – Basic Idea

On a write… Keyed hash is taken over data and address Data and hash are stored in memory

On a read… Data and hash are returned from memory Hash is computed Compare computed hash and returned hash

CPU Memory

Data Hash(Key,Data,Address)

Data Hash(Key,Data,Address)

Hash(Key,Data,Address)


Security Analysis of Basic Execution Integrity

Arbitrary data cannot be introduced because: The hash is keyed and An attacker does not know the key

Data stored at one address cannot be substituted for data stored at another address because: Hashing the data along with the address binds the two

But a replay attack is possible because: An attacker may replay stale data previously stored at the

given address


Outline


Basic Execution Integrity Cached Hash Trees* Log Hashing


* Blum, et al. “Checking the Correctness of Memories” – UC BerkleyGassend, et al. “Caches and Hash Trees for Efficient Memory Integrity Verification” – MITMerkle, et al. “Protocols for Public Key Cryptography”


Cached Hash Trees

Fundamental problem with basic hashing Hashes verified data integrity, but nothing verified

the integrity of the hashes A solution: cached hash trees

Keyed hashes are taken over data Keyed hashes are taken over those hashes, etc. Problem: memory requirement of hashes

Solution: Hashes are stored in memory and cached on-chip along with data.


Cached Hash Trees

How it works A tree is built Leaf nodes contain data Intermediate nodes are

hashes The root hash is kept in a

special register on-chip Hashes are only updated

when necessaryData Block

Hash Hash Hash Hash

HashHash Hash Hash

Hash HashHash Hash

Hash


Cached Hash Tree Consistency Invariant:

If a node is in memory→ then it’s parent hash is consistent with it (whether the hash is in the cache or in memory)


Cached Hash Tree Consistency

Cache Memory

= Up-to-date hash = Outdated hash

Data

Parent Hash

Grandparent Hash

hashes are not updatedIf data is written …



Cache Memory


Data

Parent Hash

Grandparent Hash

parent hash in cache is updatedIf dirty data is evicted …



Cache Memory


Data

Parent Hash

Grandparent Hash

parent hash in cache is updatedIf a hash block is evicted …



Cache Memory


Data

Parent Hash

Grandparent Hash

1. The parent is loaded and verified against grandparent.

If data is loaded and parenthash is not in the cache …

2. Then the data is verified against its parent.


Performance Analysis of Cached Hash Trees Common case: Hash nodes are in cache

Data evictions only require an update to a cached node Data loads only require one hash check with cached node

Uncommon case: Hash nodes are not in the cache Data evictions require hash node loads Data loads require hash node loads

Passing hash nodes across the memory bus cuts into the bandwidth of data

Hash nodes occupy space in the cache


Outline


Basic Execution Integrity Cached Hash Trees Log Hashing*


* Suh, et al. “Efficient Memory Integrity Verification and Encryption for Secure Processors” – MIT


Log Hashing

Key insight Verification is not necessary at every load Verification is necessary before application results

are produced Implication

Relax constraint on constant, vigilant verification


Log Hashing – Incremental Multiset Hashes* Incremental

Keyed hash is not computed over all data, just additional data

Multiset Duplicate items are

allowed Multiplicity of items is

significant Order of items is not

Hash

Set 1

Set 2

=

Hash Engine

* Clarke, et al. “Incremental Multiset Hash Functions and Their Application to Memory Integrity Checking” – MIT


Log Hashing

2 incremental multiset hashes WriteHash

Hashes everything evicted from cache (written to memory)

ReadHash Hashes everything fetched from memory

Counters are associated with memory operations and keyed hashes taken over (data, counter, address)


Log Hashing

3 phases of operation Initialization

All program data written out to memory (hashed into WriteHash)

Run-time Hash of every eviction is added to WriteHash Hash of every fetch is added to ReadHash

Verification All data not in cache is brought in (hashing into ReadHash) ReadHash compared to WriteHash. If equal, integrity

maintained. Else, integrity violated.


Log Hashing - Initialization

Write Hash Read Hash

Memory

Cache


Log Hashing – Run-time


Memory

Cache


Log Hashing – Verification


Memory

Cache

=


Log Hashing – Performance Analysis Initialization and verification are very costly We assume initialization and verification are

rare occurrences. Run-time hashing has no overhead Loading/storing sequence numbers in

memory incurs a small performance overhead and a small memory overhead.


Log Hashing – Security Analysis If data is tampered with in memory:

ReadHash will be different from WriteHash. If data was returned from memory more times

than it was written (as in a replay attack): The multiplicity of hashed items will not match →

hashes will not match. If data is returned from memory out of order:

The hashes won’t match because different counter values would have been hashed in with the data.


Outline

Execution Privacy Execution Integrity Proposed Architectures

XOM SP AEGIS SENSS

Conclusions and Open Questions


Proposed Architectures

XOM* First of its kind Uses naïve privacy and integrity mechanisms Slow and vulnerable to attack Keys for encryption and hashing burned on chip

* Lie, et al. “Architectural Support for Copy and Tamper Resistant Software” – Stanford



Secret-Protected* Based on XOM Uses naïve privacy and integrity mechanisms Decouples secret from device

Key stored on chip only during user session User keys are separate from device secret (hardware

key) and are transferable

* Lee, et al. “Architecture for Protecting Critical Secrets in Microprocessors” – Princeton



AEGIS* Uses OTP encryption for privacy

without performance optimizations like prediction and precomputation

Uses cached hash trees for integrity Hides device keys using Physically Random

Functions (PRFs) The circuit timing characteristics of a particular chip are

unique and impossible to measure. PUFs exploit this to create device secrets

* Suh, et al. “Design and Implementation of the AEGIS Single-Chip Secure Processor Using Physical Random Functions” – MIT



SENSS* Uses simple OTP encryption scheme like AEGIS Uses cached hash tree scheme like AEGIS Adds support for multiprocessor systems

Each device has its own key Combination Cipher Block Chaining and One Time Pad

mode encryption is used for cache-to-cache transfers

* Zhang, et al. “SENSS: Security Enhancement to Symmetric Shared Memory Multiprocessors” - UTD


Outline

Execution Privacy Execution Integrity Proposed Architectures Conclusions and Open Questions


Conclusions – OTP

Execution privacy is solved by OTP encryption (with optimizations) Secure against all system-level attacks and

physical attacks (outside processor). Almost no performance cost


Conclusions – Cached Hash Trees Cached hash trees are secure against all

known attacks But they have potentially poor performance

No research has been done to stress test them Performance is bad when hash tree is not in cache

→ a large working set or pathological access pattern may result in poor performance


Conclusions – Log Hashing

Log hashing is secure as long as verification is done before results are used How do you ensure that results are not consumed

by users or other applications e.g. disk writes, network writes, shared memory, screen

refresh, OS interrupts

Log hashing has good performance if verification is infrequent But what if it’s not? How many applications

require frequent verification?


Conclusions – Keys

Execution privacy and integrity require keys Keys must be protected, even if OS is compromised

or physical attack How should keys be protected?

Are Physically Random Functions really resistant to physical attack?

How should device public keys be used? Should the manufacturer publish them? How should revocation work? What happens if ownership of the device is transferred?

Architectures for Secure Processing

Matt DeVuyst



Cache Memory


Data

Parent Hash

Grandparent Hash

1. The parent is loaded and verified against grandparent.

If dirty data is evicted andparent hash is not in the cache …

2. Then the parent is updated

architectures for secure processing matt devuyst

Documents

research exam matt devuyst

address slide

detection slide

memory time

tampering uc riverside

architectures conclusions

control flow protection

open questions