beba behavioural based forwarding deliverable report · pdf filebeba behavioural based...

BEBA Behavioural Based

Forwarding Grant Agreement: 644122

BEBA/WP2 – D2.2 Version: 1.1 of 65


Forwarding

Deliverable Report

D2.2 Basic BEBA abstraction proof of concept prototype

Deliverable title Basic BEBA abstraction proof of concept prototype

Version 1.1

Due date of deliverable (month)

October 2015

Actual submission date of the deliverable (dd/mm/yyyy)

31/10/2015

Start date of project (dd/mm/yyyy)

01/01/2015

Duration of the project 27 months

Work Package WP2

Task T2.3

Leader for this deliverable CNIT

Other contributing partners NEC, KTH, 6WIND

Authors

Marco Bonola, Luca Pollini, Davide Sanvito, Salvatore Pontarelli, Giuseppe Bianchi, Carmelo Cascone (CNIT), Vincent Jardin, Thomas Monjalon (6WIND), Roberto Bifulco (NEC), Dejan Kostic, Georgios Katsikas, Radosav Rudic (KTH)

Deliverable reviewer(s) Julien Boite (TCS), Viktor Pus (CESNET)




Project co-funded by the European Commission within the Horizon 2020

(H2020) Programme

DISSEMINATION LEVEL PU Public X PP Restricted to other programme participants (including the Commission

Services)

RE Restricted to a group specified by the consortium (including the Commission Services)

CO Confidential, only for members of the consortium (including the Commission Services)

REVISION HISTORY

Revision Date Author Organisation Description 0.1 05/09/2015 Marco Bonola CNIT First draft of section

2 0.2 10/09/2015 Davide Sanvito,

Luca Pollini CNIT First draft of section

1 0.3 15/09/2015 Marco Bonola,

Davide Sanvito, Luca Pollini

CNIT Section 1 updated and reviewed

0.5 21/09/2015 Vincent Jardin, Thomas Monjalon

6WIND Add information about acceleration

0.6 6/10/2015 Roberto Bifulco NEC Section 1.5 added

0.7 17/10/2015 Dejan Kostic, Georgios Katsikas,

Radosav Rudic

KTH Section 1.6 added

0.8 19/10/2015 Salvatore Pontarelli CNIT Sections 4.1 and 2 updated

0.9 21/10/2015 Vincet Jardin, Thomas Monjalon

6WIND Section 3.4 updated

1.0 24/10/2015 Carmelo Cascone CNIT Added SW PoC performance

analysis

1.1 26/10/2015 Marco Bonola CNIT Final including partners’ revision

PROPRIETARY RIGHTS STATEMENT

This document contains information, which is proprietary to the BEBA consortium. Neither this document nor the information contained herein shall be used, duplicated or communicated by any means to any third party, in whole or in parts, except with the prior written consent of the BEBA consortium. This restriction legend shall not be altered or obliterated on or from this document.

STATEMENT OF ORIGINALITY

This deliverable contains original unpublished work except where clearly indicated otherwise. Acknowledgement of

previously published material and of the work of others has been made through appropriate citation, quotation or both




TABLE OF CONTENT

EXECUTIVE SUMMARY ................................................................................................ 6

1 SOFTWARE PROOF OF CONCEPT PROTOTYPE ....................................................... 7

1.1 HOW DOES THE BEBA PROTOTYPE FIT IN OFSOFTSWITCH13? ............................................... 7

1.2 SW IMPLEMENTATION DETAILS ................................................................................... 8

1.2.1 FSM instantiation ..................................................................................... 13

1.2.1.1 Table statefulness configuration ............................................................. 14

1.2.1.2 Lookup/Update scope setting................................................................. 15

1.2.1.3 Flow table population ........................................................................... 15

1.2.2 Packet processing .................................................................................... 16

1.2.2.1 State transition .................................................................................... 18

1.3 IMPLEMENTATION OF THE BEBA BASIC CONTROLLER PROTOTYPE ......................................... 20

1.3.1 Overview of the Ryu OpenFlow controller .................................................... 20

1.3.2 Ryu and OpenFlow experimenters .............................................................. 21

1.3.3 Implementation of the BEBA extensions ..................................................... 22

1.3.3.1 Table configuration .............................................................................. 22

1.3.3.2 Lookup-scope and update-scope configuration ......................................... 23

1.3.3.3 set_state action support ....................................................................... 24

1.3.3.4 State match support ............................................................................. 25

1.4 IN-SWITCH PACKET GENERATION PROTOTYPE ................................................................ 25

1.4.1 Packet template table ............................................................................... 25

1.4.2 In-Switch Packet Generation Instruction ..................................................... 28

1.4.3 Controller InSP API example ..................................................................... 29

1.5 STATE SYNCHRONIZATION MECHANISM PROTOTYPE ......................................................... 30




1.5.1 Controller to Switch ................................................................................. 30

1.5.2 Switch to Controller ................................................................................. 34

2 FPGA-BASED HARDWARE PROOF OF CONCEPT PROTOTYPE ............................... 36

2.1 DEVELOPMENT PLATFORM ....................................................................................... 37

2.2 BEBA COMPONENTS .............................................................................................. 39

2.2.1 Microcontroller ........................................................................................ 40

2.2.2 Look-up/update extractors ........................................................................ 40

2.2.3 State/timer table ..................................................................................... 42

2.2.4 Metadata block ........................................................................................ 44

2.2.5 FSM table ............................................................................................... 44

2.2.6 Packet output .......................................................................................... 45

2.2.7 Configuration interface ............................................................................. 45

2.3 SYNTHESIS AND POC SIMULATION ............................................................................. 46

2.4 DISCUSSION AND EXTENSIONS ................................................................................. 47

2.5 LIMITATIONS ...................................................................................................... 48

2.6 PERFORMANCE ACHIEVABLE WITH AN ASIC IMPLEMENTATION ............................................. 48

3 PERFORMANCE ANALYSIS .................................................................................. 48

3.1 TEST DESCRIPTIONS ............................................................................................. 48

3.2 RESULTS .......................................................................................................... 50

3.3 STRATEGIES FOR IMPROVED PERFORMANCES ................................................................. 51

3.3.1 Slow path and fast path based acceleration ................................................. 52

3.3.2 Packet offload APIs – Netlink (RFC3549) and standardization Switchdev ......... 54

3.3.3 Packet offload APIs – eBPF and P4 processing ............................................. 57

3.3.4 Design solutions for BEBA ......................................................................... 57

4 SIMPLE USE CASE DEPLOYMENT WITH HW/SW PROTOTYPES ........................... 58

4.1 HW PROTOTYPE DEMONSTRATION ............................................................................. 58




4.1.1 Simple use case description ...................................................................... 58

4.1.2 BEBA prototype configuration .................................................................... 58

4.1.3 Results ................................................................................................... 60

4.2 SW PROTOTYPE DEMONSTRATION ............................................................................. 60

4.2.1 Emulation environment description and use case description ......................... 60

4.2.2 Programming the BEBA controller .............................................................. 61

4.2.3 Testing the BEBA application ..................................................................... 63




Executive summary This deliverable documents the project’s results related to activity T2.3 “Proof of concept

validation”. In particular, section 1 describes the BEBA software switch proof of concept

implementation, which integrates the primitives of the basic BEBA API (FSM execution, packet

generation and state synchronization). Section 2 focuses on a PoC hardware implementation

which realizes a subset of the BEBA basic API (in particular the FSM execution). In section 3 a

performance analysis of the SW PoC implementation is given along with a discussion of

possible strategies to improve performance. Finally, section 4 provides a description of two

simple demonstrations of both the SW and HW prototypes.




1 Software proof of concept prototype

To validate experimentally the BEBA abstraction, we tried to gain further insights by

developing a prototype software implementation. Among all available software switch

implementations, our choice fell on ofsoftswitch13 [1] because it better fits our requirements

due to its simplicity and adherence to the OpenFlow v1.3 specification, with our proposed

stateful operation support.

This software switch is built upon the OpenFlow 1.3 specification and it is the one that better

implements the OpenFlow experimenter fields, useful to extended the switch with additional

functionality.

As for the BEBA controller, we extended the widley used openflow controller ryu [2] to support

the few new messages required to interact with a BEBA switch.

1.1 How does the BEBA prototype fit in ofsoftswitch13?

The BEBA basic forwarding abstraction has been developed in ofsoftswitch13 as an OpenFlow

experimenter extension. From the OpenFlow 1.5 specification [3]:

”Experimenter extensions provide a standard way for OpenFlow switches to offer additional

functionality within the OpenFlow message type space. This is a staging area for features

meant for future OpenFlow revisions. Many OpenFlow object types offer Experimenter

extensions, such as basic messages, OXM matches, instructions, actions, meters and error”.

Implementing experimenter extensions allows the BEBA extension to be independent from the

OpenFlow version, giving the opportunity to well accept future OpenFlow updates. Otherwise,

creating a patched ofsoftswitch13 would have tied the extension to the OpenFlow 1.3

specification, even if the patch had not altered the switch basic functionalities.

The OpenFlow control channel processing is done in two distinct steps: firstly, the wire format

is converted to an internal representation, and then this internal representation is interpreted

by the soft switch code. The opposite direction is handled in a similar way: from the internal

representation to the wire format.

The internal representations of each OpenFlow entity (messages, actions, matches, etc.) and

the conversion functions (ofl_msg_unpack and ofl_msg_pack) are described in the OpenFlow

library.

The OpenFlow library functions accept in input an extra argument that points to a set of

callback functions used to manage an encountered experimenter entity (listed in the ofl.h

header file).




Depending on the message type, the OpenFlow library chooses if it has to use its internal

functions or has to rely on the provided callbacks in the oflib-exp module, where it is

possible to find the callbacks already provided by Nicira and OpenFlow. Here the BEBA’s

extension took place, by adding the BEBA experimenter_id as one of the possible options.

The BEBA extension is limited to a small number of files. In the beba-exp.h the wire format

structures are described (cf. section 4 D.2.1). In the ofl-exp-beba.h file the required internal

structures are defined reusing the already defined internal headers (e.g.

ofl_action_experimenter), and, similarly to ofl-exp-openflow and ofl-exp-nicira, in the

ofl-exp-beba are defined all the callback functions used to pack and unpack the required new

BEBA messages, actions and matches.

Once the translation from wire format to internal format is complete, the message is processed

in the udatapath module (dp_control.c file). Having the message in the oflib internal

format, the code is responsible to dispatch the message to the correct module depending on

the message type. Experimenter messages are handled by the dp_exp.c code in which the

BEBA extension took place.

A response message, from the switch to the controller, is firstly created using the BEBA’s

internal format and then the sending functionality will use the oflib to convert it to the

OpenFlow wire format which use the callback provided for the conversion.

1.2 SW implementation details

The soft switch implementation (whose workflow is depicted in Figure 1) consists of a main

loop (dp_run() function) that periodically checks if there are some messages to manage both

from the control channel and from a normal switch port.

The controller handles the switch bootstrap event by sending a sequence of messages. Some

of these are part of the OpenFlow specification and others are an extension of it.

Each controller message is processed in the remote_run() function callable from the dp_run()

function. Once the message is unpacked in ofl_msg_unpack() it is checked against message

type. For all the messages not included in the specification, an extension has been made by

using the OFPT_EXPERIMENTER type message. Since various vendors can extend the OpenFlow

specification, an experimenter_id is needed to identify which manufacturer the message

belongs to. The experimenter_id used by BEBA is 0xBEBABEBA1.

All the new callback functions have been developed in the ofl_exp_beba.c file inside the

oflib-exp module.

1 Note that at this time no official experimenter_id has been chosen. As the project standardization activities are

ongoing, this ID may be different from the one that will be submitted to the ONF body.




Figure 1 - BEBA basic forwarding abstraction prototype workflow




As stated in section 4 of deliverable D.2.1, the configurations and modifications of the state

table are performed with State Modification messages sent by the controller. This message is

an experimenter message with exp_type field set to OFPT_EXP_STATE_MOD.

The following callback function shows how the translation from the wire format to the internal

format is performed.

ofl_err ofl_exp_beba_msg_unpack(struct ofp_header *oh, size_t *len, struct ofl_msg_experimenter **msg) { ofl_err error; struct ofp_experimenter_header *exp_header; if (*len < sizeof(struct ofp_experimenter_header)) { OFL_LOG_WARN(LOG_MODULE, "Received EXPERIMENTER message has invalid length (%zu).", *len); return ofl_error(OFPET_BAD_REQUEST, OFPBRC_BAD_LEN); } exp_header = (struct ofp_experimenter_header *)oh; if (ntohl(exp_header->experimenter) == BEBA_VENDOR_ID) { switch (ntohl(exp_header->exp_type)) { case (OFPT_EXP_STATE_MOD): { struct ofp_exp_msg_state_mod *sm; struct ofl_exp_msg_state_mod *dm; if (*len < sizeof(struct ofp_experimenter_header) + 2*sizeof(uint8_t)) { OFL_LOG_WARN(LOG_MODULE,

"Received STATE_MOD message has invalid length (%zu).", *len); return ofl_error(OFPET_BAD_REQUEST, OFPBRC_BAD_LEN); } *len -= sizeof(struct ofp_experimenter_header); sm = (struct ofp_exp_msg_state_mod *)exp_header; dm = (struct ofl_exp_msg_state_mod *)malloc(sizeof(struct ofl_exp_msg_state_mod)); dm->header.header.experimenter_id = ntohl(exp_header->experimenter); dm->header.type = ntohl(exp_header->exp_type); dm->command = (enum ofp_exp_msg_state_mod_commands)sm->command; *len -= 2*sizeof(uint8_t); if (dm->command == OFPSC_STATEFUL_TABLE_CONFIG){ error = ofl_structs_stateful_table_config_unpack(&(sm->payload[0]),

len, &(dm->payload[0])); if (error) { free(dm); return error; } } else if (dm->command == OFPSC_SET_L_EXTRACTOR ||

dm->command == OFPSC_SET_U_EXTRACTOR){ error = ofl_structs_extraction_unpack(&(sm->payload[0]), len, &(dm->payload[0])); if (error) { free(dm); return error; } } else if (dm->command == OFPSC_SET_FLOW_STATE){ error = ofl_structs_set_flow_state_unpack(&(sm->payload[0]), len, &(dm->payload[0])); if (error) { free(dm); return error; } } else if (dm->command == OFPSC_DEL_FLOW_STATE){ error = ofl_structs_del_flow_state_unpack(&(sm->payload[0]), len, &(dm->payload[0])); if (error) {




free(dm); return error; } } else if (dm->command == OFPSC_SET_GLOBAL_STATE){ error = ofl_structs_set_global_state_unpack(&(sm->payload[0]), len,

&(dm->payload[0])); if (error) { free(dm); return error; } } else if (dm->command == OFPSC_RESET_GLOBAL_STATE){ // payload is empty } (*msg) = (struct ofl_msg_experimenter *)dm; return 0; } default: { OFL_LOG_WARN(LOG_MODULE, "Trying to unpack unknown BEBA Experimenter message."); return ofl_error(OFPET_BAD_REQUEST, OFPBRC_BAD_EXPERIMENTER); } } } else { OFL_LOG_WARN(LOG_MODULE, "Trying to unpack non-BEBA Experimenter message."); return ofl_error(OFPET_BAD_REQUEST, OFPBRC_BAD_EXPERIMENTER); } free(msg); return 0; }

During the flow table population, some flow modification messages could contain BEBA

experimenter actions. The message is OpenFlow standard but the unpacking of the contents

needs a further extension.

In the following function, it is possible to note how the translation from wire format to internal

format of a BEBA experimenter action is performed.

ofl_err ofl_exp_beba_act_unpack(struct ofp_action_header *src, size_t *len, struct ofl_action_header **dst) { if (*len < sizeof(struct ofp_action_experimenter_header)) { OFL_LOG_WARN(LOG_MODULE, "Received EXPERIMENTER action has invalid length (%zu).", *len); return ofl_error(OFPET_BAD_REQUEST, OFPBRC_BAD_LEN); } struct ofp_action_experimenter_header *exp; exp = (struct ofp_action_experimenter_header *)src; if (ntohl(exp->experimenter) == BEBA_VENDOR_ID) { struct ofp_beba_action_experimenter_header *ext; ext = (struct ofp_beba_action_experimenter_header *)exp; switch (ntohl(ext->act_type)) { case (OFPAT_EXP_SET_STATE): { struct ofp_exp_action_set_state *sa; struct ofl_exp_action_set_state *da; if (*len < sizeof(struct ofp_exp_action_set_state)) { OFL_LOG_WARN(LOG_MODULE, "Received SET STATE action has invalid length (%zu).", *len); return ofl_error(OFPET_BAD_ACTION, OFPBRC_BAD_LEN); } sa = (struct ofp_exp_action_set_state *)ext;




da = (struct ofl_exp_action_set_state *) malloc(sizeof(struct ofl_exp_action_set_state));

if (sa->table_id >= PIPELINE_TABLES) { if (OFL_LOG_IS_WARN_ENABLED(LOG_MODULE)) { char *ts = ofl_table_to_string(sa->table_id); OFL_LOG_WARN(LOG_MODULE,

"Received SET STATE action has invalid table_id (%s).", ts); free(ts); } free(da); return ofl_error(OFPET_BAD_REQUEST, OFPBRC_BAD_TABLE_ID); } da->header.header.experimenter_id = ntohl(exp->experimenter); da->header.act_type = ntohl(ext->act_type); da->state = ntohl(sa->state); da->state_mask = ntohl(sa->state_mask); da->table_id = sa->table_id; da->hard_rollback = ntohl(sa->hard_rollback); da->idle_rollback = ntohl(sa->idle_rollback); da->hard_timeout = ntohl(sa->hard_timeout); da->idle_timeout = ntohl(sa->idle_timeout); *dst = (struct ofl_action_header *)da; *len -= sizeof(struct ofp_exp_action_set_state); break; } case (OFPAT_EXP_SET_FLAG): { struct ofp_exp_action_set_flag *sa; struct ofl_exp_action_set_flag *da; if (*len < sizeof(struct ofp_exp_action_set_flag)) { OFL_LOG_WARN(LOG_MODULE,

"Received SET FLAG action has invalid length (%zu).", *len); return ofl_error(OFPET_BAD_ACTION, OFPBRC_BAD_LEN); } sa = (struct ofp_exp_action_set_flag*)ext; da = (struct ofl_exp_action_set_flag *)

malloc(sizeof(struct ofl_exp_action_set_flag)); da->header.header.experimenter_id = ntohl(exp->experimenter); da->header.act_type = ntohl(ext->act_type); da->flag = ntohl(sa->flag); da->flag_mask = ntohl(sa->flag_mask); *dst = (struct ofl_action_header *)da; *len -= sizeof(struct ofp_exp_action_set_flag); break; } default: { OFL_LOG_WARN(LOG_MODULE, "Trying to unpack unknown Beba Experimenter action."); return ofl_error(OFPET_BAD_REQUEST, OFPBRC_BAD_EXPERIMENTER); } } } return 0; }

The same is valid for BEBA experimenter matches. Here is the code to convert the message in

a format manageable by the switch.

int ofl_exp_beba_field_unpack(struct ofl_match *match, struct oxm_field *f, void *experimenter_id, void *value, void *mask) { switch (f->index) { case OFI_OXM_EXP_STATE:{ ofl_structs_match_exp_put32(match, f->header,




ntohl(*((uint32_t*) experimenter_id)), ntohl(*((uint32_t*) value))); return 0; } case OFI_OXM_EXP_STATE_W:{ if (check_bad_wildcard32(ntohl(*((uint32_t*) value)), ntohl(*((uint32_t*) mask)))){ return ofp_mkerr(OFPET_BAD_MATCH, OFPBMC_BAD_WILDCARDS); } ofl_structs_match_exp_put32m(match, f->header, ntohl(*((uint32_t*) experimenter_id)), ntohl(*((uint32_t*) value)), ntohl(*((uint32_t*) mask))); return 0; } case OFI_OXM_EXP_FLAGS:{ ofl_structs_match_exp_put32(match, f->header,

ntohl(*((uint32_t*) experimenter_id)), ntohl(*((uint32_t*) value))); return 0; } case OFI_OXM_EXP_FLAGS_W:{ if (check_bad_wildcard32(ntohl(*((uint32_t*) value)), ntohl(*((uint32_t*) mask)))){ return ofp_mkerr(OFPET_BAD_MATCH, OFPBMC_BAD_WILDCARDS); } ofl_structs_match_exp_put32m(match, f->header, ntohl(*((uint32_t*) experimenter_id)), ntohl(*((uint32_t*) value)), ntohl(*((uint32_t*) mask))); return 0; } default: NOT_REACHED(); } }

1.2.1 FSM instantiation

The instantiation of a state machine requires a set of experimenter and standard OpenFlow

messages. As stated in the previous section, the handling of a BEBA experimenter message is

performed in the dp_exp.c file.

The following function is responsible for parsing a state modification message.

ofl_err handle_state_mod(struct pipeline *pl, struct ofl_exp_msg_state_mod *msg, const struct sender *sender) { if (msg->command == OFPSC_STATEFUL_TABLE_CONFIG) { struct ofl_exp_stateful_table_config *p =

(struct ofl_exp_stateful_table_config *) msg->payload; struct state_table *st = pl->tables[p->table_id]->state_table; state_table_configure_stateful(st, p->stateful); } else if (msg->command == OFPSC_SET_L_EXTRACTOR || msg->command == OFPSC_SET_U_EXTRACTOR) { struct ofl_exp_set_extractor *p = (struct ofl_exp_set_extractor *) msg->payload; struct state_table *st = pl->tables[p->table_id]->state_table; if (state_table_is_stateful(st)){ int update = 0; if (msg->command == OFPSC_SET_U_EXTRACTOR) update = 1; state_table_set_extractor(st, (struct key_extractor *)p, update); } else { OFL_LOG_WARN(LOG_MODULE, "ERROR STATE MOD:

cannot configure extractor (stage %u is not stateful)", p->table_id); } else if (msg->command == OFPSC_SET_FLOW_STATE) { struct ofl_exp_set_flow_state *p = (struct ofl_exp_set_flow_state *) msg->payload; struct state_table *st = pl->tables[p->table_id]->state_table; if (state_table_is_stateful(st) && state_table_is_configured(st)){ state_table_set_state(st, NULL, p, NULL);




} else{ OFL_LOG_WARN(LOG_MODULE,

"ERROR STATE MOD at stage %u: stage not stateful or not configured", p->table_id); } } else if (msg->command == OFPSC_DEL_FLOW_STATE) { struct ofl_exp_del_flow_state *p = (struct ofl_exp_del_flow_state *) msg->payload; struct state_table *st = pl->tables[p->table_id]->state_table; if (state_table_is_stateful(st) && state_table_is_configured(st)){ state_table_del_state(st, p->key, p->key_len); } else{OFL_LOG_WARN(LOG_MODULE, "ERROR STATE MOD at stage %u: stage not stateful or not configured", p->table_id); } } else if (msg->command == OFPSC_SET_GLOBAL_STATE) { uint32_t global_states = pl->dp->global_states; struct ofl_exp_set_global_state *p = (struct ofl_exp_set_global_state *) msg->payload; global_states = (global_states & ~(p->flag_mask)) | (p->flag & p->flag_mask); pl->dp->global_states = global_states; } else if (msg->command == OFPSC_RESET_GLOBAL_STATE) { pl->dp->global_states = OFP_GLOBAL_STATES_DEFAULT; } else return 1; return 0;

The following subsections describe each of the possible state modification messages.

1.2.1.1 Table statefulness configuration

By default all the tables in the pipeline are stateless. The controller will send a

OFPT_EXP_STATE_MOD message with the OFPSC_STATEFUL_TABLE_CONFIG command for each

table that needs to be configured as stateful. The stateful parameter is set to 1 and the target

table is indexed by the table_id parameter.

Once the command is parsed, the handler will modify the stateful table parameter from 0 to 1.

static ofl_err ofl_structs_stateful_table_config_unpack(struct ofp_exp_stateful_table_config *src, size_t *len, struct ofl_exp_stateful_table_config *dst) { int i; if(*len == sizeof(struct ofp_exp_stateful_table_config)) { if (src->table_id >= PIPELINE_TABLES) { OFL_LOG_WARN(LOG_MODULE, "Received STATE_MOD message has invalid table id (%zu).",

src->table_id ); return ofl_error(OFPET_BAD_REQUEST, OFPBRC_BAD_TABLE_ID); } dst->table_id = src->table_id; dst->stateful = src->stateful; } else { OFL_LOG_WARN(LOG_MODULE, "Received state mod stateful_table is too short (%zu).", *len); return ofl_error(OFPET_BAD_ACTION, OFPBAC_BAD_LEN); } *len -= sizeof(struct ofp_exp_stateful_table_config); return 0;




}

1.2.1.2 Lookup/Update scope setting

For each stateful table the controller needs to send two OFPT_STATE_MOD message with the

OFPSC_SET_L_EXTRACTOR or OFPSC_SET_U_EXTRACTOR command field for the lookup and update

scope respectively.

This is the same message used to configure the statefulness of a table but using different

commands.

Following the same logic explained above, the command is parsed in the handle_state_mod().

For both OFPSC_SET_L_EXTRACTOR and OFPSC_SET_U_EXTRACTOR the

state_table_set_extractor() function is in charge of storing the list of fields that will be

used to lookup and update the states.

static ofl_err ofl_structs_extraction_unpack(struct ofp_exp_set_extractor *src, size_t *len, struct

ofl_exp_set_extractor *dst) { int i; if(*len == ((1+ntohl(src->field_count))*sizeof(uint32_t) + 4*sizeof(uint8_t)) && (ntohl(src->field_count)>0)) { if (src->table_id >= PIPELINE_TABLES) { OFL_LOG_WARN(LOG_MODULE,

"Received STATE_MOD message has invalid table id (%zu).", src->table_id ); return ofl_error(OFPET_BAD_REQUEST, OFPBRC_BAD_TABLE_ID); } dst->table_id = src->table_id; dst->field_count=ntohl(src->field_count); for (i=0;i<dst->field_count;i++) { dst->fields[i]=ntohl(src->fields[i]); } } else { //control of struct ofp_extraction length. OFL_LOG_WARN(LOG_MODULE, "Received state mod extraction is too short (%zu).", *len); return ofl_error(OFPET_BAD_ACTION, OFPBAC_BAD_LEN); } *len -= (((1+ntohl(src->field_count))*sizeof(uint32_t)) + 4*sizeof(uint8_t)); return 0; }

1.2.1.3 Flow table population

In order to populate the flow tables, a set of FLOW_MOD are sent. These let us define the Mealy

machines behaviours.

The FLOW_MOD messages are supported by OpenFlow natively, thus the parsing is handled by

the original switch implementation and not by third parties. The function




pipeline_handle_flow_mod() in handle_control_msg() accepts as input the message and

extrapolates the new rules to be installed in the table specified by the message.

1.2.2 Packet processing

As stated before, pipeline_process_packet() processes an incoming packet from a standard

switch port. When a packet enters a stateful stage, the following state table lookup is

performed:

if (state_table_is_stateful(table->state_table) && state_table_is_configured(table->state_table)) { state_entry = state_table_lookup(table->state_table, pkt); if(state_entry!=NULL){

ofl_structs_match_exp_put32(&pkt->handle_std->match, OXM_EXP_STATE, 0xBEBABEBA, 0x00000000);

state_table_write_state(state_entry, pkt); } }

The fields defined by the lookup-scope are extracted from the packet header and used as key

to access the state table (a hash table). If an entry for the given key is found, the associated

state is returned. If no state is found, the returned value is the default one (zero).

The following code excerpt is responsible for extracting the lookup and update keys (copied

into *buf) obtained by concatenation of the match fields defined in *extractor from the

received packet pointed by *pkt:

int __extract_key(uint8_t *buf, struct key_extractor *extractor, struct packet *pkt) { int i, extracted_key_len=0, expected_key_len=0; struct ofl_match_tlv *f; for (i=0; i<extractor->field_count; i++) { uint32_t type = (int)extractor->fields[i]; HMAP_FOR_EACH_WITH_HASH(f, struct ofl_match_tlv, hmap_node, hash_int(type, 0), &pkt->handle_std->match.match_fields){ if (type == f->header) { memcpy(&buf[extracted_key_len], f->value, OXM_LENGTH(f->header)); extracted_key_len = extracted_key_len + OXM_LENGTH(f->header);

//keeps only 8 last bits of oxm_header that contains oxm_length(in which length of oxm_payload)

break; } } expected_key_len = expected_key_len + OXM_LENGTH(type); } /* check if the full key has been extracted: if key is extracted partially or not at all, we cannot access the state table */ if (extracted_key_len==expected_key_len) return 1; else return 0; }

The following function is responsible for retrieving from the state table, pointed by *table, the

state associated to the key, extracted from a packet pointed by *pkt:




struct state_entry * state_table_lookup(struct state_table* table, struct packet *pkt) { struct state_entry * e = NULL; uint8_t key[MAX_STATE_KEY_LEN] = {0}; struct timeval tv; if(!__extract_key(key, &table->read_key, pkt)) { OFL_LOG_WARN(LOG_MODULE, "lookup key fields not found in the packet's header -> NULL"); return NULL; } HMAP_FOR_EACH_WITH_HASH(e, struct state_entry, hmap_node, hash_bytes(key, MAX_STATE_KEY_LEN, 0), &table->state_entries){ if (!memcmp(key, e->key, MAX_STATE_KEY_LEN)){ OFL_LOG_WARN(LOG_MODULE, "found corresponding state %u",e->state); //check if the hard_timeout of matched state entry has expired if ((e->stats->hard_timeout>0) && state_entry_hard_timeout(table,e)) { if (e->state==STATE_DEFAULT) e == NULL; break; } //check if the idle_timeout of matched state entry has expired if ((e->stats->idle_timeout>0) && state_entry_idle_timeout(table,e)) { if (e->state==STATE_DEFAULT) e == NULL; break; } gettimeofday(&tv,NULL); e->last_used = 1000000 * tv.tv_sec + tv.tv_usec; break; } } if (e == NULL) { OFL_LOG_WARN(LOG_MODULE, "not found the corresponding state value\n"); return &table->default_state_entry; } else return e; }

The returned state value is appended to the packet header as a virtual header field “state”.

When the key fields are not found in the packet's header, a special NULL state value is

returned and the “state” field is not added. The following code is responsible for such

operation:

void state_table_write_state(struct state_entry *entry, struct packet *pkt) { struct ofl_match_tlv *f; HMAP_FOR_EACH_WITH_HASH(f, struct ofl_match_tlv, hmap_node, hash_int(OXM_EXP_STATE,0), &pkt->handle_std->match.match_fields){ int32_t *state = (uint32_t*) (f->value + EXP_ID_LEN); *state = (*state & 0x00000000) | (entry->state); } }

In addition to the “state” field, the value of the global registers is added to the packet through

the 'flags' virtual header field.




//set 'flags' virtual header field value HMAP_FOR_EACH_WITH_HASH(f, struct ofl_match_tlv, hmap_node, hash_int(OXM_EXP_FLAGS,0), &pkt->handle_std->match.match_fields){ uint32_t *flags = (uint32_t*) (f->value + EXP_ID_LEN); *flags = (*flags & 0x00000000 ) | (pkt->dp->global_states); }

The packet is then processed in the usual flow table with the possibility to match also over

“state” and “flags” header fields.

1.2.2.1 State transition

State transitions can be performed through either set_state actions, triggered from matches,

or state modification messages sent by the controller.

In both cases, the handling function called is the same, and for the former the message must

have the command field set to OFPSC_SET_FLOW_STATE. The following function is responsible

for performing a state transition:

void state_table_set_state(struct state_table *table, struct packet *pkt, struct ofl_exp_set_flow_state *msg, struct ofl_exp_action_set_state *act) { uint8_t key[MAX_STATE_KEY_LEN] = {0}; struct state_entry *e; uint32_t state,state_mask; uint32_t idle_rollback,hard_rollback; uint32_t idle_timeout,hard_timeout; uint64_t now; struct timeval tv; int i; uint32_t key_len=0; //update-scope key extractor length struct key_extractor *extractor=&table->write_key; for (i=0; i<extractor->field_count; i++) { uint32_t type = (int)extractor->fields[i]; key_len = key_len + OXM_LENGTH(type); } if (pkt) { //SET_STATE action state = act->state; state_mask = act->state_mask; idle_rollback = act->idle_rollback; hard_rollback = act->hard_rollback; idle_timeout = act->idle_timeout; hard_timeout = act->hard_timeout; if(!__extract_key(key, &table->write_key, pkt)){ OFL_LOG_WARN(LOG_MODULE, "lookup key fields not found in the packet's header"); return; } } else if (msg){ //SET_STATE message state = msg->state; state_mask = msg->state_mask; idle_rollback = msg->idle_rollback; hard_rollback = msg->hard_rollback; idle_timeout = msg->idle_timeout;




hard_timeout = msg->hard_timeout; if(key_len == msg->key_len) { memcpy(key, msg->key, msg->key_len); } else { OFL_LOG_WARN(LOG_MODULE, "key extractor length != received key length"); return; } } HMAP_FOR_EACH_WITH_HASH(e, struct state_entry, hmap_node, hash_bytes(key, MAX_STATE_KEY_LEN, 0), &table->state_entries){ if (!memcmp(key, e->key, MAX_STATE_KEY_LEN)){ OFL_LOG_WARN(LOG_MODULE, "state value is %u updated to hash map", state); if ((((e->state & ~(state_mask)) | (state & state_mask)) == STATE_DEFAULT)

&& hard_timeout==0 && idle_timeout==0){ state_table_del_state(table, key, key_len); } else { e->state = (e->state & ~(state_mask)) | (state & state_mask); gettimeofday(&tv,NULL); now = 1000000 * tv.tv_sec + tv.tv_usec; e->created = now; if (e->stats->idle_timeout) hmap_remove_and_shrink(&table->idle_entries, &e->idle_node); if (e->stats->hard_timeout) hmap_remove_and_shrink(&table->hard_entries, &e->hard_node); e->stats->idle_timeout = 0; e->stats->hard_timeout = 0; e->stats->idle_rollback = 0; e->stats->hard_rollback = 0; if (hard_timeout>0 && hard_rollback!=((e->state & ~(state_mask)) |

(state & state_mask))) { e->stats->hard_timeout = hard_timeout; e->stats->hard_rollback = hard_rollback; e->remove_at = now + hard_timeout; hmap_insert(&table->hard_entries, &e->hard_node,

hash_bytes(key, MAX_STATE_KEY_LEN, 0)); } if (idle_timeout>0 && idle_rollback!=((e->state & ~(state_mask)) |

(state & state_mask))) { e->stats->idle_timeout = idle_timeout; e->stats->idle_rollback = idle_rollback; e->last_used = now; hmap_insert(&table->idle_entries, &e->idle_node,

hash_bytes(key, MAX_STATE_KEY_LEN, 0)); } } return; } } gettimeofday(&tv,NULL); now = 1000000 * tv.tv_sec + tv.tv_usec; e = xmalloc(sizeof(struct state_entry)); e->created = now; e->stats = xmalloc(sizeof(struct ofl_exp_state_stats)); e->stats->idle_timeout = 0; e->stats->hard_timeout = 0; e->stats->idle_rollback = 0; e->stats->hard_rollback = 0; memcpy(e->key, key, MAX_STATE_KEY_LEN); e->state = state & state_mask; // A new state entry with state!=DEF is always installed.




if ((state & state_mask) != STATE_DEFAULT) { OFL_LOG_WARN(LOG_MODULE, "state value is %u inserted to hash map", e->state); hmap_insert(&table->state_entries, &e->hmap_node, hash_bytes(key, MAX_STATE_KEY_LEN, 0)); } else { // Otherwise a new state entry with state=DEF will be

installed only if at least one timeout is set with rollback!=DEF if ((hard_timeout>0 && hard_rollback!=STATE_DEFAULT) || (

idle_timeout>0 && idle_rollback!=STATE_DEFAULT)) hmap_insert(&table->state_entries, &e->hmap_node, hash_bytes(key, MAX_STATE_KEY_LEN, 0)); } // Configuring a timeout with rollback state=state makes no sense if (hard_timeout>0 && hard_rollback!=(state & state_mask)){ e->remove_at = hard_timeout>0 == 0 ? 0 : now + hard_timeout; e->stats->hard_timeout = hard_timeout; e->stats->hard_rollback = hard_rollback; hmap_insert(&table->hard_entries, &e->hard_node, hash_bytes(key, MAX_STATE_KEY_LEN, 0)); } if (idle_timeout>0 && idle_rollback!=(state & state_mask)){ e->stats->idle_timeout = idle_timeout; e->stats->idle_rollback = idle_rollback; e->last_used = now; hmap_insert(&table->idle_entries, &e->idle_node, hash_bytes(key, MAX_STATE_KEY_LEN, 0)); } }

1.3 Implementation of the BEBA basic controller prototype

1.3.1 Overview of the Ryu OpenFlow controller

To support BEBA features at control plane, the well-known Ryu SDN software framework [RYU]

has been extended: Ryu is a component-based open source SDN framework written in Python.

In addition to NETCONF, OF-Config and SNMP, it supports OpenFlow as southbound protocol

for managing network devices. Ryu fully supports OpenFlow 1.0, 1.2, 1.3, 1.4, 1.5 and Nicira

Extensions.

As depicted in Figure 2, the Ryu framework is driven by events: a Ryu application implements

event handlers corresponding to messages that are meant to be received (for example a

packet received from the switch, an error message, features reply messages or other statistics

messages). Event handlers are defined by decorating application class methods so that the

event handler is called from the application’s event loop when such an event occurs.




Figure 2 - Ryu approach [Source: http://thenewstack.io]

Ryu runs in multi-thread, but each Ryu application is automatically created with just one

thread (the Event-Processing Thread). Events are popped out from a FIFO Event Receive

Queue by this thread and sent to the appropriate event handler. When an event handler is

given control, the Ryu application cannot process further events until control is returned.

Additional threads can be created to perform application-specific processing using the

hub.spawn function.

1.3.2 Ryu and OpenFlow experimenters

Ryu framework supports in particular OpenFlow 1.3 and includes library for parsing/serializing

OF messages. The most interesting module is ryu.ofproto.ofproto_v1_3_parser which

implements OpenFlow 1.3.x.

To support BEBA basic features we need new messages, new actions and new match fields. A

BEBA-enabled controller, at least for the basic version, does not need to provide northbound

API, so we just need to extend the controller to make it able to parse the user-defined

application and generate the appropriate message on the wire.

Instead of patching the above mentioned module, we can exploit experimenter messages and

actions, already implemented in Ryu, and meant to provide “a standard way for OpenFlow

switches to offer additional functionality within the OpenFlow message type space“ [OF13].

In particular, an OpenFlow experimenter message has a well-defined structure with an

experimenter_header and a user-defined payload. We developed a module that on one hand

offers an API to the user to specify parameters and options, and on the other hand creates

experimenter message by filling fields and packing the payload. This message, which is now an




OpenFlow experimenter message, can be processed and sent to the appropriate switch using

the standard Ryu API.

The same applies for actions: by exploiting OpenFlow experimenter actions structure, our

module gives BEBA actions support by relying on experimenter actions of Ryu API. In this way

users can seamlessly make use of instructions with both OpenFlow and BEBA actions in their

applications.

Finally, for what concern matching fields, Ryu already implements experimenter fields

matching such as the Nicira’s ones.

The choice of the OpenFlow Experimenter approach allows a minimal departure from the

original Ryu code and ensure easy maintenance of code during update operations (this factor is

not negligible in an active open-source project such as Ryu).

1.3.3 Implementation of the BEBA extensions

In this section, the controller side of the FSM instantiation is described with a simple example.

The MAC Learning application offers a simple, but complete use case. In section 5.1 of D2.1 we

specified how the application can be mapped onto BEBA abstraction, while in this section we

are going to show the use of the BEBA API in the context of that particular use case.

The complete source code of the application is available from the BEBA project homepage;

here we will focus just on the main components.

In order to support BEBA, Ryu has been extended by adding 2 new modules

(ryu.ofproto.beba_v1_0 and ryu.ofproto.beba_v1_0_parser), by adding a new class in

ryu.ofproto.oxm_fields, and by adding 2 new match fields in ryu.ofproto_v1_3.

1.3.3.1 Table configuration

The following two function calls are responsible for instantiating and sending a message of

type OFPSC_STATEFUL_TABLE_CONFIG:

req = osparser.OFPExpMsgConfigureStatefulTable(datapath=datapath,

table_id=0, stateful=1)

datapath.send_msg(req)

where:

osparser is the BEBA module offering the BEBA API,




OFPExpMsgConfigureStatefulTable() is a method which returns an Experimenter

message packed with the user-provided data according to the specification of the

message (4.1.1 in D.2.1).

The following Python code excerpt is responsible for implementing the

OFPExpMsgConfigureStatefulTable() function:

OFPT_EXP_STATE_MOD = 0 OFPSC_STATEFUL_TABLE_CONFIG = 0 OFP_EXP_STATE_MOD_PACK_STR='!Bx' OFP_EXP_STATE_MOD_STATEFUL_TABLE_CONFIG_PACK_STR='!BB' def OFPExpMsgConfigureStatefulTable(datapath, stateful, table_id): command=osproto.OFPSC_STATEFUL_TABLE_CONFIG data=struct.pack(osproto.OFP_EXP_STATE_MOD_PACK_STR, command) data+=struct.pack(osproto.OFP_EXP_STATE_MOD_STATEFUL_TABLE_CONFIG_PACK_STR,table_id,stateful) exp_type=osproto.OFPT_EXP_STATE_MOD return ofproto_parser.OFPExperimenter(datapath=datapath, experimenter=0xBEBABEBA, exp_type=exp_type, data=data)

It’s worth noting that the returned object, req, is a Ryu API object representing a standard

OpenFlow OFPExperimenter message, not a custom BEBA object. Thus it is possible to send it

using method datapath.send_msg(req).

1.3.3.2 Lookup-scope and update-scope configuration

The same consideration holds for the configuration of the lookup-scope and update-scope.

Here is the code used by a BEBA application:

""" Set lookup extractor = {eth_dst} """ req = osparser.OFPExpMsgKeyExtract(datapath=datapath, command=osp.OFPSC_EXP_SET_L_EXTRACTOR, fields=[ofp.OXM_OF_ETH_DST], table_id=0) datapath.send_msg(req) """ Set update extractor = {eth_src} """ req = osparser.OFPExpMsgKeyExtract(datapath=datapath, command=osp.OFPSC_EXP_SET_U_EXTRACTOR, fields=[ofp.OXM_OF_ETH_SRC], table_id=0) datapath.send_msg(req)




While the following is part of the BEBA library:

OFP_EXP_STATE_MOD_EXTRACTOR_PACK_STR='!B3xI'

def OFPExpMsgKeyExtract(datapath, command, fields, table_id):

field_count=len(fields)

data=struct.pack(osproto.OFP_EXP_STATE_MOD_PACK_STR, command)

data+=struct.pack(osproto.OFP_EXP_STATE_MOD_EXTRACTOR_PACK_STR,table_id,field_co

unt)

field_extract_format='!I'

if field_count <= osproto.MAX_FIELD_COUNT:

for f in range(field_count):

data+=struct.pack(field_extract_format,fields[f])

else:

LOG.error("OFPExpMsgKeyExtract: Number of fields > MAX_FIELD_COUNT")

exp_type=osproto.OFPT_EXP_STATE_MOD

return ofproto_parser.OFPExperimenter(datapath=datapath, experimenter=0xBEBABEBA, exp_type=exp_type, data=data)

1.3.3.3 set_state action support

Let us consider the following code lines which are responsible for building the list of actions to

be executed (see fig.4 in of chapter 5 in D.2.1):

actions = [osparser.OFPExpActionSetState(state=i, table_id=0, hard_timeout=10), ofparser.OFPActionOutput(out_port)]

OFPExpActionSetState() is part of BEBA API and returns a standard OpenFlow

OFPActionExperimenterUnknown after packing into the payload the parameters provided by

the user. Here follows the Python code for such function:

def OFPExpActionSetState(state, table_id, hard_timeout=0, idle_timeout=0, hard_rollback=0, idle_rollback=0, state_mask=0xffffffff): act_type=osproto.OFPAT_EXP_SET_STATE data=struct.pack(osproto.OFP_EXP_ACTION_SET_STATE_PACK_STR, act_type, state, state_mask, table_id, hard_rollback, idle_rollback, hard_timeout*1000000, idle_timeout*1000000) return ofproto_parser.OFPActionExperimenterUnknown(experimenter=0xBEBABEBA, data=data)




The resulting action list can be used as an input argument for the Ryu class

OFPInstructionActions. The mixture of BEBA actions with OpenFlow actions is not at all a

problem as the BEBA actions are transparently packed as OpenFlow Experimenter actions.

1.3.3.4 State match support

The BEBA state filed is matched as any other OpenFlow fileds. For example, in the following

line we are defining a match on the input port and on the state:

match = ofparser.OFPMatch(in_port=i, state=s)

Even in this case standard match fields such as input port and custom (experimenter) match

fields can be put in the same match without problems.

1.4 In-switch packet generation prototype

The in-switch packet generation (InSP) API introduces the possibility of programming the

generation of a packet from the switch, in reaction to some kind of events, e.g., packet

reception. A first description of the API was presented already in D4.1, to which we refer the

reader for further details about the API concepts. In this section, we will focus on describing

the prototype implementation details.

We implemented a subset of the InSP API (almost the entire API, with the exclusion of the

copy instructions) using the OpenFlow experimenter extensions, which enable compatibility

with standard OpenFlow switches. Please notice that the implementation of InSP is orthogonal

to the implementation of the stateful processing API. That is, while we believe that the

combination of these two APIs nicely complements the capability of the SDN switches, we also

believe that the implementation of any one of them has value per se.

The InSP API has two main components: the packet template table, together with the

messages required to manage it, and the triggering instruction. Here, we briefly recall that the

packet template table contains packet templates which are used as a model (i.e., a template)

for a generated packet, while the triggering instruction can be used together with a flow table

entry to trigger the generation of a packet from a packet template. The remainder of this

section describes the implementation of those two components.

1.4.1 Packet template table

The packet template table is the main data structure added to the switch in order to support

the InSP API. The structure is presented in the following listing (taken from ofl-exp-beba.h):

/*************************************************************************/ /* experimenter pkttmp table /*************************************************************************/ struct pkttmp_table { struct datapath *dp;




size_t entries_num; struct hmap entries; }; struct pkttmp_entry { struct hmap_node node; uint32_t pkttmp_id; struct datapath *dp; struct pkttmp_table *table; uint64_t created; uint8_t *data; size_t data_length; };

The packet template table is implemented as a hash map, with each entry being identified by a

32 bits packet template id, which works as table key value. In the current implementation, a

packet template entry further contains the byte array that specifies generated packets’ content

and a creation timestamp, which may be used for inspecting the switch state and/or to

implement a future timeout feature on the table’s entries. A handful of functions are also

defined to deal with the data structures:

/* experimenter pkttmp table functions*/ struct pkttmp_table * pkttmp_table_create(struct datapath *dp); void pkttmp_table_destroy(struct pkttmp_table *table); /* experimenter pkttmp entry functions */ struct pkttmp_entry * pkttmp_entry_create(struct datapath *dp, struct pkttmp_table *table, struct ofl_exp_add_pkttmp *mod); void pkttmp_entry_destroy(struct pkttmp_entry *entry);

Packet template table’s entries are added and deleted using a new type of experiment

message named PKTTMP_MOD. A PKTTMP_MOD has two possible commands: add and delete.

The add command contains all the information required to specify a packet template table’s

entry, while the delete commands contains only the pkttmp_id of the entry to be deleted.

The OpenFlow protocol headers for PKTTMP_MOD are specified in beba-exp.h, where the

structures to serialize/deserialize them are defined as follows:

/**************************************************************** * * OFPT_EXP_PKTTMP_MOD




* ****************************************************************/ struct ofp_exp_msg_pkttmp_mod { struct ofp_experimenter_header header; /* OpenFlow's standard experimenter message header */ uint8_t command; uint8_t pad; uint8_t payload[]; }; struct ofp_exp_add_pkttmp { uint32_t pkttmp_id; uint8_t pad[4]; /* uint8_t data[0]; */ /* Packet data. The length is inferred from the length field in the header. */ }; struct ofp_exp_del_pkttmp { uint32_t pkttmp_id; uint8_t pad[4]; }; enum ofp_exp_msg_pkttmp_mod_commands { OFPSC_ADD_PKTTMP = 0, OFPSC_DEL_PKTTMP };

The corresponding internal data structures to deal with the message are defined in ofl-exp-

beba.h:

/************************ * pkttmp mod messages ************************/ struct ofl_exp_msg_pkttmp_mod { struct ofl_exp_beba_msg_header header; /* OFP_EXP_PKTTMP_MOD */ enum ofp_exp_msg_pkttmp_mod_commands command; uint8_t payload[12]; //(sizeof(ofl_exp_add_pkttmp)) }; struct ofl_exp_add_pkttmp { uint32_t pkttmp_id; size_t data_length; uint8_t *data; }; struct ofl_exp_del_pkttmp { uint32_t pkttmp_id; };




When a PKTTMP_MOD is received by the switch, a corresponding function is defined to handle

the experimenter message. In short, the function adds or deletes an entry to/from the packet

template table, depending on the command type (ADD/DELETE).

1.4.2 In-Switch Packet Generation Instruction

The InSP API defines an instruction to trigger the generation of a packet. The instruction is

defined as an OpenFlow experimenter instruction, i.e., it can be used in any context an

OpenFlow instruction is used, for instance, in the definition of a flow table’s entry.

The instruction internal structure is defined in ofl-exp-beba.h:

*************************************************************************/ /* experimenter instructions ofl_exp */ /*************************************************************************/ struct ofl_exp_beba_instr_header { struct ofl_instruction_experimenter header; /* BEBA_VENDOR_ID */ uint32_t instr_type; }; struct ofl_exp_instruction_in_switch_pkt_gen { struct ofl_exp_beba_instr_header header;/*OFPIT_EXP_IN_SWITCH_PKT_GEN*/ uint32_t pkttmp_id; size_t actions_num; struct ofl_action_header **actions; };

To allow for the definition of future different instructions, the BEBA experimenter instruction

has a general header which defines a type field. An In-Switch packet generation instruction has

a value of OFPIT_EXP_IN_SWITCH_PKT_GEN for such field. The instruction then contains: the

packet template’s id, to specify from which template the packet content should be derived

from, and a set of actions that should be applied to the generated packet.

Notice that the In-Switch Packet Generation instruction is very similar to the apply/write

actions instructions defined in the OpenFlow specification. In fact, the actions specified by the

instruction are those specified by the OpenFlow specification, while the In-Switch Packet

Generation instruction only adds the packet template id information.

Whenever an InSP instruction is activated, e.g., because of a flow entry matching a packet, the

handling of such instruction is performed by the experimenter instruction callback defined in

ofsoftswitch (cf. Section 1.3). In particular we implemented the dp_exp_inst() callback to

perform the following operations:

1. Extract the pkttmp_id from the instruction;

2. Lookup the packet template corresponding to the pkttmp_id;




3. Generate a new packet using the template to create the packet’s content;

4. Execute the instruction’s action list on the generated packet;

Notice that since any action is supported in the InSP instruction, the packet may be delivered

to a switch’s port, or it may also be injected at the beginning of the switch’s pipeline (using the

TABLE port as value of the output port).

Also, notice that the packet that triggered the execution of the InSP instruction continues its

life on the switch’s pipeline. That is, any other instruction specified before and/or after the

InSP instruction is applied on the received packet, and the packet’s action set is eventually

executed.

1.4.3 Controller InSP API example

We now provide a brief presentation of the InSP API as implemented in the RYU controller. We

do not specify the implementation details as they are similar to those presented in Section 1.4,

while we rather focus on the specific InSP API calls.

The creation of a packet template is achieved by sending a PKTTMP_MOD of type ADD, which

contains the pkttmp_id and the template of the packet’s data. A usage example is presented in

the following code snippet:

""" Create PKTTMP entries """ req = osparser.OFPExpMsgAddPktTmp(datapath=datapath, pkttmp_id=0, pkt_data=pkt_data) datapath.send_msg(req)

where datapath is an object representing the switch in the RYU’s API.

To trigger the generation of a packet using such packet template, we need then to specify a

flow entry that has the InSP instruction among its instructions list. The following snippet

presents an example:

""" Create PKTTMP trigger (install flow entry) """ match = ofparser.OFPMatch(in_port=1) actions = [ofparser.OFPActionOutput(1)] inst = [osparser.OFPInstructionInSwitchPktGen(pkttmp_id, actions)] mod = ofparser.OFPFlowMod(datapath=datapath, table_id=table_id, priority=priority, match=match, instructions=inst) datapath.send_msg(mod)

In this specific example, notice that both the match structure’s in_port and the action

output’s out_port have the same value. This is typically not allowed by the OpenFlow




specification, since it states that a received packet should not be sent out to the original

receiving port. However, the InSP instruction generates a new packet, whose lifecycle is

independent from the originally received packet’s lifecycle.

1.5 State synchronization mechanism prototype

The ability to synchronize the states of various dataplane switches in a BEBA network

comprises a fundamental principle of this next generation SDN approach. The state

synchronization API, as part of the basic BEBA API, is introduced in D4.1 in which all the

concepts are detailed. The goal of this section is to realize a prototype implementation of a

part of this API and provide more technical details on the internal mechanisms. The state

synchronization contains two logical channels that pass information from the BEBA Controller

to a BEBA switch and vice versa. The following sections describe the messages that pass

through these two channels.

1.5.1 Controller to Switch

This channel transmits downstream messages from a BEBA controller targeting to query the

state of a BEBA switch. Three messages are introduced in D4.1, enclosed in REQ-C6 message:

1. REQ-C6-1: Given a state table and a flow key, get the associated state of this flow.

2. REQ-C6-2: Given a state table and a state, get the flows currently in that state.

3. REQ-C6-3: Retrieve the global state of a switch.

Switch-side

The switch must contain the appropriate data structures to understand the meaning of these queries, thus we first introduce the C code of ofsoftswitch that formalizes these data structures. struct ofl_exp_msg_multipart_request_state { struct ofl_exp_beba_msg_multipart_request header; /* OFPMP_STATE */ uint8_t table_id; /* ID of table to read (from ofp_table_multipart), 0xff for all tables. */ uint8_t get_from_state; uint32_t state; struct ofl_match_header *match; /* Fields to match. */ };

This structure accommodates both REQ-C6-1, REQ-C6-2. First, we specify the table ID that

indicates the state table. If no ID specified, the switch queries all tables. Then, the field

get_from_state defines the type of the query. REQ-C6-1 sets this field to 0 and utilizes the




last field of the structure (i.e. match pointer) to specify the key of the flow that the controller

wants to query in order to retrieve its state. On the other hand, REQ-C6-2 sets

get_from_state=1 and specifies the state setting the state variable. The expected result of

this second query is a set of flow keys that match this state.

The switch responds to the two messages above by utilizing a reply data structure as follows:

struct ofl_exp_msg_multipart_reply_state { struct ofl_exp_beba_msg_multipart_reply header; /* OFPMP_STATE */ size_t stats_num; struct ofl_exp_state_stats **stats; };

This structure contains two fields (besides its own header). The former field tells the controller

how many entries are sent by the switch while the latter field contains the information of these

entries in a table. This table is defined below:

struct ofl_exp_state_stats { uint8_t table_id; /* ID of table flow came from. */ uint32_t duration_sec; /* Time state entry has been alive in secs. */ uint32_t duration_nsec; /* Time state entry has been alive in nsecs beyond duration_sec. */ uint32_t field_count; /*number of extractor fields*/ uint32_t fields[OFPSC_MAX_FIELD_COUNT]; /*extractor fields*/ uint32_t hard_rollback; uint32_t idle_rollback; uint32_t hard_timeout; /* Number of seconds before expiration. */ uint32_t idle_timeout; /* Number of seconds idle before expiration. */ struct ofl_exp_state_entry entry; /* Description of the state entry. */ };

The data structure ofl_exp_state_stats indicates the table from which the information is

taken, encodes a set of statistics related to the query and finally describes the state entry

retrieved by the table. The comments mapped to each field of the data structure briefly explain

their purpose, thus we will focus only on the last field which contains the information that

controller is asking about. This structure is defined below:

struct ofl_exp_state_entry{ uint32_t key_len; uint8_t key[OFPSC_MAX_KEY_LEN]; uint32_t state; };




Depending upon the type of the query (i.e. REQ-C6-1 or REQ-C6-2), the controller might

extract either the key value (if the query asks for flows as per REQ-C6-1) or the state (if the

query asks for state as per REQ-C6-2).

As far as the global state is concerned, the switch contains a simpler set of data structures that

encode this state. In the following pieces of code, the variable flag corresponds to the global

state.

A request for the global state (REQ-C6-3) can be issued using this message structure:

struct ofl_exp_msg_multipart_request_global_state { struct ofl_exp_beba_msg_multipart_request header; /* OFPMP_FLAGS */ };

Since there is only one global state variable per switch, this message simply contains an

appropriate header that is uniquely identifiable by the switch. When seen by the switch, a reply

is composed as follows:

struct ofl_exp_msg_multipart_reply_global_state { struct ofl_exp_beba_msg_multipart_reply header; /* OFPMP_FLAGS */ uint32_t global_states; };

The switch uses again an appropriate header that corresponds to the global state and one

variable that gives the value of this state.

Controller-side

Having specified the data structures and messages of the switch for messages REQ-C6-1/2/3,

now we see how the controller can instantiate remote function calls to retrieve the information.

In the Ruy prototype, one method is responsible to issue the first two messages above (i.e.

REQ-C6-1, REQ-C6-2). This method is written below:

def OFPExpStateStatsMultipartRequest(datapath, flags=0, table_id=ofproto.OFPTT_ALL, state=None, match=None):

get_from_state = 1 if state is None: get_from_state = 0 state = 0 if match is None: match = ofproto_parser.OFPMatch() data=bytearray() msg_pack_into(osproto.OFP_STATE_STATS_REQUEST_0_PACK_STR, data, 0, table_id,

get_from_state, state) offset=osproto.OFP_STATE_STATS_REQUEST_0_SIZE match.serialize(data, offset)




exp_type=osproto.OFPMP_EXP_STATE_STATS return ofproto_parser.OFPExperimenterStatsRequest(datapath=datapath, flags=flags,

experimenter=0xBEBABEBA, exp_type=exp_type, data=data)

The controller specifies the target switch using the variable datapath. Variable table_id is

used to specify the table from which we want to retrieve the state. If no table is requested, the

controller queries all the tables by default. Then we can use either variable state or variable

match depending on whether we want to issue REQ-C6-2 or REQ-C6-1 respectively. Note that

only one of these variables must have a non-zero value in order to be able to distinguish

between the two messages. As discussed above, this happens using the variable

get_from_state in the switch’s data structure (see switch-side).

In order to avoid programming errors, we made two different wrappers of this Ryu method,

each corresponding to the appropriate message. These methods explicitly set either variable

state or match to None. We list these wrappers below:

Controller-side request for message REQ-C6-1

def OFPExpGetFlowState(datapath, flags=0, table_id=ofproto.OFPTT_ALL, match=None): state = None return OFPExpStateStatsMultipartRequest(datapath, flags, table_id, state, match)


def OFPExpGetFlowsInState(datapath, flags=0, table_id=ofproto.OFPTT_ALL, state=None): match = None return OFPExpStateStatsMultipartRequest(datapath, flags, table_id, state, match)

Finally, the request for the global state can be issued using the method below:


def OFPExpGlobalStateStatsMultipartRequest(datapath, flags=0): data=bytearray() exp_type=osproto.OFPMP_EXP_FLAGS_STATS return ofproto_parser.OFPExperimenterStatsRequest(datapath=datapath, flags=flags,

experimenter=0xBEBABEBA, exp_type=exp_type, data=data)

This request composes a structure ofl_exp_msg_multipart_request_global_state as introduced

above (switch-side). This prototype completely covers the controller to switch interactions

regarding the state synchronization.




1.5.2 Switch to Controller

The other logical channel of the state synchronization API details the messages and data

structures required to allow a switch to generate a notification. Based on the specification of

these messages in D4.1, we define:

1. REQ-C5: Request a packet template if it does not exist in the packet template table.

2. REQ-C7: Generates notifications about state transitions.

3. REQ-C8: Generates positive acknowledgements when a (set of) rule(s) is truly installed

in the flow table.

In this prototype, we partially cover the functionality of this channel by implementing message

REQ-C7. The remaining two messages (i.e. REQ-C5 and REQ-C8) will be reported to the next

WP2 deliverable.

State Change Notification (REQ-C7)

First of all, a state change can be caused by two different events. One event can be a

controller command that demands a state change of a particular switch. This event is

synchronous and it does not make sense for a switch to generate a state change notification

since the controller already knows about the imminent state change that is about to happen.

However, a state change notification becomes important when asynchronous events from the

dataplane (i.e. an arrived packet, an expired timer) cause a state transition in the switch. In

this case, the controller needs to become aware by the switch.

With the above in mind, an upstream notification requires one extra experimenter message at

the switch side. We complemented the enumeration below with message

OFPT_EXPT_STATE_CHANGED.

/*EXPERIMENTER MESSAGES*/ enum ofp_exp_messages { OFPT_EXP_STATE_MOD OFPT_EXPT_STATE_CHANGED };

The switch function called upon a state set is declared in ofl-exp-beba.h as:

void state_table_set_state(struct state_table *, struct packet *, struct

ofl_exp_set_flow_state *msg, struct ofl_exp_action_set_state *act);

Upon a state transition, if it is caused by an asynchronous event, we set a variable

NOTIFY_STATE_CHANGES and the following piece of code is executed in the implementation of this

function (ofl-exp-beba.h):




HMAP_FOR_EACH_WITH_HASH(e, struct state_entry, hmap_node, hash_bytes(key, MAX_STATE_KEY_LEN, 0), &table->state_entries){

if (!memcmp(key, e->key, MAX_STATE_KEY_LEN)){ OFL_LOG_WARN(LOG_MODULE, "state value is %u updated to hash map", state); if ((((e->state & ~(state_mask)) | (state & state_mask)) ==

STATE_DEFAULT) && hard_timeout==0 && idle_timeout==0){ state_table_del_state(table, key, key_len); new_state = STATE_DEFAULT; old_state = e->state; } else { old_state = e->state; e->state = (e->state & ~(state_mask)) | (state & state_mask); new_state = e->state; gettimeofday(&tv,NULL); now = 1000000 * tv.tv_sec + tv.tv_usec; e->created = now; if (e->stats->idle_timeout) hmap_remove_and_shrink(&table->idle_entries, &e->idle_node); if (e->stats->hard_timeout) hmap_remove_and_shrink(&table->hard_entries, &e->hard_node); e->stats->idle_timeout = 0; e->stats->hard_timeout = 0; e->stats->idle_rollback = 0; e->stats->hard_rollback = 0; if (hard_timeout>0 && hard_rollback!=((e->state & ~(state_mask)) |

(state & state_mask))) { e->stats->hard_timeout = hard_timeout; e->stats->hard_rollback = hard_rollback; e->remove_at = now + hard_timeout; hmap_insert(&table->hard_entries, &e->hard_node, hash_bytes(key,

MAX_STATE_KEY_LEN, 0)); } if (idle_timeout>0 && idle_rollback!=((e->state & ~(state_mask)) |

(state & state_mask))) { e->stats->idle_timeout = idle_timeout; e->stats->idle_rollback = idle_rollback; e->last_used = now; hmap_insert(&table->idle_entries, &e->idle_node, hash_bytes(key,

MAX_STATE_KEY_LEN, 0)); } } { /* Notify controller about state change, only if needed */ If (NOTIFY_STATE_CHANGES){ struct ofl_exp_msg_notify_state_changed message = {{.header = OFPT_EXPERIMENTER, .experimenter_id = 0xBEBABEBA}, .type = OFPT_EXPT_STATE_CHANGED, .table_id = extractor->table_id, .old_state = old_state,




.new_state = new_state, .state_mask = state_mask, .key_len = key_len, .key = key}; dp_send_message(dp, (struct ofl_msg_header *)&msg, NULL); } } return; } }

The bold parts of the snippet above highlight the state transitions and the message emission.

We compose a new message to notify the controller about a transition. This new message

belongs to the experimenter family of messages for BEBA (experimenter header=0xBEBABEBA),

and encodes the message type (type=OFPT_EXPT_STATE_CHANGED), the table in which the trasition

occurs (table_id), the old (old_state) and new (new_state) state values, the state mask

(state_mask) and the flow (key_len, key) that is affected by this state change. The data

structure can be found in ofl-exp-beba.h and is depicted below:

struct ofl_exp_msg_notify_state_changed { struct ofl_msg_experimenter header; enum ofp_exp_messages type; uint8_t table_id; uint32_t old_state; uint32_t new_state; uint32_t state_mask; uint32_t key_len; uint8_t key[OFPSC_MAX_KEY_LEN]; };

The controller needs to read this message and unpack it properly in order to decode all the

information above.

2 FPGA-based hardware proof of concept prototype To gain understanding on the feasibility and wire-speed operation of the BEBA abstraction, we

have implemented a proof-of-concept (PoC) hardware prototype using an experimental FPGA

platform. The designed hardware prototype is conservative in terms of TCAM entries and clock

frequency but it includes all the key components and features of the BEBA abstraction,

including support for cross-flow state management. We remark that the main focus of this

work is to show the feasibility of a hardware implementation, not to present a fully deployed

FPGA implementation. With this scope, we do not focus on an efficient implementation of a

well-defined IP blocks such as TCAM, but we only use a simple implementation able to provide

the TCAM functionality even if with a reduced number of entries. Further pushing our FPGA

implementation is thus not only well out of the scope of this work, but it is also of limited

practical relevance, as carrier-grade implementations in the order of Terabit/s throughput [5]




in any case would require a custom ASIC implementation. Considering that the TCAM size does

not affect throughput given its O(1) access time, we believe that even if at reduced scale, our

implementation permits us to comparatively argue about complexity and performance with

respect to an equivalent OpenFlow implementation.

In this section, we describe the main blocks composing the prototype and we discuss the

performance obtained in the current FPGA prototype, evaluating the performance achievable

upgrading in case of an ASIC implementation. In particular, we show that the stateful

extension can be developed reusing the same building block of an standard OpenFlow

implementation, while the proposed extensions (mainly the support for “cross-flow” state

handling, i.e. permit the arrival of a packet of a given flow to trigger a state transition for a

different flow) can be implemented with simple yet efficient combinatorial hardware blocks.

The proposed implementation is highly scalable, since the main issue in terms of scalability is

related to the number of flow states (that are not in the DEFAULT state) to manage. Since

these states can be easily stored in a d-left hash table, the scalability of the system is related

to the maximum size of the SRAM memory (more than 2 million of flows can be stored in a 32

MB embedded SRAM).

Finally, from the implementation we did, we will show that one of the expected issues, i.e.

number of clock cycles required to update a state, can be limited to few clock cycles, thus

allowing a state update so fast that it does not compromise the functionality of the PoC even

when back to back packets belonging to the same flow and coming from the same network

interface should be analyzed.

2.1 Development platform

The PoC has been designed using as target development board the NetFPGA SUME [6], an x8

Gen3 PCIe adapter card incorporating Xilinx’s Virtex-7 690T FPGA [7], four SFP+ transceivers

providing four 10GbE links, three 72 Mbits QDR II SRAM and two 4GB DDR3 memories. The

board also provides a USB Connector for FPGA programming and debugging and an UART

interface. The general scheme of the PoC prototype is depicted in Figure 4. The FPGA is

clocked at 156.25 MHz, with a 64 bits data path from the Ethernet ports, corresponding to a

10 Gbps throughput per port.

The NetFPGA SUME provides a flexible infrastructure to develop hardware for networks. In

particular, the NetFPGA SUME firmware configure the SFP+ transceivers to work as four 10GbE

Ethernet links, exploiting the 10G Ethernet MAC hard-macros of the Xilinx FPGA. The data

input/output of the Ethernet MAC are based on the AXI4-Stream protocol. This protocol has

been selected by Xilinx and ARM as the default protocol for their IPs and peripherals to use in

conjunction with Xilinx and ARM products. This choice allows a fast prototype development that

exploits an IP based design flow to put together several hardware blocks that are commonly

used in the design of network apparatus.




Figure 3 - NetFPGA SUME infrastructure.

Figure 3 provides the view of the top level of the NetFPGA SUME infrastructure. All the blocks

are connected using the AXI4-Stream protocol and are provided as IP from Xilinx or from the

NetFPGA SUME repository. The blocks are mainly used to set-up the 10G Ethernet MAC, the

I2C peripheral for the clock generation, the UART for serial communication, the PCIe

communication block for the PC interface and the block for the generation of the secondary




clocks (such as the one that will be used for the microprocessor). Exploiting this infrastructure,

we interfaced the BEBA prototype (the block inside the red square) using the same AXI4-

Stream protocol used by the other blocks. Therefore, the NetFPGA SUME infrastructure

provides to the BEBA prototype 4 ingress and 4 egress 64 bits wide ports, each one able to

sustain a throughput of 10 Gbps.

2.2 Beba components The BEBA prototype uses four ingress queues to collect the packets coming from the ingress

ports. After, a 4-input 1-output mixer block aggregates the packets using a round robin policy.

The output of the mixer is a 320 bits data bus able to provide an overall throughput of 50

Gbps. A delay queue stores the packets during the time need by the PoC tables to operate.

The packets first go to the look-up and update extractor blocks that build the keys that are

used to read/update the state table. The state table is composed by a d-left hash table, a

TCAM and a companion SRAM. The new key, obtained composing the previous key and the

extracted state is fed to the second TCAM/SRAM pair that is in charge of executing the FSM.

The output of this TCAM/SRAM pair provides the command for the Action block and (if

required) a new state that will be written in the hash table. In the following, a detailed

description of the blocks composing the node prototype is presented.

Figure 4 - Scheme of the BEBA HW PoC prototype




2.2.1 Microcontroller

The BEBA prototype hosts a small Microblaze (a soft microprocessor core specifically designed

for Xilinx FPGAs) that is used to provide a communication interface between the prototype and

the external world. In particular, the UART interface is used to send the configuration

command to the BEBA prototype and to retrieve debug/status information. The software

running on the Microblaze acts as an agent for the BEBA prototype allowing to configure the

various elements composing the prototype (configuration registers, TCAM and RAM memories,

etc.). In order to configure the elements composing the core of the BEBA prototype, each

component is memory mapped in the address space of the Microblaze, which can directly

read/write the content of these components. These configuration capabilities can be also

exploited to perform some operations on the data path that do not require high timing

constraint. A typical operation is related to the aging of the entries stored in the state table. In

fact, it is possible to program the microprocessor to perform a periodic scrubbing of the state

table, removing all the entries older than a specific threshold.

2.2.2 Look-up/update extractors

Figure 5 - HW block implementing the field selection for the look-up/update key extraction

The look-up extractor and the update extractor select a subset of the incoming header to

create the look-up and update keys. The selection of the fields composing the keys is based on

the use of two basic operations: the first operation selects the beginning of the header (i.e.

performs the shift of the input key), while the second operation is a bit-wise mask operation.

In particular, each key is built using two fields, each one defined by a starting pointer and a

bit-wise mask.

Two barrel shifters take as input the initial 320 bits of the packet and each one provide as

output 64 contiguous bits starting from the initial bit defined by the corresponding starting

pointers. The two extracted fields are put together forming a 128 bits vector and bit-wise

masked providing the required key. Each extractor is configured using two configuration

registers, one for the shifters and one for the mask.




The logical operations composing the extractor are easily implementable in hardware and are

able to process the incoming data at the required clock frequency, while maintaining a high

degree of flexibility needed to select different kind of protocol fields. In the case of the MAC

learning example, the lookup scope, that is the destination MAC address, can be configured

setting the starting pointer to 0 and the mask configuration register to mask all but the first 48

bits of the packet. Instead, for the update scope, we can set the starting pointer to 48 and the

same mask configuration register used for the destination MAC.

We note that in this case, we have multiple choices for the selection of the source MAC, since

both the source MAC and the destination MAC are very close, and therefore the selection of the

source or destination MAC address can also be done only changing the mask configuration

Instead, if the lookup scope selects the {SRC,DST} MAC pair and the update scope selects the

{DST,SRC} MAC pair, the lookup will use 0 and 48 as starting pointers, while the update will

use 48 and 0 as starting pointers, thus providing the cross flow state management of Ethernet

flows.

We remark that the use of a configuration mask allows combining together multiple fields to

form the scope (e.g. the field of the source IP and the field of the destination port of a TCP

packet can be combined), with the limitation that the bits composing these fields are in the

same 128 bits wide windows defined by the index configuration register. This limitation does

not seem particularly important, since typical scopes can refer to multiple fields, but they are

usually contiguous fields, or fields with a limited distance. Instead, the use of two different

pairs of registers for the lookup and update scope selection allows decoupling the header fields

used for the state lookup form the header fields used for the update.

Another motivation that drives the selection of this index/mask implementation of the

extractor block is that the extractors always provide as output a fixed width 128 bits value

(even if often some of the 128 are always masked to 0). This allows to directly use the key

provided from the extractors as input for the TCAM and the hash table used to implement the

state table.

Furthermore, this implementation of the extractors allows to easily extend the BEBA prototype

functionality to select the update scope depending on the state of the packet under inspection.

In fact, the per-state update scope selection only requires to store the value of the 2 registers

(the 2 bytes of the index register plus the 9 bytes of the mask register) in the FSM execution

table, and to apply these values to configure the update scope extractor block.

As a last remark, we notice that the lookup/update extractor configuration is somewhat similar

to the format of Protocol-Oblivious Forwarding (POF) element [8]. While the lookup/update

extractor configuration is composed by an offset and mask, the corresponding POF element is

formed by an offset and a length. However, the mask configuration register can be used to

shorten the length of the key to extract (i.e. setting to 0 the last n bits of the incoming input).

Therefore, this simple block can provide a superset of the operations provided by a POF

element.




As previously stated, the proposed extractor blocks represent a good compromise between the

hardware feasibility and the field extraction flexibility required by the BEBA prototype. A

further enhancement of the extractor blocks could be achieved using a generic P4

description [17] of the header fields and implementing on the FPGA a more complex header

extractor.

2.2.3 State/timer table

The state labels (a 32 bits value) are stored in a d-left hash table, with d=4. The table is sized

for 4K entries, and is accessed with 128 bit keys provided, alternatively, by the look-up

extractor and the update extractor during state read and (re)write, respectively. The state

table provides the state associated to a flow identified by the look-up extractor. The table also

stores a 32 bits value representing the expiration time of the flow. When a flow is

inserted/updated in the table, the expiration time is computed as the current time plus the

validity time of the flow.

The actual implementation of the state table depends on several design parameters (hardware

or software, type of memories, latency, throughput, number of flows to manage etc.). For the

hardware implementation, a good compromise between flexibility and hardware cost consists

in implementing the state table by using a hash table and a small TCAM. In our proof-of-

concept FPGA implementation, the TCAM has 32 entries of 128 bits (TCAM1), and an

associated RAM block of 32 entries of 32 bits (RAM1) that reads the output of the TCAM and

provides the state associated to the specific TCAM row. As already anticipated, the limited size

of the TCAM is due to the difficulties of implementing efficient TCAMs using the FPGA

resources. We outline that an effective and scalable FPGA implementation of Ternary Content

Addressable Memories is a widely open research issue [9], [10], [11], especially since the

priority resolution hardware limits the maximum operating frequency when the number of

TCAM entries increase. Indeed, the achievable performance with FPGA TCAM are still far, in

terms of size and clock frequency, from those attainable by a full custom ASIC TCAM design

[12]. The TCAM is needed to handle special (wildcard) cases, such as static state assignment

to a pool of flows (e.g. ACLs), flow categories which are out of the scope of the machine and

must be processed in a different way (if necessary by another stage, either stateless or

stateful). Instead, the hash table keeps track of the state of the flows traveling in the network.

The hash table is realized by using a d-left hash table, with d=4. The table is sized for 4K

entries, and is accessed with 128 bits keys provided, alternatively, by the look-up extractor or

the update extractor during state read and (re)write, respectively. The choice of the right hash

table structure to realize the state table is one of the most important design choices, since this

has an impact on the scalability of the solution (since this hash table will store the state for all

the individual active flows in the network) and on the maximum throughput (since a look-up

on the hash table must be performed for each packet).

The requirement of a look-up for each packet requires a hash table with a constant access

time. This choice avoids the issues related to a non-fixed worst case delay, which could

compromise the overall performance of the system. Therefore, we implemented the hash table




using a multiple hash structure (MHT) such as the ones described in [13]. Since the different

hash tables can be accessed in parallel, the look-up of a key can be performed in only one

clock cycle, avoiding that the access to the hash table becomes a bottleneck for the system.

However, in order to simplify the hardware structure, and to avoid a bottleneck in the update

of the state, also the insertion of items in the hash table must be performed in a fixed number

(namely one) of clock cycles. This requires to avoid the use of MHT with moving capabilities

(such as the well-known cuckoo hash tables) since the insertion time of these structure is not

deterministic. This choice, which corresponds to the implementation of a d-left hash table, will

also bring a big simplification in the management of the multiple hash table. The drawback of

this choice is the reduced memory efficiency with respect to the cuckoo hashing. While a 4-

way cuckoo hash can reach the 99% memory occupancy, the d-left hash permits a memory

occupancy of the order of 70%.

Figure 6 - d-left hash table HW implementation

Figure 6 shows the blocks composing the d-left hash table. The four hash blocks perform four

different H3 hash functions [14]. These hash functions has been chosen since they can be

easily implemented in hardware. The four RAM blocks store the content of the hash table,

while the comparator block checks which RAM block actually stores the queried key and

provides the value associated to that key. Since each flow can require up to two memory

accesses, one for look-up and one for insert/update, the required throughput for the hash

table is two times the maximum throughput of the system. In order to avoid this bottleneck,




the FPGA Block RAMs are configured to work as dual port RAMs, with one write port and one

read port. Therefore, each RAM is able to provide a read and a write operation for each clock

cycle, thus maintaining the target throughput of 50 Gbps. Finally, the control block manages

the insert, remove and write signals needed to update/delete the hash table entries.

2.2.4 Metadata block

This block provides some additional information (metadata) that will be used by the XFSM to

decide the next state transitions and the actions to perform on the incoming packet. Even if it

is possible to identify several types of metadata, here we limit this information to few basic

types that can be used in several different applications. In particular, we select as metadata

the following information:

1. input interface: 8 bits value indicating the input interface from which that packet

arrived,

2. timestamp: this information is a 32 bits vector providing the actual equipment time,

3. expired flag: we compare the actual timestamp with the expiration time provided by the

state table, signaling if the associated flow is expired.

4. random number: this information is a 16 bits vector providing a random value. A typical

use case for this metadata is the use for load balancing.

The metadata is aggregated to the 320 bits packet header and is sent to the Flow table.

2.2.5 FSM table

As for the state table, also the FSM execution table is an abstract structure that can be

implemented in several ways depending on the underlying hardware that will provide the FSM

execution functionalities. In our prototype, the FSM execution table is implemented by a TCAM.

This choice allows the highest degree of flexibility while providing a constant one clock cycle

access time. The TCAM has 128 entries of 160 bits (TCAM2), associated to a Block RAM of 128

entries of 96 bits (RAM2), storing the next state used to update the flow table, the specific

action to perform on the packet and the validity time of the flow. While the next state and the

validity time are used to update the flow table, the action is sent to the action block. The TCAM

takes as input the retrieved flow state and a 128 bits vector extracted from the packet header

+ metadata vector using a key extractor (called FSM scope extractor) similar to the one used

for look-up and update.

The TCAM provides as output the row associated to the matching rule with higher priority. As

previously mentioned, the limited number of entries of the TCAMs is due to the inefficient

mapping of these structures on an FPGA. This number, however, is similar to that of other

FPGA based TCAM implementations, such as [15].




2.2.6 Packet output

A final Action Block applies the retrieved action to the packet coming from the delay queue.

Being our prototype a proof-of-concept, as of now only a basic subset of OpenFlow actions

have been implemented: drop, select (enable one or more of the output ports to forward the

packet), and tag (insert/modify/remove the VLAN tag). This block then provides as output the

four 64 bits data-bus for the four 10 Gbits/sec egress ports. The Action block is realized

composing several blocks that provide elementary operations. All the blocks share the same

I/O interface, that is composed by an input port and one or more egress ports. This choice will

permit to enhance the functionalities of the prototype adding new blocks if it is needed to

perform more actions. For the implemented prototype, the first elementary block allows to

insert/remove the VLAN tag, or to modify the VLAN tag value. The subsequent elementary

block is a select block that takes as input the packet and provide one output for each port.

Depending on the value of the action given by the FSM execution table, the block selects on

which output ports the packet must be forwarded. If no output is selected, the packet is

dropped.

2.2.7 Configuration interface

As mentioned above, the BEBA prototype can be configured by the microprocessor writing on

the memory addresses corresponding to the various configurable blocks, i.e. the extractors,

the two TCAMs and the two RAMs. Moreover, the microprocessor is able to read/write on the

hash table, both to perform maintenance tasks (removing of unused/expired entries), to reset

the BEBA prototype to a clean state, or to install specific flows directly in the state table.

The following table summarizes the configuration blocks and provides their address range (the

number of bytes allocated for each configurable entity).

Name Address Range

General debug/status

registers 0x80000000-0x80007FFF

Lookup pointers 0x80008010-0x8000801F

Update pointers 0x80008020-0x8000802F

Lookup mask 0x80008030-0x8000803F

Lookup pointers 0x80008040-0x8000804F

FSM scope pointers 0x80008050-0x8000805F




TCAM1 0x80010000-0x80010FFF

TCAM2 0x80011000-0x80011FFF

RAM1 0x80020000-0x80020FFF

RAM2 0x80021000-0x80021FFF

Hash Table 0x80100000-0x8010FFFF

Table 1 - memory mapping of the Beba configurable blocks

The general debug/status registers are used for general monitoring/debug tasks

(enable/disable 10GbE interfaces, count number of IP/TCP/UDP packets etc.).

The configuration of the BEBA prototype is triggered by a command sent to the microprocessor

using the UART interface. The command specifies the address defined in Table 1 corresponding

to the block to configure and the value to write.

The microprocessor reads the commands coming from the ofsoftwitch controller, parses the

content of the commands and translates them in a set of read/write command to send to the

BEBA prototype. The ofsoftwitch message is parsed and the information that contains the

information needed to identify where the data should be written (extractor, TCAMs, RAMs) and

the information to write is extracted. After, if needed, the microcontroller translates the

address related to the TCAM/RAM rows in the absolute address space of the microcontroller,

and sends the actual commands to the prototype.

2.3 Synthesis and PoC simulation

Figure 7 - Captured waveform for a PoC simulation




In order to show the behavior of the BEBA prototype, in this subsection we present a small

simulation of BEBA. The system has been configured to implement the port knocking example

that has been previously described in deliverable D2.1. The system inspects several packets,

changing the state corresponding to the SRC IP of the incoming requests and dropping the

packet until the final state in reached. Figure 7 presents the result of the simulation.

The waveform shows the ingress bus (only one of the ingress queues is presented in the

waveform), the four egress queues, and some signals of the hash table and of the TCAM1 and

TCAM2 blocks. In particular, it is possible to see that the first packet only corresponds to a

match in the TCAM1, (the flow is in the default state), while the subsequent packets coming

from the same SRC IP are matched also by the hash table. The signal of TCAM2 shows the

state transitions that occur during the processing of the packets. The flow state starts from the

DEFAULT state, (labeled as 16), and moves to the intermediate states (labeled as 11, 10, 9)

while the packets with the right TCP destination port arrives. At the end of the state

transitions, the flow state arrives in the OPEN state (labeled as 5), which corresponds to the

forwarding action (the value of the action signal is 0x00000020) in which the packets are

transmitted of out the port tx2 of the switch. Instead, during the intermediate states the

packet are dropped, as indicated by the action signal (the value of the action signal is

0x000000100), which set the drop flag for the action block.

The whole system has been synthesized using the standard Xilinx design flow: the resource

occupation for the implemented system, in terms of used logic resources, is presented in the

table below.

Type of resources # of used resources [%]

Number of Slice LUTs 63,742 out of 433200 14%

Block RAMs 254 out of 1470 17%

Table 2 - FPGA resources used for the Beba prototype

The PoC prototype requires less than 20% of resources. This result proves that the PoC

architecture is well suited for a hardware implementation, since the implemented system is

able to provide a minimal but complete implementation of the PoC concept, and is able to

sustain a considerable throughput.

2.4 Discussion and extensions

The FPGA prototype confirms the feasibility of the BEBA implementation. The additional

hardware needed to support cross-flow state management (namely, the extractor modules)

uses a negligible amount of logic resources and does not exhibit any implementation criticality.




Similarly, the limited number of actions and TCAM entries implemented in the prototype are

just due to the proof-of-concept nature of our prototype (and lack of an OpenFlow hardware

from which BEBA would directly inherit these parts).

2.5 Limitations

If compared with an OpenFlow implementation, BEBA exhibits only one (minor) shortcoming.

The system update latency, i.e. the time interval from the first table lookup to the last state

update is 5 clock cycles. The FPGA prototype is able to sustain the full throughput of 40

Gbits/sec provided by the 4 switch ports. If we suppose a minimum packet size of 40 bytes

(320 bits), the system is able to process 1 packet for each clock cycle, and thus up to 5

packets could be pipelined. However, the feedback loop (not present in the forward-only

OpenFlow pipelines [16]) raises a concern: the state update performed for a packet at the fifth

clock cycle would be missed by pipelined packets. This could be an issue for packets belonging

to a same flow arriving back-to-back (consecutive clock cycles). In practice, as long as the

system is configured to work by aggregating N ≥ 5 different links, the mixer’s round robin

policy will separate two packets coming from the same link of N clock cycles, thus solving the

problem. Note that the 5 clock cycles latency is fixed by the hardware blocks used in the FPGA

(the TCAM and the Block RAMs) and basically does not change with scaling up the number of

ingress ports or moving to an ASIC. Moreover, we remark that also the typical control update

mechanism of OpenFlow does not allow to exactly determine at which time instant a new rule

is installed in the flow tables, since this update is heavily dependent on the way in which the

tables are implemented in the actual OpenFlow switch.

2.6 Performance achievable with an ASIC implementation

As previously stated, while an FPGA prototype permits to assess feasibility, a full

performance/scale architecture requires ASIC technology. Following the same technology

assumptions of [5], a BEBA ASIC design would be able to work at 1GHz operating frequency.

This corresponds to an aggregate throughput of 960M packets/s, that is the maximum

achievable by a 64 ports 10 Gb/s switch chip. However, the most important scaling provided

by the ASIC implementation is given by the number of entries that can be stored in the BEBA

tables. The size of the SRAM that can be instantiated on a last generation chip is up to 32 MB,

corresponding to 2 millions of entries in the d-left hash for the Flow table. The size of a TCAM

can be up to 40 Mb, corresponding to 256K FSM table entries.

3 Performance analysis

3.1 Test descriptions

The BEBA basic API implementation was based on the open source OpenFlow 1.3 reference

switch called “ofsoftswitch13” and originally developed by CPqD [OFS13]. Ofsoftswitch13

purpose is to provide a simple implementation of the features described by OpenFlow 1.3,




especially targeted to developers willing to extend switch functionalities for testing purposes,

hence presenting a simpler codebase when compared to more advanced performance-oriented

implementations such as Open vSwitch. The main drawback in this case is throughput. Indeed,

ofsoftswitch13 can forward packets at a speed that is not suitable for high performance

networks (> 1Gb/s). However, the simplicity of its codebase made it an ideal choice for our

proof of concept implementations as it allows to both test new functionalities and formally

describe packet processing features.

We performed experiments to validate the performances obtained by both ofsoftswitch13

(unmodified) and our BEBA switch. In the following, we will refer to these implementations

respectively as OFSS13 and BS.

Figure 8 depicts the testbed topology used for our experiments. A client machine sends UDP

packets at a constant rate to a server. Both machines are connected through a switch. For

each experiment, we evaluate the throughput as the average bit rate measured at the server.

Figure 8 - Testbed topology

We preferred to keep the flow table configuration simple, indeed we do not want to compare

the effects of different flow table configurations (e.g. wildcard match vs. exact match, etc.),

but rather the latency that performing BEBA stateful operations might introduce. For this

reason, we chose to develop a simple pipeline where packets are forwarded bidirectionally

based on the input port (match on port 1, forward on port 2 and vice versa) each time varying

the number of stages (state table and flow table) that packets have to go through. Figure 9

shows the pipeline programmed at the switch. For each stage, a state table lookup and update

operation is performed using as lookup and update scope the Ethernet source address.




Figure 9 - Pipeline used for the BEBA switch performance analysis

3.2 Results

We compared the following test cases:

OFSS13: Unmodified ofsoftswitch13 implementation without BEBA extensions;

BS (stateless processing): BEBA switch implementation without configuring the

stateful flag. I.e. the same processing of OFSS13 is performed;

BS with state lookup: BEBA switch with all stages configured as stateful, but without

using set-state actions in the flow tables. I.e. per each packet only a state lookup

operation is performed at each stage;

BS with state lookup and update: BEBA switch configured to perform also a state

update operation at each stage.

For each test case, we performed several experiments varying the number of pipeline stages

from 1 to 24. For each experiment, we configured the client to generate UDP traffic for 60

seconds. We repeated each experiment 5 times.

The experiments were conducted using 3 desktop PCs. The switch was built using an Intel Core

2 Quad Core Processor 3Ghz with 8GB RAM equipped with an Intel 4 port Gigabit Ethernet

adapter (connected using a PCIe v2.0 x16 slot). The same configuration was used for the client

and server, both equipped with a Gigabit Ethernet adapter. To generate traffic we used Iperf,

configured to generate 1 single UDP stream at ~600Mbps.




Figure 10 - Experimental results

The results obtained are plotted in Figure 10. We performed a total of 960 experiments (4

configurations, 24 different pipeline widths, 10 repetitions of the same experiment), each time

generating traffic for 30 seconds. The values plotted show the mean and standard deviation for

each experiment. The baseline is set by OFSS13, which can forward traffic at a rate up to

~200Mbps even for longer pipelines. Almost the same result is obtained by BS in stateless

configuration. Interesting is the case of BS when performing stateful operations. In both cases

(lookup only and lookup and update) performances are degraded linearly w.r.t. the number of

stages, with a minimum of ~160Mbps when using 24 stages. A linearly increasing gap between

the two stateful BS configurations can be observed, indicative of the impact of the write

operation triggered by the set-state action.

3.3 Strategies for improved performances

The strategies to improve the performance of a SDN switches are either to implement the

packet processing into an ASIC/FPGA dataplane or into a dedicated software based dataplane.

As part of the architecture options, the design can be:

(1) a standalone ASIC dataplane driven by an agent that behaves as a driver of the

memories and tables of the switching ASICs. For such designs, when the ASIC is not

used, the system cannot work anymore.

(2) a mix architecture when the ASIC dataplane is an offload datapath of the software

dataplane. For such designs, when the ASIC is not used, the system can keep running.




Since ASICs can include many limitations due to hardware constraints, the option (2) is the

preferred option because it is always possible to design a software fallback whenever the

hardware dataplane cannot support some scenarios.

The dataplane based on ASIC logics is described in the previous section.

Moreover, the option (2) allows developing all the control logics independently of the hardware

constraints.

In case of a software dataplane, there are two implementations logics:

(a) a standalone fast dataplane, which is usually a multi-core/multi-threaded dataplane

that is driven by the control plane. This solution requires one single set of

datastructures which are used to store the forwarding entries.

(b) a mix of a fast path dataplane and of a slow path dataplane. For this solution, the slow

path dataplane implements all the capabilities of the software stacks while the fast path

dataplane implements only the subset of most of the packet processing which are

required for most of the packets to be processed. We can notice that this solution

usually requires twice more memory that the previous one because there are is a set of

datastructures for packet processing for the slow path and a second set of the

datastructures for the same packet processing in the fast path.

The design (a) has been the preferred option of the Open source OpenvSwitch (OVS) project

which supports the IO drivers over the DPDK libraries.

3.3.1 Slow path and fast path based acceleration

The design (b) is the preferred option of networking companies designing software and

hardware based telecom and switch routers. It is the 6WINDGate™ design too.




KernelStack

ControlPlane

FastPath?Local

info

Localinfo

Fast path packet

Continuoussynchronization

Exception packet

Synchronizationmodules

Figure 11 - Slow path + fast path processing

Currently, as described in the previous section, the prototype implementation of BEBA is done

in the udatapath of ofsoftswitch which is the option (a).

The current bottlenecks of the udatapath of ofsoftwitches are:

- the lack the threading with lockless datastructures which prevents from scaling from 1

to multiple CPU cores.

- the packets are received and transmitted using the AF_PACKET sockets from the kernel

to the userland which prevents from sustaining more than 100Ks of packet per seconds

because of the limitations of system calls and of the kernel’s AF_PACKET designs.

- the datastructures are not designed to be efficiently handled by CPU based on CPU

constraints (memory alignments, prefetching, CPU cache lines, per core

datastructures).

- Udatapath is userland only without a mix of control plane + slow path design (the

impacts will be explained in the following section).

Same as OpenvSwitch, adding DPDK’s, librte libraries for the IOs to the udatapath of

ofsoftswitch can be applied. Other options like using netmap could be used too. However, we




can notice that the industry is getting lot of benefits from the DPDK’s IOs since it is getting a

huge support from the CPU vendors and from PCI NIC vendors (http://dpdk.org/doc/nics)

including support for the FPGA based 100G PCI boards [25]. This option would lead to the

design (a).

3.3.2 Packet offload APIs – Netlink (RFC3549) and standardization Switchdev

Both the software fast path and hardware ASIC based dataplane requires information from the

slow path in order to be provisioned and then to offload packet processing.

Back to 2003, the industry did start with a standardization API for Forwarding processing

(ForCES) - https://datatracker.ietf.org/wg/forces/documents/ , but it was too complex to scale

even though there are some evolutions.

One of the foundation of ForCES was to start using system notifications (PF_ROUTE on *BSD)

or Netlink on Linux (RFC3549).

We may over-summarize that a weakness of ForCES was the lack of an open implementation

that would have been used to unify all the dataplane provides from the industry.

A new attempt to solve and to create a standard dataplane API is based on Switchdev[26].

Switchdev is a young open-source Linux kernel project that aims at providing kernel

extensions for supporting offloading of packet processing into some switch devices (the fast

path) based on the Linux kernel provisioning and states (the slow path).

Compared to programming the switches from their OpenFlow APIs, Switchdev is much closer

to the offload of Linux’s capabilities. Since Linux includes an OpenFlow switch (OVS),

Switchdev can become an offload for switching datapath of OVS.

As described in the following call flow, before switchdev, using either an OpenFlow model or

ForCES, the switches would have been programmed directly without using standard Linux

environments as a defacto repository and datamodel. Using such model, it usually means that

the (local) control plane has to become linked with the drivers of the dataplanes.




Figure 12 - Openflow or ForCES before Switchdev

ForCES or Openflow (ofsofwtich, OVS, etc.)

Opendaylight, Ryu, etc




Once the defacto standardization of Switchdev APIs are done, it allows supporting packet

offloading with a generic model that becomes independent of the underlying dataplane

technologies.

Figure 13 - Openflow using switchdev. a driver independent support

ForCES or Openflow (ofsofwtich, OVS, etc.)

Opendaylight, Ryu, etc




3.3.3 Packet offload APIs – eBPF and P4 processing

One of the major limits of OpenFlow is that it requires either a software, firmware or even

hardware update in order to support new actions (BEBA supports new encapsulation protocols

such as GENEVE, new headers into existing protocols such as NSH or GBP for VxLAN).

In order to get back agility with SDN dataplanes, eBPF is being defined in order to inject any

dataplane actions. eBPF is an extension of BPF with some Just-In-Time CPU optimizations [27].

As demonstrated with some ongoing work on OVS [27], eBPF can be combined to enhance

OVS’ dataplanes, but it is suffering of its generic’s performance issue based on the needs of a

BPF engines into the datapath (kernel, fast path, ASICs). Most of the CPU (or ASIC clock

cycles) for eBPF programs are spent in packet header parsing, so there is an attempt to mix

both eBPF and P4 with many objectives:

- Transform P4’s OVS needs toward eBPF (http://openvswitch.org/support/slides/p4.pdf),

- Have some native P4 blocks (ASIC blocks, C based fast path/dataplane blocks) that can

pre-define some packet processing without the use of eBPF.

These two options into the vSwitching dataplanes shall allow to create new SDN features with

a fair performance (eBPF only) while ultimate performance can be achieved using the same P4

description but getting it hard-wired (hard-coded) into the dataplanes.

3.3.4 Design solutions for BEBA

Currently, BEBA prototypes are based on userland only extensions of ofsoftswitch. As

described previously. It means that both the control plane (OpenFlow control message

handling) and the slow path (kernel dataplane of packet IOs) are not split into two planes. The

use of Switchdev and of Netlinks which are described in the previous sections require at

minimum a support in a Linux kernel in order for the slow path to emit Netlink messages

according to the states of the processing rules.

So, the core solution to provide efficient processing for BEBA should start with either a rewrite

of ofsoftwitch along with a kernel side or with an rewrite of BEBA into OVS itself in order to

benefit of OVS design which is already made of a userland (control plane) and of a kernel

datapath (slow path).

Once this port to OVS’s kernel will be done, it shall allow integration of the support of a fast

path offloading the OVS processing within a DPDK environment.




4 Simple use case deployment with HW/SW prototypes

4.1 HW prototype demonstration

The features of the BEBA PoC described in section 2 have been validated on this prototype

implementing a simple use case. In this section, we describe the selected use case, the

corresponding configuration of the PoC blocks (mainly the memories and the field extractors)

and we show the achieved results.

4.1.1 Simple use case description

The use case we use for the prototype validation is the MAC learning configuration of the BEBA

prototype. The switch is configured to read the DST MAC address of the incoming packet, and

to decide in which output port the packet should be delivered. The output information is read

from the state table. In order to implement the learning activity of the switch, for each

incoming packet, the SRC MAC and the source interface metadata are used to write in the

state table the output port learned from the packet. The configuration of the prototype

requires to configure the lookup, the FSM scope, and the update extractors, and to program

the TCAMs and RAMs as described in the next section. The lookup is used to read the outgoing

port for the incoming packet, while the source interface and the SRC MAC are used to update

the state table.

4.1.2 BEBA prototype configuration

The lookup extractor is configured to read the DST MAC address. The FSM scope read the

source interface, and the update extractor selects the SRC MAC. The first TCAM/RAM pair

contains only one row, in which we configure the flooding action for any entry not in the hash

table. The second TCAM/RAM pair is configured with N2+N rows.

Figure 14 - MAC learning configuration for TCAM/RAM pair




Finally, the update scope extractor is configured to selects the SRC MAC that is used to update

the content of the state table.

Figure 15 - Screenshot of the MAC learning configuration

In Figure 15, the debug output of the configuration of the extractors and of the default row of

the first TCAM/RAM pair and the first 2 rows of the second TCAM/RAM pair are presented. The

first debug lines show the configuration of the extractors. Each TCAM row is composed by two




lines. The first line represents the bit value, while the second row identify the do not care bits.

In particular, the bits set to 1 in the second row represent the masked bits.

The TCAM 1 is configured with the default ALL-MACTH entry (the second row is an all 1 vector

written in hexadecimal form). The default state associated with the default entry is the flooding

state EEEE0001. The first row of TCAM2 is configured to match the state EEEE0001 and the

first source interface (the only unmasked bit of the first TCAM2 row). The second row of

TCAM2 is configured to match the state EEEE0001 and the second source interface. The

remaining rows configure the TCAM as described in Figure 14 and are not shown for sake of

space.

4.1.3 Results

The FPGA prototype has been tested with the configured use case. Even if the system is

designed to sustain the maximum throughput of 40 Gbits/sec, the testbed environment that is

currently used only support using two 10 GbE interfaces, for a maximum theoretic throughput

of 20 Gbit/sec. Unfortunately, the actual configuration of the PC hosting the GbE Intel cards

used for sending packet to the FGPA reach the actual throughput of around 6 Gbits/sec. The

performed test shows that the FPGA was able to correctly deliver the incoming packets

regardless of the packet length (supposing a minimum packet length of 50 bytes).

4.2 SW prototype demonstration

4.2.1 Emulation environment description and use case description

The BEBA switch prototype has been integrated in the Linux based emulation platform Mininet

[4]. From the project repository is possible to download a fully functional image of a Mininet

distribution that will fetch both the modified ofsofswitch based BEBA prototype switch and the

Ryu based BEBA controller implementations.

Once all the required software components are compiled and installed, users can easily deploy

network topologies consisting of both BEBA switches and controllers interacting with “legacy”

Linux hosts, even in virtualized Linux machines. For example, to launch a simple topology with

2 hosts attached to a single BEBA switch driven by an external BEBA controller, a user

executes the following command:

sudo mn --topo single,2 --arp --mac --switch user --controller remote

In the next section we will describe how to run a port knocking application, a toy use case

example already presented in the project DoW and D2.1, which is particularly useful to better

understand the BEBA the SW prototypes configuration.

Figure 16 shows such simple port knocking emulation environment, in which we have:

1. 2 legacy hosts (terminals “Node h1” and “Node h2”)




2. 1 BEBA switch (terminal “Node s1”)

3. 1 controller (black terminal on the left)

4. 1 terminal for the Mininet command line interface (black terminal on the right)

Figure 16 - Port knocking demo on top of the BEBA mininet emulation environment

4.2.2 Programming the BEBA controller

The demo environment is launched, as briefly described in the previous section, with

“standard” Mininet commands.

The BEBA controller is launched with the following command:

ryu-manager ryu/ryu/app/beba/portknock.py

where portknock.py is the python code responsible for configuring the single BEBA switch

instantiated in this demonstration. The application code is listed here after (library imports are

omitted):

port_list = [10, 11, 12, 13, 1000] final_port = port_list[-1] second_last_port = port_list[-2] LOG.info("Port knock sequence is %s" % port_list[0:-1]) LOG.info("Final port to open is %s" % port_list[-1]) class OSPortKnocking(app_manager.RyuApp):




def __init__(self, *args, **kwargs): super(OSPortKnocking, self).__init__(*args, **kwargs) @set_ev_cls(ofp_event.EventOFPSwitchFeatures, CONFIG_DISPATCHER) def switch_features_handler(self, ev): msg = ev.msg datapath = msg.datapath ofp = datapath.ofproto parser = datapath.ofproto_parser LOG.info("Configuring switch %d..." % datapath.id) """ Set table 0 as stateful """ req = parser.OFPExpMsgConfigureStatefulTable(datapath=datapath,table_id=0, stateful=1) datapath.send_msg(req) """ Set lookup extractor = {ip_src} """ req = parser.OFPExpMsgKeyExtract(datapath=datapath, command=ofp.OFPSC_EXP_SET_L_EXTRACTOR, fields=[ofp.OXM_OF_IPV4_SRC], table_id=0) datapath.send_msg(req) """ Set update extractor = {ip_src} (same as lookup) """ req = parser.OFPExpMsgKeyExtract(datapath=datapath, command=ofp.OFPSC_EXP_SET_U_EXTRACTOR, fields=[ofp.OXM_OF_IPV4_SRC], table_id=0) datapath.send_msg(req) """ ARP packets flooding """ match = parser.OFPMatch(eth_type=0x0806) actions = [parser.OFPActionOutput(ofp.OFPP_FLOOD)] self.add_flow(datapath = datapath, table_id=0, priority=100, match=match, actions=actions) """ Flow entries for port knocking """ for i in range(len(port_list)): match = parser.OFPMatch(eth_type=0x0800, ip_proto=17, state=i, udp_dst=port_list[i]) if port_list[i] != final_port and port_list[i] != second_last_port: # If state not OPEN, set state and drop (implicit) actions = [parser.OFPExpActionSetState(state=i+1, table_id=0,idle_timeout=5)] elif port_list[i] == second_last_port: # In the transaction to the OPEN state, the timeout is set to 10 sec actions = [parser.OFPExpActionSetState(state=i+1, table_id=0, idle_timeout=10)] else: actions = [parser.OFPActionOutput(2)] self.add_flow(datapath=datapath, table_id=0, priority=10, match=match, actions=actions) """ Get back to DEFAULT if wrong knock (UDP match, lowest priority) """ match = parser.OFPMatch(eth_type=0x0800, ip_proto=17) actions = [parser.OFPExpActionSetState(state=0, table_id=0)] self.add_flow(datapath=datapath, table_id=0, priority=0, match=match, actions=actions) """ Test port 1300, always forward on port 2 """ match = parser.OFPMatch(eth_type=0x0800, ip_proto=17, udp_dst=1300) actions = [parser.OFPActionOutput(2)] self.add_flow(datapath = datapath, table_id=0, priority=10, match=match, actions=actions)

def add_flow(self, datapath, table_id, priority, match, actions): ofp = datapath.ofproto parser = datapath.ofproto_parser inst = [parser.OFPInstructionActions(ofp.OFPIT_APPLY_ACTIONS, actions)] mod = parser.OFPFlowMod(datapath=datapath, table_id=table_id, priority=priority, match=match, instructions=inst) datapath.send_msg(mod)




4.2.3 Testing the BEBA application

Figure 16 also shows the actual execution of the port knocking demo. To test the correct

execution of the application finite state machine we use netcat, a simple networking tool to

send data over generic tcp/udp sockets. The secret sequence (in this case 10, 11, 12, 13 UDP)

is required to open the udp port 1000 on node H2. To do so, we run the following command

sequence on Node H1:

echo -n "*" | nc -q1 -u 10.0.0.2 10

echo -n "*" | nc -q1 -u 10.0.0.2 11

echo -n "*" | nc -q1 -u 10.0.0.2 12

echo -n "*" | nc -q1 -u 10.0.0.2 13

As shown in Figure 16, from this point on UDP port 1000 is open for receiving data.




References

[1] ofsoftswitch13 project homepage: https://github.com/CPqD/ofsoftswitch13

[2] Ryu project homepage: http://osrg.github.io/ryu/

[3] Open Networking Foundation. “OpenFlow Switch Specification ver. 1.5.0”. In: Oct 14, 2013.

[4] Mininet project homepage: http://mininet.org

[5] P. Bosshart, G. Gibb, H.-S. Kim, G. Varghese, N. McKeown, M. Izzard, F. Mujica, and M. Horowitz, “Forwarding metamorphosis: Fast programmable match-action processing in hardware for sdn,” in ACM SIGCOMM 2013. ACM, 2013, pp. 99–110

[6] “COMBO Product Brief,” http://www.invea- tech.com/data/combo/combo pb en.pdf

[7] “Virtex-5 Family Overview,” http://www.xilinx.com/

[8] H. Song, “Protocol-oblivious forwarding: Unleash the power of sdn through a future-proof forwarding plane,” in Proceedings of the Second ACM SIGCOMM Workshop on Hot Topics in Software Defined Networking, ser. HotSDN ’13. ACM, 2013, pp. 127–132

[9] B. Jean-Louis, “Using block RAM for high performance read/write TCAMs,” 2012

[10] Z. Ullah, M. Jaiswal, Y. Chan, and R. Cheung, “FPGA Implementation of SRAM-based Ternary Content Addressable Memory,” in IEEE 26th International Parallel and Distributed Processing Symposium Workshops & PhD Forum (IPDPSW), 2012

[11] W.Jiang,“Scalable ternary content addressable memory implementation using FPGAs,” in Architectures for Networking and Communications Systems (ANCS), 2013 ACM/IEEE Symposium on. IEEE, 2013, pp. 71–82

[12] P. K. and A. S., “Content-addressable memory (CAM) circuits and architectures: A tutorial and survey,” IEEE Journal of Solid-State Circuits, vol. 41, no. 3, pp. 712–727, 2006

[13] A. Kirsch, M. Mitzenmacher, and G. Varghese, “Hash-based techniques for high-speed packet processing,” in Algorithms for Next Generation Networks. Springer, 2010, pp. 181–218

[14] M.Ramakrishna,E.Fu,andE.Bahcekapili,“Efficienthardwarehashing functions for high performance computers,” Computers, IEEE Transac- tions on, vol. 46, no. 12, pp. 1378–1381, Dec 1997

[15] J. Naous, D. Erickson, G. A. Covington, G. Appenzeller, and N. McK- eown, “Implementing an OpenFlow switch on the NetFPGA platform,” in Proceedings of the 4th ACM/IEEE Symposium on Architectures for Networking and Communications Systems. ACM, 2008, pp. 1–9

[16] OpenNetworking Foundation,“OpenFlow Switch Specificationver1.4,” Tech. Rep., Oct. 2013

[17] P. Bosshart, D. Daly, G. Gibb, M. Izzard, N. McKeown, J. Rexford, C. Schlesinger, D. Talayco, A. Vahdat, G. Varghese, and D. Walker. P4: Programming protocol-independent packet processors. SIGCOMM Comput. Commun. Rev., 44(3):87–95,




July 2014

[18] P4 Language Consortium. P4. http://p4.org/

[19] P4 Language Consortium. P4-HLIR. https://github.com/p4lang/p4-hlir

[20] P4 Language Consortium. P4C-BEHAVIORAL. https://github.com/p4lang/p4c-behavioral

[21] P4 Language Consortium. P4-GRAPHS. https://github.com/p4lang/p4c-graphs

[22] M. Attig and G. Brebner. 400 gb/s programmable packet parsing on a single fpga. In In Proceedings on the 2011 ACMJIEEE Seventh Symposium on Architectures for Networking and Communications Systems, ANCS ’11, pages 12–23. IEEE Computer Society, 2011

[23] V. Pus, L. Kekely, and J. Korenek. Low-latency modular packet header parser for fpga. In Proceedings of the Eighth ACM/IEEE Symposium on Architectures for Networking and Communications Systems, ANCS’12, pages 77–78, New York, NY, USA, 2012. ACM

[24] F. Risso and M. Baldi. Netpdl: An extensible xml-based language for packet header description. Comput. Netw., 50(5):688–706, Apr. 2006.

[25] 100G FPGA support with 140Mpps for DPDK based PCI NICs - - https://www.linkedin.com/grp/post/4842610-6017768520522158081

[26] Switchdev - Ethernet switch device driver model (switchdev) - https://www.kernel.org/doc/Documentation/networking/switchdev.txt

[27] OVS and eBPF Micro Summit Notes - http://openvswitch.org/pipermail/dev/2014-October/047421.html + eBPF OVS Request For Comments - http://openvswitch.org/pipermail/dev/2014-December/049852.html

beba behavioural based forwarding deliverable report · pdf filebeba behavioural based...

Documents