osiris technical lead open storage research infrastructure ... · other but 10g to wsu datacenter...

32
Open Storage Research Infrastructure HEPiX Spring 2019 - SDSC Ben Meekhof University of Michigan Advanced Research Computing OSiRIS Technical Lead

Upload: others

Post on 20-May-2020

1 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: OSiRIS Technical Lead Open Storage Research Infrastructure ... · other but 10G to WSU datacenter ⬝ Adding a new node, or losing enough disks, would completely swamp the 10G link

Open Storage Research Infrastructure

HEPiX Spring 2019 - SDSC

Ben MeekhofUniversity of Michigan Advanced Research ComputingOSiRIS Technical Lead

Page 2: OSiRIS Technical Lead Open Storage Research Infrastructure ... · other but 10G to WSU datacenter ⬝ Adding a new node, or losing enough disks, would completely swamp the 10G link

2OSiRIS - Open Storage Research Infrastructure

OSiRIS is a pilot project funded by the NSF to evaluate a software-defined storage infrastructure for our primary Michigan research universities and beyond.

Our goal is to provide transparent, high-performance access to the same storage infrastructure from well-connected locations on any of our campuses.

⬝ Leveraging CEPH features such as CRUSH, cache tiers to place data

⬝ NFSv4 (Ganesha) with Krb5 auth and LDAP to map POSIX identity

⬝ Radosgw/S3 behind HAproxy, public and campus local endpoints

⬝ Identity establishment and provisioning of federated users (COmanage)

UM, driven by OSiRIS, recently joined Ceph Foundation: https://ceph.com/foundation

OSiRIS Summary

Page 3: OSiRIS Technical Lead Open Storage Research Infrastructure ... · other but 10G to WSU datacenter ⬝ Adding a new node, or losing enough disks, would completely swamp the 10G link

3OSiRIS - Open Storage Research Infrastructure

OSiRIS Summary - Structure

Single Ceph cluster (Mimic 13.2.x ) spanning UM, WSU, MSU - 792 OSD, 7 PiB

Network topology store (UNIS) and SDN rules (Flange) managed at IU

NVMe nodes at VAI used for Ceph cache tier (Ganesha NFS export to local cluster)

Page 4: OSiRIS Technical Lead Open Storage Research Infrastructure ... · other but 10G to WSU datacenter ⬝ Adding a new node, or losing enough disks, would completely swamp the 10G link

4OSiRIS - Open Storage Research Infrastructure

COmanage - Virtual Org Provisioning

When we create COmanage COU (virtual org):

Data pools created

RGW placement target defined to link to pool cou.Name.rgw

CephFS pool create and added to fs

COU directory created and placed on CephFS pool

Default perms/ownership set to COU all members group, write perms for admins group (as a default, can be modified)

Page 5: OSiRIS Technical Lead Open Storage Research Infrastructure ... · other but 10G to WSU datacenter ⬝ Adding a new node, or losing enough disks, would completely swamp the 10G link

5OSiRIS - Open Storage Research Infrastructure

OSiRIS relies on other identity providers to verify users⬝ InCommon and eduGain federations

The first step to using OSiRIS is to authenticate and enroll via COmanage⬝ Users choose a COU (virtual org) ⬝ Designated virtual org admins can approve new enrollments, OSiRIS

admins don’t need to be involved for every enrollment

Once enrolled COmanage can feed information to provisioning plugins. ⬝ LDAP, Grouper are core plugins included with COmanage⬝ We wrote a Ceph provisioner⬝ Ceph provisioner runs ‘ceph’ and ‘radosgw-admin’ commands directly ⬝ No way in RGW admin api to do placement settings ⬝ There used to be ceph-rest-api, now ceph-mgr RESTful plugin but I haven’t

really looked into whether it offers what we would need (appears not)

OSiRIS Identity Onboarding

Page 6: OSiRIS Technical Lead Open Storage Research Infrastructure ... · other but 10G to WSU datacenter ⬝ Adding a new node, or losing enough disks, would completely swamp the 10G link

6OSiRIS - Open Storage Research Infrastructure

COmanage - New User Provisioning

Page 7: OSiRIS Technical Lead Open Storage Research Infrastructure ... · other but 10G to WSU datacenter ⬝ Adding a new node, or losing enough disks, would completely swamp the 10G link

7OSiRIS - Open Storage Research Infrastructure

Virtual Orgs are provisioned from COmanage as Grouper stems

VO admins are given capabilities to create/manage groups under their stem

Groups become Unix group objects in LDAP usable in filesystem permissions

Every COU (VO) has the CO_COU groups available for use by default, COmanage sets membership in these

Grouper - VO Group Self Management

Page 8: OSiRIS Technical Lead Open Storage Research Infrastructure ... · other but 10G to WSU datacenter ⬝ Adding a new node, or losing enough disks, would completely swamp the 10G link

8OSiRIS - Open Storage Research Infrastructure

COmanage Credential Management

COmanage Ceph Provisioner plugin provides user interface to retrieve/manage credentials

Page 9: OSiRIS Technical Lead Open Storage Research Infrastructure ... · other but 10G to WSU datacenter ⬝ Adding a new node, or losing enough disks, would completely swamp the 10G link

9OSiRIS - Open Storage Research Infrastructure

We want to have a view of usage by virtual org, and also ensure fair usage of available RGW resources (bandwidth, ops)

S3 metrics and throttling

Page 10: OSiRIS Technical Lead Open Storage Research Infrastructure ... · other but 10G to WSU datacenter ⬝ Adding a new node, or losing enough disks, would completely swamp the 10G link

10OSiRIS - Open Storage Research Infrastructure

Pool metrics and other cluster metrics are all gathered and fed to Influxdb by a plugin to ceph-mgr. Originally written by a student here at UM and contributed to Ceph (now greatly modified by others). http://docs.ceph.com/docs/mimic/mgr/influx/

Ceph-mgr influx plugin

(cpu metric from collectd)

Page 11: OSiRIS Technical Lead Open Storage Research Infrastructure ... · other but 10G to WSU datacenter ⬝ Adding a new node, or losing enough disks, would completely swamp the 10G link

11OSiRIS - Open Storage Research Infrastructure

We isolate our virtual orgs by Ceph data pool so we can gather metrics on usage, throughput, iops at the storage level

Virtual Org Storage Metrics

Page 12: OSiRIS Technical Lead Open Storage Research Infrastructure ... · other but 10G to WSU datacenter ⬝ Adding a new node, or losing enough disks, would completely swamp the 10G link

12OSiRIS - Open Storage Research Infrastructure

Mounting CephFS directly provides no provision for mapping user identity from external systems (posix numeric uid/gid/grouplist).

This functionality is available in NFSv4 with idmapd, and we are able to provide NFS mounts to our member campuses using their local KRB5 realms. ⬝ Setup can be an issue due to complexity and requirement of client principals⬝ Used to provide mounts on campus cluster login/xfer nodes where users already

obtain krb5 creds at login⬝ Also used in MSAD krb5 environment at Van Andel site ⬝ Baked in config: https://hub.docker.com/r/miosiris/nfs-ganesha-ceph

POSIX ID Mapping

Page 13: OSiRIS Technical Lead Open Storage Research Infrastructure ... · other but 10G to WSU datacenter ⬝ Adding a new node, or losing enough disks, would completely swamp the 10G link

13OSiRIS - Open Storage Research Infrastructure

Two components:

Map server gid/uid to client user/group

Map krb5 principal to posix uid/gid

NFS ID Mapping Overview

Page 14: OSiRIS Technical Lead Open Storage Research Infrastructure ... · other but 10G to WSU datacenter ⬝ Adding a new node, or losing enough disks, would completely swamp the 10G link

14OSiRIS - Open Storage Research Infrastructure

We also provide Globus access to CephFS and S3 storage⬝ For now separate endpoints, future Globus version will support multiple storage

connectors⬝ Ceph S3 connector is highly scalable - the more raw storage you have, the more it

costs!⬝ Ceph connector uses radosgw admin API to lookup user credentials and connect to

endpoint URL with them

Credentials: CILogon + globus-gridmap⬝ We keep CILogon DN in LDAP voPerson CoPersonCertificateDN attribute⬝ For now a script is used to feed DN to LDAP but should be self-service for users (in

COmanage gateway)

Simple python script generates globus-gridmap file on our Globus servers⬝ Anyone want to write gridmap module to directly query DN from LDAP?

https://groups.google.com/a/globus.org/forum/#!topic/admin-discuss/8D54FzJzS-o

Globus and gridmap

Page 15: OSiRIS Technical Lead Open Storage Research Infrastructure ... · other but 10G to WSU datacenter ⬝ Adding a new node, or losing enough disks, would completely swamp the 10G link

15OSiRIS - Open Storage Research Infrastructure

We’ve had issues with network bandwidth inequality: UM and MSU sites have 80G link to each other but 10G to WSU datacenter⬝ Adding a new node, or losing enough disks,

would completely swamp the 10G link and cause OSD flapping, mon/mds problems, service disruptions

Lowering recovery tunings fixes the issue, at the expense of under-utilizing our faster links. Recovery sleep had the most effect, the others not as clear

osd_recovery_max_active: 1 # (default 3) osd_backfill_scan_min: 8 #(def 64)osd_backfill_scan_max: 64 #(def 512)osd_recovery_sleep_hybrid: 0.1 # (def .025)

Fighting Network Inequality

Page 16: OSiRIS Technical Lead Open Storage Research Infrastructure ... · other but 10G to WSU datacenter ⬝ Adding a new node, or losing enough disks, would completely swamp the 10G link

16OSiRIS - Open Storage Research Infrastructure

Speaking of networks...The OSiRIS Network Management Abstraction Layer is a key part of the project with

several important focuses:

Capturing site topology and routing information from multiple sources: SNMP, LLDP,

sflow, SDN controllers, and existing topology and looking glass services

Converge on common scheduled measurement architecture with existing perfSONAR

mesh configurations

Correlate long-term performance measurements with passive metrics collected via

other monitoring infrastructure (Check_mk, Collectd, Nagios, etc)

Be able to apply this information to SDN based architecture for traffic routing and traffic

shaping / QOS (for example, prioritize client / cluster service traffic over recovery)

http://www.osris.org/nmal

NMAL work is led by the Indiana University CREST team

Page 17: OSiRIS Technical Lead Open Storage Research Infrastructure ... · other but 10G to WSU datacenter ⬝ Adding a new node, or losing enough disks, would completely swamp the 10G link

17OSiRIS - Open Storage Research Infrastructure

At Van Andel Institute (Grand Rapids) we deployed 3 NVMe based nodesPart of our cluster but for cache tier pools onlyAllocated a tier on top of VAI CephFS storage and local Ganesha NFS server

Van Andel - Ceph Cache Tiering

http://www.osris.org/domains/vai.html

Page 18: OSiRIS Technical Lead Open Storage Research Infrastructure ... · other but 10G to WSU datacenter ⬝ Adding a new node, or losing enough disks, would completely swamp the 10G link

18OSiRIS - Open Storage Research Infrastructure

Another thing we can do to optimize data reads for certain locations is adjust CRUSH to place primary (1st) OSD at a site.

Since reads/writes all go to primary the clients at that site will see a boost to read (write still constrained by latency to other replicas)

Our ‘sites’ are all CRUSH buckets so it’s as simple as creating CRUSH rule that specifically allocates replicas starting with the desired primary site:

rule replicated_primary_wsu {id 5step take wsu step chooseleaf firstn 1 type hoststep emitstep take msu … }

http://www.osris.org/article/2019/03/01/ceph-osd-site-affinity

CRUSH and Primary OSD

Page 19: OSiRIS Technical Lead Open Storage Research Infrastructure ... · other but 10G to WSU datacenter ⬝ Adding a new node, or losing enough disks, would completely swamp the 10G link

19OSiRIS - Open Storage Research Infrastructure

We manage everything with puppet, deployment with Foreman ⬝ foreman-bootdisk for external deployments such as Van Andel⬝ r10k git environments ⬝ Recently updated to puppet 5 / foreman 1.20

Define a site and role (sub-role for storage) from hostname, use these in hiera lookups⬝ Example: um-stor-nvm01 becomes a Ceph ‘stor’ node using devices as defined in

‘nvm’ nodetype to create OSD⬝ site, role, node, nodetype are hiera tree levels⬝ At the site level define things like networks (frontend/backend/mgmt), CRUSH

locations, etc

Ceph deployment and disk provisioning managed by Puppet module ⬝ Storage nodes lookup Ceph OSD devices in hiera based on hostname component ⬝ Our module was forked from openstack/puppet-ceph ⬝ Supports all the ceph daemons, bluestore, multi-OSD devices ⬝ https://github.com/MI-OSiRIS/puppet-ceph

Puppet

Page 20: OSiRIS Technical Lead Open Storage Research Infrastructure ... · other but 10G to WSU datacenter ⬝ Adding a new node, or losing enough disks, would completely swamp the 10G link

20OSiRIS - Open Storage Research Infrastructure

OSiRIS is involved with multiple science domains including medical imaging, biology population studies, ocean tidal studies (NRL), and Physics / HEP

ATLAS Event Service - Storage service based on S3 allowing jobs on any resource to retrieve physics events in small chunks.

⬝ Recently: Testing Globus to move data to OSiRIS S3 (and others)

ATLAS dCache - Ceph can be used as a backend for dCache. With AGLT2 work is underway to provide dCache storage from OSiRIS.

⬝ dCache Book: https://www.dcache.org/manuals/Book-5.0/cookbook-pool.shtml#running-pool-with-ceph-backend

⬝ Some more (old?) background: https://www.dcache.org/manuals/workshop-2017-05-29-Umea/000-Final/tigran-dcache+ceph.pdf

IceCube Neutrino Detector - Interest in using OSiRIS via object store interface (S3 or directly, not sure). Early discussion.

More: http://www.osris.org/domains

OSiRIS and HEP

Page 21: OSiRIS Technical Lead Open Storage Research Infrastructure ... · other but 10G to WSU datacenter ⬝ Adding a new node, or losing enough disks, would completely swamp the 10G link

21OSiRIS - Open Storage Research Infrastructure

This is the 4th year of OSiRIS, with a potential 5th year TBD by upcoming NSF review

We’d like to get more data on the platform, have a number of queued up users or new engagements (IceCube, OSN, Van Andel, more)

See more utilization of S3 services as a more practical path to working in-place on data sets⬝ Recently deployed Globus S3 connectors (good timing for ATLAS!)

We are almost ready to bring up 100 Gbit links between all sites. Phase-1 will just put our cluster ‘backend’ networks on private peerings between sites and significantly improve replication / recovery capabilities as well as remove this traffic from client network links (cluster_network, public_network in Ceph)

⬝ Links will land directly on OSiRIS TOR switches (Dell Z9100) instead of campus DMZ routes - more control for us, more potential to break things if we peer to campus

Future

Page 22: OSiRIS Technical Lead Open Storage Research Infrastructure ... · other but 10G to WSU datacenter ⬝ Adding a new node, or losing enough disks, would completely swamp the 10G link

22OSiRIS - Open Storage Research Infrastructure

Questions?

The End

Page 23: OSiRIS Technical Lead Open Storage Research Infrastructure ... · other but 10G to WSU datacenter ⬝ Adding a new node, or losing enough disks, would completely swamp the 10G link

23OSiRIS - Open Storage Research Infrastructure

Internet2 COmanage: https://spaces.at.internet2.edu/display/COmanage/Home

Internet2 Grouper: https://www.internet2.edu/products-services/trust-identity/grouper/

OSiRIS CephProvisioner: https://github.com/MI-OSiRIS/comanage-registry/tree/ceph_provisioner/app/AvailablePlugin/CephProvisioner

OSIRIS Docker (Ganesha, NMAL containers): https://hub.docker.com/u/miosiris

OSiRIS Docs: https://www.osris.org/documentation

Ceph Influxdb plugin: http://docs.ceph.com/docs/mimic/mgr/influx/

Reference / Supplemental

Page 24: OSiRIS Technical Lead Open Storage Research Infrastructure ... · other but 10G to WSU datacenter ⬝ Adding a new node, or losing enough disks, would completely swamp the 10G link

24OSiRIS - Open Storage Research Infrastructure

CPU – 56 hyper-thread cores (28 real cores) to 60 disks in one node – can see 100% usage on those during startup, normal 20-30%

Memory– 384 GB for 600 TB disk (about 650 MiB per TB)– 6.5GB per 10TB OSD

4 x NVMe OSD Bluestore DB (Intel DC P3700 1.6TB)– 110GB per 10TB OSD (15 per device)– 11GB per TB of space– about 1% of block device space (docs recommend 4%....not realistic for us)

VAI Cache Tier– 3 nodes, each 1 x 11 TB Micron Pro 9100 NVMe– 4 OSD per NVMe– 2x AMD EPYC 7251 2Ghz 8-Core, 128GB

Some Numbers (current hw)

Older hardware has 4 x smaller 400GB or 800GB NVMe - spec’d pre-Bluestore

Page 25: OSiRIS Technical Lead Open Storage Research Infrastructure ... · other but 10G to WSU datacenter ⬝ Adding a new node, or losing enough disks, would completely swamp the 10G link

25OSiRIS - Open Storage Research Infrastructure

ATLAS Event Service

Supplement to ‘heavy’ ATLAS grid infrastructure

Jobs fetch events / store output via S3 URL

Short term compute jobs good for preemptible resources

Page 26: OSiRIS Technical Lead Open Storage Research Infrastructure ... · other but 10G to WSU datacenter ⬝ Adding a new node, or losing enough disks, would completely swamp the 10G link

26OSiRIS - Open Storage Research Infrastructure

Core Ceph cluster sites share identical config and similar numbers / types of OSD

Any site can be used for S3/RGW access (HAproxy uses RGW backends at each site)

Any site can be used via Globus endpoint for FS or S3

Users at each site can mount NFS export from Ganesha + Ceph FSAL. NFSv4 idmap

umich_ldap scheme used to map POSIX identities.

Site Overview

Page 27: OSiRIS Technical Lead Open Storage Research Infrastructure ... · other but 10G to WSU datacenter ⬝ Adding a new node, or losing enough disks, would completely swamp the 10G link

27OSiRIS - Open Storage Research Infrastructure

Example hardware models and details shown in the diagram on the left.

This year’s purchases used R740 headnodes and 10TB SAS disks and Intel P3700 PCIe NVMe devices

Site Overview - hardware

Page 28: OSiRIS Technical Lead Open Storage Research Infrastructure ... · other but 10G to WSU datacenter ⬝ Adding a new node, or losing enough disks, would completely swamp the 10G link

28OSiRIS - Open Storage Research Infrastructure

Cache Tier Benchmarks - RADOS (VAI)

http://www.osris.org/domains/vai.html

Page 29: OSiRIS Technical Lead Open Storage Research Infrastructure ... · other but 10G to WSU datacenter ⬝ Adding a new node, or losing enough disks, would completely swamp the 10G link

29OSiRIS - Open Storage Research Infrastructure

From SC18: http://www.osris.org/article/2018/11/15/ceph-cache-tiering-demo-at-sc18

Cache Tier Benchmarks - NFS / Iozone

Page 30: OSiRIS Technical Lead Open Storage Research Infrastructure ... · other but 10G to WSU datacenter ⬝ Adding a new node, or losing enough disks, would completely swamp the 10G link

30OSiRIS - Open Storage Research Infrastructure

NMAL - Topology discovery (viz)

Visualization can also

display computed paths

through topology

Page 31: OSiRIS Technical Lead Open Storage Research Infrastructure ... · other but 10G to WSU datacenter ⬝ Adding a new node, or losing enough disks, would completely swamp the 10G link

31OSiRIS - Open Storage Research Infrastructure

NMAL SDN Deployment - Ryu controller

Ryu SDN framework (https://osrg.github.io/ryu/)

Simple to deploy and integrate with ourPython-based tools

Ryu in a VM through OVS required someplanning to separate control from dataplanewith common physical LAG on all hosts

Services managed by Puppet

Page 32: OSiRIS Technical Lead Open Storage Research Infrastructure ... · other but 10G to WSU datacenter ⬝ Adding a new node, or losing enough disks, would completely swamp the 10G link

32OSiRIS - Open Storage Research Infrastructure

NMAL - Bringing topology and monitoring together

Regular perfSONAR testing (via MCA mesh) results are exposed via topology visualization

The goal is to create an OSiRIS monitoring dashboard to quickly analyze and troubleshoot performance issues

Near-term plan:

Integrate passive measurements

Make topology elements dynamically update to highlight current network and device state

Long-term goal:

Integrate analysis engine based on UNIS-RT to perform changepoint detection and reporting