computing at cdf ➢ introduction ➢ computing requirements ➢ central analysis farm ➢...
TRANSCRIPT
![Page 1: Computing at CDF ➢ Introduction ➢ Computing requirements ➢ Central Analysis Farm ➢ Conclusions Frank Wurthwein MIT/FNAL-CD for the CDF Collaboration](https://reader036.vdocument.in/reader036/viewer/2022081520/56649eff5503460f94c14e70/html5/thumbnails/1.jpg)
Computing at CDF
➢ Introduction
➢ Computing requirements
➢ Central Analysis Farm
➢ Conclusions
Frank WurthweinMIT/FNAL-CD
for the CDF Collaboration
![Page 2: Computing at CDF ➢ Introduction ➢ Computing requirements ➢ Central Analysis Farm ➢ Conclusions Frank Wurthwein MIT/FNAL-CD for the CDF Collaboration](https://reader036.vdocument.in/reader036/viewer/2022081520/56649eff5503460f94c14e70/html5/thumbnails/2.jpg)
CDF in a Nutshell
Frank Wurthwein/MIT LCCWS '02
➢ CDF + D0 experiments analyze pp collisions from Tevatron at Fermilab➢ Tevatron highest energy collider in world ( TeV) until LHC➢ Run I (1992-1996) huge success 200+ papers (t quark discovery, ...)➢ Run II (March 2001-) upgrades for luminosity (10) + energy (~10%) expect integrated luminosity 20 (Run IIa) and 150 (Run IIb) of Run I
s=2
Run II physics goals:➢Search for Higgs boson➢Top quark properties (m
t,
tot, ...)
➢Electroweak (mW,
W, ZZ, ...)
➢Search for new physics (e.g. SUSY)➢QCD at large Q2 (jets,
s, ...)
➢CKM tests in b hadron decays
![Page 3: Computing at CDF ➢ Introduction ➢ Computing requirements ➢ Central Analysis Farm ➢ Conclusions Frank Wurthwein MIT/FNAL-CD for the CDF Collaboration](https://reader036.vdocument.in/reader036/viewer/2022081520/56649eff5503460f94c14e70/html5/thumbnails/3.jpg)
CDF RunII Collaboration
LCCWS'02
Goal: Provide computing resources for 200+ collaborators simultaneously doing analysis per day!
Frank Wurthwein/MIT
![Page 4: Computing at CDF ➢ Introduction ➢ Computing requirements ➢ Central Analysis Farm ➢ Conclusions Frank Wurthwein MIT/FNAL-CD for the CDF Collaboration](https://reader036.vdocument.in/reader036/viewer/2022081520/56649eff5503460f94c14e70/html5/thumbnails/4.jpg)
Frank Wurthwein/MIT LCCWS'02
CDF
Level 3 TriggerProduction Farm
Central Analysis Facility(CAF)
UserDesktops
RoboticTape Storage
CDF DAQ/Analysis Flow
Data Analysis7MHz beam Xing
0.75 Million channels
300 Hz
L1
L2
75 Hz
20 MB/s
Recon
Read/write
Data
MC
![Page 5: Computing at CDF ➢ Introduction ➢ Computing requirements ➢ Central Analysis Farm ➢ Conclusions Frank Wurthwein MIT/FNAL-CD for the CDF Collaboration](https://reader036.vdocument.in/reader036/viewer/2022081520/56649eff5503460f94c14e70/html5/thumbnails/5.jpg)
Reconstruction Farms
Frank Wurthwein/MIT LCCWS'02
Data reconstruction + validation, Monte Carlo generation154 dual P3's (equivalent to 244 1 Ghz machines)Job management:
➢Batch system FBSNG developed at FNAL➢Single executable, validated offline
150 Million events
![Page 6: Computing at CDF ➢ Introduction ➢ Computing requirements ➢ Central Analysis Farm ➢ Conclusions Frank Wurthwein MIT/FNAL-CD for the CDF Collaboration](https://reader036.vdocument.in/reader036/viewer/2022081520/56649eff5503460f94c14e70/html5/thumbnails/6.jpg)
Data Handling
Frank Wurthwein/MIT LCCWS'02
Data archived using STK 9940 drives and tape robot
Enstore: Network-attached tape system developed at FNALprovides interface layer for staging data from tape
5 TB/day 100 TB
Today: 176TB on tape
![Page 7: Computing at CDF ➢ Introduction ➢ Computing requirements ➢ Central Analysis Farm ➢ Conclusions Frank Wurthwein MIT/FNAL-CD for the CDF Collaboration](https://reader036.vdocument.in/reader036/viewer/2022081520/56649eff5503460f94c14e70/html5/thumbnails/7.jpg)
Database Usage at CDF
Frank Wurthwein/MIT LCCWS'02
Oracle DB: Metadata + Calibrations
DB Hardware:
➢2 Sun E4500 Duals
➢1 Linux Quad
Presently evaluating:
➢MySQL
➢Replication to remote sites
➢Oracle9 streams, failover, load balance
![Page 8: Computing at CDF ➢ Introduction ➢ Computing requirements ➢ Central Analysis Farm ➢ Conclusions Frank Wurthwein MIT/FNAL-CD for the CDF Collaboration](https://reader036.vdocument.in/reader036/viewer/2022081520/56649eff5503460f94c14e70/html5/thumbnails/8.jpg)
Data/Software Characteristics
LCCWS'02
Data Characteristics:➢Root I/O sequential for raw data: ~250 kB/event➢Root I/O multi-branch for reco data: 50-100 kB/event➢'Standard' ntuple: 5-10 kB/event➢Typical RunIIa secondary dataset size: 107 events
Analysis Software:➢Typical analysis jobs run @ 5 Hz on 1 GHz P3
few MB/sec➢CPU rather than I/O bound (FastEthernet)
![Page 9: Computing at CDF ➢ Introduction ➢ Computing requirements ➢ Central Analysis Farm ➢ Conclusions Frank Wurthwein MIT/FNAL-CD for the CDF Collaboration](https://reader036.vdocument.in/reader036/viewer/2022081520/56649eff5503460f94c14e70/html5/thumbnails/9.jpg)
Computing Requirements
LCCWS'02
Requirements set by goal:200 simultaneous users to analyze secondary data set (107 evts) in a day
Need ~700 TB of disk and ~5 THz of CPU by end of FY'05:need lots of disk need cheap disk IDE Raidneed lots of CPUcommodity CPU dual Intel/AMD
![Page 10: Computing at CDF ➢ Introduction ➢ Computing requirements ➢ Central Analysis Farm ➢ Conclusions Frank Wurthwein MIT/FNAL-CD for the CDF Collaboration](https://reader036.vdocument.in/reader036/viewer/2022081520/56649eff5503460f94c14e70/html5/thumbnails/10.jpg)
Past CAF Computing Model
LCCWS'02
Very expensive to expand and maintain
Bottom line: Not enough 'bang for the buck'
![Page 11: Computing at CDF ➢ Introduction ➢ Computing requirements ➢ Central Analysis Farm ➢ Conclusions Frank Wurthwein MIT/FNAL-CD for the CDF Collaboration](https://reader036.vdocument.in/reader036/viewer/2022081520/56649eff5503460f94c14e70/html5/thumbnails/11.jpg)
Design Challenges
LCCWS'02
develop/debug interactively @ remote desktopcode management & rootd
Send binary & 'sandbox' for execution on CAFkerberized gatekeeper
no user accounts on cluster BUT
user access to scratch space with quotascreative use of kerberos
![Page 12: Computing at CDF ➢ Introduction ➢ Computing requirements ➢ Central Analysis Farm ➢ Conclusions Frank Wurthwein MIT/FNAL-CD for the CDF Collaboration](https://reader036.vdocument.in/reader036/viewer/2022081520/56649eff5503460f94c14e70/html5/thumbnails/12.jpg)
CAF Architecture
LCCWS'02
Users are able to:➢submit jobs➢monitor job progress➢retrieve output
from 'any desktop' in the world
![Page 13: Computing at CDF ➢ Introduction ➢ Computing requirements ➢ Central Analysis Farm ➢ Conclusions Frank Wurthwein MIT/FNAL-CD for the CDF Collaboration](https://reader036.vdocument.in/reader036/viewer/2022081520/56649eff5503460f94c14e70/html5/thumbnails/13.jpg)
CAF Milestones
11/01
2/25/02
3/6/02
4/25/02
5/30/02
➢ Start of CAF design
➢ CAF prototype (protoCAF)assembled
➢ Fully-functional prototype system (>99% job success)
➢ ProtoCAF integrated into Stage1 system
➢ Production Stage1 CAF forcollaboration
Stage1
ProtoCAF
Design Production system in 6 months!
LCCWS'02
![Page 14: Computing at CDF ➢ Introduction ➢ Computing requirements ➢ Central Analysis Farm ➢ Conclusions Frank Wurthwein MIT/FNAL-CD for the CDF Collaboration](https://reader036.vdocument.in/reader036/viewer/2022081520/56649eff5503460f94c14e70/html5/thumbnails/14.jpg)
CAF Stage 1 Hardware
LCCWS'02
Worker Nodes
File Servers
Code Server
Linux 8-ways(interactive)
![Page 15: Computing at CDF ➢ Introduction ➢ Computing requirements ➢ Central Analysis Farm ➢ Conclusions Frank Wurthwein MIT/FNAL-CD for the CDF Collaboration](https://reader036.vdocument.in/reader036/viewer/2022081520/56649eff5503460f94c14e70/html5/thumbnails/15.jpg)
Stage 1 Hardware: Workers
LCCWS'02
Workers (132 CPUs, 1U+2U rackmount):
16 2U Dual Athelon 1.6GHz / 512MB RAM
50 1U/2U Dual P3 1.26GHz / 2GB RAMFE (11 MB/s) / 80GB job scratch each
![Page 16: Computing at CDF ➢ Introduction ➢ Computing requirements ➢ Central Analysis Farm ➢ Conclusions Frank Wurthwein MIT/FNAL-CD for the CDF Collaboration](https://reader036.vdocument.in/reader036/viewer/2022081520/56649eff5503460f94c14e70/html5/thumbnails/16.jpg)
Stage 1 Hardware: Servers
LCCWS'02
Servers (35TB total, 16 4U rackmount):2.2TB useable IDE RAID50 hot-swapDual P3 1.4GHz / 2GB RAMSysKonnect 9843 Gigabt Ethernet card
![Page 17: Computing at CDF ➢ Introduction ➢ Computing requirements ➢ Central Analysis Farm ➢ Conclusions Frank Wurthwein MIT/FNAL-CD for the CDF Collaboration](https://reader036.vdocument.in/reader036/viewer/2022081520/56649eff5503460f94c14e70/html5/thumbnails/17.jpg)
File Server Performance
Server/Client Performance: Up to 200MB/s local reads, 70 MB/s NFS
Data Integrity tests: md5sum of local reads/writes under heavy load BER read/write = 1.1+- 0.8 10-15 / 1.0+- 0.3 10-13
Cooling tests: Temp profile of disks w/ IR gun after extended disk thrashing
LCCWS'02
200 MB/s
60 MB/s
70 MB/s
![Page 18: Computing at CDF ➢ Introduction ➢ Computing requirements ➢ Central Analysis Farm ➢ Conclusions Frank Wurthwein MIT/FNAL-CD for the CDF Collaboration](https://reader036.vdocument.in/reader036/viewer/2022081520/56649eff5503460f94c14e70/html5/thumbnails/18.jpg)
Stage2 Hardware
LCCWS'02
Worker nodes:238 Dual Athlon MP2000+, 1U rackmount
1 THz of CPU power
File servers:76 systems, 4U rackmount, dual red. Power supply14 WD180GB in 2 RAID5 on 3ware 7500-82 WD40GB in RAID1 on 3ware 7000-21 GigE Syskonnect 9843Dual P3 1.4GHz
150 TB disk cache
![Page 19: Computing at CDF ➢ Introduction ➢ Computing requirements ➢ Central Analysis Farm ➢ Conclusions Frank Wurthwein MIT/FNAL-CD for the CDF Collaboration](https://reader036.vdocument.in/reader036/viewer/2022081520/56649eff5503460f94c14e70/html5/thumbnails/19.jpg)
Stage1 Data Access
LCCWS'02
Static files on disk:
NFS mounted to worker nodesremote file access via rootd
Dynamic disk cache:
dCache in front of Enstore robot
![Page 20: Computing at CDF ➢ Introduction ➢ Computing requirements ➢ Central Analysis Farm ➢ Conclusions Frank Wurthwein MIT/FNAL-CD for the CDF Collaboration](https://reader036.vdocument.in/reader036/viewer/2022081520/56649eff5503460f94c14e70/html5/thumbnails/20.jpg)
Problems & Issues
LCCWS'02
Resource overloading: ➢DB meltdown dedicated replica, startup delays➢Rcp overload replaced with fcp➢Rootd overload replaced with NFS,dCache➢File server overload scatter data randomly
System issues:➢Memory problems improved burn-in for next time➢Bit error during rcp checksum after copy➢dCache filesystem issues xfs & direct I/O
![Page 21: Computing at CDF ➢ Introduction ➢ Computing requirements ➢ Central Analysis Farm ➢ Conclusions Frank Wurthwein MIT/FNAL-CD for the CDF Collaboration](https://reader036.vdocument.in/reader036/viewer/2022081520/56649eff5503460f94c14e70/html5/thumbnails/21.jpg)
Lessons Learned
LCCWS'02
➢Expertise in FNAL-CD is essential.
➢Well organized code management is crucial.
➢Independent commissioning of data handling and job processing 3 ways of getting data to application.
![Page 22: Computing at CDF ➢ Introduction ➢ Computing requirements ➢ Central Analysis Farm ➢ Conclusions Frank Wurthwein MIT/FNAL-CD for the CDF Collaboration](https://reader036.vdocument.in/reader036/viewer/2022081520/56649eff5503460f94c14e70/html5/thumbnails/22.jpg)
CAF: User Perspective
Job Related:➢Submit jobs➢Check progress of job➢Kill a job
Remote file system access:➢ 'ls' in job's 'relative path'➢ 'ls' in a CAF node's absolute path➢ tail' of any file in job's 'relative path'
LCCWS'02
![Page 23: Computing at CDF ➢ Introduction ➢ Computing requirements ➢ Central Analysis Farm ➢ Conclusions Frank Wurthwein MIT/FNAL-CD for the CDF Collaboration](https://reader036.vdocument.in/reader036/viewer/2022081520/56649eff5503460f94c14e70/html5/thumbnails/23.jpg)
CAF Software
LCCWS'02
![Page 24: Computing at CDF ➢ Introduction ➢ Computing requirements ➢ Central Analysis Farm ➢ Conclusions Frank Wurthwein MIT/FNAL-CD for the CDF Collaboration](https://reader036.vdocument.in/reader036/viewer/2022081520/56649eff5503460f94c14e70/html5/thumbnails/24.jpg)
CAF User Interface➢ Compile, build, debug analysis job on 'desktop'
➢ Fill in appropriate fields & submit job
➢ Retrieve output using kerberized FTP tools... or write output directly to 'desktop'!
section integer range
user exe+tcl directoryoutput destination
LCCWS'02
![Page 25: Computing at CDF ➢ Introduction ➢ Computing requirements ➢ Central Analysis Farm ➢ Conclusions Frank Wurthwein MIT/FNAL-CD for the CDF Collaboration](https://reader036.vdocument.in/reader036/viewer/2022081520/56649eff5503460f94c14e70/html5/thumbnails/25.jpg)
Each user a different queue
Process type for job length
test: 5 minsshort: 2
hrsmedium: 6 hrslong: 2
days
This example:1 job 11
sections (+ 1 additional section automatic for job cleanup)
Web Monitoring of User Queues
LCCWS'02
![Page 26: Computing at CDF ➢ Introduction ➢ Computing requirements ➢ Central Analysis Farm ➢ Conclusions Frank Wurthwein MIT/FNAL-CD for the CDF Collaboration](https://reader036.vdocument.in/reader036/viewer/2022081520/56649eff5503460f94c14e70/html5/thumbnails/26.jpg)
Monitoring jobs in your queue
LCCWS'02
![Page 27: Computing at CDF ➢ Introduction ➢ Computing requirements ➢ Central Analysis Farm ➢ Conclusions Frank Wurthwein MIT/FNAL-CD for the CDF Collaboration](https://reader036.vdocument.in/reader036/viewer/2022081520/56649eff5503460f94c14e70/html5/thumbnails/27.jpg)
Monitoring sections of your job
LCCWS'02
![Page 28: Computing at CDF ➢ Introduction ➢ Computing requirements ➢ Central Analysis Farm ➢ Conclusions Frank Wurthwein MIT/FNAL-CD for the CDF Collaboration](https://reader036.vdocument.in/reader036/viewer/2022081520/56649eff5503460f94c14e70/html5/thumbnails/28.jpg)
LCCWS'02
CAF in active use by CDF collaboration
➢300 CAF Users (queues) to date➢Several dozen simultaneous users in
a typical 24 hr period
CAF Utilization
![Page 29: Computing at CDF ➢ Introduction ➢ Computing requirements ➢ Central Analysis Farm ➢ Conclusions Frank Wurthwein MIT/FNAL-CD for the CDF Collaboration](https://reader036.vdocument.in/reader036/viewer/2022081520/56649eff5503460f94c14e70/html5/thumbnails/29.jpg)
CAF System Monitoring
LCCWS'02
Round-robinDatabase(RRD)
![Page 30: Computing at CDF ➢ Introduction ➢ Computing requirements ➢ Central Analysis Farm ➢ Conclusions Frank Wurthwein MIT/FNAL-CD for the CDF Collaboration](https://reader036.vdocument.in/reader036/viewer/2022081520/56649eff5503460f94c14e70/html5/thumbnails/30.jpg)
LCCWS'02
CPU Utilization
CAF utilization steadily rising since opened to collaboration
Provided 10-fold increase in analysis resources for last summer physics conferences
Need for more CPU for winter
Week
3 monthsDay
![Page 31: Computing at CDF ➢ Introduction ➢ Computing requirements ➢ Central Analysis Farm ➢ Conclusions Frank Wurthwein MIT/FNAL-CD for the CDF Collaboration](https://reader036.vdocument.in/reader036/viewer/2022081520/56649eff5503460f94c14e70/html5/thumbnails/31.jpg)
LCCWS'02
File Server
Worker Node
Aggregate I/O 4-8TB/day
Aggregate I/O
Data Processing
Average I/O1-2MB/sec @ ~80%CPU util.
3 months
1 week
![Page 32: Computing at CDF ➢ Introduction ➢ Computing requirements ➢ Central Analysis Farm ➢ Conclusions Frank Wurthwein MIT/FNAL-CD for the CDF Collaboration](https://reader036.vdocument.in/reader036/viewer/2022081520/56649eff5503460f94c14e70/html5/thumbnails/32.jpg)
Work in Progress
LCCWS'02
Stage2 upgrade: 1THz CPU & 150TB disk
SAM framework for global data handling/distribution
''DCAF'' remote ''replicas'' of CAF
Central login pool @ FNAL
![Page 33: Computing at CDF ➢ Introduction ➢ Computing requirements ➢ Central Analysis Farm ➢ Conclusions Frank Wurthwein MIT/FNAL-CD for the CDF Collaboration](https://reader036.vdocument.in/reader036/viewer/2022081520/56649eff5503460f94c14e70/html5/thumbnails/33.jpg)
CAF Summary
LCCWS'02
Distributed Desk-to-Farm Computing Model
Production system under heavy use:➢Single farm at FNAL
4-8TB/day processed by user applicationsAverage CPU utilization of 80%
➢Many users all over the world300 total userstypical: 30 users per day share 130 CPU'sRegularly several 1000 jobs queued
➢Connected to tape via large cache➢Currently updating to 1THz & 150TB
![Page 34: Computing at CDF ➢ Introduction ➢ Computing requirements ➢ Central Analysis Farm ➢ Conclusions Frank Wurthwein MIT/FNAL-CD for the CDF Collaboration](https://reader036.vdocument.in/reader036/viewer/2022081520/56649eff5503460f94c14e70/html5/thumbnails/34.jpg)
CDF Summary
LCCWS'02
Variety of computing systems deployed:
➢Single app. Farms: Online & Offline
➢Multiple app. Farm: user analysis farm
➢Expecting 1.7Petabyte tape archive by FY05
➢Expecting 700TB disk cache by FY05
➢Expecting 5THz of CPU by FY05
➢Oracle DB cluster with loadavg & failover for metadata.