hpcai 202004 eda · providentia worldwide well-established sector • traditional enterprise...
Post on 02-Aug-2020
5 Views
Preview:
TRANSCRIPT
Prov iden t ia Wor ldw ide
S. Ryan Quick @phaedo, Providentia Worldwide. April 2020
HPC ImpactEDA Telemetry Neural Networks
Systems IntelligenceEcosystem Management
Prov iden t ia Wor ldw ide
Systems Intelligence PrinciplesMethodology for leveraging multiple data domains through complex data processing
Disparate / Unlike Domains
Messaging Middleware
Insight
Insight
Prov iden t ia Wor ldw ide
• Aggregation
• Event Statistics
• Atomic Pattern Recognition
• Simple example shown as “waterfalling” for illustration — the operations are parallel and stateless
• Pattern is an example of the type and method of telemetry we use for EDA environmental and in-workload collection to feed AI and neural networks inline
• There are literally thousands of metrics for a single operation, millions per job
Multiple-Domain Simple Data Access
Metrics Calculator
CPU Event
Source
app login r/secapp successful login r/secapp failed login r/seccpu 1m load avgcpu 5m load avgcpu 15m load avgcpu blocked proc cntcpu running proc cntcpu waiting proc cntcpu user %cpu idle %cpu system %cpu io wait %db active queriesdb slow queriesdb selectsdb updatesdb deletesdb rows fetcheddb table locks helddb row locks held
Available Source Fields App Login Event
Source
DB Access Event
Source
> 3?
app failed login / app success login * 100
AVG(cpu waiting / cpu running)) / cpu 1M load avg * 100
> 0.5? DB Slow Queries
> 4?
Anomaly Detected: Potential Login
Attack
yes
yes
yes
Prov iden t ia Wor ldw ide
• Affinity + Simple Case
• Stream + Augmented Datasource
• Parallel Stream
• Frequency-Shifted Stream
• “Correlative/Normalized View”: Similar to a SQL “join” concept, we relate data fields in disparate stream sources
• Many examples — for other talks :)
• This illustrates the mechanisms by which we can combine and augment data types for complex events in AI/neural networks and utilize inline training and active models.
• Also allows us to introduce the notion of insight, which is crucial to incremental improvement model — especially for “slight touch ecosystems” like coral reefs
Multiple-Domain Complex Event Processing Approaches
Complex Event Processor
CPU Source
Zookeeper Source
RabbitMQ Source
Application Event
Source
Parallel SourceDisparate
Normalization
Correlative/Normalized
View
Correlative/Normalized
View
Correlative/Normalized
View
approx-data-szavg-latencyephemeral-countfollowersmax-fd-cntmax-latencymin-latencyopen-fd-cntnum-alive-connectionsoutstanding-requestspackets-receivedpackets-sentpending-syncssynced-followerswatch-cntznode-cnt
Zookeeper
message totalmessage readymessage unaskedrate.publishrate.deliverrate.redeliverrate.confirmrate.ackconnection.totalconnection.idlechannel.totalchannel.publisherchannel.consumerchannel.duplexchannel.inactiveexchange.rate.phaedoq.totalq.idleq.messages.phaedoq.consumers.phaedoq.memory.phaedoq.ingress.phaedoq.egress.phaedobinding.total
RabbitMQ
Prov iden t ia Wor ldw ide
Semiconductor EDA Designing the Digital Future
Prov iden t ia Wor ldw ide
HPC HTC
• “High Throughput Computing”
• Very predictable, common engineering pipeline
• Toolset geared to repeat the steps in the pattern 100s, 1000s of times per iteration, per engineer constantly. Each adjustment cascades hundreds/thousands of small jobs.
• Jobs are very short lived. Avg time on single core is under 3s. Job scheduler itself is often a bottleneck on large, shared systems.
• EDA requires multiple phases of HDL synthesizers and HLL compilers and so can result in different sorts of computational bottlenecks at different phases of the pipeline as well as resulting for different design choices in the engineering decisions.
EDA Characteristics
Prov iden t ia Wor ldw ide
Well-established Sector
• Traditional enterprise storage (NFS3)
• 10-100M small <=1M files/dir)
• user and group based access controls
• POSIX, locking not required
• OS scheduler is often sufficient. Sometimes, job submission separated by login node.
• License model well understood, and generally by core or time-based. Codes are generally proprietary.
• Turnkey deployment is up and running in minutes on nearly any sized system. Very little motivation to alter the status quo.
EDA Characteristics
Prov iden t ia Wor ldw ide
What Would it Take to Try something new?
• All on-prem, w/ cloud tests successful but not adopted:
• too costly
• intellectual property concerns
• ROI delayed
• data management difficulties
• Storage enhancements show improvements, and large shops adopt those, but NFS3 performs well for most small-medium practitioners.
EDA Environments
Prov iden t ia Wor ldw ide
What Would it Take to Try something new?
• EDA Process is well-known, easy-to-hire to, and well-understood in the industry. Why rock the boat?
• Any perturbations to the system would need to overcome the cost of change, which in semiconductor fabrication can be immense.
• Even where bottlenecks are known (storage, compute, scheduling), they are understood and manageable. New is new and unpredictable with unknown value…
EDA Pipelines at Scale?
Prov iden t ia Wor ldw ide
For valuable and motivational change in semiconductor EDA, we need disruption both in behavior and environment simultaneously.
Prov iden t ia Wor ldw ide
External focus for HTC/Systems Intelligence
• Two primary mechanisms for augmenting the EDA process:
Internally (inside the EDA pipeline).
Externally (augmenting and enhancing the pipelining environment).
We are focusing here for this project, but the usual neural network caveats apply.
Neural Networks for EDA Pipelines
Semiconductor Electronic Design Automation«precondition» API to workflow data
Chip SpecificationDesign entry/Functional verificationRTL synthesisPartitioning of chipDesign for test (DFT) insertionFloor planningPlacement stageClock tree synthesis (CTS)Routing stageFinal verificationGDS II
Infrastructure Automation«precondition» API to all components«precondition» API backwards compatible
Systems ProvisioningNetwork ProvisioningApplication DeploymentConfiguration ManagementPlatform ManagementChange Orchestration
capabilities
XY
User/group file CRUDWorkflow schedulingJob managementLicense management
X
Y
sd Systems Intelligence — EDA Messaging Substrate
CEP
Ingest
Data Analytics
inline models
offline modelsAtomic Pattern
Recognition
Parallel Stream
Command & Control
Stream Augmentation
data
/sco
res/
met
rics decisioning
orchestration
validation
feedback
Frequency-Shifted Streams
Affinity Streams
Aggregation/ Statistics
Internal
External
Prov iden t ia Wor ldw ide
Semiconductor EDA Designing the Digital Future
“When we think of sensing technologies as devices that order the world, rather than devices that describe it, then alternative relationships between the social and the technical are strikingly brought to light.”
— Genevieve Bell (Intel) @feraldata
Prov iden t ia Wor ldw ideEDA Workflow and Supporting Infrastructure SI Messaging
Semiconductor Electronic Design Automation«precondition» API to workflow data
Chip SpecificationDesign entry/Functional verificationRTL synthesisPartitioning of chipDesign for test (DFT) insertionFloor planningPlacement stageClock tree synthesis (CTS)Routing stageFinal verificationGDS II
Infrastructure Automation«precondition» API to all components«precondition» API backwards compatible
Systems ProvisioningNetwork ProvisioningApplication DeploymentConfiguration ManagementPlatform ManagementChange Orchestration
capabilities
XY
User/group file CRUDWorkflow schedulingJob managementLicense management
X
Y
sd Systems Intelligence — EDA Messaging Substrate
CEP
Ingest
Data Analytics
inline models
offline modelsAtomic Pattern
Recognition
Parallel Stream
Command & Control
Stream Augmentation
data
/sco
res/
met
rics decisioning
orchestration
validation
feedback
Frequency-Shifted Streams
Affinity Streams
Aggregation/ Statistics
Semiconductor Electronic Design Automation«precondition» API to workflow data
Chip SpecificationDesign entry/Functional verificationRTL synthesisPartitioning of chipDesign for test (DFT) insertionFloor planningPlacement stageClock tree synthesis (CTS)Routing stageFinal verificationGDS II
Infrastructure Automation«precondition» API to all components«precondition» API backwards compatible
Systems ProvisioningNetwork ProvisioningApplication DeploymentConfiguration ManagementPlatform ManagementChange Orchestration
capabilities
XY
User/group file CRUDWorkflow schedulingJob managementLicense management
X
Y
sd Systems Intelligence — EDA Messaging Substrate
CEP
Ingest
Data Analytics
inline models
offline modelsAtomic Pattern
Recognition
Parallel Stream
Command & Control
Stream Augmentation
data
/sco
res/
met
rics decisioning
orchestration
validation
feedback
Frequency-Shifted Streams
Affinity Streams
Aggregation/ Statistics
External Capabilities and Infrastructure
EDA SI Messaging Substrate
Insight
Insight
Prov iden t ia Wor ldw ideEDA Workflow and AI/NN Frameworks
Semiconductor Electronic Design Automation«precondition» API to workflow data
Chip SpecificationDesign entry/Functional verificationRTL synthesisPartitioning of chipDesign for test (DFT) insertionFloor planningPlacement stageClock tree synthesis (CTS)Routing stageFinal verificationGDS II
Infrastructure Automation«precondition» API to all components«precondition» API backwards compatible
Systems ProvisioningNetwork ProvisioningApplication DeploymentConfiguration ManagementPlatform ManagementChange Orchestration
capabilities
XY
User/group file CRUDWorkflow schedulingJob managementLicense management
X
Y
sd Systems Intelligence — EDA Messaging Substrate
CEP
Ingest
Data Analytics
inline models
offline modelsAtomic Pattern
Recognition
Parallel Stream
Command & Control
Stream Augmentation
data
/sco
res/
met
rics decisioning
orchestration
validation
feedback
Frequency-Shifted Streams
Affinity Streams
Aggregation/ Statistics
Semiconductor Electronic Design Automation«precondition» API to workflow data
Chip SpecificationDesign entry/Functional verificationRTL synthesisPartitioning of chipDesign for test (DFT) insertionFloor planningPlacement stageClock tree synthesis (CTS)Routing stageFinal verificationGDS II
Infrastructure Automation«precondition» API to all components«precondition» API backwards compatible
Systems ProvisioningNetwork ProvisioningApplication DeploymentConfiguration ManagementPlatform ManagementChange Orchestration
capabilities
XY
User/group file CRUDWorkflow schedulingJob managementLicense management
sd Neural Networks
sd Messaging-Based Machine Learning / AI / Neural Networks Workflow
Data Analytics and Normalization Reactive Systems
scor
ing
/ met
rics
decisioning
orchestration
validation
feedback
inline learning models
Clustering, Classification, Decision
Trees
Insight Consumers
Ecosystem Insight and KPI Enhancements
Ecosystem Messaging Platform Pattern Enhancements
Mod
el R
unM
odel
Tra
inin
g
Offline / replay learning models
CEP
/ING
EST
from
Exi
stin
g Da
taso
urce
s
X
Y
Y
X
External Capabilities and Infrastructure
EDA ML / AI / NN Workflow
SI M
essa
ging
Sub
stra
te
Insight
Insight
Insight
Prov iden t ia Wor ldw ide
Unique position for AI and NNWhy Artificial Intelligence/Neural Networks for this Problem?
• Small, incremental human-driven changes are not cost-effective in today’s DevOps systems
• Continuous observation for “minority report” style changes is difficult to design sprints and test efficacy, even harder to measure ROI
• Command and control systems can be designed to allow incremental change directly from NNs based on deployments — e.g. allow each “reef” to tune itself based on its own ecosystem
• The “show your work”/“show your rationale” problems are weaker in EDA compared to delivering results than in other domains
Prov iden t ia Wor ldw ide
Insight: “looking inward”
Insight provides a mechanism for self-tuning behavior of the running system at all levels:•algorithms, models, data access, expert systems, KPIs, behaviors, reports, accuracy, efficiency, even insight itself•In-built feedback mechanism for capturing behavior and performance•Mechanism to ensure that changes over time are accounted for and noticed if not understood•Allows for inline and ongoing training without having to maintain offline (and outdated) training datasets•Allows for locale-specific NN training (the NN-locale problem).
Prov iden t ia Wor ldw ide
Program StatusWhere are we now?
• Telemetry data from workload systems feeding messaging platform
• Synthetic workload (provided from partner benchmarking suite) being modified for user-emulation
• NN specific topology choice and models under discussion with wider team considering we will need to utilize simultaneous learning, model promotion, results propagation, etc.
• Insight mechanisms are developed in the messaging substrate automatically, with common APIs available to higher level structures. Common reporting in dashboards etc.
• Always looking for helpers to take things farther — will report more later as we (un)shelter…
top related