team
DESCRIPTION
Team. Ghaleb Abdulla (0.4) Tina Eliassi-Rad (0.4) Terence Critchlow (0.15). Task Objective. Identify data storage formats that minimize access times using historical access patterns to the same or similar data sets - PowerPoint PPT PresentationTRANSCRIPT
GA 2CASC
Team
Ghaleb Abdulla (0.4) Tina Eliassi-Rad (0.4) Terence Critchlow (0.15)
GA 3CASC
Task Objective
Identify data storage formats that minimize access times using historical access patterns to the same or similar data sets
Use spatial and temporal locality that result from data accesses to format the data on the disk
GA 4CASC
Challenges
Data can be accessed using different:— tools,— by different users.
User 1 Tool 1 Data set 1
User m-1 Tool n-1 Data set k-1
User m Tool n Data set k
GA 5CASC
Enabling Access Pattern Discovery
Application area (astrophysics) Visualization tool (VisIt) Analyze history of access patterns on two levels:
—System Level–Disk references–Network overhead–Memory usage
— Application level–Higher level commands–User level info
GA 6CASC
Enabling Access Pattern Discovery
VisIt
Astrophysics
User 1 User n
Application Logging
Disk Logging
Log files
Unsupervised Learner (e.g., k-NN, k-means, etc)
Supervised Learner (e.g., neural net, DT, etc)
Hints
[Pattern, Hints] training data
Patterns
Djehuty
GA 7CASC
Log file collection
Collect logs at the application and disk level Managing log collection process
—Start and stop collection sensors or agents based on demand
—Keep log data in one central place—Detect any failure in the monitoring agents and
restart them—Preferably work in a distributed environment
JAMM from LBL meets our requirements
GA 8CASC
JAMM Architecture
GA 9CASC
What to Collect
Application and user level:—Open—Zoom—Slice —etc.
System level—Network overhead —Disk block size—Buffer size—Disk location, etc.
We need to add our own sensors to collect data
GA 10CASC
Data format
The DTD for our XML files is as follows:
<!ELEMENT logfile (application+)>
<!ELEMENT application (user+)>
<!ATTLIST application name ID #REQUIRED>
<!ELEMENT user (dataset+)>
<!ATTLIST user name ID #REQUIRED>
<!ELEMENT dataset (session+)>
<!ATTLIST dataset name ID #REQUIRED>
<!ELEMENT session (metadata+)>
<!ATTLIST session time NMTOKENS #REQUIRED>
<!ELEMENT metadata (#PCDATA)>
<!ATTLIST metadata
name ID #REQUIRED
time NMTOKENS #IMPLIED>
GA 11CASC
Log File, Example<?xml version="1.0" ?>
<!DOCTYPE logfile (View Source for full doctype...)>
<logfile>< application name="SimTracker">< user name="Tina">< dataset name=“astro ">
<session time="01/11/2002 13:45:00 PST">
<metadata name="access_speed">100K</metadata>
<metadata name="storage_utilization">0</metadata>
<metadata name="cohesion">0</metadata>
<metadata name="fault_tolerance">1</metadata>
<metadata name="num_disks_to_strip">20</metadata>
<metadata name="start_io_device">16</metadata>
<metadata name="stripping_factor" time="01/11/2002 13:55:00">200</metadata> <metadata name="stripping_unit“ time="01/11/2002 13:59:00">64K</metadata>
<metadata name="file_permissions">write</metadata> <metadata name="access_patterns">random</metadata> <metadata name="file_size">210M</metadata> <metadata name="io_buffer_size">128K</metadata>
</session>
</dataset></user></application></logfile>
GA 12CASC
Data Analysis
Researched publicly available clustering tools Narrowed our choice to two
—CLUTO (University of Minnesota)— R (GNU)
Testing data processing algorithms on randomly generated log files
Hoping to get real log files in the near future:— Logging applications —We are currently looking at the “Flash” Log files
GA 13CASC
Questions
GA 14CASC
This work was performed under the auspices of the U.S. Department of Energy by the University of California, Lawrence Livermore National Laboratory under contract No. W-7405-Eng-48.
UCRL-MI-xxxxxx