team

GA 2CASC

Team

Ghaleb Abdulla (0.4) Tina Eliassi-Rad (0.4) Terence Critchlow (0.15)

GA 3CASC

Task Objective

Identify data storage formats that minimize access times using historical access patterns to the same or similar data sets

Use spatial and temporal locality that result from data accesses to format the data on the disk

GA 4CASC

Challenges

Data can be accessed using different:— tools,— by different users.

User 1 Tool 1 Data set 1

User m-1 Tool n-1 Data set k-1

User m Tool n Data set k

GA 5CASC

Enabling Access Pattern Discovery

Application area (astrophysics) Visualization tool (VisIt) Analyze history of access patterns on two levels:

—System Level–Disk references–Network overhead–Memory usage

— Application level–Higher level commands–User level info

GA 6CASC

Enabling Access Pattern Discovery

VisIt

Astrophysics

User 1 User n

Application Logging

Disk Logging

Log files

Unsupervised Learner (e.g., k-NN, k-means, etc)

Supervised Learner (e.g., neural net, DT, etc)

Hints

[Pattern, Hints] training data

Patterns

Djehuty

GA 7CASC

Log file collection

Collect logs at the application and disk level Managing log collection process

—Start and stop collection sensors or agents based on demand

—Keep log data in one central place—Detect any failure in the monitoring agents and

restart them—Preferably work in a distributed environment

JAMM from LBL meets our requirements

GA 8CASC

JAMM Architecture

GA 9CASC

What to Collect

Application and user level:—Open—Zoom—Slice —etc.

System level—Network overhead —Disk block size—Buffer size—Disk location, etc.

We need to add our own sensors to collect data

GA 10CASC

Data format

The DTD for our XML files is as follows:

<!ELEMENT logfile (application+)>

<!ELEMENT application (user+)>

<!ATTLIST application name ID #REQUIRED>

<!ELEMENT user (dataset+)>

<!ATTLIST user name ID #REQUIRED>

<!ELEMENT dataset (session+)>

<!ATTLIST dataset name ID #REQUIRED>

<!ELEMENT session (metadata+)>

<!ATTLIST session time NMTOKENS #REQUIRED>

<!ELEMENT metadata (#PCDATA)>

<!ATTLIST metadata

name ID #REQUIRED

time NMTOKENS #IMPLIED>

GA 11CASC

Log File, Example<?xml version="1.0" ?>

<!DOCTYPE logfile (View Source for full doctype...)>

<logfile>< application name="SimTracker">< user name="Tina">< dataset name=“astro ">

<session time="01/11/2002 13:45:00 PST">

<metadata name="access_speed">100K</metadata>

<metadata name="storage_utilization">0</metadata>

<metadata name="cohesion">0</metadata>

<metadata name="fault_tolerance">1</metadata>

<metadata name="num_disks_to_strip">20</metadata>

<metadata name="start_io_device">16</metadata>

<metadata name="stripping_factor" time="01/11/2002 13:55:00">200</metadata> <metadata name="stripping_unit“ time="01/11/2002 13:59:00">64K</metadata>

<metadata name="file_permissions">write</metadata> <metadata name="access_patterns">random</metadata> <metadata name="file_size">210M</metadata> <metadata name="io_buffer_size">128K</metadata>

</session>

</dataset></user></application></logfile>

GA 12CASC

Data Analysis

Researched publicly available clustering tools Narrowed our choice to two

—CLUTO (University of Minnesota)— R (GNU)

Testing data processing algorithms on randomly generated log files

Hoping to get real log files in the near future:— Logging applications —We are currently looking at the “Flash” Log files

GA 13CASC

Questions

GA 14CASC

This work was performed under the auspices of the U.S. Department of Energy by the University of California, Lawrence Livermore National Laboratory under contract No. W-7405-Eng-48.

UCRL-MI-xxxxxx

team

Documents