team

13
GA 2 CASC Team Ghaleb Abdulla (0.4) Tina Eliassi-Rad (0.4) Terence Critchlow (0.15)

Upload: yosefu

Post on 08-Jan-2016

32 views

Category:

Documents


2 download

DESCRIPTION

Team. Ghaleb Abdulla (0.4) Tina Eliassi-Rad (0.4) Terence Critchlow (0.15). Task Objective. Identify data storage formats that minimize access times using historical access patterns to the same or similar data sets - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: Team

GA 2CASC

Team

Ghaleb Abdulla (0.4) Tina Eliassi-Rad (0.4) Terence Critchlow (0.15)

Page 2: Team

GA 3CASC

Task Objective

Identify data storage formats that minimize access times using historical access patterns to the same or similar data sets

Use spatial and temporal locality that result from data accesses to format the data on the disk

Page 3: Team

GA 4CASC

Challenges

Data can be accessed using different:— tools,— by different users.

User 1 Tool 1 Data set 1

User m-1 Tool n-1 Data set k-1

User m Tool n Data set k

Page 4: Team

GA 5CASC

Enabling Access Pattern Discovery

Application area (astrophysics) Visualization tool (VisIt) Analyze history of access patterns on two levels:

—System Level–Disk references–Network overhead–Memory usage

— Application level–Higher level commands–User level info

Page 5: Team

GA 6CASC

Enabling Access Pattern Discovery

VisIt

Astrophysics

User 1 User n

Application Logging

Disk Logging

Log files

Unsupervised Learner (e.g., k-NN, k-means, etc)

Supervised Learner (e.g., neural net, DT, etc)

Hints

[Pattern, Hints] training data

Patterns

Djehuty

Page 6: Team

GA 7CASC

Log file collection

Collect logs at the application and disk level Managing log collection process

—Start and stop collection sensors or agents based on demand

—Keep log data in one central place—Detect any failure in the monitoring agents and

restart them—Preferably work in a distributed environment

JAMM from LBL meets our requirements

Page 7: Team

GA 8CASC

JAMM Architecture

Page 8: Team

GA 9CASC

What to Collect

Application and user level:—Open—Zoom—Slice —etc.

System level—Network overhead —Disk block size—Buffer size—Disk location, etc.

We need to add our own sensors to collect data

Page 9: Team

GA 10CASC

Data format

The DTD for our XML files is as follows:

<!ELEMENT logfile (application+)>

<!ELEMENT application (user+)>

<!ATTLIST application name ID #REQUIRED>

<!ELEMENT user (dataset+)>

<!ATTLIST user name ID #REQUIRED>

<!ELEMENT dataset (session+)>

<!ATTLIST dataset name ID #REQUIRED>

<!ELEMENT session (metadata+)>

<!ATTLIST session time NMTOKENS #REQUIRED>

<!ELEMENT metadata (#PCDATA)>

<!ATTLIST metadata

name ID #REQUIRED

time NMTOKENS #IMPLIED>

Page 10: Team

GA 11CASC

Log File, Example<?xml version="1.0" ?>

<!DOCTYPE logfile (View Source for full doctype...)>

<logfile>< application name="SimTracker">< user name="Tina">< dataset name=“astro ">

<session time="01/11/2002 13:45:00 PST">

<metadata name="access_speed">100K</metadata>

<metadata name="storage_utilization">0</metadata>

<metadata name="cohesion">0</metadata>

<metadata name="fault_tolerance">1</metadata>

<metadata name="num_disks_to_strip">20</metadata>

<metadata name="start_io_device">16</metadata>

<metadata name="stripping_factor" time="01/11/2002 13:55:00">200</metadata> <metadata name="stripping_unit“ time="01/11/2002 13:59:00">64K</metadata>

<metadata name="file_permissions">write</metadata> <metadata name="access_patterns">random</metadata> <metadata name="file_size">210M</metadata> <metadata name="io_buffer_size">128K</metadata>

</session>

</dataset></user></application></logfile>

Page 11: Team

GA 12CASC

Data Analysis

Researched publicly available clustering tools Narrowed our choice to two

—CLUTO (University of Minnesota)— R (GNU)

Testing data processing algorithms on randomly generated log files

Hoping to get real log files in the near future:— Logging applications —We are currently looking at the “Flash” Log files

Page 12: Team

GA 13CASC

Questions

Page 13: Team

GA 14CASC

This work was performed under the auspices of the U.S. Department of Energy by the University of California, Lawrence Livermore National Laboratory under contract No. W-7405-Eng-48.

UCRL-MI-xxxxxx