long tails and archive systems elliot jaffe fdis 2005
TRANSCRIPT
Long tails and Archive systems
Elliot Jaffe
FDIS 2005
Archive Metrics
• What– Distribution of file sizes– Distribution of occupied storage– How are files accessed
• Why– System architecture– Scaling for access
File size studies
UFS93 (1993)
• 12 million files
• UNIX only
• Avg. file size is 2k
• 90% of storage in
11% of files
HUJI (2005)
• 4 million files
• UNIX + Windows
• Avg. file size is 8k
• 90% of storage in
5.5% of files
What’s Changed
Then
JAWS, NOW
Online was expensive
Offline tape storage
Now
Central File Servers
Digital Libraries
Online is cheap
No offline storage
XML
Multimedia
Empirical Data
Questions
• What is the future of these distributions?
• Are the changes extensions of the tails with power laws, so that 10/90 and 20/80 rules no longer work and are the wrong way to think about them?
• Are the changes based on external factors that are unpredictable?
The Long Tail
• Chris Anderson (2004)– http://www.wired.com/wired/archive/12.10/tail.html
• The long tail of a distribution has tremendous mass and creates new market opportunities
• Amazon, Netflix, Wikipedia
Today’s landscape
NOW
File Servers
Sarbanes Oxley
Digital Libraries
Storage Capacity
Access Frequency
Next Steps
• Collecting data from large storage systems– File Sizes, Created, Last Modified, Last
Access, Frequency of Reads
• Goal: New architectures for Digital libraries– Focus on Operations– Store large and small files differently– Store very-low access files in slow access