drupal case study: abc dig music
DESCRIPTION
A Drupal case study on developing the Australian Broadcasting Corporation's Dig Music website. I gave this talk at Drupal Downunder #ddu2011 in Brisbane, Australia (Jan 23, 2011).I discuss how the Semantic Web was used to create a real time snapshot of a musical artist that is pulled live from the digital radio broadcast.I also talk about performance issues we encountered and ways that they were overcome.TRANSCRIPT
Case Study – ABC Dig Music
David Peterson @davidseth #ddu2011 http://www.flickr.com/photos/soyignatius/
David Peterson @davidseth
Challenge
Create a snapshot of an artist
Combining • Known Data • Data in the Wild
Problem
<xml> <track> <title>Purple Rain</title> <artistName>Prince</artistName> </track> </xml>
Into
It’s all about Storytelling…
Shared Understanding
• Can’t tell a story if the other person doesn’t get what we mean
• Or even speak the same language
• The story matters
• ... but ...
• You never really have all the information you need, whether big or small
You Just don’t Always Know
• Someone else knows more than you
• How to find it?
One Exception
Semantic Web
• Core idea
– you never really know the entire picture
• This is a “good thing”
• Freedom
Closed World
Open World
http://www.flickr.com/photos/almasryalyoum_e/
“If the graph of people is cool, imagine a graph of
everything” - Dries Buytaert
Open Data
Facebook?
• A little late to the party ;)
Finding a Solution
• Which APIs to use
• Which APIs can we use
• How can we combine data from multiple sources
• How can we automate it
The Curse of too Much
• There are over 50 APIs listed on programmableweb.com
• Too many to look into
• Each has its own API methods and return data formats
– JSON, XML, RSS, RDF !!!
Take your Pick
• APIs everywhere – BBC Music
– Discogs
– Last.fm
– MusicBrainz
– Yahoo Music
– Flickr
– Youtube
– The Hype Machine
Finding the Key
• One common feature was the usage of a MusicBrainz ID
– Last.fm
– Discogs
– Freebase
– Wikipedia/Dbpedia
– BBC
Eureka!
• Great, now all I had to do was use the MusicBrainz API to look up the ID and I was done. Easy...
• :(
• The search API sucked. It returned too many fuzzy results
• crap
Back to the Future
• This is where the Semantic Web enters the picture
– All that stuff about story telling
– Shared understanding
– URIs (web links)
SPARQL
Think of it as Google with a WHERE clause
SELECT ?artist WHERE {
?artist foaf:name "Prince"@en .
?artist a <http://dbpedia.org/ontology/MusicalArtist>.
}
SELECT ?artist ?bio ?url ?album WHERE {
?artist foaf:name "Prince"@en .
?artist a <http://dbpedia.org/ontology/MusicalArtist> .
?artist dbpedia2:abstract ?bio .
?artist foaf:page ?url .
OPTIONAL {
?album <http://dbpedia.org/ontology/artist> ?artist .
?album rdfs:label "Purple Rain"@en .
}
}
LIMIT 1
Pinpoint Results
• This returns ONE result
• “exactly” what we are looking for (or nothing!)
{170d193a-845c-479f-980e-bef15710653e}
http://www.flickr.com/photos/riseofphoenix/
{070d193a-845c-479f-980e-bef15710653e}
http://www.flickr.com/photos/angeldew/
Raw Data
• Not too pretty to look at
• But computers LOVE this stuff
So, what do we get
• Disambiguation
• MusicBrainz ID
• Discography
• Related Artists
• Official homepage
• Bio
• Credit card details (sometime in 2012)
The Rosetta Stone
• MusicBrainz ID is our key to the wild web of APIs
• Wikipedia URL is the key to Semantic Web
• One happy family :)
http://www.flickr.com/photos/vportals/
Take a look
[browser]
Hindsight is 20/20
... or lessons learned
Drupal Sucks
• Drupal performance, what performance?
Don’t use Drupal
• To get the best performance out of Drupal 6, don’t use Drupal 6!
Pressflow
• Key patches and enhancements
• Releases mirror official Drupal releases
• Big players are using it
– Drupal.org
– ABC
– Music labels
– Newspapers
Start your Engines
MySQL base install is ... lacking
• MyISAM == slow
• Use Percona XtraDB
• ... or ... InnoDB
Reduce your footprint
• APC
– PHP app is compiled & cached in memory
• Memcached
Search
• Drupal’s built in search can be a dawg
• Solr
– Much faster search
– Offers faceting
– Can become a platform in its own right
A Fresh Coat of Paint
• Varnish
– Last but certainly not least
– Up to millions of hits per hour
Performance Optimisations
• Switch host to Linode
• Two-server architecture - db server and app server
• Master-slave relationship for mysql
• Migrated Drupal to Pressflow
• Changed tables to InnoDB
• Varnish for serving pages
• memcached for caching
• Setup munin to monitor servers
An Alternate Future
RDFaViewEntitFielMediStreaMongo
An Alternate Future
• Drupal 7
– RDFa
– Views 3
– Entities
– Fields
– Media Module
– Stream Wrappers
– MongoDB