Download - Python
![Page 1: Python](https://reader031.vdocument.in/reader031/viewer/2022012922/54bd41e94a795931148b45ae/html5/thumbnails/1.jpg)
22nd October 2012
Python <3 Content systems- managing millions of tracks for the masses
Tuesday, October 23, 12
![Page 2: Python](https://reader031.vdocument.in/reader031/viewer/2022012922/54bd41e94a795931148b45ae/html5/thumbnails/2.jpg)
Tuesday, October 23, 12
![Page 3: Python](https://reader031.vdocument.in/reader031/viewer/2022012922/54bd41e94a795931148b45ae/html5/thumbnails/3.jpg)
Tuesday, October 23, 12
![Page 4: Python](https://reader031.vdocument.in/reader031/viewer/2022012922/54bd41e94a795931148b45ae/html5/thumbnails/4.jpg)
Tuesday, October 23, 12
![Page 5: Python](https://reader031.vdocument.in/reader031/viewer/2022012922/54bd41e94a795931148b45ae/html5/thumbnails/5.jpg)
Tuesday, October 23, 12
![Page 6: Python](https://reader031.vdocument.in/reader031/viewer/2022012922/54bd41e94a795931148b45ae/html5/thumbnails/6.jpg)
Tuesday, October 23, 12
![Page 7: Python](https://reader031.vdocument.in/reader031/viewer/2022012922/54bd41e94a795931148b45ae/html5/thumbnails/7.jpg)
Tuesday, October 23, 12
![Page 8: Python](https://reader031.vdocument.in/reader031/viewer/2022012922/54bd41e94a795931148b45ae/html5/thumbnails/8.jpg)
Tuesday, October 23, 12
![Page 9: Python](https://reader031.vdocument.in/reader031/viewer/2022012922/54bd41e94a795931148b45ae/html5/thumbnails/9.jpg)
> 15 M active users*
* Users active within the previous 30 daysTuesday, October 23, 12
![Page 10: Python](https://reader031.vdocument.in/reader031/viewer/2022012922/54bd41e94a795931148b45ae/html5/thumbnails/10.jpg)
> Available in 15 Countries
> 15 M active users*
* Users active within the previous 30 daysTuesday, October 23, 12
![Page 11: Python](https://reader031.vdocument.in/reader031/viewer/2022012922/54bd41e94a795931148b45ae/html5/thumbnails/11.jpg)
> 18 M tracks
> Available in 15 Countries
> 15 M active users*
* Users active within the previous 30 daysTuesday, October 23, 12
![Page 12: Python](https://reader031.vdocument.in/reader031/viewer/2022012922/54bd41e94a795931148b45ae/html5/thumbnails/12.jpg)
> 18 M tracks
> 20 k new tracks added per day
> Available in 15 Countries
> 15 M active users*
* Users active within the previous 30 daysTuesday, October 23, 12
![Page 13: Python](https://reader031.vdocument.in/reader031/viewer/2022012922/54bd41e94a795931148b45ae/html5/thumbnails/13.jpg)
> 18 M tracks
> 1 century of listening
> 20 k new tracks added per day
> Available in 15 Countries
> 15 M active users*
* Users active within the previous 30 daysTuesday, October 23, 12
![Page 14: Python](https://reader031.vdocument.in/reader031/viewer/2022012922/54bd41e94a795931148b45ae/html5/thumbnails/14.jpg)
> 18 M tracks
> 1 century of listening
> 20 k new tracks added per day
> 500 M playlists
> Available in 15 Countries
> 15 M active users*
* Users active within the previous 30 daysTuesday, October 23, 12
![Page 15: Python](https://reader031.vdocument.in/reader031/viewer/2022012922/54bd41e94a795931148b45ae/html5/thumbnails/15.jpg)
Service overview
Tuesday, October 23, 12
![Page 16: Python](https://reader031.vdocument.in/reader031/viewer/2022012922/54bd41e94a795931148b45ae/html5/thumbnails/16.jpg)
Service overview
Storage
Tuesday, October 23, 12
![Page 17: Python](https://reader031.vdocument.in/reader031/viewer/2022012922/54bd41e94a795931148b45ae/html5/thumbnails/17.jpg)
Service overview
Storage
User
Tuesday, October 23, 12
![Page 18: Python](https://reader031.vdocument.in/reader031/viewer/2022012922/54bd41e94a795931148b45ae/html5/thumbnails/18.jpg)
Service overview
Storage
User
Search
Tuesday, October 23, 12
![Page 19: Python](https://reader031.vdocument.in/reader031/viewer/2022012922/54bd41e94a795931148b45ae/html5/thumbnails/19.jpg)
Service overview
Storage
User
Search
Metadata
Tuesday, October 23, 12
![Page 20: Python](https://reader031.vdocument.in/reader031/viewer/2022012922/54bd41e94a795931148b45ae/html5/thumbnails/20.jpg)
Service overview
...
Storage
User
Search
Metadata
Tuesday, October 23, 12
![Page 21: Python](https://reader031.vdocument.in/reader031/viewer/2022012922/54bd41e94a795931148b45ae/html5/thumbnails/21.jpg)
Service overview
...
Storage
User
Search
Metadata
AP
Tuesday, October 23, 12
![Page 22: Python](https://reader031.vdocument.in/reader031/viewer/2022012922/54bd41e94a795931148b45ae/html5/thumbnails/22.jpg)
Service overview
...
Storage
User
Search
Metadata
AP
Tuesday, October 23, 12
![Page 23: Python](https://reader031.vdocument.in/reader031/viewer/2022012922/54bd41e94a795931148b45ae/html5/thumbnails/23.jpg)
Service overview
...
Storage
User
Search
Metadata
AP
Tuesday, October 23, 12
![Page 24: Python](https://reader031.vdocument.in/reader031/viewer/2022012922/54bd41e94a795931148b45ae/html5/thumbnails/24.jpg)
Service overview
...
Storage
User
Search
Metadata
AP
Tuesday, October 23, 12
![Page 25: Python](https://reader031.vdocument.in/reader031/viewer/2022012922/54bd41e94a795931148b45ae/html5/thumbnails/25.jpg)
Label A
Label B
Label C
Label D
Image: Steve Juvertson (CC BY 2.0) http://www.flickr.com/photos/jurvetson/916142/
Content pipeline
Tuesday, October 23, 12
![Page 26: Python](https://reader031.vdocument.in/reader031/viewer/2022012922/54bd41e94a795931148b45ae/html5/thumbnails/26.jpg)
Label A
Label B
Label C
Label D
Image: Steve Juvertson (CC BY 2.0) http://www.flickr.com/photos/jurvetson/916142/
Ingestion
Content pipeline
Tuesday, October 23, 12
![Page 27: Python](https://reader031.vdocument.in/reader031/viewer/2022012922/54bd41e94a795931148b45ae/html5/thumbnails/27.jpg)
XMLXMLXMLXML
Background image: lord enfield (CC BY 2.0) http://www.flickr.com/photos/42424413@N06/5064658450/
Ingestion
Tuesday, October 23, 12
![Page 28: Python](https://reader031.vdocument.in/reader031/viewer/2022012922/54bd41e94a795931148b45ae/html5/thumbnails/28.jpg)
Ingestion: Delivery formats
Tuesday, October 23, 12
![Page 29: Python](https://reader031.vdocument.in/reader031/viewer/2022012922/54bd41e94a795931148b45ae/html5/thumbnails/29.jpg)
Ingestion: Delivery formats
~ 10 different incoming XML formats
Tuesday, October 23, 12
![Page 30: Python](https://reader031.vdocument.in/reader031/viewer/2022012922/54bd41e94a795931148b45ae/html5/thumbnails/30.jpg)
Ingestion: Delivery formats
~ 10 different incoming XML formats
- Proprietary formats (majors)
Tuesday, October 23, 12
![Page 31: Python](https://reader031.vdocument.in/reader031/viewer/2022012922/54bd41e94a795931148b45ae/html5/thumbnails/31.jpg)
Ingestion: Delivery formats
~ 10 different incoming XML formats
- Proprietary formats (majors)
- Spotify delivery format (mostly indies)
Tuesday, October 23, 12
![Page 32: Python](https://reader031.vdocument.in/reader031/viewer/2022012922/54bd41e94a795931148b45ae/html5/thumbnails/32.jpg)
Ingestion: Delivery formats
~ 10 different incoming XML formats
- Proprietary formats (majors)
- Spotify delivery format (mostly indies)
Thousands of lines of source specific code
Tuesday, October 23, 12
![Page 33: Python](https://reader031.vdocument.in/reader031/viewer/2022012922/54bd41e94a795931148b45ae/html5/thumbnails/33.jpg)
Data model [simplified]
Album
Track
Artist
Disc
Rights
Audio
*
*
*
*
*
1
1
1
*
*
1
1
Transcoding
1
*
Tuesday, October 23, 12
![Page 34: Python](https://reader031.vdocument.in/reader031/viewer/2022012922/54bd41e94a795931148b45ae/html5/thumbnails/34.jpg)
Ingestion
LXML and XSLT with extensions for parsing/transforming XML
Tuesday, October 23, 12
![Page 35: Python](https://reader031.vdocument.in/reader031/viewer/2022012922/54bd41e94a795931148b45ae/html5/thumbnails/35.jpg)
Ingestion: XPath extensions
>>> def formerlify(_, name):... return 'The artist formerly known as %s' %name
>>> #Namespace stuff>>> from lxml import etree>>> ns = etree.FunctionNamespace('http://my.org/myfunctions')>>> ns['hello'] = hello>>> ns.prefix = 'f'
>>> root = etree.XML('<a><b>Prince</b></a>')>>> print(root.xpath('f:hello(string(b))'))
... The artist formerly known as Prince
http://lxml.de/extensions.html#xpath-extension-functions
Tuesday, October 23, 12
![Page 36: Python](https://reader031.vdocument.in/reader031/viewer/2022012922/54bd41e94a795931148b45ae/html5/thumbnails/36.jpg)
Ingestion
Tuesday, October 23, 12
![Page 37: Python](https://reader031.vdocument.in/reader031/viewer/2022012922/54bd41e94a795931148b45ae/html5/thumbnails/37.jpg)
IngestionFun (?!) fact: largest XML file seen so far had 3.3 million rows taking up 350 MB of disk space
Tuesday, October 23, 12
![Page 38: Python](https://reader031.vdocument.in/reader031/viewer/2022012922/54bd41e94a795931148b45ae/html5/thumbnails/38.jpg)
IngestionFun (?!) fact: largest XML file seen so far had 3.3 million rows taking up 350 MB of disk space
Bible apparently fits in 3MB XML
Tuesday, October 23, 12
![Page 39: Python](https://reader031.vdocument.in/reader031/viewer/2022012922/54bd41e94a795931148b45ae/html5/thumbnails/39.jpg)
Ingestion
>>> timeit.timeit('e.parse("huge.xml")', setup='import lxml.etree as e', number=5) / 5 4.19...
>>> timeit.timeit('e.parse("huge.xml")', setup='import xml.etree.cElementTree as e', number=5) / 5 4.78...
>>> timeit.timeit('e.parse("huge.xml")', setup='import xml.etree.ElementTree as e', number=5) / 5 55.39...
Fun (?!) fact: largest XML file seen so far had 3.3 million rows taking up 350 MB of disk space
Bible apparently fits in 3MB XML
Tuesday, October 23, 12
![Page 40: Python](https://reader031.vdocument.in/reader031/viewer/2022012922/54bd41e94a795931148b45ae/html5/thumbnails/40.jpg)
Label A
Label B
Label C
Label D
Image: Steve Juvertson (CC BY 2.0) http://www.flickr.com/photos/jurvetson/916142/
Content pipeline
Tuesday, October 23, 12
![Page 41: Python](https://reader031.vdocument.in/reader031/viewer/2022012922/54bd41e94a795931148b45ae/html5/thumbnails/41.jpg)
Label A
Label B
Label C
Label D
Image: Steve Juvertson (CC BY 2.0) http://www.flickr.com/photos/jurvetson/916142/
Ingestion
Content pipeline
Tuesday, October 23, 12
![Page 42: Python](https://reader031.vdocument.in/reader031/viewer/2022012922/54bd41e94a795931148b45ae/html5/thumbnails/42.jpg)
Label A
Label B
Label C
Label D
Image: Steve Juvertson (CC BY 2.0) http://www.flickr.com/photos/jurvetson/916142/
Ingestion
Content pipeline
Merge
Tuesday, October 23, 12
![Page 43: Python](https://reader031.vdocument.in/reader031/viewer/2022012922/54bd41e94a795931148b45ae/html5/thumbnails/43.jpg)
Centralized vs. aggregated cataloging
Requires merging!
Requires humans!
Tuesday, October 23, 12
![Page 44: Python](https://reader031.vdocument.in/reader031/viewer/2022012922/54bd41e94a795931148b45ae/html5/thumbnails/44.jpg)
Image: Nicolas Genin (CC BY 2.0) http://www.flickr.com/photos/22785954@N08
Metadata - challenges
Tuesday, October 23, 12
![Page 45: Python](https://reader031.vdocument.in/reader031/viewer/2022012922/54bd41e94a795931148b45ae/html5/thumbnails/45.jpg)
Label A
Label B
Label C
Label D
Image: Steve Juvertson (CC BY 2.0) http://www.flickr.com/photos/jurvetson/916142/
Content pipeline
Tuesday, October 23, 12
![Page 46: Python](https://reader031.vdocument.in/reader031/viewer/2022012922/54bd41e94a795931148b45ae/html5/thumbnails/46.jpg)
Label A
Label B
Label C
Label D
Image: Steve Juvertson (CC BY 2.0) http://www.flickr.com/photos/jurvetson/916142/
Ingestion
Content pipeline
Tuesday, October 23, 12
![Page 47: Python](https://reader031.vdocument.in/reader031/viewer/2022012922/54bd41e94a795931148b45ae/html5/thumbnails/47.jpg)
Label A
Label B
Label C
Label D
Image: Steve Juvertson (CC BY 2.0) http://www.flickr.com/photos/jurvetson/916142/
Ingestion
Content pipeline
Merge
Tuesday, October 23, 12
![Page 48: Python](https://reader031.vdocument.in/reader031/viewer/2022012922/54bd41e94a795931148b45ae/html5/thumbnails/48.jpg)
Label A
Label B
Label C
Label D
Image: Steve Juvertson (CC BY 2.0) http://www.flickr.com/photos/jurvetson/916142/
Ingestion
Content pipeline
Merge
Curation/enrichment
Tuesday, October 23, 12
![Page 49: Python](https://reader031.vdocument.in/reader031/viewer/2022012922/54bd41e94a795931148b45ae/html5/thumbnails/49.jpg)
Ambiguous artists - thesis work
Tuesday, October 23, 12
![Page 50: Python](https://reader031.vdocument.in/reader031/viewer/2022012922/54bd41e94a795931148b45ae/html5/thumbnails/50.jpg)
Ambiguous artists - thesis work
• User input
Tuesday, October 23, 12
![Page 51: Python](https://reader031.vdocument.in/reader031/viewer/2022012922/54bd41e94a795931148b45ae/html5/thumbnails/51.jpg)
Ambiguous artists - thesis work
• User input
• Machine learning
Tuesday, October 23, 12
![Page 52: Python](https://reader031.vdocument.in/reader031/viewer/2022012922/54bd41e94a795931148b45ae/html5/thumbnails/52.jpg)
Ambiguous artists - thesis work
• User input
• Machine learning
• Matching against external sources
Tuesday, October 23, 12
![Page 53: Python](https://reader031.vdocument.in/reader031/viewer/2022012922/54bd41e94a795931148b45ae/html5/thumbnails/53.jpg)
Ambiguous artists - thesis work
• User input
• Machine learning
• Matching against external sources
• Feature selection (#matches per external source, len(name), country-count, multilingual)
Tuesday, October 23, 12
![Page 54: Python](https://reader031.vdocument.in/reader031/viewer/2022012922/54bd41e94a795931148b45ae/html5/thumbnails/54.jpg)
Ambiguous artists - thesis work
• User input
• Machine learning
• Matching against external sources
• Feature selection (#matches per external source, len(name), country-count, multilingual)
• Matchings + preprocessing in Python
Tuesday, October 23, 12
![Page 55: Python](https://reader031.vdocument.in/reader031/viewer/2022012922/54bd41e94a795931148b45ae/html5/thumbnails/55.jpg)
Content matching
(16 * 10 ** 6) ** 2
Tuesday, October 23, 12
![Page 56: Python](https://reader031.vdocument.in/reader031/viewer/2022012922/54bd41e94a795931148b45ae/html5/thumbnails/56.jpg)
Content matching
(16 * 10 ** 6) ** 2 = A large number
Tuesday, October 23, 12
![Page 57: Python](https://reader031.vdocument.in/reader031/viewer/2022012922/54bd41e94a795931148b45ae/html5/thumbnails/57.jpg)
Content matching
(16 * 10 ** 6) ** 2 = A large number
Reduce search space: >>> from unicodedata import normalize>>> key = ''.join(normalize('NFD', char)[0].lower() for char in title)[5]
Tuesday, October 23, 12
![Page 58: Python](https://reader031.vdocument.in/reader031/viewer/2022012922/54bd41e94a795931148b45ae/html5/thumbnails/58.jpg)
Content matching
(16 * 10 ** 6) ** 2 = A large number
Reduce search space: >>> from unicodedata import normalize>>> key = ''.join(normalize('NFD', char)[0].lower() for char in title)[5]
Side note: Levenshtein (edit) distance is a heavy operation
-> speeded up about 4x with pypy (or use c-extension)
Tuesday, October 23, 12
![Page 59: Python](https://reader031.vdocument.in/reader031/viewer/2022012922/54bd41e94a795931148b45ae/html5/thumbnails/59.jpg)
Automatic data processing will never be perfect
Tuesday, October 23, 12
![Page 60: Python](https://reader031.vdocument.in/reader031/viewer/2022012922/54bd41e94a795931148b45ae/html5/thumbnails/60.jpg)
Automatic data processing will never be perfect
Patch it!
Tuesday, October 23, 12
![Page 61: Python](https://reader031.vdocument.in/reader031/viewer/2022012922/54bd41e94a795931148b45ae/html5/thumbnails/61.jpg)
Label A
Label B
Label C
Label D
Image: Steve Juvertson (CC BY 2.0) http://www.flickr.com/photos/jurvetson/916142/
Content pipeline
Tuesday, October 23, 12
![Page 62: Python](https://reader031.vdocument.in/reader031/viewer/2022012922/54bd41e94a795931148b45ae/html5/thumbnails/62.jpg)
Label A
Label B
Label C
Label D
Image: Steve Juvertson (CC BY 2.0) http://www.flickr.com/photos/jurvetson/916142/
Ingestion
Content pipeline
Tuesday, October 23, 12
![Page 63: Python](https://reader031.vdocument.in/reader031/viewer/2022012922/54bd41e94a795931148b45ae/html5/thumbnails/63.jpg)
Label A
Label B
Label C
Label D
Image: Steve Juvertson (CC BY 2.0) http://www.flickr.com/photos/jurvetson/916142/
Ingestion
Content pipeline
Merge
Tuesday, October 23, 12
![Page 64: Python](https://reader031.vdocument.in/reader031/viewer/2022012922/54bd41e94a795931148b45ae/html5/thumbnails/64.jpg)
Label A
Label B
Label C
Label D
Image: Steve Juvertson (CC BY 2.0) http://www.flickr.com/photos/jurvetson/916142/
Ingestion
Content pipeline
Merge
Curation/enrichment
Tuesday, October 23, 12
![Page 65: Python](https://reader031.vdocument.in/reader031/viewer/2022012922/54bd41e94a795931148b45ae/html5/thumbnails/65.jpg)
Label A
Label B
Label C
Label D
Image: Steve Juvertson (CC BY 2.0) http://www.flickr.com/photos/jurvetson/916142/
Ingestion
Content pipeline
Transcoding
Merge
Curation/enrichment
Tuesday, October 23, 12
![Page 66: Python](https://reader031.vdocument.in/reader031/viewer/2022012922/54bd41e94a795931148b45ae/html5/thumbnails/66.jpg)
Transcoding
Asynchronous
RabbitMQ + amqplib
Master / workers
Tuesday, October 23, 12
![Page 67: Python](https://reader031.vdocument.in/reader031/viewer/2022012922/54bd41e94a795931148b45ae/html5/thumbnails/67.jpg)
Label A
Label B
Label C
Label D
Image: Steve Juvertson (CC BY 2.0) http://www.flickr.com/photos/jurvetson/916142/
Content pipeline
Tuesday, October 23, 12
![Page 68: Python](https://reader031.vdocument.in/reader031/viewer/2022012922/54bd41e94a795931148b45ae/html5/thumbnails/68.jpg)
Label A
Label B
Label C
Label D
Image: Steve Juvertson (CC BY 2.0) http://www.flickr.com/photos/jurvetson/916142/
Ingestion
Content pipeline
Tuesday, October 23, 12
![Page 69: Python](https://reader031.vdocument.in/reader031/viewer/2022012922/54bd41e94a795931148b45ae/html5/thumbnails/69.jpg)
Label A
Label B
Label C
Label D
Image: Steve Juvertson (CC BY 2.0) http://www.flickr.com/photos/jurvetson/916142/
Ingestion
Content pipeline
Merge
Tuesday, October 23, 12
![Page 70: Python](https://reader031.vdocument.in/reader031/viewer/2022012922/54bd41e94a795931148b45ae/html5/thumbnails/70.jpg)
Label A
Label B
Label C
Label D
Image: Steve Juvertson (CC BY 2.0) http://www.flickr.com/photos/jurvetson/916142/
Ingestion
Content pipeline
Merge
Curation/enrichment
Tuesday, October 23, 12
![Page 71: Python](https://reader031.vdocument.in/reader031/viewer/2022012922/54bd41e94a795931148b45ae/html5/thumbnails/71.jpg)
Label A
Label B
Label C
Label D
Image: Steve Juvertson (CC BY 2.0) http://www.flickr.com/photos/jurvetson/916142/
Ingestion
Content pipeline
Transcoding
Merge
Curation/enrichment
Tuesday, October 23, 12
![Page 72: Python](https://reader031.vdocument.in/reader031/viewer/2022012922/54bd41e94a795931148b45ae/html5/thumbnails/72.jpg)
Label A
Label B
Label C
Label D
Image: Steve Juvertson (CC BY 2.0) http://www.flickr.com/photos/jurvetson/916142/
Ingestion
Content pipeline
Indexing
Transcoding
Merge
Curation/enrichment
Tuesday, October 23, 12
![Page 73: Python](https://reader031.vdocument.in/reader031/viewer/2022012922/54bd41e94a795931148b45ae/html5/thumbnails/73.jpg)
Index build
Tuesday, October 23, 12
![Page 74: Python](https://reader031.vdocument.in/reader031/viewer/2022012922/54bd41e94a795931148b45ae/html5/thumbnails/74.jpg)
Index build
• Nightly batch job on db-dumps
Tuesday, October 23, 12
![Page 75: Python](https://reader031.vdocument.in/reader031/viewer/2022012922/54bd41e94a795931148b45ae/html5/thumbnails/75.jpg)
Index build
• Nightly batch job on db-dumps
• Previously mostly python but now moved to Java for performance reason
Tuesday, October 23, 12
![Page 76: Python](https://reader031.vdocument.in/reader031/viewer/2022012922/54bd41e94a795931148b45ae/html5/thumbnails/76.jpg)
Index build
• Nightly batch job on db-dumps
• Previously mostly python but now moved to Java for performance reason
• But still lots of python helper scripts :)
Tuesday, October 23, 12
![Page 77: Python](https://reader031.vdocument.in/reader031/viewer/2022012922/54bd41e94a795931148b45ae/html5/thumbnails/77.jpg)
Label A
Label B
Label C
Label D
Image: Steve Juvertson (CC BY 2.0) http://www.flickr.com/photos/jurvetson/916142/
Content pipeline
Tuesday, October 23, 12
![Page 78: Python](https://reader031.vdocument.in/reader031/viewer/2022012922/54bd41e94a795931148b45ae/html5/thumbnails/78.jpg)
Label A
Label B
Label C
Label D
Image: Steve Juvertson (CC BY 2.0) http://www.flickr.com/photos/jurvetson/916142/
Ingestion
Content pipeline
Tuesday, October 23, 12
![Page 79: Python](https://reader031.vdocument.in/reader031/viewer/2022012922/54bd41e94a795931148b45ae/html5/thumbnails/79.jpg)
Label A
Label B
Label C
Label D
Image: Steve Juvertson (CC BY 2.0) http://www.flickr.com/photos/jurvetson/916142/
Ingestion
Content pipeline
Merge
Tuesday, October 23, 12
![Page 80: Python](https://reader031.vdocument.in/reader031/viewer/2022012922/54bd41e94a795931148b45ae/html5/thumbnails/80.jpg)
Label A
Label B
Label C
Label D
Image: Steve Juvertson (CC BY 2.0) http://www.flickr.com/photos/jurvetson/916142/
Ingestion
Content pipeline
Merge
Curation/enrichment
Tuesday, October 23, 12
![Page 81: Python](https://reader031.vdocument.in/reader031/viewer/2022012922/54bd41e94a795931148b45ae/html5/thumbnails/81.jpg)
Label A
Label B
Label C
Label D
Image: Steve Juvertson (CC BY 2.0) http://www.flickr.com/photos/jurvetson/916142/
Ingestion
Content pipeline
Transcoding
Merge
Curation/enrichment
Tuesday, October 23, 12
![Page 82: Python](https://reader031.vdocument.in/reader031/viewer/2022012922/54bd41e94a795931148b45ae/html5/thumbnails/82.jpg)
Label A
Label B
Label C
Label D
Image: Steve Juvertson (CC BY 2.0) http://www.flickr.com/photos/jurvetson/916142/
Ingestion
Content pipeline
Indexing
Transcoding
Merge
Curation/enrichment
Tuesday, October 23, 12
![Page 83: Python](https://reader031.vdocument.in/reader031/viewer/2022012922/54bd41e94a795931148b45ae/html5/thumbnails/83.jpg)
Label A
Label B
Label C
Label D
Image: Steve Juvertson (CC BY 2.0) http://www.flickr.com/photos/jurvetson/916142/
Ingestion
Content pipeline
Indexing
Transcoding On site live services, e.g. search, browse
Publishing
Merge
Curation/enrichment
Tuesday, October 23, 12
![Page 84: Python](https://reader031.vdocument.in/reader031/viewer/2022012922/54bd41e94a795931148b45ae/html5/thumbnails/84.jpg)
Distribution/publish Service A
Service B
Service C
Tuesday, October 23, 12
![Page 85: Python](https://reader031.vdocument.in/reader031/viewer/2022012922/54bd41e94a795931148b45ae/html5/thumbnails/85.jpg)
Distribution/publish
Index A
Index B
Index C
Service A
Service B
Service C
Tuesday, October 23, 12
![Page 86: Python](https://reader031.vdocument.in/reader031/viewer/2022012922/54bd41e94a795931148b45ae/html5/thumbnails/86.jpg)
Distribution/publish
Index A
Index B
Index C
Service A
Service B
Service C
Tuesday, October 23, 12
![Page 87: Python](https://reader031.vdocument.in/reader031/viewer/2022012922/54bd41e94a795931148b45ae/html5/thumbnails/87.jpg)
Distribution/publish
Index A
Index B
Index C
Service A
Service B
Service C
Tuesday, October 23, 12
![Page 88: Python](https://reader031.vdocument.in/reader031/viewer/2022012922/54bd41e94a795931148b45ae/html5/thumbnails/88.jpg)
Distribution/publish
Index A
Index B
Index C
Service A
Service B
Service C
Tuesday, October 23, 12
![Page 89: Python](https://reader031.vdocument.in/reader031/viewer/2022012922/54bd41e94a795931148b45ae/html5/thumbnails/89.jpg)
Distribution/publish
Index A
Index B
Index C
Service A
Service B
Service C
Tuesday, October 23, 12
![Page 90: Python](https://reader031.vdocument.in/reader031/viewer/2022012922/54bd41e94a795931148b45ae/html5/thumbnails/90.jpg)
Scheduling being migrated to ZooKeeper
image: http://www.flickr.com/photos/seattlemunicipalarchives/with/3797940791/
Tuesday, October 23, 12
![Page 91: Python](https://reader031.vdocument.in/reader031/viewer/2022012922/54bd41e94a795931148b45ae/html5/thumbnails/91.jpg)
Distribution/publish
Staged rollout
Tuesday, October 23, 12
![Page 92: Python](https://reader031.vdocument.in/reader031/viewer/2022012922/54bd41e94a795931148b45ae/html5/thumbnails/92.jpg)
Distribution/publish
Tuesday, October 23, 12
![Page 93: Python](https://reader031.vdocument.in/reader031/viewer/2022012922/54bd41e94a795931148b45ae/html5/thumbnails/93.jpg)
Distribution/publish
Exponential back-off
Tuesday, October 23, 12
![Page 94: Python](https://reader031.vdocument.in/reader031/viewer/2022012922/54bd41e94a795931148b45ae/html5/thumbnails/94.jpg)
Distribution/publish
Exponential back-offwaiting 5s ...
Tuesday, October 23, 12
![Page 95: Python](https://reader031.vdocument.in/reader031/viewer/2022012922/54bd41e94a795931148b45ae/html5/thumbnails/95.jpg)
Distribution/publish
Exponential back-offwaiting 5s ...waiting 10s ...
Tuesday, October 23, 12
![Page 96: Python](https://reader031.vdocument.in/reader031/viewer/2022012922/54bd41e94a795931148b45ae/html5/thumbnails/96.jpg)
Distribution/publish
Exponential back-offwaiting 5s ...waiting 10s ...waiting 30s ...
Tuesday, October 23, 12
![Page 97: Python](https://reader031.vdocument.in/reader031/viewer/2022012922/54bd41e94a795931148b45ae/html5/thumbnails/97.jpg)
Distribution/publish
Exponential back-offwaiting 5s ...waiting 10s ...waiting 30s ...waiting 60s ...
Tuesday, October 23, 12
![Page 98: Python](https://reader031.vdocument.in/reader031/viewer/2022012922/54bd41e94a795931148b45ae/html5/thumbnails/98.jpg)
Label A
Label B
Label C
Label D
Image: Steve Juvertson (CC BY 2.0) http://www.flickr.com/photos/jurvetson/916142/
Ingestion
Content pipeline
Indexing
Transcoding On site live services, e.g. search, browse
Publishing
Merge
Curation/enrichment
Tuesday, October 23, 12
![Page 99: Python](https://reader031.vdocument.in/reader031/viewer/2022012922/54bd41e94a795931148b45ae/html5/thumbnails/99.jpg)
Store ’da data
Tuesday, October 23, 12
![Page 100: Python](https://reader031.vdocument.in/reader031/viewer/2022012922/54bd41e94a795931148b45ae/html5/thumbnails/100.jpg)
Choice of database
Tuesday, October 23, 12
![Page 101: Python](https://reader031.vdocument.in/reader031/viewer/2022012922/54bd41e94a795931148b45ae/html5/thumbnails/101.jpg)
Choice of database
Depends on the use case - duh!
Tuesday, October 23, 12
![Page 102: Python](https://reader031.vdocument.in/reader031/viewer/2022012922/54bd41e94a795931148b45ae/html5/thumbnails/102.jpg)
Choice of database
Depends on the use case - duh!
• PostgreSQL (e.g. user service)
Tuesday, October 23, 12
![Page 103: Python](https://reader031.vdocument.in/reader031/viewer/2022012922/54bd41e94a795931148b45ae/html5/thumbnails/103.jpg)
Choice of database
Depends on the use case - duh!
• PostgreSQL (e.g. user service)
• Cassandra (e.g. playlist service)
Tuesday, October 23, 12
![Page 104: Python](https://reader031.vdocument.in/reader031/viewer/2022012922/54bd41e94a795931148b45ae/html5/thumbnails/104.jpg)
Choice of database
Depends on the use case - duh!
• PostgreSQL (e.g. user service)
• Cassandra (e.g. playlist service)
• Tokyo cabinet (e.g. browse service)
Tuesday, October 23, 12
![Page 105: Python](https://reader031.vdocument.in/reader031/viewer/2022012922/54bd41e94a795931148b45ae/html5/thumbnails/105.jpg)
Choice of database
Depends on the use case - duh!
• PostgreSQL (e.g. user service)
• Cassandra (e.g. playlist service)
• Tokyo cabinet (e.g. browse service)
• Lucene (search service)
Tuesday, October 23, 12
![Page 106: Python](https://reader031.vdocument.in/reader031/viewer/2022012922/54bd41e94a795931148b45ae/html5/thumbnails/106.jpg)
Choice of database
Depends on the use case - duh!
• PostgreSQL (e.g. user service)
• Cassandra (e.g. playlist service)
• Tokyo cabinet (e.g. browse service)
• Lucene (search service)
• HDFS
Tuesday, October 23, 12
![Page 107: Python](https://reader031.vdocument.in/reader031/viewer/2022012922/54bd41e94a795931148b45ae/html5/thumbnails/107.jpg)
PostgreSQL
[Pic. of elephant]
Image: http2007 (CC BY 2.0) http://www.flickr.com/photos/42424413@N06/5064658450/Tuesday, October 23, 12
![Page 108: Python](https://reader031.vdocument.in/reader031/viewer/2022012922/54bd41e94a795931148b45ae/html5/thumbnails/108.jpg)
PostgreSQL
Redundancy + scaling: master/slave
Tuesday, October 23, 12
![Page 109: Python](https://reader031.vdocument.in/reader031/viewer/2022012922/54bd41e94a795931148b45ae/html5/thumbnails/109.jpg)
PostgreSQL
Joins and subqueries - let the query planner roll!
Tuesday, October 23, 12
![Page 110: Python](https://reader031.vdocument.in/reader031/viewer/2022012922/54bd41e94a795931148b45ae/html5/thumbnails/110.jpg)
PostgreSQL
Python?
Tuesday, October 23, 12
![Page 111: Python](https://reader031.vdocument.in/reader031/viewer/2022012922/54bd41e94a795931148b45ae/html5/thumbnails/111.jpg)
PostgreSQL
Python?- psycopg2 + SQL-queries
- SQLAlchemy migrator for versioning of db-schemas
Tuesday, October 23, 12
![Page 112: Python](https://reader031.vdocument.in/reader031/viewer/2022012922/54bd41e94a795931148b45ae/html5/thumbnails/112.jpg)
PostgreSQL
Python?- psycopg2 + SQL-queries
- SQLAlchemy migrator for versioning of db-schemas
Server side, aka named, cursors:conn = psycopg2.connect(database='huge_db', user='postgres', password='secret')sscursor = conn.cursor('my_cursor')sscursor.execute('SELECT * FROM big_table')rows = sscursor.fetchmany(1000)...
Tip!
Tuesday, October 23, 12
![Page 113: Python](https://reader031.vdocument.in/reader031/viewer/2022012922/54bd41e94a795931148b45ae/html5/thumbnails/113.jpg)
Scaling the content pipeline
What to scale for?
Tuesday, October 23, 12
![Page 114: Python](https://reader031.vdocument.in/reader031/viewer/2022012922/54bd41e94a795931148b45ae/html5/thumbnails/114.jpg)
Scaling the content pipeline
Size of catalog
Tuesday, October 23, 12
![Page 115: Python](https://reader031.vdocument.in/reader031/viewer/2022012922/54bd41e94a795931148b45ae/html5/thumbnails/115.jpg)
Scaling the content pipeline
# Users
Tuesday, October 23, 12
![Page 117: Python](https://reader031.vdocument.in/reader031/viewer/2022012922/54bd41e94a795931148b45ae/html5/thumbnails/117.jpg)
Distribution/publish
Popen + gevent (although IO-bound)import gevent
gevent.monkey.patch_all()
def _wait(self): while True: res = self.poll() if res is not None: return res gevent.sleep(0.1)
subprocess.Popen.wait = _wait
Tuesday, October 23, 12