shipping data science products! - bi...

23
Shipping Data Science Products! Turning raw data into valuable services BudapestBI Forum 2015 License: CC By Attribution Ian Ozsvald @IanOzsvald ModelInsight.io

Upload: others

Post on 05-Jul-2020

3 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Shipping Data Science Products! - BI Consultingbiconsulting.hu/letoltes/2015budapestbi/budapestbiforum2015_IanOzsvald.pdfMicroservices Flask is my go-to tool Swagger docs (git pull

Shipping Data Science Products!Turning raw data into valuable servicesBudapestBI Forum 2015License: CC By Attribution

Ian Ozsvald @IanOzsvald ModelInsight.io

Page 2: Shipping Data Science Products! - BI Consultingbiconsulting.hu/letoltes/2015budapestbi/budapestbiforum2015_IanOzsvald.pdfMicroservices Flask is my go-to tool Swagger docs (git pull

[email protected] @IanOzsvald BudapestBI Forum October 2015

Who Am I?

● “Industrial Data Science” for 15 years● Data Product Builder● O'Reilly Author● Teacher at PyCons

Page 3: Shipping Data Science Products! - BI Consultingbiconsulting.hu/letoltes/2015budapestbi/budapestbiforum2015_IanOzsvald.pdfMicroservices Flask is my go-to tool Swagger docs (git pull

[email protected] @IanOzsvald BudapestBI Forum October 2015

Who are you?

● Type A(nalysis) or B(building)● Robert Chang - “Doing Data Science at

Twitter”

Page 4: Shipping Data Science Products! - BI Consultingbiconsulting.hu/letoltes/2015budapestbi/budapestbiforum2015_IanOzsvald.pdfMicroservices Flask is my go-to tool Swagger docs (git pull

[email protected] @IanOzsvald BudapestBI Forum October 2015

What frustrations do we share?

● Lack of useful data● Biggest time sink - cleaning & transforming

● Conservative management● How can we derisk projects?

● Medium Data● luckily we have Wes in the room

Page 5: Shipping Data Science Products! - BI Consultingbiconsulting.hu/letoltes/2015budapestbi/budapestbiforum2015_IanOzsvald.pdfMicroservices Flask is my go-to tool Swagger docs (git pull

[email protected] @IanOzsvald BudapestBI Forum October 2015

Which projects succeed?

● Explain existing data (visualisation!)● Automate repetitive/slow processes (higher accuracy, more repeatable)

● Augment data to make new data (e.g. for search engines and ML)

● Predict the future (e.g. replace human intuition or use subtler relationships)

Page 6: Shipping Data Science Products! - BI Consultingbiconsulting.hu/letoltes/2015budapestbi/budapestbiforum2015_IanOzsvald.pdfMicroservices Flask is my go-to tool Swagger docs (git pull

[email protected] @IanOzsvald BudapestBI Forum October 2015

Why is it valuable?

Page 7: Shipping Data Science Products! - BI Consultingbiconsulting.hu/letoltes/2015budapestbi/budapestbiforum2015_IanOzsvald.pdfMicroservices Flask is my go-to tool Swagger docs (git pull

[email protected] @IanOzsvald BudapestBI Forum October 2015

Visualising data

● Most data isn't interesting...● Requires human curation + detective skills to get the good stuff

● Couple a researcher + a business person

Page 8: Shipping Data Science Products! - BI Consultingbiconsulting.hu/letoltes/2015budapestbi/budapestbiforum2015_IanOzsvald.pdfMicroservices Flask is my go-to tool Swagger docs (git pull

[email protected] @IanOzsvald BudapestBI Forum October 2015

Medical data (anti-allergy)

Perceived complexity might make sign-off more difficult...

Page 9: Shipping Data Science Products! - BI Consultingbiconsulting.hu/letoltes/2015budapestbi/budapestbiforum2015_IanOzsvald.pdfMicroservices Flask is my go-to tool Swagger docs (git pull

[email protected] @IanOzsvald BudapestBI Forum October 2015

Medical data (anti-allergy)

Predict using:● food● alcohol ● pollen● pollution● location● cats● ...

Page 10: Shipping Data Science Products! - BI Consultingbiconsulting.hu/letoltes/2015budapestbi/budapestbiforum2015_IanOzsvald.pdfMicroservices Flask is my go-to tool Swagger docs (git pull

[email protected] @IanOzsvald BudapestBI Forum October 2015

Extracting data from binary files

● Copy/pasting PDF/PNG data is laborious● How can we scale it?● textract/Tika - unified interface● Specialised tools e.g. Sovren● This might take months!

Page 11: Shipping Data Science Products! - BI Consultingbiconsulting.hu/letoltes/2015budapestbi/budapestbiforum2015_IanOzsvald.pdfMicroservices Flask is my go-to tool Swagger docs (git pull

[email protected] @IanOzsvald BudapestBI Forum October 2015

Augmenting data

● Identifying people, places, brands, sentiment

● “i love my apple phone” ● Context-sensitive (e.g movies vs products)

● Build custom machine-learned tools● Augment job titles● Reconcile the same order in 2 tables

Page 12: Shipping Data Science Products! - BI Consultingbiconsulting.hu/letoltes/2015budapestbi/budapestbiforum2015_IanOzsvald.pdfMicroservices Flask is my go-to tool Swagger docs (git pull

[email protected] @IanOzsvald BudapestBI Forum October 2015

Machine Learning

● PyMC (Markov Chain Monte Carlo)Please cite these projects! (it helps their funding)

Page 13: Shipping Data Science Products! - BI Consultingbiconsulting.hu/letoltes/2015budapestbi/budapestbiforum2015_IanOzsvald.pdfMicroservices Flask is my go-to tool Swagger docs (git pull

[email protected] @IanOzsvald BudapestBI Forum October 2015

Debugging Machine Learning?● Thoughts from you?● No obvious tools to show me:

● these examples were well-fitted● these always wrongly-fitted● these always uncertain

● No data-diagnostics to validate inputs (e.g. for Logistic Regression)

● No visualisers for most of the models● Your hard-won knowledge->new debug tools? (PLEASE!)

Page 14: Shipping Data Science Products! - BI Consultingbiconsulting.hu/letoltes/2015budapestbi/budapestbiforum2015_IanOzsvald.pdfMicroservices Flask is my go-to tool Swagger docs (git pull

[email protected] @IanOzsvald BudapestBI Forum October 2015

Debugging Machine Learning?Roelof Pieters PyDataLondon2015

Page 15: Shipping Data Science Products! - BI Consultingbiconsulting.hu/letoltes/2015budapestbi/budapestbiforum2015_IanOzsvald.pdfMicroservices Flask is my go-to tool Swagger docs (git pull

[email protected] @IanOzsvald BudapestBI Forum October 2015

Delivery: Keep It Simple (Stupid!)

● We're (probably) not publishing the best result

● Debuggability is key - 3am Sunday CTO beeper alert is no time for complexity

● “cult of the imperfect” Watson-Watt● Dumb models + clean data beat other combinations

Page 16: Shipping Data Science Products! - BI Consultingbiconsulting.hu/letoltes/2015budapestbi/budapestbiforum2015_IanOzsvald.pdfMicroservices Flask is my go-to tool Swagger docs (git pull

[email protected] @IanOzsvald BudapestBI Forum October 2015

Don't Kill It!● Your data is missing, it is poor and it lies

● Missing data kills projects!● Log everything! ● Make data quality tools & reports● More data->desynchronisation

● R&D != Engineering● Discovery-based● Success and failure equally useful

engarde

Page 17: Shipping Data Science Products! - BI Consultingbiconsulting.hu/letoltes/2015budapestbi/budapestbiforum2015_IanOzsvald.pdfMicroservices Flask is my go-to tool Swagger docs (git pull

[email protected] @IanOzsvald BudapestBI Forum October 2015

Internal deployment

● CSVs/Reports● Database updates● IPython Notebook

(not secure though!)

Page 18: Shipping Data Science Products! - BI Consultingbiconsulting.hu/letoltes/2015budapestbi/budapestbiforum2015_IanOzsvald.pdfMicroservices Flask is my go-to tool Swagger docs (git pull

[email protected] @IanOzsvald BudapestBI Forum October 2015

Deploying live systems

● Spyre (locked-down)● Microservices

● Flask is my go-to tool● Swagger docs● (git pull / fabric / provisioned machines)● Docker + Amazon ECS

Page 19: Shipping Data Science Products! - BI Consultingbiconsulting.hu/letoltes/2015budapestbi/budapestbiforum2015_IanOzsvald.pdfMicroservices Flask is my go-to tool Swagger docs (git pull

[email protected] @IanOzsvald BudapestBI Forum October 2015

Python Deployment● Make Python modules (setup.py)

● python setup.py develop # symlink● Unit tests + coverage● Use a config system (e.g.

github.com/ianozsvald/ python_template_with_config)

● Keep Separation of Concerns!● “12 Factor App” useful ideas

Page 20: Shipping Data Science Products! - BI Consultingbiconsulting.hu/letoltes/2015budapestbi/budapestbiforum2015_IanOzsvald.pdfMicroservices Flask is my go-to tool Swagger docs (git pull

[email protected] @IanOzsvald BudapestBI Forum October 2015

Some common gotchas● MySQL UTF8 is 3 byte by default #sigh● JavaScript months are 0-based (not 1)● Never compromise on datetimes (ISO 8601)

● iOS NSDate's epoch is 2001● Windows CP1252 text (strongly prefer UTF8)● MongoDB no_timeout_cursor=True● Github's 100MB file limits (new Large File Support)● Never throw data away! Never overwrite original data! Always transform it (e.g. Luigi)

● Data duplication bites you in the end...

Page 21: Shipping Data Science Products! - BI Consultingbiconsulting.hu/letoltes/2015budapestbi/budapestbiforum2015_IanOzsvald.pdfMicroservices Flask is my go-to tool Swagger docs (git pull

[email protected] @IanOzsvald BudapestBI Forum October 2015

(Perhaps) Avoid Big Data

● Don't be in a rush - 50,000 lines of good data will beat a pile of Bad Big Data

● 244GB RAM EC2+many Xeons $2.80/hr

Page 22: Shipping Data Science Products! - BI Consultingbiconsulting.hu/letoltes/2015budapestbi/budapestbiforum2015_IanOzsvald.pdfMicroservices Flask is my go-to tool Swagger docs (git pull

[email protected] @IanOzsvald BudapestBI Forum October 2015

“Data Science Delivered”● New mini project / pamphlet● Includes dirty data strategies, ways to debug ML, thoughts on managing projects - 15 yrs experience (please critique and file bugs!)

● https://github.com/ianozsvald/

data_science_delivered ● Please give me your feedback

Page 23: Shipping Data Science Products! - BI Consultingbiconsulting.hu/letoltes/2015budapestbi/budapestbiforum2015_IanOzsvald.pdfMicroservices Flask is my go-to tool Swagger docs (git pull

[email protected] @IanOzsvald BudapestBI Forum October 2015

Closing

● Tell me your dirty data stories, perhaps in a Ruin Pub? (I am automating some of this)

● Takehome - Keep it clean, keep it simple● Come talk on your projects at our PyDataLondon monthly meetup or start your own!