data scientist enablement dse 400 week 8 roadmap

19
Data Scientist Enablement DSE 400 - Fast Track to Data Science Week 8 Roadmap Advanced Center of Excellence Modern Renaissance Corporation In Collaboration with SONO team and others Content of this document is under Creative Commons Licence CC BY 4.0

Upload: mohan-bavirisetty

Post on 27-Jan-2015

116 views

Category:

Documents


3 download

DESCRIPTION

 

TRANSCRIPT

Page 1: Data scientist enablement   dse 400   week 8 roadmap

Data Scientist EnablementDSE 400 - Fast Track to Data Science

Week 8 Roadmap

Advanced Center of ExcellenceModern Renaissance CorporationIn Collaboration with SONO team and others

Content of this document is under Creative Commons Licence CC BY 4.0

Page 2: Data scientist enablement   dse 400   week 8 roadmap

AgendaYou can always find the latest version of this document at http://bit.ly/1qbXns0

Week 8 OverviewMission StatementDiscussions Learning PathActivities AssignmentSubmissionDSE Program TimelineAdaptive Learning OptionsReferences

“Charity and personal force are the only investments worth anything.”- Walt Whitman.

Page 3: Data scientist enablement   dse 400   week 8 roadmap

Mission and Objectives

Mission of our program is to provide free, open and world-class enablement of Data Scientists and help advance the profession of Data Science as well as allied disciplines.

We aim to prepare the participants with analytical and practical skills emphasizing breadth and depth in a range of relevant disciplines and capabilities in Data/Decision Sciences, Big Data Analytics, Architecture and Systems Engineering.

Page 4: Data scientist enablement   dse 400   week 8 roadmap

Social Discourse:Discuss about Ethics around Big Data . Test drive R-COP and Modern Data Platforms-COP

Learning plan: Read about Data Quality, Watch On demand Webinar

Activities:

Explore Google datasets. Start a blog on Big Data. Continue Personal Roadmap

Assignment 8:Cleanse and Visualize Sensor dataset. Alternatively, do a case study, write a blogpost or create mini-documentary.

DSE 400 - Week 8 at a glance

Page 5: Data scientist enablement   dse 400   week 8 roadmap

Discussion: Read Big Data’s Dangerous New Era of Discrimination Research and reflect on how Big Data and its associated technologies are misused or applied unethically. Share your views on how this can be rectified.

You can participate in this discussion on Linkedin, Facebook and Google+ Discussions on SONO will continue as planned on DSE 400 Jump Pad. This will allow more choice for participants. We are hoping this will result in the increased social engagement.

Check out Language R and Modern Data Platforms Communities of Practice (COPs) to help you increase your competence in R, Machine Learning, Hadoop ecosystem and other platforms. Reach out to Olivia Ramirez, Ellen Brock or Manju Rupani if you want to contribute to these communities or if you have any suggestions.

Social Engagement - Week 8SONO Linkedin Facebook Google+

Page 6: Data scientist enablement   dse 400   week 8 roadmap

Read Big Data’s Dangerous New Era of DiscriminationWatch Human Ethical Aspects of Big Data by Grady Booch<Optional> Read Get a Handle on Big Data Quality<Optional> Watch Big Data Integration and Governance Use Case - IBM OnDemand Webinar<Optional> Big Data and the Ethics and Challenges of Living in a Connected Society. O'Reilly Webcast<Optional> Big Data: Usage, Ethics, Algorithms by Vladislav Shershulsky

Recommended Learning Plan

Page 7: Data scientist enablement   dse 400   week 8 roadmap

Activities<Practice> Visit Google Public Data Directory. Check out Global Competitiveness Report. Compare your country’s GDP per capita with World average. Also compare your country’s capacity for Innovation with other countries in your region. Also explore other dimensions in this dataset pertinent to your analysis or area of focus.

<Practice> Continue learning and experimentation with R and Hadoop ecosystem through R-COP and Modern Data Platforms-COP. Seek/share advice, knowledge and resources. Reach put to Ellen Brock, Manju Rupani or Olivia Ramirez if you want to play more active role in these communities.

<Practice>Write a blog post on the ongoing Disruption in the Education sector. Explore sites like Stanford Journal of Social Innovation, blogs.hbr.org, Innovation Excellence, Forbes.com or Asoka Foundation etc. to see if you can publish you blogs on these communities.

Page 8: Data scientist enablement   dse 400   week 8 roadmap

Examples of InfographicsSource: UNICEF

Page 9: Data scientist enablement   dse 400   week 8 roadmap

Activities<Practice> Infographics are graphic visual representations of information, data or knowledge intended to present complex information quickly and clearly. Read 10 Free Tools for Creating Infographics. Research about a cause you seriously care about and produce one page Infographic on your cause. Human Rights, Environment, Poverty Elimination, Fight against Child labor, corruption, Equality, Religious Harmony, Prevention of cruelty to animals etc. are a few examples of causes people around the planet care about.

<Practice> Continue your earlier work ( or start it afresh, in case you haven’t started it) on Personal Career Advancement Roadmap. Revise it to take advantage of the Certification options available in DSE program. Read or listen to Malcolm Gladwell’s Outliers: Our Story of Success.

<Optional> <Advanced Research> Techniques for Fraud Prevention. Read Improving Credit Card Fraud Prevention Using a Meta Learning Strategy and explore how this framework can be applied to robust solutions for Fraud Prevention in your industry.

Page 10: Data scientist enablement   dse 400   week 8 roadmap

Assignment 8 - Submission RequiredOption A - HDP 2.0 | R-sqldf | BigQuery

Download HortonWorks Sensor Data from Amazon. Using either HDP 2.0 (or its equivalent), or R-sqldf or Google BigQuery compute the following.

a) import raw data and clean using HiveSQL Script (see next slide) or equivalent techniqueb) download and import cleansed data (hvac_building) into a spreadsheet like Google Spreadsheet, Excel or OpenOffice Calc etc. or any visualization tool you are familiar withc) Visualize the data showing geographic distribution pattern based on data from hvac_building table

You may reach out to Rachel Fleming <[email protected]> if you have any difficulties with the assignments or looking for more challenging assignments or activities.

Page 11: Data scientist enablement   dse 400   week 8 roadmap

Extract raw tables hvac and building from Sensor.zip file then execute the following HiveSQL scripts to generate tables hvac_temperatures and hvac_building

create table hvac_temperatures as select *, targettemp - actualtemp as temp_diff, IF((targettemp - actualtemp) > 5, 'COLD', IF((targettemp - actualtemp) < -5, 'HOT', 'NORMAL')) AS temprange, IF((targettemp - actualtemp) > 5, '1', IF((targettemp - actualtemp) < -5, '1', 0)) AS extremetemp from hvac;

create table if not exists hvac_building as select h.*, b.country, b.hvacproduct, b.buildingage, b.buildingmgr from building b join hvac_temperatures h on b.buildingid = h.buildingid;

(Source: HortonWorks)

Assignment 8 - Option AHiveSQL Scripts

Page 12: Data scientist enablement   dse 400   week 8 roadmap

Assignment 8 - Submission RequiredOptions B, C and D

Option B - Data-Driven PhilanthropyDo a case study on how organizations like Red Cross, UNICEF, Gates Foundation or Oxfam are using data-driven strategies to promote Global Health and Development.

Option C - Ethics and Big DataWrite a blog post or short article on Ethical Application of Big Data technologies in the areas or sectors you care about. (Fighting Poverty, Child Labor or Illiteracy, and ecological degradation etc. are a few examples)

Option D - Biopic or mini-documentary on Florence NightingaleFlorence Nightingale came to prominence for her outstanding service-orientation and originating modern nursing practices. She also employed statistics and data-driven decision management approaches. Research on Florence Nightingale and produce a short Biopic or mini-documentary about her.

You may reach out to Rachel Fleming <[email protected]> if you have any difficulties with the assignments or looking for more challenging assignments or activities.

Page 13: Data scientist enablement   dse 400   week 8 roadmap

Submission in PDF format is requiredRecommended Deadline: Saturday, 11:59 PM your local time. If you can’t submit your assignment in time, please complete it and turn it in ASAP. While there is no penalty for late submission, it will help you focus on next week’s lessons if you turn in assignments in time.

Mail Assignment 8 to <[email protected]> with DSE 400 > Assignment 8 in the subject line. Submit a single PDF document showing your queries and result samples. Include screenshots as necessary. Naming convention DSE 400 - Assignment 8 - Your Full Name is required for your document for the sake of consistency. No document links should be sent. Just one single PDF document, and Only in PDF format is accepted.

Page 14: Data scientist enablement   dse 400   week 8 roadmap

DSE Program 2014 timeline

Fast track toData Science(DSE 400)

Modern Data Platforms (DSE 502)

Advanced Techniques inBig Data Analytics (DSE 600)

Jan 19 - Mar 15

Mar 30 - May 10

May 25 - July 5

July 20 - Aug 30

Machine Learning with R (DSE 501)

Page 15: Data scientist enablement   dse 400   week 8 roadmap

Adaptive Learning Options Data Scientist Enablement program

Maturity Composite Score * Proficiency Certificate

Level 5 > 90 Innovating Capability Black Belt

Level 4 > 80 and <= 90 Architectural Capability Green Belt

Level 3 > 70 and <= 80 Solutioning Capability Yellow Belt

Level 2 > 60 and <= 70 Basic Understanding Completion

Level 1 <= 60 Basic Familiarity Audit

* Composite score is computed taking into consideration of performance of participants in assignments, activities, projects, social engagement, collaboration, team development, publications and advanced research etc. in all 4 modules of DSE program

Page 16: Data scientist enablement   dse 400   week 8 roadmap

References, Resources and Additional Reading

Ethics of Big Data. Davis and Patterson. O’Reilly Publications. 2012Outliers: Our Story of Success. Malcolm Gladwell. Little Brown and Company. 2008SQL Tutorial. W3Schools.comImproving Credit Card Fraud Prevention Using a Meta Learning Strategy Joseph King-Fung Pun. 201117 short tutorials all Data Scientists should read (and practice). Dr. Granville. Data Science CentralHadoop Illuminated. Kerzner and Maniyam 2013. Hadoop Illuminated LLCHadoop Definitive Guide. 3rd Edition. Tom White. O’Reilly Publications. 2012Mapreduce: Simplified Data Processing on Large Clusters. Dean and Ghemavat. Google 2004[MIT OCW] How to Process, Analyze and Visualize Data. Marcus & Wu. 2012[MIT OCW] Ethical Practice: Professionalism, Social Responsibility… Prof Leigh Hafrey, 2012Big Data - Hadoop, Hive, Pig and Hbase video collectionModern Data Platforms-Community of PracticeLanguage R-Community of PracticeData Science Enablement playlist

Page 17: Data scientist enablement   dse 400   week 8 roadmap

Citation Content that appears as is, on this document only, is under Creative Commons License CC BY 4.0 This license may not necessarily apply to other material referenced here in this document.

Sensor dataset used in this week’s assignment is attributed to Hortonworks. This dataset is not available under Creative Commons Licence.

Content from IBM, Hortonworks, Google, Youtube, Data Science Central and O’Reilly Media etc. is excluded from the above Creative Commons License.

Page 18: Data scientist enablement   dse 400   week 8 roadmap

For More InformationWeek 8 discussions take place during this week on DSE 400 forums on Linkedin, Facebook, Google+ and SONO. There is also an active Q&A session for everyone's benefit. Also check out Language R- Community of Practice if you would like to advance your competence in R or if you would like to contribute to this community.

<Mentoring On Demand> You may reach out to Ms. Rachel Fleming <[email protected]> if you have any difficulties with the assignments or looking for more challenging activities. If you need a mentor or someone to help you accelerate along the DSE program, you may reach out to Vishal Kumar <[email protected]> or Ligia Buzan<[email protected]>

We welcome questions, thoughts and suggestions. Post these in the right forums/discussions or write to us at <[email protected]>

You can always find the latest version of this document and other DSE 400 roadmaps at http://bitly.com/bundles/o_4ldaljhta4/1

Page 19: Data scientist enablement   dse 400   week 8 roadmap

Thank You