ds280 - spring 2018 - final project presentation - olga ... · ô ] } À Ç ry µ Ç î t z ] z ] l...

20
DS 280 – SPRING 2018 Big Data Architecture Final Project Olga Poleleyeva 1

Upload: others

Post on 05-Aug-2020

0 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: DS280 - Spring 2018 - Final Project Presentation - Olga ... · ô ] } À Ç rY µ Ç î t Z ] Z ] l v o ] ] µ ] } v õ ] } À Ç rY µ Ç î Z µ o W ] l v o ] ] µ ] } v

DS 280 – SPRING 2018

Big Data ArchitectureFinal Project

Olga Poleleyeva

1

Page 2: DS280 - Spring 2018 - Final Project Presentation - Olga ... · ô ] } À Ç rY µ Ç î t Z ] Z ] l v o ] ] µ ] } v õ ] } À Ç rY µ Ç î Z µ o W ] l v o ] ] µ ] } v

2

Bicycle sharing and weather conditions

Goal: To analyze Chicago bicycle sharing data in relations to available

weather information, to find possible correlations between the number,

frequency or duration of bicycle trips and various weather conditions or other variables in order to improve

the program.

Page 3: DS280 - Spring 2018 - Final Project Presentation - Olga ... · ô ] } À Ç rY µ Ç î t Z ] Z ] l v o ] ] µ ] } v õ ] } À Ç rY µ Ç î Z µ o W ] l v o ] ] µ ] } v

3

Dataset Details

• File Type: CSV• File Size: 3.24 GB• Data Source: Kagglehttps://www.kaggle.com/yingwurenjian/chicago-divvy-bicycle-sharing-data

Dataset contains the bicycle sharing data in Chicago from 2013 to2017 from Divvy website, combined with weather informationfrom Wunderground website.There are total of 13,774,715 records, in 27 columns. Each trip isrecorded with start/end place and time, duration, and additionalbike station’s info. The detailed weather info includedtemperature, visibility, precipitation, and so on.

Page 4: DS280 - Spring 2018 - Final Project Presentation - Olga ... · ô ] } À Ç rY µ Ç î t Z ] Z ] l v o ] ] µ ] } v õ ] } À Ç rY µ Ç î Z µ o W ] l v o ] ] µ ] } v

4

DB & Table Details

• Database Name: user13_final_db

• Table Name: bikeshare_initial_tbl

• Table Type: external• Field Names were exactly

as in dataset• Initial table had

information that was not intended to be used for this data analysis, so the new table was created.

Page 5: DS280 - Spring 2018 - Final Project Presentation - Olga ... · ô ] } À Ç rY µ Ç î t Z ] Z ] l v o ] ] µ ] } v õ ] } À Ç rY µ Ç î Z µ o W ] l v o ] ] µ ] } v

5

Adjusted Table Details

• Table Name: bikeshare_final_tbl

• Table Type: internal• Fields were adjusted

accordingly:- In order to make results more user-

friendly, trip duration was re-calculated from seconds into minutes

- In order to get more meaningful analysis, all trips shorter than 5 minutes were excluded

- In order to get the table size easier to manage, some weather and stations info fields were dropped

Page 6: DS280 - Spring 2018 - Final Project Presentation - Olga ... · ô ] } À Ç rY µ Ç î t Z ] Z ] l v o ] ] µ ] } v õ ] } À Ç rY µ Ç î Z µ o W ] l v o ] ] µ ] } v

6

Page 7: DS280 - Spring 2018 - Final Project Presentation - Olga ... · ô ] } À Ç rY µ Ç î t Z ] Z ] l v o ] ] µ ] } v õ ] } À Ç rY µ Ç î Z µ o W ] l v o ] ] µ ] } v

7

Data Discovery - Query 1

How many times were the bikes returned on the same day, on the next

day, and so on?

SELECTDATEDIFF(stoptime, starttime) as trip_days, COUNT(trip_id) as number_of_tripsFROM bikeshare_final_tblGROUP BY DATEDIFF(stoptime, starttime);

Results:0 11,838,936

1 46,111

Page 8: DS280 - Spring 2018 - Final Project Presentation - Olga ... · ô ] } À Ç rY µ Ç î t Z ] Z ] l v o ] ] µ ] } v õ ] } À Ç rY µ Ç î Z µ o W ] l v o ] ] µ ] } v

8

Data Discovery - Query 2

What is the bike rental distribution depending on the temperature ranges?

SELECT temp_range, FORMAT_NUMBER(COUNT(trip_id),0) AS number_of_tripsFROM(SELECT trip_id, temperature,CASE WHEN temperature > 85 THEN '86F and above: VERY HOT'

WHEN temperature > 76 THEN '77F - 85F: HOT'WHEN temperature > 54 THEN '55F - 76F: WARM'WHEN temperature > 32 THEN '33F - 54F: COOL'WHEN temperature > 0 THEN '1F - 32F: COLD'

ELSE '0F and below: VERY COLD'END AS temp_rangeFROM bikeshare_final_tbl)AS trips_by_rangeGROUP BY temp_rangeORDER BY temp_range DESC;

Page 9: DS280 - Spring 2018 - Final Project Presentation - Olga ... · ô ] } À Ç rY µ Ç î t Z ] Z ] l v o ] ] µ ] } v õ ] } À Ç rY µ Ç î Z µ o W ] l v o ] ] µ ] } v

9

Data Discovery - Query 2

Results: bike rental distribution depending on the temperature ranges

01,000,0002,000,0003,000,0004,000,0005,000,0006,000,000

86F andabove:

VERY HOT

77F - 85F:HOT

55F - 76F:WARM

33F - 54F:COOL

1F - 32F:COLD

0F andbelow:VERYCOLD

Num

ber o

f trip

sTemperature Range

Number of trips by temperature range

Page 10: DS280 - Spring 2018 - Final Project Presentation - Olga ... · ô ] } À Ç rY µ Ç î t Z ] Z ] l v o ] ] µ ] } v õ ] } À Ç rY µ Ç î Z µ o W ] l v o ] ] µ ] } v

10

Data Discovery - Query 3

What is the bike rental distribution depending on time of day?

SELECT time_range, COUNT(trip_id) AS number_of_tripsFROM(SELECT trip_id, HOUR (starttime) AS start_hour,

CASE WHEN HOUR (starttime) > '05' AND HOUR (starttime) <= '08' THEN '1. 6:00 am - 8:59 am: morning

commute'WHEN HOUR (starttime) > '08' AND HOUR (starttime) <= '10' THEN '2. 9:00 am - 10:59 am: morning'WHEN HOUR (starttime) > '10' AND HOUR (starttime) <= '13' THEN '3. 11:00 pm - 1:59 pm: lunch

break'WHEN HOUR (starttime) > '13' AND HOUR (starttime) <= '15' THEN '4. 2:00 pm - 3:59 pm: afternoon'WHEN HOUR (starttime) > '15' AND HOUR (starttime) <= '18' THEN '5. 4:00 pm - 6:59 pm: evening

commute'WHEN HOUR (starttime) > '18' AND HOUR (starttime) <= '20' THEN '6. 7:00 pm - 8:59 pm: dinner time'WHEN HOUR (starttime) > '20' AND HOUR (starttime) <= '23' THEN '7. 9:00 pm - 11:59 pm: party time'

ELSE '8. 12:00 am - 5:59 am: night time'END AS time_range

FROM bikeshare_final_tbl)AS trips_by_rangeGROUP BY time_rangeORDER BY time_range;

Page 11: DS280 - Spring 2018 - Final Project Presentation - Olga ... · ô ] } À Ç rY µ Ç î t Z ] Z ] l v o ] ] µ ] } v õ ] } À Ç rY µ Ç î Z µ o W ] l v o ] ] µ ] } v

11

Data Discovery - Query 3

Results: bike rental distribution depending on time of day

Page 12: DS280 - Spring 2018 - Final Project Presentation - Olga ... · ô ] } À Ç rY µ Ç î t Z ] Z ] l v o ] ] µ ] } v õ ] } À Ç rY µ Ç î Z µ o W ] l v o ] ] µ ] } v

12

Data Discovery - Query 3

What if we add some weather conditions?

SELECT time_range, COUNT(trip_id) AS number_of_trips_in_stormFROM(SELECT trip_id, HOUR (starttime) AS start_hour,

CASEWHEN HOUR (starttime) > '05' AND HOUR (starttime) <= '08' THEN '1. 6:00 am - 8:59 am: morning

commute'WHEN HOUR (starttime) > '08' AND HOUR (starttime) <= '10' THEN '2. 9:00 am - 10:59 am: morning'WHEN HOUR (starttime) > '10' AND HOUR (starttime) <= '13' THEN '3. 11:00 pm - 1:59 pm: lunch

break'WHEN HOUR (starttime) > '13' AND HOUR (starttime) <= '15' THEN '4. 2:00 pm - 3:59 pm: afternoon'WHEN HOUR (starttime) > '15' AND HOUR (starttime) <= '18' THEN '5. 4:00 pm - 6:59 pm: evening

commute'WHEN HOUR (starttime) > '18' AND HOUR (starttime) <= '20' THEN '6. 7:00 pm - 8:59 pm: dinner

time'WHEN HOUR (starttime) > '20' AND HOUR (starttime) <= '23' THEN '7. 9:00 pm - 11:59 pm: party

time'ELSE '8. 12:00 am - 5:59 am: night time'

END AS time_rangeFROM bikeshare_final_tbl WHERE events='tstorms')

AS trips_by_rangeGROUP BY time_rangeORDER BY time_range;

Page 13: DS280 - Spring 2018 - Final Project Presentation - Olga ... · ô ] } À Ç rY µ Ç î t Z ] Z ] l v o ] ] µ ] } v õ ] } À Ç rY µ Ç î Z µ o W ] l v o ] ] µ ] } v

13

Data Discovery - Query 3

Results: bike rental distribution depending on time of day for some weather conditions

0

20

40

60

80

100

120

140

160

0

500

1000

1500

2000

2500

3000

3500

Thou

sand

s

Thou

sand

s

Number of trips by time range affected by weather conditions

number_of_trips number_of_trips_when_sunny

number_of_trips_in_rain number_of_trips_in_storm

Page 14: DS280 - Spring 2018 - Final Project Presentation - Olga ... · ô ] } À Ç rY µ Ç î t Z ] Z ] l v o ] ] µ ] } v õ ] } À Ç rY µ Ç î Z µ o W ] l v o ] ] µ ] } v

14

Data Discovery - Query 4

Display 10 longest trips (in hours and minutes) which happened with no visibility in the snow

SELECTfrom_station_name, to_station_name, TO_DATE (starttime) AS startdate, trip_in_minutes,CONCAT (CAST (FLOOR(trip_in_minutes/60) AS STRING),' hrs ',CAST (trip_in_minutes-60*FLOOR(trip_in_minutes/60) AS STRING),' min') AS trip_time,temperature, trip_idFROM bikeshare_final_tbl AS fWHERE LOWER (events)='snow' AND visibility<=0GROUP BY from_station_name, to_station_name, starttime, trip_in_minutes, temperature, trip_idORDER BY trip_in_minutes DESC LIMIT 10;

Write SQL Output – For Query 1

Page 15: DS280 - Spring 2018 - Final Project Presentation - Olga ... · ô ] } À Ç rY µ Ç î t Z ] Z ] l v o ] ] µ ] } v õ ] } À Ç rY µ Ç î Z µ o W ] l v o ] ] µ ] } v

15

Data Discovery - Query 5

What is the total trip duration for all trips taken on September 11 each year, in hours rounded to one decimal

place and how it depended on weather conditions?

SELECT x.year, FORMAT_NUMBER(SUM(x.tim)/60,1) as total_trips_in_hours, x.conditionsFROM(SELECT YEAR(starttime) AS year, trip_in_minutes AS tim, conditionsFROM bikeshare_final_tblWHERE MONTH(starttime)='09' AND DAY(starttime)='11'GROUP BY starttime, trip_in_minutes, conditions) AS xGROUP BY x.year, x.conditions;

9-11-2013: 1,397.9 hours9-11-2014: 1,466.0 hours9-11-2015: 1,788.4 hours9-11-2016: 5,116.6 hours9-11-2017: 3,835.4 hours

Page 16: DS280 - Spring 2018 - Final Project Presentation - Olga ... · ô ] } À Ç rY µ Ç î t Z ] Z ] l v o ] ] µ ] } v õ ] } À Ç rY µ Ç î Z µ o W ] l v o ] ] µ ] } v

16

Data Discovery - Query 5

9-11-2013: 1,397.9 hours9-11-2014: 1,466.0 hours9-11-2015: 1,788.4 hours9-11-2016: 5,116.6 hours9-11-2017: 3,835.4 hours

Page 17: DS280 - Spring 2018 - Final Project Presentation - Olga ... · ô ] } À Ç rY µ Ç î t Z ] Z ] l v o ] ] µ ] } v õ ] } À Ç rY µ Ç î Z µ o W ] l v o ] ] µ ] } v

17

Data Discovery - Query 6

What is the average trip duration in minutes, rounded to one decimal, for the trips starting and ending at the same station, taken by female subscribers when the wind speed

is above 20?

SELECTROUND(AVG(trip_in_minutes),1) ASaverage_trip_length, wind_speedFROM bikeshare_final_tbl AS fWHEREfrom_station_id=to_station_idAND UPPER(usertype)='SUBSCRIBER' AND UPPER(gender)='FEMALE’ AND wind_speed > 20GROUP BY wind_speedORDER BY wind_speed;

Page 18: DS280 - Spring 2018 - Final Project Presentation - Olga ... · ô ] } À Ç rY µ Ç î t Z ] Z ] l v o ] ] µ ] } v õ ] } À Ç rY µ Ç î Z µ o W ] l v o ] ] µ ] } v

18

Bicycle sharing and weather conditions

Results:1. Less than 0.3% of bicycles are not returned on the same day2. Most bicycles are rented when temperature is between 55F and 76F3. Most popular time is between 4 and 7 pm regardless of weather4. Some harsh weather conditions like snow and no visibility make trips shorter, but not stopping people from using the bicycle sharing system5. Comparing the data of the same calendar date by year, the correlation between nice weather conditions and higher number of trips can be concluded6. There doesn’t seem to be a correlation between the wind speed and the length of the tripsApplications:By finding some weaker spots in the bicycle sharing system, the incentives program could be implemented to increase the use of the system on slow days/times.

Page 19: DS280 - Spring 2018 - Final Project Presentation - Olga ... · ô ] } À Ç rY µ Ç î t Z ] Z ] l v o ] ] µ ] } v õ ] } À Ç rY µ Ç î Z µ o W ] l v o ] ] µ ] } v

19

Bicycle sharing and weather conditions

https://eportolapol.wordpress.com/

e-Portfolio:

Page 20: DS280 - Spring 2018 - Final Project Presentation - Olga ... · ô ] } À Ç rY µ Ç î t Z ] Z ] l v o ] ] µ ] } v õ ] } À Ç rY µ Ç î Z µ o W ] l v o ] ] µ ] } v

20

Bicycle sharing and weather conditions