ds280 - spring 2018 - final project presentation - olga ... · ô ] } À Ç ry µ Ç î t z ] z ] l...
TRANSCRIPT
DS 280 – SPRING 2018
Big Data ArchitectureFinal Project
Olga Poleleyeva
1
2
Bicycle sharing and weather conditions
Goal: To analyze Chicago bicycle sharing data in relations to available
weather information, to find possible correlations between the number,
frequency or duration of bicycle trips and various weather conditions or other variables in order to improve
the program.
3
Dataset Details
• File Type: CSV• File Size: 3.24 GB• Data Source: Kagglehttps://www.kaggle.com/yingwurenjian/chicago-divvy-bicycle-sharing-data
Dataset contains the bicycle sharing data in Chicago from 2013 to2017 from Divvy website, combined with weather informationfrom Wunderground website.There are total of 13,774,715 records, in 27 columns. Each trip isrecorded with start/end place and time, duration, and additionalbike station’s info. The detailed weather info includedtemperature, visibility, precipitation, and so on.
4
DB & Table Details
• Database Name: user13_final_db
• Table Name: bikeshare_initial_tbl
• Table Type: external• Field Names were exactly
as in dataset• Initial table had
information that was not intended to be used for this data analysis, so the new table was created.
5
Adjusted Table Details
• Table Name: bikeshare_final_tbl
• Table Type: internal• Fields were adjusted
accordingly:- In order to make results more user-
friendly, trip duration was re-calculated from seconds into minutes
- In order to get more meaningful analysis, all trips shorter than 5 minutes were excluded
- In order to get the table size easier to manage, some weather and stations info fields were dropped
6
7
Data Discovery - Query 1
How many times were the bikes returned on the same day, on the next
day, and so on?
SELECTDATEDIFF(stoptime, starttime) as trip_days, COUNT(trip_id) as number_of_tripsFROM bikeshare_final_tblGROUP BY DATEDIFF(stoptime, starttime);
Results:0 11,838,936
1 46,111
8
Data Discovery - Query 2
What is the bike rental distribution depending on the temperature ranges?
SELECT temp_range, FORMAT_NUMBER(COUNT(trip_id),0) AS number_of_tripsFROM(SELECT trip_id, temperature,CASE WHEN temperature > 85 THEN '86F and above: VERY HOT'
WHEN temperature > 76 THEN '77F - 85F: HOT'WHEN temperature > 54 THEN '55F - 76F: WARM'WHEN temperature > 32 THEN '33F - 54F: COOL'WHEN temperature > 0 THEN '1F - 32F: COLD'
ELSE '0F and below: VERY COLD'END AS temp_rangeFROM bikeshare_final_tbl)AS trips_by_rangeGROUP BY temp_rangeORDER BY temp_range DESC;
9
Data Discovery - Query 2
Results: bike rental distribution depending on the temperature ranges
01,000,0002,000,0003,000,0004,000,0005,000,0006,000,000
86F andabove:
VERY HOT
77F - 85F:HOT
55F - 76F:WARM
33F - 54F:COOL
1F - 32F:COLD
0F andbelow:VERYCOLD
Num
ber o
f trip
sTemperature Range
Number of trips by temperature range
10
Data Discovery - Query 3
What is the bike rental distribution depending on time of day?
SELECT time_range, COUNT(trip_id) AS number_of_tripsFROM(SELECT trip_id, HOUR (starttime) AS start_hour,
CASE WHEN HOUR (starttime) > '05' AND HOUR (starttime) <= '08' THEN '1. 6:00 am - 8:59 am: morning
commute'WHEN HOUR (starttime) > '08' AND HOUR (starttime) <= '10' THEN '2. 9:00 am - 10:59 am: morning'WHEN HOUR (starttime) > '10' AND HOUR (starttime) <= '13' THEN '3. 11:00 pm - 1:59 pm: lunch
break'WHEN HOUR (starttime) > '13' AND HOUR (starttime) <= '15' THEN '4. 2:00 pm - 3:59 pm: afternoon'WHEN HOUR (starttime) > '15' AND HOUR (starttime) <= '18' THEN '5. 4:00 pm - 6:59 pm: evening
commute'WHEN HOUR (starttime) > '18' AND HOUR (starttime) <= '20' THEN '6. 7:00 pm - 8:59 pm: dinner time'WHEN HOUR (starttime) > '20' AND HOUR (starttime) <= '23' THEN '7. 9:00 pm - 11:59 pm: party time'
ELSE '8. 12:00 am - 5:59 am: night time'END AS time_range
FROM bikeshare_final_tbl)AS trips_by_rangeGROUP BY time_rangeORDER BY time_range;
11
Data Discovery - Query 3
Results: bike rental distribution depending on time of day
12
Data Discovery - Query 3
What if we add some weather conditions?
SELECT time_range, COUNT(trip_id) AS number_of_trips_in_stormFROM(SELECT trip_id, HOUR (starttime) AS start_hour,
CASEWHEN HOUR (starttime) > '05' AND HOUR (starttime) <= '08' THEN '1. 6:00 am - 8:59 am: morning
commute'WHEN HOUR (starttime) > '08' AND HOUR (starttime) <= '10' THEN '2. 9:00 am - 10:59 am: morning'WHEN HOUR (starttime) > '10' AND HOUR (starttime) <= '13' THEN '3. 11:00 pm - 1:59 pm: lunch
break'WHEN HOUR (starttime) > '13' AND HOUR (starttime) <= '15' THEN '4. 2:00 pm - 3:59 pm: afternoon'WHEN HOUR (starttime) > '15' AND HOUR (starttime) <= '18' THEN '5. 4:00 pm - 6:59 pm: evening
commute'WHEN HOUR (starttime) > '18' AND HOUR (starttime) <= '20' THEN '6. 7:00 pm - 8:59 pm: dinner
time'WHEN HOUR (starttime) > '20' AND HOUR (starttime) <= '23' THEN '7. 9:00 pm - 11:59 pm: party
time'ELSE '8. 12:00 am - 5:59 am: night time'
END AS time_rangeFROM bikeshare_final_tbl WHERE events='tstorms')
AS trips_by_rangeGROUP BY time_rangeORDER BY time_range;
13
Data Discovery - Query 3
Results: bike rental distribution depending on time of day for some weather conditions
0
20
40
60
80
100
120
140
160
0
500
1000
1500
2000
2500
3000
3500
Thou
sand
s
Thou
sand
s
Number of trips by time range affected by weather conditions
number_of_trips number_of_trips_when_sunny
number_of_trips_in_rain number_of_trips_in_storm
14
Data Discovery - Query 4
Display 10 longest trips (in hours and minutes) which happened with no visibility in the snow
SELECTfrom_station_name, to_station_name, TO_DATE (starttime) AS startdate, trip_in_minutes,CONCAT (CAST (FLOOR(trip_in_minutes/60) AS STRING),' hrs ',CAST (trip_in_minutes-60*FLOOR(trip_in_minutes/60) AS STRING),' min') AS trip_time,temperature, trip_idFROM bikeshare_final_tbl AS fWHERE LOWER (events)='snow' AND visibility<=0GROUP BY from_station_name, to_station_name, starttime, trip_in_minutes, temperature, trip_idORDER BY trip_in_minutes DESC LIMIT 10;
Write SQL Output – For Query 1
15
Data Discovery - Query 5
What is the total trip duration for all trips taken on September 11 each year, in hours rounded to one decimal
place and how it depended on weather conditions?
SELECT x.year, FORMAT_NUMBER(SUM(x.tim)/60,1) as total_trips_in_hours, x.conditionsFROM(SELECT YEAR(starttime) AS year, trip_in_minutes AS tim, conditionsFROM bikeshare_final_tblWHERE MONTH(starttime)='09' AND DAY(starttime)='11'GROUP BY starttime, trip_in_minutes, conditions) AS xGROUP BY x.year, x.conditions;
9-11-2013: 1,397.9 hours9-11-2014: 1,466.0 hours9-11-2015: 1,788.4 hours9-11-2016: 5,116.6 hours9-11-2017: 3,835.4 hours
16
Data Discovery - Query 5
9-11-2013: 1,397.9 hours9-11-2014: 1,466.0 hours9-11-2015: 1,788.4 hours9-11-2016: 5,116.6 hours9-11-2017: 3,835.4 hours
17
Data Discovery - Query 6
What is the average trip duration in minutes, rounded to one decimal, for the trips starting and ending at the same station, taken by female subscribers when the wind speed
is above 20?
SELECTROUND(AVG(trip_in_minutes),1) ASaverage_trip_length, wind_speedFROM bikeshare_final_tbl AS fWHEREfrom_station_id=to_station_idAND UPPER(usertype)='SUBSCRIBER' AND UPPER(gender)='FEMALE’ AND wind_speed > 20GROUP BY wind_speedORDER BY wind_speed;
18
Bicycle sharing and weather conditions
Results:1. Less than 0.3% of bicycles are not returned on the same day2. Most bicycles are rented when temperature is between 55F and 76F3. Most popular time is between 4 and 7 pm regardless of weather4. Some harsh weather conditions like snow and no visibility make trips shorter, but not stopping people from using the bicycle sharing system5. Comparing the data of the same calendar date by year, the correlation between nice weather conditions and higher number of trips can be concluded6. There doesn’t seem to be a correlation between the wind speed and the length of the tripsApplications:By finding some weaker spots in the bicycle sharing system, the incentives program could be implemented to increase the use of the system on slow days/times.
19
Bicycle sharing and weather conditions
https://eportolapol.wordpress.com/
e-Portfolio:
20
Bicycle sharing and weather conditions