using data science on internet search behavior as a proxy for human behavior juan miguel lavista
DESCRIPTION
Using Data Science on Internet Search Behavior as a Proxy for Human Behavior Juan Miguel Lavista. Agenda. Using Data Science on Internet Search Behavior as a Proxy for Human Behavior. Context Problem definition Examples Summary. Context. 17,293,822,600,000,000,000 Bytes [1]. - PowerPoint PPT PresentationTRANSCRIPT
![Page 1: Using Data Science on Internet Search Behavior as a Proxy for Human Behavior Juan Miguel Lavista](https://reader036.vdocument.in/reader036/viewer/2022062815/5681310c550346895d9748bf/html5/thumbnails/1.jpg)
Using Data Science on Internet Search Behavior as a Proxy for
Human BehaviorJuan Miguel Lavista
![Page 2: Using Data Science on Internet Search Behavior as a Proxy for Human Behavior Juan Miguel Lavista](https://reader036.vdocument.in/reader036/viewer/2022062815/5681310c550346895d9748bf/html5/thumbnails/2.jpg)
AgendaUsing Data Science
on Internet Search Behavior as a Proxy for Human Behavior Context
Problem definition
Examples
Summary
![Page 3: Using Data Science on Internet Search Behavior as a Proxy for Human Behavior Juan Miguel Lavista](https://reader036.vdocument.in/reader036/viewer/2022062815/5681310c550346895d9748bf/html5/thumbnails/3.jpg)
Context
![Page 4: Using Data Science on Internet Search Behavior as a Proxy for Human Behavior Juan Miguel Lavista](https://reader036.vdocument.in/reader036/viewer/2022062815/5681310c550346895d9748bf/html5/thumbnails/4.jpg)
17,293,822,600,000,000,000 Bytes[1]
15 Exabytes = 1.5 million times the size of all books in the Library of Congress [2]
[1] The Human Face of Big Data , 2012 | ISBN-10: 1454908270 Rick Smolan, Jennifer Erwitt[2] Peter Lyman, Hal R. Varian (2000-10-18). "How Much Information?"
![Page 5: Using Data Science on Internet Search Behavior as a Proxy for Human Behavior Juan Miguel Lavista](https://reader036.vdocument.in/reader036/viewer/2022062815/5681310c550346895d9748bf/html5/thumbnails/5.jpg)
1984US$1
Billion [3]
Cost of storageof every single
book ever written
~130 million books[4]
2014US$3,000
[3] A history of storage cost, Matthew Komorowski, 2009[4] There are 130 Million Books in the World, How Many Have You Read?, 2009 BY WALLACE YOVETICH
![Page 6: Using Data Science on Internet Search Behavior as a Proxy for Human Behavior Juan Miguel Lavista](https://reader036.vdocument.in/reader036/viewer/2022062815/5681310c550346895d9748bf/html5/thumbnails/6.jpg)
1996Cost of Processing
power[6]
2014
XBOX ONE$399
ASCI Red Super computer (6000 Pentium Pro)
$67,000,000
[6] The history of supercomputers, Sebastian Anthony, 2012
![Page 7: Using Data Science on Internet Search Behavior as a Proxy for Human Behavior Juan Miguel Lavista](https://reader036.vdocument.in/reader036/viewer/2022062815/5681310c550346895d9748bf/html5/thumbnails/7.jpg)
Concepts
Research
Information is only useful if its accessible…
1989 – Tim Burners Lee
writes his initial proposal
for the web
August 1991, First website
from CERN online –
Including First index
Circa 1992 –
Index
discontinued.
![Page 8: Using Data Science on Internet Search Behavior as a Proxy for Human Behavior Juan Miguel Lavista](https://reader036.vdocument.in/reader036/viewer/2022062815/5681310c550346895d9748bf/html5/thumbnails/8.jpg)
All 29 websites!
Web – circa 1992
![Page 9: Using Data Science on Internet Search Behavior as a Proxy for Human Behavior Juan Miguel Lavista](https://reader036.vdocument.in/reader036/viewer/2022062815/5681310c550346895d9748bf/html5/thumbnails/9.jpg)
“If you notice something incorrect
or have any comment which you don't think is a FAQ, feel free to mail me”
Phone +1 (617)253 5702, fax +1 (617)258 8682, email: [email protected]
![Page 10: Using Data Science on Internet Search Behavior as a Proxy for Human Behavior Juan Miguel Lavista](https://reader036.vdocument.in/reader036/viewer/2022062815/5681310c550346895d9748bf/html5/thumbnails/10.jpg)
History behind
http://www.
www.cern.ch info.cern.ch
![Page 11: Using Data Science on Internet Search Behavior as a Proxy for Human Behavior Juan Miguel Lavista](https://reader036.vdocument.in/reader036/viewer/2022062815/5681310c550346895d9748bf/html5/thumbnails/11.jpg)
Web started growing and there was a need to search on it
![Page 12: Using Data Science on Internet Search Behavior as a Proxy for Human Behavior Juan Miguel Lavista](https://reader036.vdocument.in/reader036/viewer/2022062815/5681310c550346895d9748bf/html5/thumbnails/12.jpg)
ARCHIECirca 1990
by Alan Emtage Peter J. Deutsch Simply contacted a list of FTP archives on a regular basis and stored locally
Search functionality was using
Unix GREP
![Page 13: Using Data Science on Internet Search Behavior as a Proxy for Human Behavior Juan Miguel Lavista](https://reader036.vdocument.in/reader036/viewer/2022062815/5681310c550346895d9748bf/html5/thumbnails/13.jpg)
24 Years Later…
![Page 14: Using Data Science on Internet Search Behavior as a Proxy for Human Behavior Juan Miguel Lavista](https://reader036.vdocument.in/reader036/viewer/2022062815/5681310c550346895d9748bf/html5/thumbnails/14.jpg)
2 trillion queries per year
2.8 billion Users
![Page 15: Using Data Science on Internet Search Behavior as a Proxy for Human Behavior Juan Miguel Lavista](https://reader036.vdocument.in/reader036/viewer/2022062815/5681310c550346895d9748bf/html5/thumbnails/15.jpg)
Indexable web is ~ 40 trillion pages
A couple of weeks to read..
5700 web pages per person
![Page 16: Using Data Science on Internet Search Behavior as a Proxy for Human Behavior Juan Miguel Lavista](https://reader036.vdocument.in/reader036/viewer/2022062815/5681310c550346895d9748bf/html5/thumbnails/16.jpg)
This is just 1 search (we make 2 trillion
searches per year)
A lot more time to complete a search…
![Page 17: Using Data Science on Internet Search Behavior as a Proxy for Human Behavior Juan Miguel Lavista](https://reader036.vdocument.in/reader036/viewer/2022062815/5681310c550346895d9748bf/html5/thumbnails/17.jpg)
Agenda• Using Data Science on Internet Search Behavior as a
Proxy for Human Behavior Context
Problem definitionExamples
Summary
![Page 18: Using Data Science on Internet Search Behavior as a Proxy for Human Behavior Juan Miguel Lavista](https://reader036.vdocument.in/reader036/viewer/2022062815/5681310c550346895d9748bf/html5/thumbnails/18.jpg)
Problem definitionUsing Data Science on Internet Search Behavior as a Proxy for Human Behavior
Search Focus: RelevanceAnd Performance
What can we learn from what people are searching?
![Page 19: Using Data Science on Internet Search Behavior as a Proxy for Human Behavior Juan Miguel Lavista](https://reader036.vdocument.in/reader036/viewer/2022062815/5681310c550346895d9748bf/html5/thumbnails/19.jpg)
Agenda• Using Data Science on Internet Search Behavior as a
Proxy for Human Behavior Context
Problem definition
ExamplesSummary
![Page 20: Using Data Science on Internet Search Behavior as a Proxy for Human Behavior Juan Miguel Lavista](https://reader036.vdocument.in/reader036/viewer/2022062815/5681310c550346895d9748bf/html5/thumbnails/20.jpg)
ExamplesUsing Data Science on Internet Search Behavior as a Proxy for Human Behavior
Breaking News
Drug Interactions
Wake up time
Seasonal Flu
![Page 21: Using Data Science on Internet Search Behavior as a Proxy for Human Behavior Juan Miguel Lavista](https://reader036.vdocument.in/reader036/viewer/2022062815/5681310c550346895d9748bf/html5/thumbnails/21.jpg)
Breaking News Detection
![Page 22: Using Data Science on Internet Search Behavior as a Proxy for Human Behavior Juan Miguel Lavista](https://reader036.vdocument.in/reader036/viewer/2022062815/5681310c550346895d9748bf/html5/thumbnails/22.jpg)
Breaking New Detection
Daily traffic follows a very stable pattern
We Build a model to predict query volume on a per-minute basis
If there are no rare-events, predicting query volume during the day is very accurate
Model works with some variation at the Country, State, or city level
![Page 23: Using Data Science on Internet Search Behavior as a Proxy for Human Behavior Juan Miguel Lavista](https://reader036.vdocument.in/reader036/viewer/2022062815/5681310c550346895d9748bf/html5/thumbnails/23.jpg)
u
We compare the daily traffic against prediction, and measure how much they deviate.
Anomaly detection Problem
Z-Score +7
Spike Location: Boston
![Page 24: Using Data Science on Internet Search Behavior as a Proxy for Human Behavior Juan Miguel Lavista](https://reader036.vdocument.in/reader036/viewer/2022062815/5681310c550346895d9748bf/html5/thumbnails/24.jpg)
Wake up time
![Page 25: Using Data Science on Internet Search Behavior as a Proxy for Human Behavior Juan Miguel Lavista](https://reader036.vdocument.in/reader036/viewer/2022062815/5681310c550346895d9748bf/html5/thumbnails/25.jpg)
Wake up TimeMethodology
We calculated the time at which we receive 50% of daily peak traffic from each metro area in their local time zones. The 25 cities follow the same general curve across all seven days of the week. While the patterns are the same, we did see a 43 minute shift between the earliest risers and the late risers.
6:43 6:55 7:10 7:15 7:28 7:32
![Page 26: Using Data Science on Internet Search Behavior as a Proxy for Human Behavior Juan Miguel Lavista](https://reader036.vdocument.in/reader036/viewer/2022062815/5681310c550346895d9748bf/html5/thumbnails/26.jpg)
San Francisco
![Page 27: Using Data Science on Internet Search Behavior as a Proxy for Human Behavior Juan Miguel Lavista](https://reader036.vdocument.in/reader036/viewer/2022062815/5681310c550346895d9748bf/html5/thumbnails/27.jpg)
![Page 28: Using Data Science on Internet Search Behavior as a Proxy for Human Behavior Juan Miguel Lavista](https://reader036.vdocument.in/reader036/viewer/2022062815/5681310c550346895d9748bf/html5/thumbnails/28.jpg)
Wake up time during the weekAt what time do we wake up during the week?
Monday Tuesday Thursday Friday
7:067:10
7:016:48
7:05
Wednesday
![Page 29: Using Data Science on Internet Search Behavior as a Proxy for Human Behavior Juan Miguel Lavista](https://reader036.vdocument.in/reader036/viewer/2022062815/5681310c550346895d9748bf/html5/thumbnails/29.jpg)
Detecting Seasonal Influenza Using
Search Logs
![Page 30: Using Data Science on Internet Search Behavior as a Proxy for Human Behavior Juan Miguel Lavista](https://reader036.vdocument.in/reader036/viewer/2022062815/5681310c550346895d9748bf/html5/thumbnails/30.jpg)
Epidemics of seasonal influenza are a major public
health concern, causing tens of millions of respiratory illnesses and 250,000 to
500,000 deaths worldwide each year
Early detection of disease activity, when followed by a rapid response, can reduce the impact of both seasonal and pandemic influenza
Using internet searches for influenza surveillance. Clinical Polgreen, P. M., Chen, Y., Pennock, D. M. & Forrest, N. D. Infectious Diseases 47, 1443–1448 (2008)
Detecting influenza epidemics using search engine query dataJeremy Ginsberg,Matthew H. Mohebbi,Rajan S. Patel,Lynnette Brammer,Mark S. Smolinski& Larry Brilliant
![Page 31: Using Data Science on Internet Search Behavior as a Proxy for Human Behavior Juan Miguel Lavista](https://reader036.vdocument.in/reader036/viewer/2022062815/5681310c550346895d9748bf/html5/thumbnails/31.jpg)
How does it works?Detecting influenza epidemics using search engine query data
CDC publishes national and regional data
from these surveillance systems on a
weekly basis, typically with a 1-2 week
reporting lag
Detecting influenza epidemics using search engine query dataJeremy Ginsberg,Matthew H. Mohebbi,Rajan S. Patel,Lynnette Brammer,Mark S. Smolinski& Larry Brilliant
![Page 32: Using Data Science on Internet Search Behavior as a Proxy for Human Behavior Juan Miguel Lavista](https://reader036.vdocument.in/reader036/viewer/2022062815/5681310c550346895d9748bf/html5/thumbnails/32.jpg)
Controversy
Lorem ipsum dolor sit amet, consectetur
adipiscing elit. Fusce suscipit neque non
libero aliquam, ut facilisis lacus pretium.
Sed imperdiet tincidunt velit.
Lorem ipsum dolor sit amet, consectetur
adipiscing elit. Fusce suscipit neque non
libero aliquam, ut facilisis lacus pretium.
Sed imperdiet tincidunt velit.
03
04
![Page 33: Using Data Science on Internet Search Behavior as a Proxy for Human Behavior Juan Miguel Lavista](https://reader036.vdocument.in/reader036/viewer/2022062815/5681310c550346895d9748bf/html5/thumbnails/33.jpg)
Signal is definitely relevant
Model can be improved“all models are wrong but some
are useful” George Box
This is NOT a failure for Big Data
We need to be careful of [all data] [no-science]
approaches
![Page 34: Using Data Science on Internet Search Behavior as a Proxy for Human Behavior Juan Miguel Lavista](https://reader036.vdocument.in/reader036/viewer/2022062815/5681310c550346895d9748bf/html5/thumbnails/34.jpg)
Article by Chris Anderson , Wired Magazine, 2008 [13]
“… faced with massive data, this approach to science —hypothesis, model, test — is becoming obsolete”
“The new availability of huge amounts of data [...] offers a whole new way of understanding the world. Correlation supersedes causation”
“There is now a better way. Petabytes allow us to say: Correlation is enough.”
“With enough data, the numbers speak for themselves.”
[13] http://edge.org/3rd_culture/anderson08/anderson08_index.html
All data no-science ?Discussion
![Page 35: Using Data Science on Internet Search Behavior as a Proxy for Human Behavior Juan Miguel Lavista](https://reader036.vdocument.in/reader036/viewer/2022062815/5681310c550346895d9748bf/html5/thumbnails/35.jpg)
All Data no-Science ApproachThis is a example for a subtitle
0.81 Correlation between Flu trends and GUNS related queries.
0.82 Correlation between CDC Flu and Les Miserable related queries
![Page 36: Using Data Science on Internet Search Behavior as a Proxy for Human Behavior Juan Miguel Lavista](https://reader036.vdocument.in/reader036/viewer/2022062815/5681310c550346895d9748bf/html5/thumbnails/36.jpg)
“Torture the data enough and it will confess..”Ronald Coase
![Page 37: Using Data Science on Internet Search Behavior as a Proxy for Human Behavior Juan Miguel Lavista](https://reader036.vdocument.in/reader036/viewer/2022062815/5681310c550346895d9748bf/html5/thumbnails/37.jpg)
Fooled by randomness
![Page 38: Using Data Science on Internet Search Behavior as a Proxy for Human Behavior Juan Miguel Lavista](https://reader036.vdocument.in/reader036/viewer/2022062815/5681310c550346895d9748bf/html5/thumbnails/38.jpg)
Signal is definitely relevant
Model can be improved“all models are wrong but some
are useful” George Box
This is NOT a failure for Big Data
We need to be careful of [all data] [no-science]
approaches
![Page 39: Using Data Science on Internet Search Behavior as a Proxy for Human Behavior Juan Miguel Lavista](https://reader036.vdocument.in/reader036/viewer/2022062815/5681310c550346895d9748bf/html5/thumbnails/39.jpg)
Detecting Adverse drug Interactions
![Page 40: Using Data Science on Internet Search Behavior as a Proxy for Human Behavior Juan Miguel Lavista](https://reader036.vdocument.in/reader036/viewer/2022062815/5681310c550346895d9748bf/html5/thumbnails/40.jpg)
Context: Adverse drug events cause substantial morbidity and mortality
and are often discovered after a drug comes to
market.
In the US alone, adverse drug events cause thousands of deaths annually and their associated medical treatment costs billions of dollar
![Page 41: Using Data Science on Internet Search Behavior as a Proxy for Human Behavior Juan Miguel Lavista](https://reader036.vdocument.in/reader036/viewer/2022062815/5681310c550346895d9748bf/html5/thumbnails/41.jpg)
Detecting Adverse drug InteractionsTesting impact of a drug by FDA
For each drug, FDA does a randomize control experiment before releasing them in order to Understand impact of the drug
![Page 42: Using Data Science on Internet Search Behavior as a Proxy for Human Behavior Juan Miguel Lavista](https://reader036.vdocument.in/reader036/viewer/2022062815/5681310c550346895d9748bf/html5/thumbnails/42.jpg)
InteractionsWhat are interactions?
Drug A OK
Drug B OK
Drug A
Drug B
Not OK
![Page 43: Using Data Science on Internet Search Behavior as a Proxy for Human Behavior Juan Miguel Lavista](https://reader036.vdocument.in/reader036/viewer/2022062815/5681310c550346895d9748bf/html5/thumbnails/43.jpg)
Web-scale pharmacovigilance: listening to signals from the crowdRyen W White,Nicholas P Tatonetti, Nigam H Shah, Russ B Altman, Eric Horvitz
Hypothesized: Internet users may provide early clues about adverse drug events via their online information-seeking
Web-scale pharmacovigilance: listening to signals from the crowdRyen W White,Nicholas P Tatonetti, Nigam H Shah, Russ B Altman, Eric Horvitz
![Page 44: Using Data Science on Internet Search Behavior as a Proxy for Human Behavior Juan Miguel Lavista](https://reader036.vdocument.in/reader036/viewer/2022062815/5681310c550346895d9748bf/html5/thumbnails/44.jpg)
Test case scenarioWeb-scale pharmacovigilance: listening to signals from the crowd
Paroxetine(an antidepressant)
Interaction between the 2 was reported to create hyperglycemia
Pravastatin(a cholesterol lowering drug)
Hyperglycemia, or high blood sugar ) is a condition in which an excessive amount of glucose circulates in the blood plasma.Web-scale pharmacovigilance: listening to signals from the crowd
Ryen W White,Nicholas P Tatonetti, Nigam H Shah, Russ B Altman, Eric Horvitz
![Page 45: Using Data Science on Internet Search Behavior as a Proxy for Human Behavior Juan Miguel Lavista](https://reader036.vdocument.in/reader036/viewer/2022062815/5681310c550346895d9748bf/html5/thumbnails/45.jpg)
MethodologyWeb-scale pharmacovigilance: listening to signals from the crowd
Method: By examining words used in user queries, they sought evidence that searches from people exploring pravastatin and paroxetine over time (using logs from 2010) would have a higher rate of including hyperglycemia-associated words than people searching for only one of the drugs
Web-scale pharmacovigilance: listening to signals from the crowdRyen W White,Nicholas P Tatonetti, Nigam H Shah, Russ B Altman, Eric Horvitz
![Page 46: Using Data Science on Internet Search Behavior as a Proxy for Human Behavior Juan Miguel Lavista](https://reader036.vdocument.in/reader036/viewer/2022062815/5681310c550346895d9748bf/html5/thumbnails/46.jpg)
ResultsWeb-scale pharmacovigilance: listening to signals from the crowd
The figure shows that people who search for both paroxetine
and pravastatin over
the 12-month period are more likely to perform searches on
the terms associated with
hyperglycemia
The study shows that signals concerning drug interactions can
be mined directly from search logs and confirms the findings
of laboratory studies as well as prior known associations.
Web-scale pharmacovigilance: listening to signals from the crowdRyen W White,Nicholas P Tatonetti, Nigam H Shah, Russ B Altman, Eric Horvitz
![Page 47: Using Data Science on Internet Search Behavior as a Proxy for Human Behavior Juan Miguel Lavista](https://reader036.vdocument.in/reader036/viewer/2022062815/5681310c550346895d9748bf/html5/thumbnails/47.jpg)
Agenda• Using Data Science on Internet Search Behavior as a
Proxy for Human Behavior Context
Problem definition
Examples
Summary
![Page 48: Using Data Science on Internet Search Behavior as a Proxy for Human Behavior Juan Miguel Lavista](https://reader036.vdocument.in/reader036/viewer/2022062815/5681310c550346895d9748bf/html5/thumbnails/48.jpg)
Summary
![Page 49: Using Data Science on Internet Search Behavior as a Proxy for Human Behavior Juan Miguel Lavista](https://reader036.vdocument.in/reader036/viewer/2022062815/5681310c550346895d9748bf/html5/thumbnails/49.jpg)
Using Data Science on Internet Search Behavior as a Proxy for Human Behavior
Search logs are a very powerful data set that can be used not only to improve the relevancy of search results, but also as a unique data source to solve other problems..This is only a small subset of problems, we believe this is the tip of the iceberg of the potential of this data source
We live in an amazing era, and is too soon to realize how big is the impact of the web in human kind..
![Page 50: Using Data Science on Internet Search Behavior as a Proxy for Human Behavior Juan Miguel Lavista](https://reader036.vdocument.in/reader036/viewer/2022062815/5681310c550346895d9748bf/html5/thumbnails/50.jpg)
We are living in this era.
To soon to realize how big is the impact of the internet for human kind..
We are in an inflexion point in the history of the world..