![Page 1: Robert Currier, Mote Marine Laboratory Dr. Barbara Kirkpatrick, TAMU/GCOOS](https://reader035.vdocument.in/reader035/viewer/2022062421/56649e2d5503460f94b1c63f/html5/thumbnails/1.jpg)
Developing a Data Harvester in the Amazon Cloud for the
Automated Assimilation of Florida’s Healthy Beaches Reports into
the GCOOS Data Portal
Robert Currier, Mote Marine LaboratoryDr. Barbara Kirkpatrick, TAMU/GCOOS
![Page 2: Robert Currier, Mote Marine Laboratory Dr. Barbara Kirkpatrick, TAMU/GCOOS](https://reader035.vdocument.in/reader035/viewer/2022062421/56649e2d5503460f94b1c63f/html5/thumbnails/2.jpg)
OverviewFL Department of Health monitors 34 coastal
countiesE. coli/Enterroccus samples taken weeklyDOH data publicly available but no APIOriginal DOH website used standard
HTML/CSSPython “web scraping” app developed to
harvest dataDOH outsourced website to commercial
provider
![Page 3: Robert Currier, Mote Marine Laboratory Dr. Barbara Kirkpatrick, TAMU/GCOOS](https://reader035.vdocument.in/reader035/viewer/2022062421/56649e2d5503460f94b1c63f/html5/thumbnails/3.jpg)
We had no access to DOH staff or API for the data
In “Big Data” world of today this is becoming typical:
What we built broke when data format changed
This is the story of how we fixed the harvester
![Page 4: Robert Currier, Mote Marine Laboratory Dr. Barbara Kirkpatrick, TAMU/GCOOS](https://reader035.vdocument.in/reader035/viewer/2022062421/56649e2d5503460f94b1c63f/html5/thumbnails/4.jpg)
Original Data HarvesterWritten in PythonUsed the ‘urllib’ library for web scrapingData stored in MySQL databaseHarvester ran nightly out of cronApp walked through list of counties and built
url: http://esetappsdoh.doh.state.fl.us/irm00beachwater/beachresults.apx?county=’sarasota’
Data returned as Python text objectText object fed to regular expression for
matching
![Page 5: Robert Currier, Mote Marine Laboratory Dr. Barbara Kirkpatrick, TAMU/GCOOS](https://reader035.vdocument.in/reader035/viewer/2022062421/56649e2d5503460f94b1c63f/html5/thumbnails/5.jpg)
Original Data Format
![Page 6: Robert Currier, Mote Marine Laboratory Dr. Barbara Kirkpatrick, TAMU/GCOOS](https://reader035.vdocument.in/reader035/viewer/2022062421/56649e2d5503460f94b1c63f/html5/thumbnails/6.jpg)
And Then It Stopped Working…FL DOH suddenly (to us) outsourced in early
2013New website used proprietary JavaScript and
MapsPlain HTML no longer sent to the browserInstead, custom JavaScript was loadedThe JavaScript used AJAX and DOM
manipulation
![Page 7: Robert Currier, Mote Marine Laboratory Dr. Barbara Kirkpatrick, TAMU/GCOOS](https://reader035.vdocument.in/reader035/viewer/2022062421/56649e2d5503460f94b1c63f/html5/thumbnails/7.jpg)
New Data Format
![Page 8: Robert Currier, Mote Marine Laboratory Dr. Barbara Kirkpatrick, TAMU/GCOOS](https://reader035.vdocument.in/reader035/viewer/2022062421/56649e2d5503460f94b1c63f/html5/thumbnails/8.jpg)
The SolutionEmulating a browser with Selenium
Portable software test framework for web applicationsCan act like FireFox, Chrome and IETypically used for building automated testsWe repurposed and used as a virtual browserAs a browser Selenium can execute JavaScript
![Page 9: Robert Currier, Mote Marine Laboratory Dr. Barbara Kirkpatrick, TAMU/GCOOS](https://reader035.vdocument.in/reader035/viewer/2022062421/56649e2d5503460f94b1c63f/html5/thumbnails/9.jpg)
Soup’s On!Selenium worked and we now had data
availableBut data was very unstructured and
massively uglyBeautifulSoup4 to the rescue…
![Page 10: Robert Currier, Mote Marine Laboratory Dr. Barbara Kirkpatrick, TAMU/GCOOS](https://reader035.vdocument.in/reader035/viewer/2022062421/56649e2d5503460f94b1c63f/html5/thumbnails/10.jpg)
And The Soup Was Tasty!BeautifulSoup4 gave us back our
“structured” dataSome modification needed to data parsing
code as…Locations, variables and dates were not on
same line
![Page 11: Robert Currier, Mote Marine Laboratory Dr. Barbara Kirkpatrick, TAMU/GCOOS](https://reader035.vdocument.in/reader035/viewer/2022062421/56649e2d5503460f94b1c63f/html5/thumbnails/11.jpg)
The New Code Worked PerfectlyIn Our Development Environment
![Page 12: Robert Currier, Mote Marine Laboratory Dr. Barbara Kirkpatrick, TAMU/GCOOS](https://reader035.vdocument.in/reader035/viewer/2022062421/56649e2d5503460f94b1c63f/html5/thumbnails/12.jpg)
But Failed Spectacularly When We Deployed
![Page 13: Robert Currier, Mote Marine Laboratory Dr. Barbara Kirkpatrick, TAMU/GCOOS](https://reader035.vdocument.in/reader035/viewer/2022062421/56649e2d5503460f94b1c63f/html5/thumbnails/13.jpg)
What Happened?Amazon EC-2 instances are “headless” serversNo display hardwareNo graphics libraries (GTK+)Since no graphics libraries, no browsersWithout a browser, we crash and burn
![Page 14: Robert Currier, Mote Marine Laboratory Dr. Barbara Kirkpatrick, TAMU/GCOOS](https://reader035.vdocument.in/reader035/viewer/2022062421/56649e2d5503460f94b1c63f/html5/thumbnails/14.jpg)
Adding A Virtual Headhttp://joekiller.com provided us with a script
that pulled the source and built GTK+ on our cloud server in under two hours. Thanks, Joe Lawson!
Unfortunately, the script bombed and didn’t build FireFox. We had to download the source and build by hand.
Now we had a working browser, but no monitor on which to display our output…
![Page 15: Robert Currier, Mote Marine Laboratory Dr. Barbara Kirkpatrick, TAMU/GCOOS](https://reader035.vdocument.in/reader035/viewer/2022062421/56649e2d5503460f94b1c63f/html5/thumbnails/15.jpg)
Getting A Head with XVFBXVFB: The X virtual frame bufferPerforms all graphical operations in memoryDoesn’t show outputPrimarily used for testing, but…We repurposed, just like Selenium
+ =
![Page 16: Robert Currier, Mote Marine Laboratory Dr. Barbara Kirkpatrick, TAMU/GCOOS](https://reader035.vdocument.in/reader035/viewer/2022062421/56649e2d5503460f94b1c63f/html5/thumbnails/16.jpg)
Automating The Process
![Page 17: Robert Currier, Mote Marine Laboratory Dr. Barbara Kirkpatrick, TAMU/GCOOS](https://reader035.vdocument.in/reader035/viewer/2022062421/56649e2d5503460f94b1c63f/html5/thumbnails/17.jpg)
ConclusionsDon’t be afraid to use untraditional data
sourcesBut be prepared for your code to breakWe live in a data rich environmentBut most of the data is very
messy/unstructuredSo tread lightly, and don’t lose your head!
![Page 18: Robert Currier, Mote Marine Laboratory Dr. Barbara Kirkpatrick, TAMU/GCOOS](https://reader035.vdocument.in/reader035/viewer/2022062421/56649e2d5503460f94b1c63f/html5/thumbnails/18.jpg)
Thanks To:Mote Marine LaboratoryGulf Coast Ocean Observing SystemsTexas A&M Department of OceanographyAll the Free and Open Source Software
developers
![Page 19: Robert Currier, Mote Marine Laboratory Dr. Barbara Kirkpatrick, TAMU/GCOOS](https://reader035.vdocument.in/reader035/viewer/2022062421/56649e2d5503460f94b1c63f/html5/thumbnails/19.jpg)
In Remembrance OfSeth Vidal, creator of ‘yum’, friend and FOSS
guruKilled while biking on July 8th 2013 in
Durham, NC