web scrapers and your listing data: high risk lessons
TRANSCRIPT
Web Scrapers and Your Listing Data: High Risk Lessons
Presenters
Matt CohenChief TechnologistClareity Consulting
Lauren HansenCEO
IRES MLS
Charlie MinesingerDirector of Solution Sales
Distil Networks
Overview of Bots and Web ScrapingWeb Scraping’s Impact on Real Estate WebsitesIRES / ColoProperty.com Case StudyAbout Distil NetworksQ&A
Agenda
Toward better Security for Real Estate Data Online
A Brief Intro to Bots and Web Scraping
Bad Bots Cause the Majority of Website Problems
In 2015 the most targeted verticals were digital publishing and real estate. Real Estate sites saw a 300% increase in
bad bot traffic!
Traffic by Type of Site, 2014 vs 2015
What is Web Scraping
Web scraping is the act of taking content from a website with the intent of using it for purposes outside the direct control of the site owner.
It can be used to○ Steal intellectual property○ Gain competitive advantage○ Create aggregation or meta-sites○ Perform market research○ Damage SEO rankings
Who is behind Web Scraping?
CompetitorsContent Theft
Competitive IntelPrice Scraping
AggregatorsStart-ups
Unauthorized Middlemen
HackersContent for Fake Pages
Search EnginesGoogle
BingYahooBaidu
Web Scraping Concerns on Real Estate Sites
Web scraping hurts your KPIs...Slowdowns, downtime, and poor user experiencesIncrease in costs (infrastructure and people)Distortion of web analyticsDigital ad fraud, reputation and trust (bad leads)
How Web Scrapers Impact KPIs
MLSsObligation to protect copyrightHigher cost to use reactive methods - beacons, legal, etc.Duty to enforce NAR Policy (VOWs. IDX optionally)Missed revenue opportunities for licensing content
Brokers and AgentsProvided content license on listing for specific purposeResponsible for NAR Policy (VOWs, so far)Stale (scraped) data undermines trust and reputation in brand
Why Bots / Scraping is a Problem in Real Estate
Bottom Line on Scraping
The High Costs of Scraping MLS DataResource costs - 10% to 40% of server utilization and bandwidthCustomer Care - Cost per call from consumer? Calls per month?Website Performance – brownouts results in 3 days of low trafficAd Fraud - If 30% of ads are seen by bots, are advertisers paying?Lead Gen - Bad leads, decreased value of MLS licensed data. $15/mover, $30/storage facility, … $100s per listing going to data pirates … and potentially annoying consumers in the process!
→ Biggest Losers: MLS and Brokers
Value of solution?Antivirus is $40 to $75 year per member ( = $3 -
$6/month) Anti-scraping protection should be same or less cost
Bottom LineScrapers scrape because they are making money with your listings!
And the Real Estate industry is left with...
Higher CostsLost Revenues
Why Bots / Scraping is a Problem in Real Estate
Who100 MLS Executives rep. MLSs with over 600,000 subscribers.14 rep. 400,000 IDX & VOW websites. Others would only speak informallyWhat Was Found99% say compliance with rules protecting misuse of MLS data is important59% of respondents do NOT test VOW sites for anti-scraping compliance - and the 41% rely on self-reporting
○ The industry lacks a tool for compliance review. I would require a screenshot of the site’s Distil dashboard, documentation of key settings!
Almost all IDX/VOW vendors are using no anti-scraping - or reactive, obsolete detection tactics
○ Reactive log analysis, IP-based methods, rate limiting, CAPTCHA
Clareity’s 2015 RE Industry Scraping Study
95% MLS execs agree that IDX sites should be subject to rules specifically mandating scraping protections
NAR has declined to make the change even though 95% want the “air coverage” of specific language NAR’s
The Path Forward to the 100% SolutionMust start with MLSs: MLS vendors, Public Listing WebsitesVOW complianceIDX requirements made clearOnce “our own house” is in order, pressure syndication sites The largest have at least some protections already. It’s the scores of others...
Scraping Study / The Path Forward
IRES / ColoProperty.com Case Study
IRESFor real estate professionalsServing 6,000 professionalsCounty Assessor dataMappingBroker functionality
ColoProperty.comConsumer-facing siteDaily updates on ~15,000 listings
About IRES / ColoProperty.com
Our Bot Blocking Goals
Preserve full value of MLS informationCreate a trusted environment for key constituentsProtect the integrity of listing dataDecrease hosting and bandwidth costsPrevent fraudulent lead forms and spamIncrease website speedAvoid potential litigation costs
My Advice
Don’t be like these guys...
My Advice
Example from yesterday
(highlights added)
Scraping resources just a click away...
Anyone with basic computer skills can get into the game
Inexpensive relative to the value of the content they steal
Difficult or impossible to prosecute
Website Scraping Has Never Been Easier
Insight and Control is Key
Insight and Control is Key
In April we served almost 1.5 million CAPTCHAs...
But there were only 650 attempts to solve them
So, I know I’m only serving CAPTCHAs to the bots... 99.995% of the time
Insight and Control is Key
I can adjust rate limits up and down and see how they will
impact my users...
About Distil Networks
Distil Networks in Real Estate
Majority of Bots are Advanced Persistent Bots (APBs)
APBs have one or more of the following abilities:
AdvancedMimick human behaviorLoad JavaScriptLoad external resourcesSupport cookiesBrowser automation (Selenium, PhantomJS)
Persistent Dynamic IP rotationDistribute attacks across IP addressesHide behind anonymous and peer-to-peer proxies 2016 Distil Bad Bot
Report
Sticky Bot Tracking With No Impact On Real UsersDevice FingerprintingFingerprints stick to the bot even if it attempts to reconnect from random IP addresses or hide behind an anonymous proxy or peer-to-peer network
Tracks distributed attacks that would normally fly under the radar
Without Distil
With Distil
Without Impacting Users Sharing the Same IPAvoids blocking residential users or organizations that might share the same NAT as the bot or botnet
Threat Intelligence From All Distil-Protected Sites
Known Violators DatabaseReal-time updates from the world’s largest Known Violators Database, which is based on the collective intelligence of all Distil-protected sites
Distil customers are automatically protected against new threats discovered anywhere on the network
Browser ValidationDetects all known browser automation tools, such as Selenium and Phantom JS
Protects against browser spoofing by validating each incoming request as self reported
Advanced Bot Detection Increases Accuracy
Behavioral Modeling and Machine LearningMachine-learning algorithms pinpoint behavioral anomalies specific to your site’s unique traffic patterns
Self optimizing algorithms improve bot detection and mitigation without manual configuration
○ Install on virtualized or bare metal appliance(s)○ High availability configurations with failover
monitoring○ Heartbeat up to Distil Cloud ○ Deploys in days
Flexible Deployment Options
Automatically compresses and optimizes content for faster delivery17 global datacenters automatically fail over when a primary location goes offlineAutomatically increases infrastructure and bandwidth to accommodate spikesDeploys in hours
Physical or Virtual Appliances
Content Delivery Network
Presenters
Matt CohenChief TechnologistClareity Consulting
Lauren HansenCEOIRES
Charlie MinesingerDirector of Solution Sales
Distil Networks