web scrapers and your listing data: high risk lessons

Web Scrapers and Your Listing Data: High Risk Lessons

Presenters

Matt CohenChief TechnologistClareity Consulting

Lauren HansenCEO

IRES MLS

Charlie MinesingerDirector of Solution Sales

Distil Networks

Overview of Bots and Web ScrapingWeb Scraping’s Impact on Real Estate WebsitesIRES / ColoProperty.com Case StudyAbout Distil NetworksQ&A

Agenda

Toward better Security for Real Estate Data Online

A Brief Intro to Bots and Web Scraping

Bad Bots Cause the Majority of Website Problems

In 2015 the most targeted verticals were digital publishing and real estate. Real Estate sites saw a 300% increase in

bad bot traffic!

Traffic by Type of Site, 2014 vs 2015

What is Web Scraping

Web scraping is the act of taking content from a website with the intent of using it for purposes outside the direct control of the site owner.

It can be used to○ Steal intellectual property○ Gain competitive advantage○ Create aggregation or meta-sites○ Perform market research○ Damage SEO rankings

http://www.distilnetworks.com/web-scraping/

Who is behind Web Scraping?

CompetitorsContent Theft

Competitive IntelPrice Scraping

AggregatorsStart-ups

Unauthorized Middlemen

HackersContent for Fake Pages

Search EnginesGoogle

BingYahooBaidu

Web Scraping Concerns on Real Estate Sites

Web scraping hurts your KPIs...Slowdowns, downtime, and poor user experiencesIncrease in costs (infrastructure and people)Distortion of web analyticsDigital ad fraud, reputation and trust (bad leads)

How Web Scrapers Impact KPIs

MLSsObligation to protect copyrightHigher cost to use reactive methods - beacons, legal, etc.Duty to enforce NAR Policy (VOWs. IDX optionally)Missed revenue opportunities for licensing content

Brokers and AgentsProvided content license on listing for specific purposeResponsible for NAR Policy (VOWs, so far)Stale (scraped) data undermines trust and reputation in brand

Why Bots / Scraping is a Problem in Real Estate

Bottom Line on Scraping

The High Costs of Scraping MLS DataResource costs - 10% to 40% of server utilization and bandwidthCustomer Care - Cost per call from consumer? Calls per month?Website Performance – brownouts results in 3 days of low trafficAd Fraud - If 30% of ads are seen by bots, are advertisers paying?Lead Gen - Bad leads, decreased value of MLS licensed data. $15/mover, $30/storage facility, … $100s per listing going to data pirates … and potentially annoying consumers in the process!

→ Biggest Losers: MLS and Brokers

Value of solution?Antivirus is $40 to $75 year per member ( = $3 -

$6/month) Anti-scraping protection should be same or less cost

Bottom LineScrapers scrape because they are making money with your listings!

And the Real Estate industry is left with...

Higher CostsLost Revenues

Why Bots / Scraping is a Problem in Real Estate

Who100 MLS Executives rep. MLSs with over 600,000 subscribers.14 rep. 400,000 IDX & VOW websites. Others would only speak informallyWhat Was Found99% say compliance with rules protecting misuse of MLS data is important59% of respondents do NOT test VOW sites for anti-scraping compliance - and the 41% rely on self-reporting

○ The industry lacks a tool for compliance review. I would require a screenshot of the site’s Distil dashboard, documentation of key settings!

Almost all IDX/VOW vendors are using no anti-scraping - or reactive, obsolete detection tactics

○ Reactive log analysis, IP-based methods, rate limiting, CAPTCHA

Clareity’s 2015 RE Industry Scraping Study

95% MLS execs agree that IDX sites should be subject to rules specifically mandating scraping protections

NAR has declined to make the change even though 95% want the “air coverage” of specific language NAR’s

The Path Forward to the 100% SolutionMust start with MLSs: MLS vendors, Public Listing WebsitesVOW complianceIDX requirements made clearOnce “our own house” is in order, pressure syndication sites The largest have at least some protections already. It’s the scores of others...

Scraping Study / The Path Forward

IRES / ColoProperty.com Case Study

IRESFor real estate professionalsServing 6,000 professionalsCounty Assessor dataMappingBroker functionality

ColoProperty.comConsumer-facing siteDaily updates on ~15,000 listings

About IRES / ColoProperty.com

Our Bot Blocking Goals

Preserve full value of MLS informationCreate a trusted environment for key constituentsProtect the integrity of listing dataDecrease hosting and bandwidth costsPrevent fraudulent lead forms and spamIncrease website speedAvoid potential litigation costs

My Advice

Don’t be like these guys...

My Advice

Example from yesterday

(highlights added)

Scraping resources just a click away...

Anyone with basic computer skills can get into the game

Inexpensive relative to the value of the content they steal

Difficult or impossible to prosecute

Website Scraping Has Never Been Easier

Insight and Control is Key


In April we served almost 1.5 million CAPTCHAs...

But there were only 650 attempts to solve them

So, I know I’m only serving CAPTCHAs to the bots... 99.995% of the time


I can adjust rate limits up and down and see how they will

impact my users...

About Distil Networks

Distil Networks in Real Estate

Majority of Bots are Advanced Persistent Bots (APBs)

APBs have one or more of the following abilities:

AdvancedMimick human behaviorLoad JavaScriptLoad external resourcesSupport cookiesBrowser automation (Selenium, PhantomJS)

Persistent Dynamic IP rotationDistribute attacks across IP addressesHide behind anonymous and peer-to-peer proxies 2016 Distil Bad Bot

Report

Sticky Bot Tracking With No Impact On Real UsersDevice FingerprintingFingerprints stick to the bot even if it attempts to reconnect from random IP addresses or hide behind an anonymous proxy or peer-to-peer network

Tracks distributed attacks that would normally fly under the radar

Without Distil

With Distil

Without Impacting Users Sharing the Same IPAvoids blocking residential users or organizations that might share the same NAT as the bot or botnet

Threat Intelligence From All Distil-Protected Sites

Known Violators DatabaseReal-time updates from the world’s largest Known Violators Database, which is based on the collective intelligence of all Distil-protected sites

Distil customers are automatically protected against new threats discovered anywhere on the network

Browser ValidationDetects all known browser automation tools, such as Selenium and Phantom JS

Protects against browser spoofing by validating each incoming request as self reported

Advanced Bot Detection Increases Accuracy

Behavioral Modeling and Machine LearningMachine-learning algorithms pinpoint behavioral anomalies specific to your site’s unique traffic patterns

Self optimizing algorithms improve bot detection and mitigation without manual configuration

○ Install on virtualized or bare metal appliance(s)○ High availability configurations with failover

monitoring○ Heartbeat up to Distil Cloud ○ Deploys in days

Flexible Deployment Options

Automatically compresses and optimizes content for faster delivery17 global datacenters automatically fail over when a primary location goes offlineAutomatically increases infrastructure and bandwidth to accommodate spikesDeploys in hours

Physical or Virtual Appliances

Content Delivery Network

Presenters

Matt CohenChief TechnologistClareity Consulting

Lauren HansenCEOIRES

Charlie MinesingerDirector of Solution Sales

Distil Networks