duplicate detection via topic modeling
TRANSCRIPT
![Page 1: Duplicate detection via topic modeling](https://reader030.vdocument.in/reader030/viewer/2022013117/589ff7ca1a28ab46598b5bb9/html5/thumbnails/1.jpg)
Duplicate Detection via Topic Modeling
![Page 2: Duplicate detection via topic modeling](https://reader030.vdocument.in/reader030/viewer/2022013117/589ff7ca1a28ab46598b5bb9/html5/thumbnails/2.jpg)
HomeAway Key Facts
● 1,300,000+ global vacation rental listings● 200,000,000+ vacation days / year● ~190 countries, 22 languages● HQ in Austin, TX; part of Expedia, Inc
--> Capable competition and fraud vectors
![Page 3: Duplicate detection via topic modeling](https://reader030.vdocument.in/reader030/viewer/2022013117/589ff7ca1a28ab46598b5bb9/html5/thumbnails/3.jpg)
Competitive Intelligence
![Page 4: Duplicate detection via topic modeling](https://reader030.vdocument.in/reader030/viewer/2022013117/589ff7ca1a28ab46598b5bb9/html5/thumbnails/4.jpg)
Breckenridge Colorado
HomeAway in blue
![Page 5: Duplicate detection via topic modeling](https://reader030.vdocument.in/reader030/viewer/2022013117/589ff7ca1a28ab46598b5bb9/html5/thumbnails/5.jpg)
Breckenridge, zoomed in
![Page 6: Duplicate detection via topic modeling](https://reader030.vdocument.in/reader030/viewer/2022013117/589ff7ca1a28ab46598b5bb9/html5/thumbnails/6.jpg)
Same Property
![Page 7: Duplicate detection via topic modeling](https://reader030.vdocument.in/reader030/viewer/2022013117/589ff7ca1a28ab46598b5bb9/html5/thumbnails/7.jpg)
The Property DescriptionsWhy Property Descriptions?
● Almost identical text
● Similar descriptions seemed probable
○ Consistent owner branding, easy to
replicate● Tech team wanted to use
natural language processing techniques
● Didn’t know if this would work when we began
The Other GuysThere are truly inspiring views at High Point Retreat and
plenty of places to sit and enjoy them. Take a load off in one of the many rooms with views of the ski mountain and
remember how lucky you are to live like this. Cozy up with family in the sunken living room and chat for hours on end. Sit in a circle of tree stumps around the outdoor fire pit and
roast marshmallows. After all that sitting, youll be more than happy to walk 250 yards to the free shuttle to get the blood pumping again. Then, have a seat and enjoy your free ride.
Best. Vacation. Ever. Vacation homes allow families to stay...together. At InvitedHome, we think that's pretty
important, so we do everything in our power to make your vacation totally epic. Not only do we choose the best homes
in the best destinations, but we make the experience effortless so you can really enjoy yourself. Our team will
stock your fridge, babysit the kids, cater your party, plan your day trip, make reservations, and do whatever we can to
make sure you have the Best. Vacation. Ever.
HomeAwayThere are truly inspiring views at High Point Retreat and plenty of places to sit and enjoy them. Take a load off in one of the many rooms with views of the ski mountain and remember how lucky you are to live like this. Cozy up with family in the sunken living room and chat for hours on end. Sit in a circle of tree stumps around the outdoor fire pit and roast marshmallows. After all that sitting, you’ll be more than happy to walk 250 yards to the free shuttle to get the blood pumping again. Then, have a seat and enjoy your free ride.Best.Vacation.Ever. Vacation homes allow families to stay...together. At InvitedHome, we think that's pretty important, so we do everything in our power to make your vacation totally epic. Not only do we choose the best homes in the best destinations, but we make the experience effortless so you can really enjoy yourself. Let us connect you with the best options in town for babysitting, equipment rental, transportation, catering, day trips, shopping, dining, and even stocking your fridge with groceries! We’ll do everything in our power to make sure you have the Best. Vacation. Ever.
![Page 8: Duplicate detection via topic modeling](https://reader030.vdocument.in/reader030/viewer/2022013117/589ff7ca1a28ab46598b5bb9/html5/thumbnails/8.jpg)
Worked great, but...
“Large” Vocabulary size
~6300 Tokens -> 6300 Dimensions and
millions of sparse vectors
A little slow(took a week to process the US)
Initial Approach: TF-IDF and Cosine Distance
![Page 9: Duplicate detection via topic modeling](https://reader030.vdocument.in/reader030/viewer/2022013117/589ff7ca1a28ab46598b5bb9/html5/thumbnails/9.jpg)
Spark Clusters?
Topic Modeling?
Other Distance Metrics?
![Page 10: Duplicate detection via topic modeling](https://reader030.vdocument.in/reader030/viewer/2022013117/589ff7ca1a28ab46598b5bb9/html5/thumbnails/10.jpg)
Latent Dirichlet Allocation (Topic Modeling)
Communications of the ACM, Vol. 55 No. 4, Pages 77-8410.1145/2133806.2133826
![Page 11: Duplicate detection via topic modeling](https://reader030.vdocument.in/reader030/viewer/2022013117/589ff7ca1a28ab46598b5bb9/html5/thumbnails/11.jpg)
Topic Modeling Motivations● Smaller dimensional space
● Faster processing times
● At the end, we’d have Topic Models
Must be useful for duplicate detection
We used Spark’s ML APIs for this:
val countLDA = new LDA() .setK(numTopics) .setMaxIter(params.maxIterations) .setSeed(params.randomSeed) .setFeaturesCol(featureCol) .setTopicDistributionCol("topicDistribution")
![Page 12: Duplicate detection via topic modeling](https://reader030.vdocument.in/reader030/viewer/2022013117/589ff7ca1a28ab46598b5bb9/html5/thumbnails/12.jpg)
![Page 13: Duplicate detection via topic modeling](https://reader030.vdocument.in/reader030/viewer/2022013117/589ff7ca1a28ab46598b5bb9/html5/thumbnails/13.jpg)
Distances between Topic Distributions
Euclidean Manhattan Cosine
![Page 14: Duplicate detection via topic modeling](https://reader030.vdocument.in/reader030/viewer/2022013117/589ff7ca1a28ab46598b5bb9/html5/thumbnails/14.jpg)
Distances between Topic Distributions
Euclidean Manhattan Cosine
Jensen-Shannon Hellinger
![Page 15: Duplicate detection via topic modeling](https://reader030.vdocument.in/reader030/viewer/2022013117/589ff7ca1a28ab46598b5bb9/html5/thumbnails/15.jpg)
Distances between Topic Distributions
Euclidean Manhattan Cosine
Jensen-Shannon Hellinger
![Page 16: Duplicate detection via topic modeling](https://reader030.vdocument.in/reader030/viewer/2022013117/589ff7ca1a28ab46598b5bb9/html5/thumbnails/16.jpg)
![Page 17: Duplicate detection via topic modeling](https://reader030.vdocument.in/reader030/viewer/2022013117/589ff7ca1a28ab46598b5bb9/html5/thumbnails/17.jpg)
![Page 18: Duplicate detection via topic modeling](https://reader030.vdocument.in/reader030/viewer/2022013117/589ff7ca1a28ab46598b5bb9/html5/thumbnails/18.jpg)
![Page 19: Duplicate detection via topic modeling](https://reader030.vdocument.in/reader030/viewer/2022013117/589ff7ca1a28ab46598b5bb9/html5/thumbnails/19.jpg)
How to make something useful?
This is a machine learning effort
![Page 20: Duplicate detection via topic modeling](https://reader030.vdocument.in/reader030/viewer/2022013117/589ff7ca1a28ab46598b5bb9/html5/thumbnails/20.jpg)
![Page 21: Duplicate detection via topic modeling](https://reader030.vdocument.in/reader030/viewer/2022013117/589ff7ca1a28ab46598b5bb9/html5/thumbnails/21.jpg)
![Page 22: Duplicate detection via topic modeling](https://reader030.vdocument.in/reader030/viewer/2022013117/589ff7ca1a28ab46598b5bb9/html5/thumbnails/22.jpg)
Interquartile Ranges are more resilient to outliers than standard deviations
IQRs bring information about the entire set of possible duplicates
Random Forest Model (R):trainIdx <- createDataPartition(dupesFoundByTopic$match, p=0.9, list=FALSE, times=1)
train <- dupesFoundByTopic[trainIdx,]
fit <- randomForest(as.factor(match) ~ distance + iqrs, data=train)
Combining Distance and IQR
Feature Mean Decrease Gini
distance 498
IQR 57
Reference
Pred. FALSE TRUE
FALSE 204 2
TRUE 4 32
![Page 23: Duplicate detection via topic modeling](https://reader030.vdocument.in/reader030/viewer/2022013117/589ff7ca1a28ab46598b5bb9/html5/thumbnails/23.jpg)
● Topic Models / Topic Distances seem useful
○ Esp. when part of a multi-signal model
(i.e. images)
● Hybrid Spark and R approach
○ Moving to 100% Spark in future for
speed
● Topic Models just sitting there, waiting for
exploitation
○ “Programmatic” Marketing Efforts, &c.
Current Status
![Page 24: Duplicate detection via topic modeling](https://reader030.vdocument.in/reader030/viewer/2022013117/589ff7ca1a28ab46598b5bb9/html5/thumbnails/24.jpg)
Questions?
Brent SchneemanPrincipal Data Scientist
HomeAway, Inc.
@schnee
← https://www.homeaway.com/vacation-rental/p3482065