meetup - august 25, 2015

11
Case Studies in Text-Mining and Beyond Larkin Liu - Paytm Labs

Upload: larkin-liu

Post on 22-Jan-2017

44 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Meetup - August 25, 2015

Case Studies in Text-Mining and BeyondLarkin Liu - Paytm Labs

Page 2: Meetup - August 25, 2015

About me

- Visiting Scientist at Paytm Labs- MASc Student in Operations Research, Industrial Engineering,

University of Toronto- BASc Thesis: “Application of Multivariate and Univariate Time Series

Analysis Methods to Stochastic Processes”- Areas of research and development,

- Supervised learning algorithms - Reinforcement Learning, Clustering- Stochastic models - Hidden Markov Models, Time Series- Optimization - Decision Theory

- Working with the following languages and frameworks, such as,Only been horseback riding once actually…

Page 3: Meetup - August 25, 2015

What is Paytm?

Paytm Labs• Fraud Detection

• Anomaly Detection• Classification

• Consumer Analytics• Recommender Systems• Inventory Optimization

Paytm• India’s fastest growing

eCommerce platform

• Online shopping and mobile wallet

• 100 Million users• 10+ million transactions

per day

Page 4: Meetup - August 25, 2015

Principles of Data Science• Occam’s Razor

• “Assume the solution of lowest complexity is the correct solution.” - Most often it is the optimal assumption.

• Data science is statistics• Hypothesis Testing• Regression• Bayesian Inference

• Data science is software engineering• Big Data - Optimization• Data Structures & Algorithms• Artificial Intelligence - Machine Learning

• Data science is holistic in nature• Biostatistics • Particle physicists• Mechatronics

Page 5: Meetup - August 25, 2015
Page 6: Meetup - August 25, 2015

Text Mining - Gibberish Emails vs. Intelligent EmailsExamples of Intelligent Emails:

[email protected][email protected][email protected]

Examples of Gibberish Emails:● [email protected][email protected][email protected]

Objective: Distinguish intelligent emails from spammy gibberish emails, and feed it into our decision model.

Step 1: “[email protected]

Step 2: N-Grams:“l a r k i n . l i u” 1-Gram“la ar rk ki in n. .l li iu” 2-Gram

Step 3: Build 1st Order Markov Chain

Step 4: Build probability thresholds based on sampling of intelligent and gibberish names.

Step 5: Optimize model, experiment with parameters on ROC curve.

Step 6: Build robust classifier.

Page 7: Meetup - August 25, 2015

Text Mining - Gibberish Emails vs. Intelligent Emails (Cont.)

Distribution of P(X). ROC curve on 200 names, varying P(X) threshold.

Page 8: Meetup - August 25, 2015

Text Mining - Address Fingerprinting

The following address expressions are actually the same address:

● sco-74, F.f, swastik vihar, mdc, sec-5 sco-74, F.f, swastik vihar, mdc, sec-5● SCO-74, F.F, Swastik Vihar, MDC, Sec-5, Panchkula, Haryana SCO-74, F.F, Swastik Vihar, MDC,

Sec-5, Panchkula, Haryana● SCO-74, F.F, Swastik Vihar, MDC, Sec-5, Panchkula, Haryana PANCHKULA

The fact that someone would purposefully write the same address in multiple ways is an indicator that they are trying to deceive the system.

Objective: Build an algorithm to map all addresses of identical semantic meaning to one hash. Subsequently, all marketplace activity can be associated to one identity to detect fraudulent behaviour.

Page 9: Meetup - August 25, 2015

Text Mining - Address Fingerprinting (Cont.)

Number of distinct customers per Address Fingerprint, 62 customers address fingerprint

Measure of transaction velocity per address fingerprint.

Page 10: Meetup - August 25, 2015

Community Detection

Based on associations of address fingerprints and remote IP addresses, we can generate an associative organization network of related address fingerprints.

Page 11: Meetup - August 25, 2015

Q & A

Now is the time to ask questions...