the human algorithm: automating startup data collection at mattermark

32
#datapointlive The Human Algorithm: Automating Startup Data Collection at Mattermark Sarah Catanzaro, Head of Data at Mattermark @sarahcat21

Upload: janessa-lantz

Post on 21-Apr-2017

7.227 views

Category:

Data & Analytics


0 download

TRANSCRIPT

#datapointlive

The Human Algorithm: Automating Startup Data Collection at Mattermark

Sarah Catanzaro, Head of Data at Mattermark @sarahcat21

#DPL15 | @sarahcat21

Mattermark is a deal intelligence platform and private company database used by

● investors● business and corporate development● sales

Mattermark

#DPL15 | @sarahcat21

THE CHALLENGEScale + Information Overload +

Stealth

#DPL15 | @sarahcat21

Scale

Over 125 million private companies in the world (only about 45.5 thousand public).

#DPL15 | @sarahcat21

Information overload

#DPL15 | @sarahcat21

Stealth

● Private companies do not have strong incentives (e.g. legal obligations) to share data. Many may have competitive incentives to obfuscate information.

● Investors may request non-disclosure.

#DPL15 | @sarahcat21

Mattermark’s Solution

#DPL15 | @sarahcat21

Software-oriented approach

● A must, due to the scale of our dataset○ 1.3 million companies○ 16.5k investors○ 110k funding events

● Leverage a lean data team

#DPL15 | @sarahcat21

Data collection strategy

● Web scraping● Machine learning● Direct submission● Manual data entry

#DPL15 | @sarahcat21

The “Human Algorithm”

#DPL15 | @sarahcat21

Investors ask questions like

What start-ups might raise capital in the next 6 months? What startups is

Stephanie Palmeri investing in?

#DPL15 | @sarahcat21

Our data analysts seek to understand:

● Why does this question matter?● What data is required to answer this question?● Where can this data be accessed?

#DPL15 | @sarahcat21

Next, data analysts:

1. Define repeatable processes for data collection. 2. Determine whether processes can be replicated

through web scraping and/or machine learning algorithms to collect data at scale.

3. Write functional specifications, reviewed by sales and engineering team members.

#DPL15 | @sarahcat21

Next, web and/or machine learning engineers

1. Write dev designs, reviewed by data analysts.2. Upon implementation and marketing release,

this data becomes available to customers.3. New questions arise and the cycle starts again.

#DPL15 | @sarahcat21

Funding Automation

#DPL15 | @sarahcat21

Investors ask questions like

How much funding has a company already raised?

Who were the investors at each of those rounds?

#DPL15 | @sarahcat21

Problems with existing sources

Rely on wiki-style data collection (cannot confirm the credibility of sources)

News reports are better; but ● facts are harder to extricate● different sources report different figures

#DPL15 | @sarahcat21

Solution: funding automation

A new framework for collecting and synthesizing funding data.

1. News article fact extraction (machine learning)2. Funding override system (web engineering)3. Funding confirmation email campaign

(marketing)

#DPL15 | @sarahcat21

2. News article fact extractionCrawl RSS feeds, extract data from stories (title, texts, links, etc.)

● 750+ sources● 5,000 - 10,000 articles

#DPL15 | @sarahcat21

2. News article fact extraction

Classify stories about funding

● 250 articles/day

#DPL15 | @sarahcat21

2. News article fact extraction

● Identify sentences containing information about investors, amount, and/or series

#DPL15 | @sarahcat21

2. News article fact extraction

● Extract facts● Match companies and

investors to entities in our database○ 30% of extracted articles

are entered automatically

#DPL15 | @sarahcat21

1. Funding override system● Identify reports about the same funding event● Combine information from multiple reports using wongi rules engine

#DPL15 | @sarahcat21

3. Funding confirmation email campaign

Use CRM and Hubspot to automatically send emails to founders after equity financing.

#DPL15 | @sarahcat21

What We Learned

#DPL15 | @sarahcat21

Where we struggled

Our initial implementation of a funding override system was inefficient. Why?

Because our data analysts and developers were not aligned on functional requirements.

#DPL15 | @sarahcat21

Solution

● Analysts must work closely with developers○ Pre-spec check-ins○ Analysts review dev designs to ensure that

the system design addresses the use case.● Analysts must avoid being prescriptive● Analysts must understand data mining and

machine learning concepts

#DPL15 | @sarahcat21

Where we succeeded

Implementation of news article fact extraction was successful. Why?

Because data analysts and developers worked as service providers to each other.

#DPL15 | @sarahcat21

How We Did It

#DPL15 | @sarahcat21

1. Tighter Analyst + Dev Communication

Tiger teams: 1 ML developer, 1 web/infrastructure developer, 1 data analyst, 1 project lead

Define milestones & hold daily stand-ups.

#DPL15 | @sarahcat21

3. Track II interaction reinforce symbiotic relationship

● Devs lead Python learning group● Data analysts hold seminars on topics like admin

tooling and alternative assets

#DPL15 | @sarahcat21

Thank You!