feature surfacing - meetup

Feature surfacingDiscover, Aggregate & Evaluate

Jean-Baptiste PRIEZ8 mars 2017

Feature engineering

XOR

X

Y

Z = (XY > 0)

Z

Users

Sales

Web

UsersCustomerIdFirstnameLastnameAge

SalesCustomerIdProductAmountTime

WebCustomerIdPageTime

Users.Customer_IdUsers.FirstnameUsers.LastnameUsers.AgeOutcomeCount(Sales.Product)CountDistinct(Sales.Product)Mean(Sales.Amount)Sum(Sales.Amount) where Sales.Product = 'Mobile Data'Count(Web.Page) where Day(Web.Time) in [6;7]…

Feature surfacing

LET’S START WITH AN EXAMPLE…

Feature Surfacing

Example: Outbound Mail Campaign

1 Central Table

Customer

(Id, e-mail address, age, state, will buy within 5 days)

Example: Outbound Mail Campaign

3 Peripheral

Tables(visited pages, duration of the session, browser type…)

Pages visited on the website

(number of products, amount spent,order status...)

E-mail campaignreactions

(action, action type, time sincee-mail was sent…)

Orders

Which are the sources, variables to choose? How to represent them?

Should we seat and meditate around a table?Should we try each and every variables manually?

What if we let the machine work?

It’s a Machine Learning problem:• How to smartly explore the entire set of possible aggregates?• Without under/overfitting?• With a linearithmic complexity?

What is Feature Surfacing?

1. Extraction of information contained in a multi-table data source• Aggregation operators• Filter operators

2. Evaluation of aggregates extracted from a star-relational data schema

Feature surfacing consists in applying a set of aggregation operators on the peripheral tables to generate features in the central table.

Centraltable

Peripheraltable1 Peripheraltable2

Peripheraltable3 Peripheraltable4

* *

**

1,1

0,n0,n

0,n0,n

1,1

1,1

1,1

*1rowperentityinthecentraltable,correspondingtoseveralrowsforthesameentityintheperipheraltable.

Extraction Evaluation(supervised)

What are the operators?

Some aggregation operators:Name Return type Operands Label

Count Num Table Number of records

CountDistinct Num Table, Cat Number of distinct values

Mode Cat Table, Cat Most frequent value

Mean Num Table, Num Mean value

StdDev Num Table, Num Standard deviation

Median Num Table, Num Median value

Min Num Table, Num Min value

Max Num Table, Num Max value

Sum Num Table, Num Sum of value

Some filter operators:Name Return type Operands Label

<, ≤ Table Table, Num Table filtered over field values smaller (or equal) than a record

>, ≥ Table Table, Num Table filtered over field values greater (or equal) than a record

= Table Table, Field Table filtered over field values equal than a record

Customize your operators:• Date:before,after,week-end,etc…• Time:morning,afternoon,etc…• String:split,infinitiveverb,etc…• ...

Presentation of some smart aggregates

1. Count(Pages visited)

2. Max(Orders, amount spent)

3. Mode(Email reactions, action type)

4. Median(Pages visited, duration) when Pages visited.device = “smartphone”

The maximal amount spent by the customer

The most frequent email request of the customer

Number of visited pages by the customer

How to be smart?

• Good aggregate • 1st: Aggregation ☀❤🐰

• 2nd: Filter + Aggretation ⭐

• 3rd: Filter + Filter + Aggregation ⚠♨🤔

• … etc ... ⛔🔞

M. BOULLÉ. Towards Automatic FeatureConstruction for Supervised Classification. In ECML/PKDD, P. 181-196, 2014.

How to evaluate and select features?

• Discretization / Grouping → Correlation with the target• Select (the most) correlated features

: target set (ex: sick, healthy)

split such that the trade-off between entropy & compression is optimal

Discretization algorithms

• ChiMerge (R, SAS)• Optimize entropy

• C4.5 (…)• Optimize compression

• Fusinter (Zighed & co - Sinipa)• MDL-disc / MDLP (Fayyad & Irani, Pfahringer - Spark)• MODL (Boullé)• Optimize both: entropy & compression

Popularize: MODL

: target set (ex: sick, healthy)

I: 𝑖" 𝑖# 𝑖$ 𝑖% 𝑖& 𝑖' 𝑖(

nDiscretize with MODL = Minimize the following formula:

𝑉𝑎𝑙𝑢𝑒 𝐷 = log 𝑛 + log 5678"78" +∑ log 5;6<8"

<8"7=>" +∑ log 5;!

5;,A!5;,B!…5;,D!E7=>"

entropycompression

Interpretation of smart aggregates calculated over the visited pages table

Count(VisitedPages) = Number of visited pages

Interpretation graphic shows that:• there is a niche of future buyers :

those who have visited more than 96.5 pages over the period (top segment)• the majority of the base has visited no or only a few pages the site over the

period

Foreachcustomer:

Foreachcustomer:

Median(VisitedPages, duration) = median duration of stay on a specific page

+ & -

• + Good complexity• + Statistically efficient• + Manage overfitting by design

• - not enough to win every Kaggle constests…

Let’s stay intouch!

Jean-Baptiste PRIEZData Scientist

[email protected]

feature surfacing - meetup

Data & Analytics