feature surfacing - meetup
TRANSCRIPT
Feature surfacingDiscover, Aggregate & Evaluate
Jean-Baptiste PRIEZ8 mars 2017
Feature engineering
XOR
X
Y
Z = (XY > 0)
Z
Users
Sales
Web
UsersCustomerIdFirstnameLastnameAge
SalesCustomerIdProductAmountTime
WebCustomerIdPageTime
Users.Customer_IdUsers.FirstnameUsers.LastnameUsers.AgeOutcomeCount(Sales.Product)CountDistinct(Sales.Product)Mean(Sales.Amount)Sum(Sales.Amount) where Sales.Product = 'Mobile Data'Count(Web.Page) where Day(Web.Time) in [6;7]…
Feature surfacing
LET’S START WITH AN EXAMPLE…
Feature Surfacing
Example: Outbound Mail Campaign
1 Central Table
Customer
(Id, e-mail address, age, state, will buy within 5 days)
Example: Outbound Mail Campaign
3 Peripheral
Tables(visited pages, duration of the session, browser type…)
Pages visited on the website
(number of products, amount spent,order status...)
E-mail campaignreactions
(action, action type, time sincee-mail was sent…)
Orders
Which are the sources, variables to choose? How to represent them?
Should we seat and meditate around a table?Should we try each and every variables manually?
What if we let the machine work?
It’s a Machine Learning problem:• How to smartly explore the entire set of possible aggregates?• Without under/overfitting?• With a linearithmic complexity?
What is Feature Surfacing?
1. Extraction of information contained in a multi-table data source• Aggregation operators• Filter operators
2. Evaluation of aggregates extracted from a star-relational data schema
Feature surfacing consists in applying a set of aggregation operators on the peripheral tables to generate features in the central table.
Centraltable
Peripheraltable1 Peripheraltable2
Peripheraltable3 Peripheraltable4
* *
**
1,1
0,n0,n
0,n0,n
1,1
1,1
1,1
*1rowperentityinthecentraltable,correspondingtoseveralrowsforthesameentityintheperipheraltable.
Extraction Evaluation(supervised)
What are the operators?
Some aggregation operators:Name Return type Operands Label
Count Num Table Number of records
CountDistinct Num Table, Cat Number of distinct values
Mode Cat Table, Cat Most frequent value
Mean Num Table, Num Mean value
StdDev Num Table, Num Standard deviation
Median Num Table, Num Median value
Min Num Table, Num Min value
Max Num Table, Num Max value
Sum Num Table, Num Sum of value
Some filter operators:Name Return type Operands Label
<, ≤ Table Table, Num Table filtered over field values smaller (or equal) than a record
>, ≥ Table Table, Num Table filtered over field values greater (or equal) than a record
= Table Table, Field Table filtered over field values equal than a record
Customize your operators:• Date:before,after,week-end,etc…• Time:morning,afternoon,etc…• String:split,infinitiveverb,etc…• ...
Presentation of some smart aggregates
1. Count(Pages visited)
2. Max(Orders, amount spent)
3. Mode(Email reactions, action type)
4. Median(Pages visited, duration) when Pages visited.device = “smartphone”
The maximal amount spent by the customer
The most frequent email request of the customer
Number of visited pages by the customer
How to be smart?
• Good aggregate • 1st: Aggregation ☀❤🐰
• 2nd: Filter + Aggretation ⭐
• 3rd: Filter + Filter + Aggregation ⚠♨🤔
• … etc ... ⛔🔞
M. BOULLÉ. Towards Automatic FeatureConstruction for Supervised Classification. In ECML/PKDD, P. 181-196, 2014.
How to evaluate and select features?
• Discretization / Grouping → Correlation with the target• Select (the most) correlated features
: target set (ex: sick, healthy)
split such that the trade-off between entropy & compression is optimal
Discretization algorithms
• ChiMerge (R, SAS)• Optimize entropy
• C4.5 (…)• Optimize compression
• Fusinter (Zighed & co - Sinipa)• MDL-disc / MDLP (Fayyad & Irani, Pfahringer - Spark)• MODL (Boullé)• Optimize both: entropy & compression
Popularize: MODL
: target set (ex: sick, healthy)
I: 𝑖" 𝑖# 𝑖$ 𝑖% 𝑖& 𝑖' 𝑖(
nDiscretize with MODL = Minimize the following formula:
𝑉𝑎𝑙𝑢𝑒 𝐷 = log 𝑛 + log 5678"78" +∑ log 5;6<8"
<8"7=>" +∑ log 5;!
5;,A!5;,B!…5;,D!E7=>"
entropycompression
Interpretation of smart aggregates calculated over the visited pages table
Count(VisitedPages) = Number of visited pages
Interpretation graphic shows that:• there is a niche of future buyers :
those who have visited more than 96.5 pages over the period (top segment)• the majority of the base has visited no or only a few pages the site over the
period
Foreachcustomer:
Foreachcustomer:
Median(VisitedPages, duration) = median duration of stay on a specific page
+ & -
• + Good complexity• + Statistically efficient• + Manage overfitting by design
• - not enough to win every Kaggle constests…