jan25 - ottawa machine learning meetup
TRANSCRIPT
![Page 1: Jan25 - Ottawa Machine Learning Meetup](https://reader034.vdocument.in/reader034/viewer/2022052705/589a5e401a28abc3438b5733/html5/thumbnails/1.jpg)
CLASSIFYING OMNIBUS BILLS
OTTAWA MACHINE LEARNING MEETUP - JAN. 25TH, 2016
SAMUEL WITHERSPOON, MATHEW SONKE
![Page 2: Jan25 - Ottawa Machine Learning Meetup](https://reader034.vdocument.in/reader034/viewer/2022052705/589a5e401a28abc3438b5733/html5/thumbnails/2.jpg)
DISCLAIMER
THIS IS OUR FIRST ITERATION AND IS A WORK IN PROGRESS.
![Page 3: Jan25 - Ottawa Machine Learning Meetup](https://reader034.vdocument.in/reader034/viewer/2022052705/589a5e401a28abc3438b5733/html5/thumbnails/3.jpg)
PURPOSE
WE WANT TO SHOW HOW WE MOVE FROM START TO FIRST SET OF RESULTS IN AN ML PROBLEM
![Page 4: Jan25 - Ottawa Machine Learning Meetup](https://reader034.vdocument.in/reader034/viewer/2022052705/589a5e401a28abc3438b5733/html5/thumbnails/4.jpg)
SUMMARY OF EFFORT≈ 50 HOURS SPENT
≈ 120 BILLS MANUALLY CLASSIFIED
SOURCE CODE:https://github.com/switherspoon/MachineLearningMeetup
![Page 5: Jan25 - Ottawa Machine Learning Meetup](https://reader034.vdocument.in/reader034/viewer/2022052705/589a5e401a28abc3438b5733/html5/thumbnails/5.jpg)
WHAT IS AN OMNIBUS BILL?
TYPICALLY VERY LONG
TYPICALLY LOTS OF OTHER BILLS MODIFIED
For Example Bill C-51
A BILL THAT HAS A WIDE VARIETY OF TOPICS
![Page 6: Jan25 - Ottawa Machine Learning Meetup](https://reader034.vdocument.in/reader034/viewer/2022052705/589a5e401a28abc3438b5733/html5/thumbnails/6.jpg)
THAT DEFINITION INFORMED OUR FEATURES
LENGTH OF BILL
DIVERSITY OF TOPICS IN THE BILL
NUMBER OF OTHER BILLS MODIFIED/REFERENCED
FEATURES:
![Page 7: Jan25 - Ottawa Machine Learning Meetup](https://reader034.vdocument.in/reader034/viewer/2022052705/589a5e401a28abc3438b5733/html5/thumbnails/7.jpg)
WHAT DOES AN OMNIBUS LOOK
LIKE?
![Page 8: Jan25 - Ottawa Machine Learning Meetup](https://reader034.vdocument.in/reader034/viewer/2022052705/589a5e401a28abc3438b5733/html5/thumbnails/8.jpg)
BILL C-51 - 41st PARLIAMENT 2nd
SESSION
BILL C-54 - 41st PARLIAMENT 2nd
SESSION
4/19 PAGES 1/1 PAGE
![Page 9: Jan25 - Ottawa Machine Learning Meetup](https://reader034.vdocument.in/reader034/viewer/2022052705/589a5e401a28abc3438b5733/html5/thumbnails/9.jpg)
GETTING STARTEDWE USED PYTHON3 WITH:
1. NLTK (http://www.nltk.org/) - FOR NLP 2. SCIKIT-LEARN (http://scikit-learn.org/stable/) - FOR CLASSIFIER 3. GENSIM (https://radimrehurek.com/gensim/) - FOR TOPIC MODEL 4. PSYCOPG2 (http://initd.org/psycopg/) - FOR DATA EXTRACT
ALL INSTALLED WITH PIP3
![Page 10: Jan25 - Ottawa Machine Learning Meetup](https://reader034.vdocument.in/reader034/viewer/2022052705/589a5e401a28abc3438b5733/html5/thumbnails/10.jpg)
GETTING STARTED (CONT…)
WE SOURCED OUR DATA FROM:
https://openparliament.ca/
http://parl.gc.ca
![Page 11: Jan25 - Ottawa Machine Learning Meetup](https://reader034.vdocument.in/reader034/viewer/2022052705/589a5e401a28abc3438b5733/html5/thumbnails/11.jpg)
DATA ANALYSISMANUALLY SKIMMED AND EXTRACTED FEATURES FROM ≈120 BILLS AND BUILT A SPREADSHEET
link: https://docs.google.com/spreadsheets/d/1kpbX78NZQ9bJHGVPoSmLE4LcE4Hht1UXxXg90gV1CVU/edit?usp=sharing
![Page 12: Jan25 - Ottawa Machine Learning Meetup](https://reader034.vdocument.in/reader034/viewer/2022052705/589a5e401a28abc3438b5733/html5/thumbnails/12.jpg)
MODEL FEATURESLENGTH OF BILL
NUMBER OF BILLS REFERENCED
AVERAGE SEMANTIC DISTANCE OF TOPICS IN EACH BILL
![Page 13: Jan25 - Ottawa Machine Learning Meetup](https://reader034.vdocument.in/reader034/viewer/2022052705/589a5e401a28abc3438b5733/html5/thumbnails/13.jpg)
THE MODEL
![Page 14: Jan25 - Ottawa Machine Learning Meetup](https://reader034.vdocument.in/reader034/viewer/2022052705/589a5e401a28abc3438b5733/html5/thumbnails/14.jpg)
THE CLASSIFIERNAIVE BAYES
EASY
FAST
UNDERSTANDABLE
WORKS WELL WITH SMALL TRAINING SET (MAYBE NOT THIS SMALL)
![Page 15: Jan25 - Ottawa Machine Learning Meetup](https://reader034.vdocument.in/reader034/viewer/2022052705/589a5e401a28abc3438b5733/html5/thumbnails/15.jpg)
LENGTH OF BILLLENGTH OF RAW STRING READ IN FROM FILES
AS EASY AS: len(raw)
![Page 16: Jan25 - Ottawa Machine Learning Meetup](https://reader034.vdocument.in/reader034/viewer/2022052705/589a5e401a28abc3438b5733/html5/thumbnails/16.jpg)
NUMBER OF BILLS REFERENCED
(1) DATA RETRIEVAL
(2) PREPROCESSING
(3) NAMED ENTITY RECOGNITION (NER)
![Page 17: Jan25 - Ottawa Machine Learning Meetup](https://reader034.vdocument.in/reader034/viewer/2022052705/589a5e401a28abc3438b5733/html5/thumbnails/17.jpg)
(1) DATA RETRIEVAL
2 DATA SETS TO COLLECT
• CONSOLIDATED LIST OF ACTS
• FULL TEXT OF BILLS
![Page 18: Jan25 - Ottawa Machine Learning Meetup](https://reader034.vdocument.in/reader034/viewer/2022052705/589a5e401a28abc3438b5733/html5/thumbnails/18.jpg)
DATA RETRIEVAL CONT…
LIST OF ACTS PROVIDED BY GOVERNMENT OF CANADA (http://laws-lois.justice.gc.ca/eng/acts/)
WE NEEDED A WEB SCRAPER AS NO API IS AVAILABLE • SCRAPY IS POWERFUL BUT NO PYTHON3 SUPPORT • IMPORT.IO WORKED WELL FOR OUR NEEDS
![Page 19: Jan25 - Ottawa Machine Learning Meetup](https://reader034.vdocument.in/reader034/viewer/2022052705/589a5e401a28abc3438b5733/html5/thumbnails/19.jpg)
DATA RETRIEVAL CONT…
![Page 20: Jan25 - Ottawa Machine Learning Meetup](https://reader034.vdocument.in/reader034/viewer/2022052705/589a5e401a28abc3438b5733/html5/thumbnails/20.jpg)
DATA RETRIEVAL CONT…
TEXT OF BILLS RETRIEVED FROM OPENPARLIAMENT DATABASE USING SQL
![Page 21: Jan25 - Ottawa Machine Learning Meetup](https://reader034.vdocument.in/reader034/viewer/2022052705/589a5e401a28abc3438b5733/html5/thumbnails/21.jpg)
(2) PREPROCESSING
OPENPARLIAMENT DATABASE ISN’T PERFECT • REMOVED DUPLICATES • VERIFIED SESSION NUMBER WAS CORRECT • CONVERTED EVERYTHING TO LOWERCASE
![Page 22: Jan25 - Ottawa Machine Learning Meetup](https://reader034.vdocument.in/reader034/viewer/2022052705/589a5e401a28abc3438b5733/html5/thumbnails/22.jpg)
(3) NAMED ENTITY RECOGNITION
MANY APPROACHES TO THIS • HAND-CRAFTED GRAMMAR BASED • STATISTICAL MODELS • MATCHING AGAINST A LIBRARY
![Page 23: Jan25 - Ottawa Machine Learning Meetup](https://reader034.vdocument.in/reader034/viewer/2022052705/589a5e401a28abc3438b5733/html5/thumbnails/23.jpg)
NAMED ENTITY RECOGNITION CONT…
WE NOTICED COMMON PHRASES LIKE “AMENDS”, “RELATED AMENDMENTS”, “REPLACED BY” WHEN REFERENCING ACTS
ULTIMATELY WE MATCHED BILL TEXT AGAINST A LIBRARY • THIS GAVE US GOOD RESULTS WITH LITTLE CODE • WON’T ALWAYS WORK
![Page 24: Jan25 - Ottawa Machine Learning Meetup](https://reader034.vdocument.in/reader034/viewer/2022052705/589a5e401a28abc3438b5733/html5/thumbnails/24.jpg)
SEMANTIC DISTANCE OF TOPICS
HYPOTHESIS:
SINCE AN OMNIBUS BILL HAS MANY DIFFERENT TOPICS THE AVERAGE DISTANCE BETWEEN TOPICS IN AN OMNIBUS BILL WILL BE GREATER THAN A NON-OMNIBUS BILL.
![Page 25: Jan25 - Ottawa Machine Learning Meetup](https://reader034.vdocument.in/reader034/viewer/2022052705/589a5e401a28abc3438b5733/html5/thumbnails/25.jpg)
SEMANTIC DISTANCE OF TOPICS PROCEDURE
(1) PREPROCESS A BILL
(2) LDA TOPIC MODELLING ON THE BILL
(3) SEMANTIC SIMILARITY (DISTANCE MEASURE)
(4) AVERAGE TOPIC DISTANCE OF THE BILL
![Page 26: Jan25 - Ottawa Machine Learning Meetup](https://reader034.vdocument.in/reader034/viewer/2022052705/589a5e401a28abc3438b5733/html5/thumbnails/26.jpg)
(1) PREPROCESSING• READ IN FILES
•TOKENIZE WORDS
• REMOVE STOP WORDS
•IGNORE WORD ORDER (BAG OF WORDS)
![Page 27: Jan25 - Ottawa Machine Learning Meetup](https://reader034.vdocument.in/reader034/viewer/2022052705/589a5e401a28abc3438b5733/html5/thumbnails/27.jpg)
(2) LDA TOPIC MODELING•PROBABILISTIC TOPIC MODEL
•WE ARE NOT USING IT IN ITS OPTIMAL APPLICATION
•PROBABILISTICALLY PRESUMES DOCUMENTS CONTAIN A HIDDEN STRUCTURE BUILT AROUND TOPICS
•IGNORES WORD ORDER
![Page 28: Jan25 - Ottawa Machine Learning Meetup](https://reader034.vdocument.in/reader034/viewer/2022052705/589a5e401a28abc3438b5733/html5/thumbnails/28.jpg)
LDA CONT…•MANY BILLS TOO SHORT FOR MEANINGFUL ANALYSIS W/ LDA
•BILLS THAT ARE TOO SHORT GET AN AGGREGATE SIMILARITY SCORE OF ‘1’
•THIS IS A REALLY BAD WORKAROUND
•WE IGNORE THE LDA TOPIC WEIGHTS/PROBABILITIES •THIS IS AN OPTIMIZATION PROBLEM
MORE READING: https://www.cs.princeton.edu/~blei/papers/Blei2012.pdf
![Page 29: Jan25 - Ottawa Machine Learning Meetup](https://reader034.vdocument.in/reader034/viewer/2022052705/589a5e401a28abc3438b5733/html5/thumbnails/29.jpg)
(3) SEMANTIC SIMILARITYLIN SIMILARITY
BUT WHAT DOES THIS MEAN???
![Page 30: Jan25 - Ottawa Machine Learning Meetup](https://reader034.vdocument.in/reader034/viewer/2022052705/589a5e401a28abc3438b5733/html5/thumbnails/30.jpg)
WORDNETA HIERARCHICAL TREE OF WORDS WITH MORE GENERAL WORDS AT THE ROOT AND MORE SPECIFIC WORDS AT
THE LEAF
![Page 31: Jan25 - Ottawa Machine Learning Meetup](https://reader034.vdocument.in/reader034/viewer/2022052705/589a5e401a28abc3438b5733/html5/thumbnails/31.jpg)
SIMILARITY CONT…LIN SIMILARITY
*OVERSIMPLIFICATION* THERE IS A GRAPH/NETWORK OF SYNONYMS - LIN SIMILARITY IS THE SHORTEST DISTANCE TO THE FIRST COMMON ANCESTOR (LOWEST COMMON ANCESTOR)
![Page 32: Jan25 - Ottawa Machine Learning Meetup](https://reader034.vdocument.in/reader034/viewer/2022052705/589a5e401a28abc3438b5733/html5/thumbnails/32.jpg)
SIMILARITY CONT…SCORES ARE BETWEEN 0 AND 1
>0.8 MEANS VERY SIMILAR
<0.2 MEANS NOT VERY SIMILAR
ie. CAT & DOG = 0.88 OR 0.89 (BROWN AND SEMCOR IC) HOUND & DOG = 0.88 OR 0.87 (BROWN AND SEMCOR IC) CHAIR & DOG = 0.16 OR 0.18 (BROWN AND SEMCOR IC)
![Page 33: Jan25 - Ottawa Machine Learning Meetup](https://reader034.vdocument.in/reader034/viewer/2022052705/589a5e401a28abc3438b5733/html5/thumbnails/33.jpg)
(4) AVG. TOPIC DISTANCE IN A BILL
WE CREATED AN AVERAGE SIMILARITY SCORE FOR EACH BILL:
SUM OF ALL COMPARED SCORES/TOTAL NUMBER OF COMPARISONS
THERE ARE FLAWS IN THIS APPROACH •NOUN ONLY •NO WEIGHTING
![Page 34: Jan25 - Ottawa Machine Learning Meetup](https://reader034.vdocument.in/reader034/viewer/2022052705/589a5e401a28abc3438b5733/html5/thumbnails/34.jpg)
CLASSIFICATION!WE WERE RUNNING OUT OF TIME…..
WE WANTED TO COMPARE: •NAIVE BAYES •RANDOM FOREST DECISION TREE •SVM
WE COMPARED: •NAIVE BAYES!
![Page 35: Jan25 - Ottawa Machine Learning Meetup](https://reader034.vdocument.in/reader034/viewer/2022052705/589a5e401a28abc3438b5733/html5/thumbnails/35.jpg)
CLASSIFIER COMPARISON
:(
NAIVE BAYES •GAUSSIAN •MULTINOMIAL
![Page 36: Jan25 - Ottawa Machine Learning Meetup](https://reader034.vdocument.in/reader034/viewer/2022052705/589a5e401a28abc3438b5733/html5/thumbnails/36.jpg)
MODEL EVALUATION
WE WONT SHOW YOU ACCURACY BECAUSE…
![Page 37: Jan25 - Ottawa Machine Learning Meetup](https://reader034.vdocument.in/reader034/viewer/2022052705/589a5e401a28abc3438b5733/html5/thumbnails/37.jpg)
CLASS IMBALANCE!
•9 OMNIBUS BILLS IN 120 BILLS
•7.5% CHANCE A BILL IS AN OMNIBUS BILL
•A CLASSIFIER COULD HAVE 92.5% ACCURACY BY PICKING ‘NOT OMNIBUS’ EVERY TIME!
![Page 38: Jan25 - Ottawa Machine Learning Meetup](https://reader034.vdocument.in/reader034/viewer/2022052705/589a5e401a28abc3438b5733/html5/thumbnails/38.jpg)
PRECISION True Positives / (True Positives + False Positives)
![Page 39: Jan25 - Ottawa Machine Learning Meetup](https://reader034.vdocument.in/reader034/viewer/2022052705/589a5e401a28abc3438b5733/html5/thumbnails/39.jpg)
RECALL (True Positives / (True Positives + False Negatives))
![Page 40: Jan25 - Ottawa Machine Learning Meetup](https://reader034.vdocument.in/reader034/viewer/2022052705/589a5e401a28abc3438b5733/html5/thumbnails/40.jpg)
BUT WE HAVE A CLASS IMBALANCE PROBLEM
![Page 41: Jan25 - Ottawa Machine Learning Meetup](https://reader034.vdocument.in/reader034/viewer/2022052705/589a5e401a28abc3438b5733/html5/thumbnails/41.jpg)
PRETENDING WE DON’T HAVE A PROBLEM
![Page 42: Jan25 - Ottawa Machine Learning Meetup](https://reader034.vdocument.in/reader034/viewer/2022052705/589a5e401a28abc3438b5733/html5/thumbnails/42.jpg)
CLASS IMBALANCE SOLUTION
REMOVE THE IMBALANCE!!!!
WE WENT FROM 65 TRAINING EXAMPLES TO 25 TO 11 BY REMOVING NEGATIVE EXAMPLES
![Page 43: Jan25 - Ottawa Machine Learning Meetup](https://reader034.vdocument.in/reader034/viewer/2022052705/589a5e401a28abc3438b5733/html5/thumbnails/43.jpg)
RESULTSTRUE CLASS IMBALANCE
(5:60)
NEW (5:20)
RATIOS ARE (#OMNIBUS:#NOTOMNIBUS)
![Page 44: Jan25 - Ottawa Machine Learning Meetup](https://reader034.vdocument.in/reader034/viewer/2022052705/589a5e401a28abc3438b5733/html5/thumbnails/44.jpg)
REMOVING EVEN MORENEW (5:20)
NEWEST (5:6)
![Page 45: Jan25 - Ottawa Machine Learning Meetup](https://reader034.vdocument.in/reader034/viewer/2022052705/589a5e401a28abc3438b5733/html5/thumbnails/45.jpg)
FINAL TRAINING SET
} }
![Page 46: Jan25 - Ottawa Machine Learning Meetup](https://reader034.vdocument.in/reader034/viewer/2022052705/589a5e401a28abc3438b5733/html5/thumbnails/46.jpg)
CONCLUSIONS
EITHER NEED: (1)SUBSTANTIALLY MORE DATA OR; (2)BETTER ACCURACY ON TOPIC EXTRACTION AND
NAMED ENTITY RECOGNITION
LOTS OF ROOM FOR IMPROVEMENT
WE STILL THINK THREE FEATURES IS ENOUGH
NEED TO DO MORE WORK CLEANING/VALIDATING OUR INPUT DATA
![Page 47: Jan25 - Ottawa Machine Learning Meetup](https://reader034.vdocument.in/reader034/viewer/2022052705/589a5e401a28abc3438b5733/html5/thumbnails/47.jpg)
CONCLUSIONS CONT…
WE ARE PERFORMING BETTER THAN RANDOM GUESSING!
WE WOULD LOVE HELP IMPROVING OUR APPROACH
![Page 48: Jan25 - Ottawa Machine Learning Meetup](https://reader034.vdocument.in/reader034/viewer/2022052705/589a5e401a28abc3438b5733/html5/thumbnails/48.jpg)
WAYS TO IMPROVEUSE MORE COMPLEX NER IMPLEMENTATION TO IMPROVE ACCURACY
LINKED TOPIC MODELLING
IMPROVE WORD SIMILARITY APPROACH TO INCLUDE WEIGHTINGS
EXPERIMENT WITH DOCUMENT VECTORS AND NEURAL NETS
USE DIFFERENT DISTRIBUTIONS FOR DIFFERENT FEATURES (OPTIMIZATION OF CLASSIFIER)
TRY TF/IDF AS A DIFFERENT METHOD FOR MEASURING THE ‘SEMANTIC DIFFERENCE’ IN A DOCUMENT
EXPERIMENT WITH OTHER CLASSIFIERS
EXPERIMENT WITH MORE FEATURES
…
![Page 49: Jan25 - Ottawa Machine Learning Meetup](https://reader034.vdocument.in/reader034/viewer/2022052705/589a5e401a28abc3438b5733/html5/thumbnails/49.jpg)
QUESTIONS?
![Page 50: Jan25 - Ottawa Machine Learning Meetup](https://reader034.vdocument.in/reader034/viewer/2022052705/589a5e401a28abc3438b5733/html5/thumbnails/50.jpg)
Machine learning is no cakewalk.
Can we form a group to help Ottawa companies achieve greater success with ML?
What would this group do? Who would be in it?
How would it be funded? Do we have the local talent? What about protecting IP?
Who would make the decisions? Why bother?
We want your feedback! If you'd like to participate in ongoing discussions, please leave
us your contact info.
![Page 51: Jan25 - Ottawa Machine Learning Meetup](https://reader034.vdocument.in/reader034/viewer/2022052705/589a5e401a28abc3438b5733/html5/thumbnails/51.jpg)
RELATIVE OPERATING CHARACTERISTICS (ROC)
0
0.25
0.5
0.75
1
0 0.25 0.5 0.75 1
Random GuessGaussianMultinomial
FALSE POSITIVE RATE
TRU
E PO
SITI
VE
RATE