data mining and machine learning lab unsupervised feature selection for linked social media data...
TRANSCRIPT
![Page 1: Data Mining and Machine Learning Lab Unsupervised Feature Selection for Linked Social Media Data Jiliang Tang and Huan Liu Computer Science and Engineering](https://reader038.vdocument.in/reader038/viewer/2022110405/56649edb5503460f94beaa2e/html5/thumbnails/1.jpg)
Data Mining and Machine Learning Lab
Unsupervised Feature Selection for Linked Social Media Data
Jiliang Tang and Huan LiuComputer Science and Engineering
Arizona State University
August 12-16, 2012 KDD2012
![Page 2: Data Mining and Machine Learning Lab Unsupervised Feature Selection for Linked Social Media Data Jiliang Tang and Huan Liu Computer Science and Engineering](https://reader038.vdocument.in/reader038/viewer/2022110405/56649edb5503460f94beaa2e/html5/thumbnails/2.jpg)
Social Media
• The expansive use of social media generates massive data in an unprecedented rate
- 250 million tweets per day
- 3,000 photos in Flickr per minute
-153 million blogs posted per year
![Page 3: Data Mining and Machine Learning Lab Unsupervised Feature Selection for Linked Social Media Data Jiliang Tang and Huan Liu Computer Science and Engineering](https://reader038.vdocument.in/reader038/viewer/2022110405/56649edb5503460f94beaa2e/html5/thumbnails/3.jpg)
High-dimensional Social Media Data
• Social Media Data can be high-dimensional– Photos– Video stream– Tweets
• Presenting new challenges– Massive and noisy data– Curse of dimensionality
![Page 4: Data Mining and Machine Learning Lab Unsupervised Feature Selection for Linked Social Media Data Jiliang Tang and Huan Liu Computer Science and Engineering](https://reader038.vdocument.in/reader038/viewer/2022110405/56649edb5503460f94beaa2e/html5/thumbnails/4.jpg)
Feature Selection
• Feature selection is an effective way of preparing high-dimensional data for efficient data mining.
• What is new for feature selection of social media
data?
![Page 5: Data Mining and Machine Learning Lab Unsupervised Feature Selection for Linked Social Media Data Jiliang Tang and Huan Liu Computer Science and Engineering](https://reader038.vdocument.in/reader038/viewer/2022110405/56649edb5503460f94beaa2e/html5/thumbnails/5.jpg)
Representation of Linked Data
𝑢1𝑢2𝑢3
𝑢5𝑢6
𝑢4
𝑢7𝑢8
𝑓 1𝑓 2 𝑓 𝑚…. …. ….
1 1
1 1 1
1 1 1
1 1 1 11 1 1
1 1 11 1
1 1 1
𝑢1𝑢2𝑢3
𝑢5𝑢6
𝑢4
𝑢7𝑢8
𝑢1𝑢2𝑢3𝑢4𝑢5𝑢6𝑢7𝑢8
![Page 6: Data Mining and Machine Learning Lab Unsupervised Feature Selection for Linked Social Media Data Jiliang Tang and Huan Liu Computer Science and Engineering](https://reader038.vdocument.in/reader038/viewer/2022110405/56649edb5503460f94beaa2e/html5/thumbnails/6.jpg)
Challenges for Feature Selection
• Unlabeled data - No explicit definition of feature relevancy
- Without additional constraints, many subsets of features could be equally good
• Linked data - Not independent and identically distributed
![Page 7: Data Mining and Machine Learning Lab Unsupervised Feature Selection for Linked Social Media Data Jiliang Tang and Huan Liu Computer Science and Engineering](https://reader038.vdocument.in/reader038/viewer/2022110405/56649edb5503460f94beaa2e/html5/thumbnails/7.jpg)
Opportunities for Feature Selection
• Social media data provides link information - Correlation between data instances
• Social media data provides extra constraints
- Enabling us to exploring the use of social theories
![Page 8: Data Mining and Machine Learning Lab Unsupervised Feature Selection for Linked Social Media Data Jiliang Tang and Huan Liu Computer Science and Engineering](https://reader038.vdocument.in/reader038/viewer/2022110405/56649edb5503460f94beaa2e/html5/thumbnails/8.jpg)
Problem Statement
• Given n linked data instances, its attribute-value representation X, its link representation R, we want to select a subset of features by exploiting both X and R for these n data instances in an unsupervised scenario.
![Page 9: Data Mining and Machine Learning Lab Unsupervised Feature Selection for Linked Social Media Data Jiliang Tang and Huan Liu Computer Science and Engineering](https://reader038.vdocument.in/reader038/viewer/2022110405/56649edb5503460f94beaa2e/html5/thumbnails/9.jpg)
Supervised and Unsupervised Feature Selection
• A unified view– Selecting features that are consistent with some
constraints for either supervised or unsupervised feature selection
– Class labels are sort of targets as a constraint
• Two problems for unsupervised feature selection
- What are the targets?
- Where can we find constraints?
![Page 10: Data Mining and Machine Learning Lab Unsupervised Feature Selection for Linked Social Media Data Jiliang Tang and Huan Liu Computer Science and Engineering](https://reader038.vdocument.in/reader038/viewer/2022110405/56649edb5503460f94beaa2e/html5/thumbnails/10.jpg)
Our Framework: LUFS
![Page 11: Data Mining and Machine Learning Lab Unsupervised Feature Selection for Linked Social Media Data Jiliang Tang and Huan Liu Computer Science and Engineering](https://reader038.vdocument.in/reader038/viewer/2022110405/56649edb5503460f94beaa2e/html5/thumbnails/11.jpg)
The Target for LUFS
![Page 12: Data Mining and Machine Learning Lab Unsupervised Feature Selection for Linked Social Media Data Jiliang Tang and Huan Liu Computer Science and Engineering](https://reader038.vdocument.in/reader038/viewer/2022110405/56649edb5503460f94beaa2e/html5/thumbnails/12.jpg)
The Constraints for LUFS
![Page 13: Data Mining and Machine Learning Lab Unsupervised Feature Selection for Linked Social Media Data Jiliang Tang and Huan Liu Computer Science and Engineering](https://reader038.vdocument.in/reader038/viewer/2022110405/56649edb5503460f94beaa2e/html5/thumbnails/13.jpg)
Pseudo-class Label
• s is a selection vector
- s(j) = 1 if j-th feature is selected, s(j)=0 otherwise
- , X = diag(s)X
• Y is the pseudo-class label indicator matrix
- Y =
- ||Y(:,i) =
![Page 14: Data Mining and Machine Learning Lab Unsupervised Feature Selection for Linked Social Media Data Jiliang Tang and Huan Liu Computer Science and Engineering](https://reader038.vdocument.in/reader038/viewer/2022110405/56649edb5503460f94beaa2e/html5/thumbnails/14.jpg)
Social Dimension for Link Information
• Social Dimension captures group behaviors of linked Instances– Instances in different social dimensions are disimilar– Instances within a social dimension are similar
• Example:
![Page 15: Data Mining and Machine Learning Lab Unsupervised Feature Selection for Linked Social Media Data Jiliang Tang and Huan Liu Computer Science and Engineering](https://reader038.vdocument.in/reader038/viewer/2022110405/56649edb5503460f94beaa2e/html5/thumbnails/15.jpg)
Social Dimension Regularization
• Within, between, and total social dimension scatter matrices,
• Instances are similar within social dimensions while dissimilar between social dimensions.
![Page 16: Data Mining and Machine Learning Lab Unsupervised Feature Selection for Linked Social Media Data Jiliang Tang and Huan Liu Computer Science and Engineering](https://reader038.vdocument.in/reader038/viewer/2022110405/56649edb5503460f94beaa2e/html5/thumbnails/16.jpg)
Constraint from Attribute-Value Data
• Similar instances in terms of their contents are more likely to share similar topics,
![Page 17: Data Mining and Machine Learning Lab Unsupervised Feature Selection for Linked Social Media Data Jiliang Tang and Huan Liu Computer Science and Engineering](https://reader038.vdocument.in/reader038/viewer/2022110405/56649edb5503460f94beaa2e/html5/thumbnails/17.jpg)
An Optimization Problem for LUFS
![Page 18: Data Mining and Machine Learning Lab Unsupervised Feature Selection for Linked Social Media Data Jiliang Tang and Huan Liu Computer Science and Engineering](https://reader038.vdocument.in/reader038/viewer/2022110405/56649edb5503460f94beaa2e/html5/thumbnails/18.jpg)
The Optimization Problem for LUFS
![Page 19: Data Mining and Machine Learning Lab Unsupervised Feature Selection for Linked Social Media Data Jiliang Tang and Huan Liu Computer Science and Engineering](https://reader038.vdocument.in/reader038/viewer/2022110405/56649edb5503460f94beaa2e/html5/thumbnails/19.jpg)
The Optimization Problem for LUFS
![Page 20: Data Mining and Machine Learning Lab Unsupervised Feature Selection for Linked Social Media Data Jiliang Tang and Huan Liu Computer Science and Engineering](https://reader038.vdocument.in/reader038/viewer/2022110405/56649edb5503460f94beaa2e/html5/thumbnails/20.jpg)
The Optimization Problem for LUFS
![Page 21: Data Mining and Machine Learning Lab Unsupervised Feature Selection for Linked Social Media Data Jiliang Tang and Huan Liu Computer Science and Engineering](https://reader038.vdocument.in/reader038/viewer/2022110405/56649edb5503460f94beaa2e/html5/thumbnails/21.jpg)
LUFS after Two Relaxations
• Spectral Relaxation on Y - Social Dimension Regularization:
• W = diag(s)W, and adding 2,1-norm on W
)YTr(YFFYYTrmin TTT )(
![Page 22: Data Mining and Machine Learning Lab Unsupervised Feature Selection for Linked Social Media Data Jiliang Tang and Huan Liu Computer Science and Engineering](https://reader038.vdocument.in/reader038/viewer/2022110405/56649edb5503460f94beaa2e/html5/thumbnails/22.jpg)
Evaluating LUFS
• Datasets and experiment setting
• What is the performance of LUFS comparing to state-of-the art baseline methods?
• Why does LUFS work?
![Page 23: Data Mining and Machine Learning Lab Unsupervised Feature Selection for Linked Social Media Data Jiliang Tang and Huan Liu Computer Science and Engineering](https://reader038.vdocument.in/reader038/viewer/2022110405/56649edb5503460f94beaa2e/html5/thumbnails/23.jpg)
Evaluating LUFS
• Datasets and experiment setting
• What is the performance of LUFS comparing to state-of-the art baseline methods?
• Why does LUFS work?
![Page 24: Data Mining and Machine Learning Lab Unsupervised Feature Selection for Linked Social Media Data Jiliang Tang and Huan Liu Computer Science and Engineering](https://reader038.vdocument.in/reader038/viewer/2022110405/56649edb5503460f94beaa2e/html5/thumbnails/24.jpg)
Data and Characteristics
• BlogCatalog
• Flickr
![Page 25: Data Mining and Machine Learning Lab Unsupervised Feature Selection for Linked Social Media Data Jiliang Tang and Huan Liu Computer Science and Engineering](https://reader038.vdocument.in/reader038/viewer/2022110405/56649edb5503460f94beaa2e/html5/thumbnails/25.jpg)
http://dmml.asu.edu/users/xufei/datasets.html
![Page 26: Data Mining and Machine Learning Lab Unsupervised Feature Selection for Linked Social Media Data Jiliang Tang and Huan Liu Computer Science and Engineering](https://reader038.vdocument.in/reader038/viewer/2022110405/56649edb5503460f94beaa2e/html5/thumbnails/26.jpg)
Experiment Settings
• Metrics - Clustering: Accuracy and NMI
- K-Means
• Baseline methods - UDFS
- SPEC
- Laplacian Score
![Page 27: Data Mining and Machine Learning Lab Unsupervised Feature Selection for Linked Social Media Data Jiliang Tang and Huan Liu Computer Science and Engineering](https://reader038.vdocument.in/reader038/viewer/2022110405/56649edb5503460f94beaa2e/html5/thumbnails/27.jpg)
Evaluating LUFS
• Datasets and experiment setting
• What is the performance of LUFS comparing to state-of-the art baseline methods?
• Why does LUFS work?
![Page 28: Data Mining and Machine Learning Lab Unsupervised Feature Selection for Linked Social Media Data Jiliang Tang and Huan Liu Computer Science and Engineering](https://reader038.vdocument.in/reader038/viewer/2022110405/56649edb5503460f94beaa2e/html5/thumbnails/28.jpg)
Results on Flickr
![Page 29: Data Mining and Machine Learning Lab Unsupervised Feature Selection for Linked Social Media Data Jiliang Tang and Huan Liu Computer Science and Engineering](https://reader038.vdocument.in/reader038/viewer/2022110405/56649edb5503460f94beaa2e/html5/thumbnails/29.jpg)
Results on Flickr
![Page 30: Data Mining and Machine Learning Lab Unsupervised Feature Selection for Linked Social Media Data Jiliang Tang and Huan Liu Computer Science and Engineering](https://reader038.vdocument.in/reader038/viewer/2022110405/56649edb5503460f94beaa2e/html5/thumbnails/30.jpg)
Results on BlogCatalog
![Page 31: Data Mining and Machine Learning Lab Unsupervised Feature Selection for Linked Social Media Data Jiliang Tang and Huan Liu Computer Science and Engineering](https://reader038.vdocument.in/reader038/viewer/2022110405/56649edb5503460f94beaa2e/html5/thumbnails/31.jpg)
Evaluating LUFS
• Datasets and experiment setting
• What is the performance of LUFS comparing to state-of-the art baseline methods?
• Why does LUFS work?
![Page 32: Data Mining and Machine Learning Lab Unsupervised Feature Selection for Linked Social Media Data Jiliang Tang and Huan Liu Computer Science and Engineering](https://reader038.vdocument.in/reader038/viewer/2022110405/56649edb5503460f94beaa2e/html5/thumbnails/32.jpg)
Probing Further: Why Social Dimensions Work
Social Dimensions Random Groups
…….
…….
Link Information
Social Dimension Extraction
Random Assignment
![Page 33: Data Mining and Machine Learning Lab Unsupervised Feature Selection for Linked Social Media Data Jiliang Tang and Huan Liu Computer Science and Engineering](https://reader038.vdocument.in/reader038/viewer/2022110405/56649edb5503460f94beaa2e/html5/thumbnails/33.jpg)
Results in Flickr
![Page 34: Data Mining and Machine Learning Lab Unsupervised Feature Selection for Linked Social Media Data Jiliang Tang and Huan Liu Computer Science and Engineering](https://reader038.vdocument.in/reader038/viewer/2022110405/56649edb5503460f94beaa2e/html5/thumbnails/34.jpg)
Future Work
• Further exploration of link information
• Noise and incomplete social media data
• Other sources: multi-view sources
• The strength of social ties ( strong and weak ties mixed)
![Page 35: Data Mining and Machine Learning Lab Unsupervised Feature Selection for Linked Social Media Data Jiliang Tang and Huan Liu Computer Science and Engineering](https://reader038.vdocument.in/reader038/viewer/2022110405/56649edb5503460f94beaa2e/html5/thumbnails/35.jpg)
http://www.public.asu.edu/~huanliu/projects/NSF12/
More Information?
![Page 36: Data Mining and Machine Learning Lab Unsupervised Feature Selection for Linked Social Media Data Jiliang Tang and Huan Liu Computer Science and Engineering](https://reader038.vdocument.in/reader038/viewer/2022110405/56649edb5503460f94beaa2e/html5/thumbnails/36.jpg)
Questions
Acknowledgments: This work is, in part, sponsored by National Science Foundation via a grant (#0812551). Comments and suggestions from DMML members and reviewers are greatly appreciated.