study of the parallel techniques for dimensionality reduction and its impact on quality of the text...

14
Study of the parallel techniques for dimensionality reduction and its impact on quality of the text processing algorithms Marcin Pietroń 1,2 , Maciej Wielgosz 1,2 , Michał Karwatowski 1,2 , Kazimierz Wiatr 12 1 AGH University of Science and Technology, al. Mickiewicza 30, 30-059 Kraków, 2 ACK Cyfronet AGH, ul. Nawojki 11, 30-950 Kraków RUC 17-18.09.2015 Kraków

Upload: sabrina-singleton

Post on 12-Jan-2016

212 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Study of the parallel techniques for dimensionality reduction and its impact on quality of the text processing algorithms Marcin Pietroń 1,2, Maciej Wielgosz

Study of the parallel techniques for dimensionality reduction and its impact on quality of the text processing algorithms

Marcin Pietroń1,2, Maciej Wielgosz1,2,Michał Karwatowski1,2, Kazimierz Wiatr12

1AGH University of Science and Technology, al. Mickiewicza 30, 30-059 Kraków,2ACK Cyfronet AGH, ul. Nawojki 11, 30-950 Kraków

RUC 17-18.09.2015 Kraków

Page 2: Study of the parallel techniques for dimensionality reduction and its impact on quality of the text processing algorithms Marcin Pietroń 1,2, Maciej Wielgosz

2Agenda

Text classification

System architecture

Metrics

Dimensionality reduction

Experiments and results

Conclusions and future work

Page 3: Study of the parallel techniques for dimensionality reduction and its impact on quality of the text processing algorithms Marcin Pietroń 1,2, Maciej Wielgosz

3Text classification

Very useful and popular problem in internet and big data processing

Real time processing requirement

Preceded by text preprocessing

Clustering as a one of a few techniques which helps text classification

Page 4: Study of the parallel techniques for dimensionality reduction and its impact on quality of the text processing algorithms Marcin Pietroń 1,2, Maciej Wielgosz

4System architecture

  

Text pre-processing

 

Dictionary and model

transformation

 

SVD

 

K-means

Page 5: Study of the parallel techniques for dimensionality reduction and its impact on quality of the text processing algorithms Marcin Pietroń 1,2, Maciej Wielgosz

5System architecture

Document corpus generation (e.g. crawler)

Text preprocessing (implemented by gensim library, lemmatization, stoplist etc.)

SVD

K-means as clustering method (clustering documents to chosen domains)

Page 6: Study of the parallel techniques for dimensionality reduction and its impact on quality of the text processing algorithms Marcin Pietroń 1,2, Maciej Wielgosz

6Quality metrics

Page 7: Study of the parallel techniques for dimensionality reduction and its impact on quality of the text processing algorithms Marcin Pietroń 1,2, Maciej Wielgosz

7Entropy

𝑬 (𝑪 𝒊 )=−∑𝒉=𝟏

𝒌 𝒏𝒊𝒉

𝒏𝒊

𝐥𝐨𝐠 (𝒏𝒊

𝒉

𝒏𝒊

¿)¿

𝑬𝒏𝒕𝒓𝒐𝒑𝒚=∑𝒊=𝟏

𝒌 𝒏𝒊

𝒏𝑬 (𝑪𝒊)

Page 8: Study of the parallel techniques for dimensionality reduction and its impact on quality of the text processing algorithms Marcin Pietroń 1,2, Maciej Wielgosz

8Dimensionality reduction

SVD:

A =

where U is matrix of left singular vectors, V is matrix of the right singular vectors and is diagonal matrix with singular 𝛴values

Page 9: Study of the parallel techniques for dimensionality reduction and its impact on quality of the text processing algorithms Marcin Pietroń 1,2, Maciej Wielgosz

9Dimensionality reduction

Random Projection:

random projection of vectors to reduced space by special matrixes (distances between points in reduced space are scalable)

A = (e.g. Achlioptas random projection matrix)

Page 10: Study of the parallel techniques for dimensionality reduction and its impact on quality of the text processing algorithms Marcin Pietroń 1,2, Maciej Wielgosz

10Results and experiments

 

number of 

clustersPrecision recall F-measure

business 3.9(0.3) 0.81(0.022) 0.56(0.077) 0.66(0.034)

culture 3(0) 0.37(0.015) 0.7(0.061) 0.48(0.024)

automotive4.8(0.4) 0.39(0.007) 0.56(0.021) 0.45(0.01)

science 2.1(0.3) 0.39(0.014) 0.74(0.016) 0.51(0.014)

sport 4.8(0.4) 0.39(0.007) 0.56(0.021) 0.45(0.01)

employed algorithmsEntropy

vsm+kmeans 0.28(0.012)

vsm+tfidf+kmeans0.17(0.019)

vsm+tfidf+svd+kmeans0.16(0.006)

Page 11: Study of the parallel techniques for dimensionality reduction and its impact on quality of the text processing algorithms Marcin Pietroń 1,2, Maciej Wielgosz

11Results and experiments

0.750

0.800

0.850

0.900

0.950

1.000

1.050

2.000 42.000 700.000 1500.000 2300.000 3100.000 3900.000 4700.00 5500.00 6300.00 7100.00 7900.00

Entropy mean

Page 12: Study of the parallel techniques for dimensionality reduction and its impact on quality of the text processing algorithms Marcin Pietroń 1,2, Maciej Wielgosz

12GPU implementation

reduction size GPGPU [ms] CPU [ms]

10 33 80

20 77 305

30 107 420

40 161 624

NVIDIA tesla m2090 Intel Xeon e5645

Page 13: Study of the parallel techniques for dimensionality reduction and its impact on quality of the text processing algorithms Marcin Pietroń 1,2, Maciej Wielgosz

13Conclusions and future work

Applying more algorithms lowers entropy

GPU can efficiently reduce time of text classification

Random projection hardware implementation

K-means GPU acceleration

Page 14: Study of the parallel techniques for dimensionality reduction and its impact on quality of the text processing algorithms Marcin Pietroń 1,2, Maciej Wielgosz

14Questions

?