mining the database administration data | stack exchange

Post on 12-Apr-2017

130 Views

Category:

Data & Analytics

1 Downloads

Preview:

Click to see full reader

TRANSCRIPT

Business Intelligence and Big Data Analytics ProjectThe case of Stack Exchange - Data Administration

Lamprini Koutsokera

lkoutsokera@aueb.gr

Alexandros Lattas

alexander@lattas.eu

Working Space

Data Acquisition

41.779 Posts 22.390 Users 123.697 Posts History 69.185 Comments 148.425 Votes 42.127 Badges

XML to CSV Converter(Online tool)

447.603 rows

Data Cleansing - Adjustment

Comments & Post History & Posts Users without Id but with Display Name -> Guest Users

Post History

Users without Id & Display Name -> 10.039 rows deletedVotes -> 12.207 rows deleted

Badges -> 213 rows deleted -> 73 distinct badges remained

Primary & Foreign keys

5% of data deleted

Varchars to NumericspostHistoryTypes | postTypes | voteTypes

age | reputation | viewsTables/dimensions creation

(1)

(2)

(3)

Star - Snowflake Schema

Fact MetricsTotal Comment Score

Posts EditsUsers Participated

Score View Count

Answer CountComment CountFavorite Count

Cube Creation

Dimensions Users (Age, Reputation, Views)Badge TypesPost Types Post History TypesCreation DateVotes Types και Tags

Measurements

Bridge Tables Posts Post HistoryBridge TagsVotesBadges

Fact Table + Posts

Posts

Bridge Tags Tags Post History Post History Types

Votes Votes Types

Users Badges Badges Types

Dimension Usage

Stack Exchange in Metrics

Top 10 Tags

Wednesday 3:00 p.m. Age Group25-34

Posts through months

#

#

#

Posts through countries

United States3.525 posts

India1.648 posts

United Kingdom1.857 posts

Canada1.473 posts

Data Transformation

postid firebird checkpoint warning oracle-apex aggregation subquery

16956 0 0 0 1 0 0

21733 0 0 0 0 0 0

35756 0 0 0 0 0 0

44484 1 0 0 0 0 0

43484 0 0 0 0 0 0

40422 0 0 0 0 0 0

44726 0 0 0 0 0 0

35932 0 0 0 0 0 1

13.608 Posts – 694 Tags

Tag separation into distinct words

<sql-server><aggregation>

Data Mining

Clustering Association Rules

Scalable EM

30% testing set – 70 % training setdefault 10 number of clusters

min. support 0.01 min. confidence 0.1

3.343 score

6.556 edits

1.035.024 views

609 favorites8.847 users participated

8.700 score13.654 edits

1.695.060 views1.065 favorites

20.637 users participated

7.999 score

12.364 edits2.067.306 views

1.028 favorites

19.521 users participated

2.818.903 views

1.391 favorites18.741 users participated

6.436 score

15.655 edits

5.078score

7.016 edits948.036 views

11.936 users participated1.038 favorites 3.294 score

6.939 edits1.538.607 views

497 favorites8.914 users participatedCluster Mapping – Posts View

13.608 Posts

11.347 badges475.314 reputation

42.600 views

56.657 upvotes2.907 downvotes

29.844 badges1.605.644 reputation

131.913 views205.183 upvotes

9.812 downvotes

177.444 upvotes

6.503 downvotes128.337 views

1.355.876 reputation

27.052 badges

81.750 views

2.308 downvotes75.049 upvotes

25.612 badges

1.005.826 reputation

13.754badges

709.640 reputation55.846 views

3.421 downvotes90.959 upvotes 6.008 downvotes

163.349 upvotes

81.289 views1.332.268 reputation

21.083 badgesCluster Mapping – Users View

6.534 Users

25-34 age group

25-34 age group

25-34 age group

25-34 age group

25-34 age group

25-34 age group

Association Rules

backup

sqlserver

index

mysql

replication

performance

optimization

database-design

Map Reduce

Cleansing

XML FilesPosts & Users

(&).*?(;)^((?!AboutMe=).)*$

Reducer

Mapper #1

Mapper #2

Map Reduce ResultsPosts Users Posts further analysis

Body About Me

• Key• Value• Default• Clustering• Slave• Physical• Node

• Logging• Relationship• C• Dynamic• Language

Tags’ description enhancement

DBs’ problemsolving

Graph DBsProgramming Languages

Visualization

Users’ backgroundexploration

• Developer• Software• Web• Programming• Server• Engineer• SQL

• Java• C#• PHP• Microsoft• Linux

Skills KnowledgeInterests KnowledgeJob recommendation

“without”

• Without Time Zone • Without Restarting • Without using SQL

Timestamp type without losing timezone information.

Related with Oracle and PostregSQL.MySQL automatically deals with it.

Practical Implications

Insights for Solutions & Improvements

Targeted Marketingactions per DB Product

Insights on customer behavior per DB Product

Improve data-driven decision making SE process

Improve descriptivetags quality

top related