data mart and data mining on ca state financial...

37

Upload: danghuong

Post on 11-May-2018

220 views

Category:

Documents


2 download

TRANSCRIPT

Motivation

Dataset

Project Scope

Data Warehousing

Data Mining

Conclusion

Learning Experience

References

Research 1:

Public school funding is the largest program in the state budget, receiving more than 40 % of the state's General Fund resources. The 2014–15 state budget includes more than $45 billion in General Fund resources.

- California Department of Education

Research 2:

Since 1980, higher education spending has decreased by 13 percent in inflation adjusted dollars, whereas spending on California’s prisons and associated correctional programs has skyrocketed by 436 percent.

- California HuffingtonPost

CA State Government financial data reported by counties, cities and districts with more than billions of records in each files .

This dataset has details of Expenditures, Revenues and State Income of all the departments generated in the form of fees, penalties and taxes.

https://bythenumbers.sco.ca.gov/

Dataset

County

City

District

Year

Department

Sub-Dept.

Value

Financial

Class

To give important financial information on government’s funding and income based on distinct regions and departments.

Target user ? Citizens, Tax payers, Students

Businesses, Non-profit organizations

Data Mart

What is the State Income based on County, City and District?

Which Business categories and Sub-Departments are responsible for the maximum income collection?

Determine the expenditures for a particular department.

How much has your county spent on public safety in the past 4 years?

Original data: three different files in .csv format

Handled missing values and listed required attributes for our project

Data integration and data reduction to relevant records

To provide relation and association among three different datasets, we created extra attributes and identifier

Load tables in MySQL database

Large Dataset with 20 billions of records approximately

Departments with invalid and blank values were eliminated to maintain consistency in our record

Removed least required sub-categories. Primary focus was on income and expenditure

Demo !!!

Link to our Data Mart : http://athena.ecs.csus.edu/~appanap/

Q. A star schema has what type of relationship between a dimension and fact table?

a) Many to many

b) One to one

c) One to many

d) All of the above

Answer: One to many

Data Mining

Classification algorithm is used to classify the counties and departments into Loss and Profit classes

Prediction for deciding value ranges for year 2014

Used the combined 3 datasets from our DataMart application

Maintained required attributes for mining in CSV format

Converted numeric column to nominal i.e. values to ranges

Challenges Algorithms selection for dataset Large dataset Creating classes for important categories

Classification Tree: J48

Tool: Weka, Tableau

Used data of 2010 to 2013 as training data

Data of year 2014 is used as Test data to check prediction accuracy

As we can see the prediction is almost accurate.The Predicted class of 2014 has approximate same values and range compared to above actual values.

Using tableau we were able to visualize and research for the following:

what are the top 5 revenue and expenditure generating counties?

Financial data comparison in between Sacramento and Los Angeles

Funds distribution for flood control Government expenditures on public facilities like roads,

parking facilities etc. Government expenses on public health

Comparison on government’s expenses between prisons and education

Q. Which of the following is not a data mining functionality?

a) Characterization and Discrimination

b) Classification and regressionc) Selection and interpretationd) Clustering and Analysis

Answer: Selection and interpretation

Data Warehouse design:

PHP, HTML/CSS, JavaScript

Database:

MySQL

Data Mining tools:

WEKA

Data Visualization:

Tableau

Learned designing of data mart application

Learned different data mining tools like Weka, Rapid miner and Tableau

Learned the practical usage of various classification algorithms like J48, Naïve Bayesian, correlation matrix

Team work and brainstorming really helped us to resolve issues in execution of our project

California States Controller’s Office , Government Financial Reports,

https://bythenumbers.sco.ca.gov/browse?utf8=%E2%9C%93&page=1

California Department of Education:

http://www.cde.ca.gov/fg/fr/eb/

California Drought

http://drought.ca.gov/topstory/top-story-58.html

California Spending More On Prisons Than Colleges, Report

http://www.huffingtonpost.com/2012/09/06/california-prisons-colleges_n_1863101.html