datawarehousing and datamining prof. sin-min lee
TRANSCRIPT
![Page 1: DataWarehousing and DataMining Prof. Sin-Min Lee](https://reader035.vdocument.in/reader035/viewer/2022081519/56649ea05503460f94ba298b/html5/thumbnails/1.jpg)
DataWarehousing and DataMining
Prof. Sin-Min Lee
![Page 2: DataWarehousing and DataMining Prof. Sin-Min Lee](https://reader035.vdocument.in/reader035/viewer/2022081519/56649ea05503460f94ba298b/html5/thumbnails/2.jpg)
DATA WAREHOUSE, OLAP, and DATA MINING
• Concepts– Data warehousing
– OLAP (On-Line Analytical Processing)
– Data mining
• Case Studies– WebTarget (USN)
– TFDW (USMC)
![Page 3: DataWarehousing and DataMining Prof. Sin-Min Lee](https://reader035.vdocument.in/reader035/viewer/2022081519/56649ea05503460f94ba298b/html5/thumbnails/3.jpg)
![Page 4: DataWarehousing and DataMining Prof. Sin-Min Lee](https://reader035.vdocument.in/reader035/viewer/2022081519/56649ea05503460f94ba298b/html5/thumbnails/4.jpg)
![Page 5: DataWarehousing and DataMining Prof. Sin-Min Lee](https://reader035.vdocument.in/reader035/viewer/2022081519/56649ea05503460f94ba298b/html5/thumbnails/5.jpg)
![Page 6: DataWarehousing and DataMining Prof. Sin-Min Lee](https://reader035.vdocument.in/reader035/viewer/2022081519/56649ea05503460f94ba298b/html5/thumbnails/6.jpg)
![Page 7: DataWarehousing and DataMining Prof. Sin-Min Lee](https://reader035.vdocument.in/reader035/viewer/2022081519/56649ea05503460f94ba298b/html5/thumbnails/7.jpg)
![Page 8: DataWarehousing and DataMining Prof. Sin-Min Lee](https://reader035.vdocument.in/reader035/viewer/2022081519/56649ea05503460f94ba298b/html5/thumbnails/8.jpg)
![Page 9: DataWarehousing and DataMining Prof. Sin-Min Lee](https://reader035.vdocument.in/reader035/viewer/2022081519/56649ea05503460f94ba298b/html5/thumbnails/9.jpg)
![Page 10: DataWarehousing and DataMining Prof. Sin-Min Lee](https://reader035.vdocument.in/reader035/viewer/2022081519/56649ea05503460f94ba298b/html5/thumbnails/10.jpg)
![Page 11: DataWarehousing and DataMining Prof. Sin-Min Lee](https://reader035.vdocument.in/reader035/viewer/2022081519/56649ea05503460f94ba298b/html5/thumbnails/11.jpg)
![Page 12: DataWarehousing and DataMining Prof. Sin-Min Lee](https://reader035.vdocument.in/reader035/viewer/2022081519/56649ea05503460f94ba298b/html5/thumbnails/12.jpg)
![Page 13: DataWarehousing and DataMining Prof. Sin-Min Lee](https://reader035.vdocument.in/reader035/viewer/2022081519/56649ea05503460f94ba298b/html5/thumbnails/13.jpg)
![Page 14: DataWarehousing and DataMining Prof. Sin-Min Lee](https://reader035.vdocument.in/reader035/viewer/2022081519/56649ea05503460f94ba298b/html5/thumbnails/14.jpg)
![Page 15: DataWarehousing and DataMining Prof. Sin-Min Lee](https://reader035.vdocument.in/reader035/viewer/2022081519/56649ea05503460f94ba298b/html5/thumbnails/15.jpg)
![Page 16: DataWarehousing and DataMining Prof. Sin-Min Lee](https://reader035.vdocument.in/reader035/viewer/2022081519/56649ea05503460f94ba298b/html5/thumbnails/16.jpg)
DATA WAREHOUSE
• DATABASE MANAGEMENT IN THE INTERNET ERA
• CLIENT/SERVER - BASED
• ANALYTICAL vs OPERATIONAL (OLAP vs OLTP)
• MULTI-DIMENSIONAL ANALYSIS
• DATA WAREHOUSE (ENTERPRISE-WIDE) vs
DATA MART (FUNCTIONAL AREA)
![Page 17: DataWarehousing and DataMining Prof. Sin-Min Lee](https://reader035.vdocument.in/reader035/viewer/2022081519/56649ea05503460f94ba298b/html5/thumbnails/17.jpg)
MULTIDIMENSIONAL NATURE OF DATA WAREHOUSES
• BORING QUERY: “How many Sailors/Marines chose not to stay in the Navy/Marine Corps this year?”
• USEFUL QUERY: “What was our retention (separation) rate this year by community by paygrade by years of service by gender by rating and how did it compare to last year and what can we expect next year?”
![Page 18: DataWarehousing and DataMining Prof. Sin-Min Lee](https://reader035.vdocument.in/reader035/viewer/2022081519/56649ea05503460f94ba298b/html5/thumbnails/18.jpg)
DW ARCHITECTURE
![Page 19: DataWarehousing and DataMining Prof. Sin-Min Lee](https://reader035.vdocument.in/reader035/viewer/2022081519/56649ea05503460f94ba298b/html5/thumbnails/19.jpg)
DW 3-TIER ARCHITECTURE
![Page 20: DataWarehousing and DataMining Prof. Sin-Min Lee](https://reader035.vdocument.in/reader035/viewer/2022081519/56649ea05503460f94ba298b/html5/thumbnails/20.jpg)
1. DATA QUALITY & DATA CLEANSING
• #1 REASON FOR DW PROJECT FAILURE• PROBLEMS
- Database heterogeneity- Data heterogeneity
• FUNCTIONALITY OF TOOLS- Removing unwanted data from operational databases- Converting to common data names and definitions- Calculating summaries and derived data- Establishing defaults for missing data- Accommodating source data definition changes
![Page 21: DataWarehousing and DataMining Prof. Sin-Min Lee](https://reader035.vdocument.in/reader035/viewer/2022081519/56649ea05503460f94ba298b/html5/thumbnails/21.jpg)
APPROACHES TO DATA CLEANSING
• AUTOMATIC CODE GENERATION
Creates code to convert from source to target data
• DATA REPLICATION TOOLS
Captures changes to source database from recovery logs and database triggers and propagates changes to copies of the data
• DYNAMIC TRANFORMATION ENGINES
Rule-driven systems that capture data from source databases at user-defined intervals, transform it, and export it to a data warehouse/mart target
![Page 22: DataWarehousing and DataMining Prof. Sin-Min Lee](https://reader035.vdocument.in/reader035/viewer/2022081519/56649ea05503460f94ba298b/html5/thumbnails/22.jpg)
2. METADATA (What does the data mean?)
• Logical Structure of DW Including End User Views
• Identification of Authoritative Data Sources
• Transformation Rules for Populating DW
• Transformation Rules to Deliver Data to End-User Analytical Tools
• Subscription Information for Information Delivery
• DW Operational Information
• DW Usage Metrics
• Security Authorizations, Access Control Lists, etc.
![Page 23: DataWarehousing and DataMining Prof. Sin-Min Lee](https://reader035.vdocument.in/reader035/viewer/2022081519/56649ea05503460f94ba298b/html5/thumbnails/23.jpg)
3. DATA WAREHOUSE DATABASE
• PARALLEL COMPUTING PLATFORMS
Exs: Symmetric (Shared) Multiprocessors (SMPs);
Massively Parallel Processors (MPPs)
• ROLAP
Relational DBMS with “Heavy Duty” Indexing Capabilities
• MOLAP
Multidimensional Databases (MDDBs)
3rd Party Tools that Augment Relational Model
![Page 24: DataWarehousing and DataMining Prof. Sin-Min Lee](https://reader035.vdocument.in/reader035/viewer/2022081519/56649ea05503460f94ba298b/html5/thumbnails/24.jpg)
4. DATA MARTS
• A Data Warehouse Focused on a Specific Subject Area
• Subsidiary to a Data Warehouse of Integrated Data
• More Rapidly Deployable than a Data Warehouse
• Subject-based vice Enterprise-based
![Page 25: DataWarehousing and DataMining Prof. Sin-Min Lee](https://reader035.vdocument.in/reader035/viewer/2022081519/56649ea05503460f94ba298b/html5/thumbnails/25.jpg)
5. ACCESS TOOLS
• QUERY AND REPORTING TOOLS
- Managed query tools: Layer between user and SQL (e.g., BrioQuery)
- Configurable report generators (e.g., Brio’s BrioReport)
• APPLICATIONS
- Application development platforms (e.g., PowerSoft’s PowerBuilder; Microsoft’s Visual Basic)
![Page 26: DataWarehousing and DataMining Prof. Sin-Min Lee](https://reader035.vdocument.in/reader035/viewer/2022081519/56649ea05503460f94ba298b/html5/thumbnails/26.jpg)
ACCESS (cont’d)
• OLAP
- Support of multidimensional analysis
- Ability to drilldown and rollup along any of the
predefined dimensions
- Major vendors: Cognos, Business Objects, Brio
![Page 27: DataWarehousing and DataMining Prof. Sin-Min Lee](https://reader035.vdocument.in/reader035/viewer/2022081519/56649ea05503460f94ba298b/html5/thumbnails/27.jpg)
MULTIDIMENSIONAL DATA MODEL: STAR SCHEMA
• FACTS: Core data element being analyzed, e.g., Units_of_Items_Sold
• DIMENSIONS: Attributes about FACTS, e.g., Product_Type, Purchase_Date
![Page 28: DataWarehousing and DataMining Prof. Sin-Min Lee](https://reader035.vdocument.in/reader035/viewer/2022081519/56649ea05503460f94ba298b/html5/thumbnails/28.jpg)
ROLE OF METRICS
• Facts should be defined as Measures of Effectiveness (sometimes called Key Performance Indicators (KPI’s))
• Exs: NEC Reutilization Rate
Retention Rate
Attrition Rate
Readiness (Personnel)
![Page 29: DataWarehousing and DataMining Prof. Sin-Min Lee](https://reader035.vdocument.in/reader035/viewer/2022081519/56649ea05503460f94ba298b/html5/thumbnails/29.jpg)
COGNOS DEMO
• http://www.cognos.com/products/tours/analysis_launch.html
![Page 30: DataWarehousing and DataMining Prof. Sin-Min Lee](https://reader035.vdocument.in/reader035/viewer/2022081519/56649ea05503460f94ba298b/html5/thumbnails/30.jpg)
ACCESS: Data Mining
• “Searching for meaningful patterns in large data sets”
• Knowledge acquisition• Motivated and facilitated by:
– Availability of large data sets– Advances in storage technology– Data warehouse technology– E-commerce and the Internet
• Exploratory vs. confirmatory analysis
![Page 31: DataWarehousing and DataMining Prof. Sin-Min Lee](https://reader035.vdocument.in/reader035/viewer/2022081519/56649ea05503460f94ba298b/html5/thumbnails/31.jpg)
6. DW ADMINISTRATION AND MANAGEMENT
• “Normal” DBA Responsibilities plus:
• Source Data Quality Checks
• Keeping track of what all the source data means
• Managing Very Large Databases (gigabytes or terabytes in size)
![Page 32: DataWarehousing and DataMining Prof. Sin-Min Lee](https://reader035.vdocument.in/reader035/viewer/2022081519/56649ea05503460f94ba298b/html5/thumbnails/32.jpg)
7. INFORMATION DELIVERY SYSTEM
• How to get information from the data warehouse to users?
• Users subscribe to the data warehouse.
• Specifically, they subscribe to specific reports to be delivered on a periodic basis.
• Reports are delivered to user’s Web browser as per prescribed frequency.
• Powerful tool for delivering information to the people who need it in an extremely timely fashion. True MIS; true DSS.
![Page 33: DataWarehousing and DataMining Prof. Sin-Min Lee](https://reader035.vdocument.in/reader035/viewer/2022081519/56649ea05503460f94ba298b/html5/thumbnails/33.jpg)
BENEFITS OF DATA WAREHOUSE
• Freedom from restrictions of operational databases• Decision-oriented• Extremely efficient presentation of management
information• Widespread access to critical information for those who
need it when they need it• Knowledge discovery• Improves business intelligence• Relatively inexpensive to implement• Does not require re-engineering of legacy systems
![Page 34: DataWarehousing and DataMining Prof. Sin-Min Lee](https://reader035.vdocument.in/reader035/viewer/2022081519/56649ea05503460f94ba298b/html5/thumbnails/34.jpg)
GIS: GEOGRAPHIC INFORMATION SYSTEMS
• Ability to visualize data spatially
• Maps on top of a relational DBMS
• Data is viewed on maps vice from tables
• Features:
- Thematic maps
- Spatial queries
- Geocoding of data
• Vendors: MapInfo; ESRI (ArcInfo)
![Page 35: DataWarehousing and DataMining Prof. Sin-Min Lee](https://reader035.vdocument.in/reader035/viewer/2022081519/56649ea05503460f94ba298b/html5/thumbnails/35.jpg)