development of a data warehouse for mining of ncembt field

50
FINAL REPORT NCEMBT-091123 DEVELOPMENT OF A DATA WAREHOUSE FOR MINING OF NCEMBT FIELD RESEARCH DATA NOVEMBER 2009 Hsuan-Tsung Hsieh, Ph.D. University Of Nevada Las Vegas Davor Novosel National Center for Energy Management and Building Technologies

Upload: others

Post on 23-May-2022

1 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Development Of A Data Warehouse For Mining Of NCEMBT Field

FINAL REPORT NCEMBT-091123

DEVELOPMENT OF A DATA WAREHOUSE FOR MINING OF NCEMBT FIELD RESEARCH DATA

NOVEMBER 2009

Hsuan-Tsung Hsieh, Ph.D.

University Of Nevada Las Vegas

Davor Novosel

National Center for Energy Management and Building Technologies

Page 2: Development Of A Data Warehouse For Mining Of NCEMBT Field
Page 3: Development Of A Data Warehouse For Mining Of NCEMBT Field

FINAL REPORT NCEMBT-091123

NATIONAL CENTER FOR ENERGY MANAGEMENT AND BUILDING TECHNOLOGIES TASK 06-05: DEVELOPMENT OF A DATA WAREHOUSE FOR MINING OF NCEMBT FIELD RESEARCH DATA

NOVEMBER 2009 Prepared By: Hsuan-Tsung Hsieh, Ph.D. University Of Nevada Las Vegas Davor Novosel National Center for Energy Management and Building Technologies Prepared For: U.S. Department of Energy William Haslebacher Project Officer / Manager This report was prepared for the U.S. Department of Energy Under Cooperative Agreement DE-FC26-03GO13072

Page 4: Development Of A Data Warehouse For Mining Of NCEMBT Field

NCEMBT-091123

ii

NOTICE This report was prepared as an account of work sponsored by an agency of the United States government. Neither the United States government nor any agency thereof, nor any of their employees, makes any warranty, express or implied, or assumes any legal liability or responsibility for the accuracy, completeness, or usefulness of any information, apparatus, product, or process disclosed, or represents that its use would not infringe privately owned rights. Reference herein to any specific commercial product, process, or service by trade name, trademark, manufacturer, or otherwise does not necessarily constitute or imply its endorsement, recommendation, or favoring by the United States government or any agency thereof. The views and opinions of authors expressed herein do not necessarily state or reflect those of the United States government or any agency thereof.

NATIONAL CENTER FOR ENERGY MANAGEMENT AND BUILDING TECHNOLOGIES CONTACT Davor Novosel Chief Technology Officer National Center for Energy Management and Building Technologies 601 North Fairfax Street, Suite 250 Alexandria VA 22314 703-299-5633 [email protected]

Page 5: Development Of A Data Warehouse For Mining Of NCEMBT Field

NCEMBT-091123

iii

TABLE OF CONTENTS EXECUTIVE SUMMARY ............................................................................................................................................. 1 

1. PROJECT OBJECTIVE ............................................................................................................................................ 2 

2. BACKGROUND .................................................................................................................................................... 3 

3. ASSESSMENT OF REQUIRED RESOURCES ........................................................................................................... 4 

3.1 Hardware Requirements ............................................................................................................................... 4 

3.2 Hardware Options ......................................................................................................................................... 7 

3.3 Software requirements .................................................................................................................................. 7 

4. DEFINITION OF HARDWARE AND SOFTWARE ARCHITECTURE ................................................................................ 8 

4.1 Source Data Layer ........................................................................................................................................ 8 

4.2 Data Transformation Layer ............................................................................................................................ 8 

4.3 Data Warehouse Layer .................................................................................................................................. 8 

4.4 Reporting Layer ............................................................................................................................................ 9 

4.5 Metadata Layer ............................................................................................................................................ 9 

4.6 Operations Layer .......................................................................................................................................... 9 

5. DESIGN, DEVELOPMENT AND IMPLEMENTATION OF DATA WAREHOUSE SYSTEM ............................................... 10 

5.1 Hardware and System Implementation ........................................................................................................ 10 

5.1.1 Hardware Architecture Layout and Implementation .............................................................................. 10 

5.1.2 System Configuration and Implementation .......................................................................................... 11 

5.1.3 Cluster Services Design, Development, and Implementation ................................................................. 12 

5.2 Designing and Implementation of Map-driven and Multi-dimensional data presentation Modules ................ 16 

5.2.1 Map-Driven Data Visualization Interfaces ............................................................................................. 16 

5.2.2 Multi-Dimensional Data Visualization Interface .................................................................................... 18 

5.3 Implementation of the Online Questionnaire System .................................................................................... 24 

5.4 Automation of the Building Performance Report Process ............................................................................. 26 

5.4.1 Report Module Concept ....................................................................................................................... 26 

5.4.2 Report Module Implementation ........................................................................................................... 27 

5.5 Testing and Debugging of the DW System .................................................................................................... 37 

6. REFINEMENT OF THE DATA WAREHOUSE ........................................................................................................... 38 

7. REFERENCES .................................................................................................................................................... 41 

Page 6: Development Of A Data Warehouse For Mining Of NCEMBT Field

NCEMBT-091123

iv

LIST OF FIGURES Figure 1. Data flow for the proposed data warehouse. ............................................................................................. 4 

Figure 2. Proposed data warehouse architecture ..................................................................................................... 6 

Figure 3. Defined layers for the data warehouse architecture ................................................................................... 9 

Figure 4. Data warehouse architecture with nodes and storage system design layout. ............................................ 10 

Figure 5. The login screenshot with the implemented nodes and member of domain defined. ................................. 11 

Figure 6. Cluster administrator screenshot shows the current cluster owner under the defined domain. .................. 11 

Figure 7. Cluster definition screenshot with the public static IP address. ................................................................ 12 

Figure 8. IIS service associated with the script file that uses to execute the cluster configuration. ........................... 13 

Figure 9. IIS’s ownership and active node information. .......................................................................................... 13 

Figure 10. ASP.NET configuration file is located under the shared disk for providing high availability. ..................... 14 

Figure 11. SQL server 2005 Cluster service is assigned to a public static IP. .......................................................... 15 

Figure 12. SQL Server 2005 is defined to be accessed by domain user. ................................................................. 15 

Figure 13. SQL Server Management Studio is used for managing database storage in the shared disk. ................... 16 

Figure 14. Building Sciences Database is presented by the map-driven presentation interface. .............................. 17 

Figure 15. Three-tiered map-driven data presentation interface. ............................................................................ 17 

Figure 16. Map-driven interface with zoom capability (Buildings are marked in the park zone for privacy concern). .. 18 

Figure 17. Building Characteristics data presentation screenshot .......................................................................... 19 

Figure 18. IEQ data presentation screenshot ......................................................................................................... 20 

Figure 19. Sound (Acoustic) data presentation screenshot .................................................................................... 21 

Figure 20. Lighting data presentation screenshot .................................................................................................. 21 

Figure 21. Microbiology data presentation screenshot .......................................................................................... 22 

Figure 22. Questionnaire resulting data classification screenshot .......................................................................... 22 

Figure 23. Questionnaire resulting data presentation screenshot........................................................................... 23 

Figure 24. Multi-dimension data presentation interface. ....................................................................................... 24 

Figure 25. Building assignment process for online perception questionnaire system .............................................. 25 

Figure 26. Real-time survey result interface for online perception questionnaire system ......................................... 25 

Figure 27. Online perception questionnaire system screenshot.............................................................................. 26 

Figure 28. Major components for building customized reporting system. ............................................................... 27 

Figure 29. Three major components for building customized reporting system. ...................................................... 28 

Figure 30. Page 1 of the “General Summary” report relating to Building characteristics ......................................... 29 

Figure 31. Page 2 of the “General Summary” report relating to IEQ measurement results ....................................... 30 

Figure 32. age 3 of the “General Summary” report relating to IEQ and Sound measurement results ........................ 31 

Figure 33. Page 4 of the “General Summary” report relating to lighting, microbiology, and questionnaire ............... 32 

Page 7: Development Of A Data Warehouse For Mining Of NCEMBT Field

NCEMBT-091123

v

Figure 34. All reports can be exported as an Excel or pdf file through the build-in functions. ................................... 33 

Figure 35. Exported pdf document from the summary report .................................................................................. 33 

Figure 36. Measurement data report can be customized by selecting required data type. ....................................... 34 

Figure 37. Pdf file output from the “Measurement Data” reporting module ............................................................ 35 

Figure 38. Output of the “Complete” data reporting contents can be customized by selecting required data type. ... 36 

Figure 39. Pdf file output from the “Complete” reporting module ........................................................................... 37 

Figure 40. Re-configured NCEMBT web page defines focal features of building sciences database ......................... 38 

Figure 41. An integrated interface for searching, reporting and presenting multi-dimensional data......................... 40 

Page 8: Development Of A Data Warehouse For Mining Of NCEMBT Field

NCEMBT-091123

vi

LIST OF TABLES Table 1. Required software for the current data warehouse design ........................................................................... 7 

Page 9: Development Of A Data Warehouse For Mining Of NCEMBT Field

EXECUTIVE SUMMARY

NCEMBT-091123

1

EXECUTIVE SUMMARY A data warehouse (DW) is a collection of integrated, subject-oriented data storage system designed to extract data and illustrate correlations that will aid the decision making process. Ultimately, the design of that data enables a researcher to create correlations between any individual data sets. Therefore, data mining is the process of extraction, analysis of data, as well as the analysis of correlations among multiple sets of data. Data warehouses are commonly used in the financial services and retail sectors. However, a data warehouse in the building sciences arena does not currently exist. Establishing the mining of warehoused data allows the respective organizations to optimize their operations and or to develop new products and services for their customers. Establishing the mining tools used for extracting warehoused data sets allows the respective organizations to optimize their operations and or to develop new products and services for their customers.

Based on the developed knowledge-based resource management system developed under NCEMBT Task 05-12, the objective was to design, develop, and implement a data warehouse utilizing the NCEMBT field gathered data. That objective was accomplished in four steps: (1) identification of required resources; (2) definition of hardware and software architectures; (3) design, development and implementation of the DW system; and (4) testing and debugging of the DW.

The completed data warehouse uses standard industry hardware and software components. It employs a map-driven data access, allows for visualization of multi-dimensional data and provides several reporting tools. The user can request a summary report, which show cases the performance of the building of interest; or he or she can request and download recorded data of interest. This data may include measured environmental parameters, such as temperature, humidity, draft, noise or lighting, or it can be summary data of the results of the surveys of building occupants’ perceptions of their indoor environments.

Page 10: Development Of A Data Warehouse For Mining Of NCEMBT Field

1. PROJECT OBJECTIVE

NCEMBT-091123

2

1. PROJECT OBJECTIVE Based on the developed knowledge-based resource management system developed under NCEMBT Task 05-12 (Hsieh 2008), the objective of this Task was to design, develop, and implement a data warehouse utilizing the NCEMBT field research data. The data warehouse would use standard industry hardware and software components to fast track the completion of this Task.

Page 11: Development Of A Data Warehouse For Mining Of NCEMBT Field

2. BACKGROUND

NCEMBT-091123

3

2. BACKGROUND The National Center for Energy Management and Building Technologies (NCEMBT) had several projects that gathered data on the performance of residential, commercial, and institutional buildings. The integrated building protocol developed under NCEMBT Task 1, and then applied to Tasks 13, 05-07 and 06-06, collected over a million data points per building.

Under NCEMBT Task 05-12 a knowledge-based resource management system (KBRMS) was developed to store, manage, and allow online access to the individual building data sets.

The true value of the data does not lie within each set but rather in the sum of all sets. The relationships contained between the individual building data sets are of value to NCEMBT researchers, and other research entities; such as the American Society of Heating; Refrigerating and Air-Conditioning Engineers, Inc.; the U.S. Environmental Protection Agency; and private companies, all of which already have asked to access the data. Statistically significant relationships are only developed and derived from the associations that exist amongst all the data sets. For example, a statistically valid correlation between ventilation rates and occupants’ perception of the indoor environmental quality of their work place can be derived from the analysis of the complete database of all buildings. However, to do so, the existing data sets must be upgraded into a data warehouse so that the knowledge discovering (or data mining process) can be initiated.

A data warehouse is a collection of integrated, subject-oriented data storage system designed to extract data and illustrate correlations that will aid the decision making process. Ultimately, the design of that data enables a researcher to create correlations between any individual data sets. Therefore, data mining is the process of extraction, analysis of data, as well as the analysis of correlations among multiple sets of data. Data warehouses are commonly used in the financial services and retail sectors. However, a data warehouse in the building sciences arena does not currently exist. Establishing the mining of warehoused data allows the respective organizations to optimize their operations and or to develop new products and services for their customers. Data mining can also be applied in the manufacturing sector to increase production yields or develop new product.

To derive the full value from the collected large volumes of data sets under the NCEMBT research, the KBRMS needs to be extended to include various data sets and databases derived from various tasks and projects. The realization of the data warehouse considers both hardware and software upgrading. Although data warehousing is not a new concept in implementation processing, the integration of a large and versatile data sets is quite a challenging task due to the amount and size of data that must be stored.

Page 12: Development Of A Data Warehouse For Mining Of NCEMBT Field

3. ASSESSMENT OF REQUIRED RESOURCES

NCEMBT-091123

4

3. ASSESSMENT OF REQUIRED RESOURCES 3.1 HARDWARE REQUIREMENTS Hardware requirement are discussed with three aspects, system issues, hardware configuration options, and cost analysis. They are all closely related to each other and also to the available funding. The defined hardware configuration optimizes to aggressively support the integration of all NCEMBT sponsored projects.

Six issues related to hardware system, disk space, memory, central processing unit (CPU), storage, expendability, and redundancy system, need to be considered prior to implementing the data warehouse as shown in Figure 1. Each issue is discussed by its requirement statement and associated recommendation.

Figure 1. Data flow for the proposed data warehouse.

Disk Space Requirements

Fast I/O speed (fast read/write access speed)

Operation system installation space requirement 15GB.

Application software, such as SQL Server, OpenViz and Visual Studio, requires around 15GB in space.

Data storage space (tables, indexes, temp space, etc.) - Currently, the NECMBT database size for the first 20 building surveyed is about 3.5GB. Another surveyed 10 building data added 5.5GB. Other types of data, such as numerical simulation and laboratory experimental data, can add approximately another 40GB into the system. To perform SQL Server Analysis Services for the data mining, the storage required is 2 to 4 times of the original database size or in the range of 100 to 200GB.

Page 13: Development Of A Data Warehouse For Mining Of NCEMBT Field

3. ASSESSMENT OF REQUIRED RESOURCES

NCEMBT-091123

5

Data landing space (for inbound files and archives) - The assumed landing space is about 5 times of the current database size or about 30GB space.

Disk Space Recommendation:

Each system was projected to need (SATA or SCSI) hard disk capacity of 150GB.

Memory Requirements

When running any application on large data set, size of memory is crucial factor of the execution speed. Software, such as some older version of Analysis Service cannot run under small memory size (Analysis Services may require more than 2GB for working large data set). An insufficiency of memory allocation can easily become bottleneck for system operation. Reference: http://www.microsoft.com/technet/prodtechnol/sql/2000/maintain/anservog.mspx.

Database (database working and dynamic memories) – the SQL Server 2005 Enterprise Edition recommends the memory size of 1GB or more (Reference: http://msdn2.microsoft.com/en-us/library/ms143506.aspx).

Memory Recommendation

Since small memory size can seriously affect process performance, we suggested to equipping the system with 12GB of memory.

Central Processing Unit (CPU) Requirements

Data Transformation (extraction, transforming, and loading): since the data process task is not performed on the real-time fashion, the CPU workload is moderate.

Usage Complexity: an estimated 10% CPU resource is allocated to the “demanding usage” and about 60% to the “general usage”. The number of concurrent users for the data warehouse-related processes is around 12.

Central Processing Unit Recommendation

Most of modern applications are designed for parallelism so that a system can benefit from additional processors. 64-bit CPUs have become mainstream to achieve higher performance in desktop computing. Therefore, a Core 2 Dual 2.4GHz CPU was suggested for the standard PC configuration while a Dual Core Xeon Processor 5130 2.00GHz of 4MB internal cache was proposed for web/database server configuration.

Storage Requirements

To keep system high availability and prevent data lost.

There are two kinds of storage systems for data warehouse systems: Storage Area network (SAN) and Redundant Array of Independent Disks (RAID).

SAN is strongly recommended for LARGE data warehouse system with high availability systems. However, such configuration is very costly.

RAID is robust for data warehouse systems with various ways of configuration:

RAID 0 is excellent in data read performance with no fault-tolerance implementation,

RAID 1 is known for its mirroring capability that preserves two complete identical data copies within the system,

RAID 5 is excellent for both data reading performance, data mirroring, and fault tolerance.

Page 14: Development Of A Data Warehouse For Mining Of NCEMBT Field

3. ASSESSMENT OF REQUIRED RESOURCES

NCEMBT-091123

6

Storage Recommendation

File reading and writing (I/O) issue was an expected bottleneck in Data Warehouse and Business Intelligent (BI) systems implementation, especially for processing large quantities of data queries simultaneously. Storage system with a RAID 1 configuration was deemed suitable for the project scale and budget with high performance (15K RPM SATA or SCSI) and fault-tolerance capability.

Expendability Requirements

Hardware was to be expandable when the size of data sets exceeded the current capability.

Expendability Recommendation

Once the data size exceeds the current storage capacity splitting the warehouse database servers and reporting/data analysis systems onto more than one server would be the preferred approach. A RAID 1 storage system could be easily expanded if required.

Redundancy System Requirement

The purpose of a redundancy system is to provide high availability.

Redundancy System Recommendation

Establishing a redundancy system with two clustered servers connected to the Directed Attached Storage (DAS) was the solution used in this project as shown in Figure 2. Under RAID 1 storage system configuration, six 146GB/15K RPM SCSI hard disks were installed. Each server (database engine and analysis servers) utilizes two hard disks while the Directed Attached Storage (DAS) device uses the remaining two for current available data sets.

Figure 2. Proposed data warehouse architecture

Page 15: Development Of A Data Warehouse For Mining Of NCEMBT Field

3. ASSESSMENT OF REQUIRED RESOURCES

NCEMBT-091123

7

3.2 HARDWARE OPTIONS Due to budget constraints, a Mini Tower was used for hosting the NECMBT web portal and database and an existing PowerEdge 2950 (131.216.114.163) was used for executing the full functionality of data warehouse services. The NCEMBT web portal data sets were scheduled to back up to an extended network hard drive. A Power Edge server (PE 2950) from a previous project was used for redundancy. A DELL PowerVault 300 was the in-between server to directly attach to the RAID 1 storage architecture. All servers and PC were connected to uninterrupted power supply (UPS) units to provide continuous service.

3.3 SOFTWARE REQUIREMENTS The Enterprise SQL Server 2005 includes a relational database engine, BI Studio, SQL Studio, SQL Server Integration Service and Analysis Services for Data Mining. The full version of Enterprise SQL Server 2005 can only perform basic data warehouse functions. However, the data analysis and data mining tools that come with the SQL server package are simple. We contemplated acquiring other commercial software application for more complicated data analysis and data mining. OpenViz from AVS is a data visualization application that can be used in analysis and reporting of the data. To customize DW required tools, MS Visual Studio package was selected for developing data visualization, data analysis, web-based data query and publication of data mining results.

Table 1. Required software for the current data warehouse design

Layer Software Functionality

Data Source/ Data Transformation

SQL Server Integration Service Design, develop and maintain the ETL System SQL Studio Write SQL for data manipulation within data warehouse BI Studio Provide interface for several SQL Server services

including Integration Service FLUENT Convert experimental data format to targeted data

warehouse structure Data Warehouse Relational Engine in MS SQL Server

2005 Enterprise Store and query data

Reporting Analysis Services Perform data analysis SQL Server Reporting Services Display analysis and summary reports BI Studio Provide interfaces to SQL Server services, including

Analysis Services. Data Mining in MS SQL Server Provide a data mining capability VS Studio Develop codes for customized reporting and data

analysis tools OpenViz Provide data visualization function FLUENT Convert simulation data into presentable data formats

Metadata Word processing and spreadsheet software

Document data

Operations SQL Studio Write SQL to manipulate the database system VS Studio Develop codes for Data Warehouse System

* Bundle with SQL Server 2005 Enterprise Version

Page 16: Development Of A Data Warehouse For Mining Of NCEMBT Field

4. DEFINITION OF HARDWARE AND SOFTWARE ARCHITECTURE

NCEMBT-091123

8

4. DEFINITION OF HARDWARE AND SOFTWARE ARCHITECTURE The data warehouse consists of two components, hardware, such as computers/servers, and software for assembling, accessing, and managing the data and the database. Each of these two components consists of a multitude of individual items that are interconnected. They generally are referred to as “architectures” or “blueprint”. The development of the hardware architecture utilized off-the-shelf computer components and integrated them into the existing UNLV network. The main concern was to determine the short- and long-term capabilities required to build the architecture. Six layers of data warehouse architecture had been identified.

A cluster architecture was selected to provide high availability and high fault tolerance for application or services with a group of computers (nodes) interacting as a single computer. The failover system is based on the MS Windows 2003 Enterprise Edition using the Dell Power Edge 2950 server hardware and the MD3000 Share Storage. If one member of the cluster (node) is not functioning or unavailable, the other node can pop in and carry the load so that applications or service are always available. Similarly to the hardware side, the software architecture (blue) would rely on commercially available, industry standard applications and processes.

4.1 SOURCE DATA LAYER Our current source data sets collected within MS SQL server and files, require specific software package such as FLUENT to process. The source data types derived from NCEMBT projects are building survey data (IEQ measurement, microbiological samples, occupant perception questionnaire, sound, and lighting data), experimental data from Building Technology Laboratory (BTLab), numerical simulation results, and related literature review documents. Other project related data can be further examined and stored into the source data layer as soon the projects are well established. The uploading interface was developed according to the specific needs from each project. Currently the NCEMBT source data is stored in the “Conditions” database. 4.2 Data Transformation Layer

This layer (aka Extract, Transform and Load – ETL) is concerned with extraction of data from source data, transformation from the source format and structure into the target data warehouse format and structure, and loading to the data warehouse. Due to complexity of the data type, no single application was able to complete all possible required data transformation tasks. Requirements included managing, accessing, analyzing, and visualizing the warehoused data. Building IEQ measurement data were transformed from 10-sec interval to 1-hour interval data sets prior to uploading into the data warehouse. Data sets from numerical simulation and laboratory experimental measurements require specific software packages, such as FLUENT, to process the data prior to indexing it into the data warehouse. Most of the simulation data sets were generated by FLUENT software.

4.3 DATA WAREHOUSE LAYER This is a decision support environment that is subject-oriented, integrated, time variant and non-volatile in nature. Subject-oriented nature targets entities of any specific research interests while integrated character defines data collection from various sources. Due to the character of data warehouse, time-variant data sets are always collected. However, once they are in the DW system, the data sets will not be changed (Non-volatile).

Page 17: Development Of A Data Warehouse For Mining Of NCEMBT Field

4. DEFINITION OF HARDWARE AND SOFTWARE ARCHITECTURE

NCEMBT-091123

9

4.4 REPORTING LAYER This layer processes and converts data into useful information for targeted users. There are numerous software applications, such as Business Intelligence (BI), and Online Analytical Processing (OLAP) applications, that provide such capabilities. Data visualization is the critical interface tool used to communicate to researchers and or interested industrial partners. A selected visualization software package would be integrated into the reporting layer for better presentation.

4.5 METADATA LAYER The metadata layer is used to inform operators and targeted users regarding the data and status holding within the data warehouse.

4.6 OPERATIONS LAYER The operation layer is comprised of the processes of loading, manipulating and extracting data. It also includes tools for managing and maintaining data warehouse itself.

Figure 3. Defined layers for the data warehouse architecture1

1 (Source: http://en.wikipedia.org/wiki/Data_warehouse)

Page 18: Development Of A Data Warehouse For Mining Of NCEMBT Field

5. DESIGN, DEVELOPMENT AND IMPLEMENTATION OF DATA WAREHOUSE SYSTEM

NCEMBT-091123

10

5. DESIGN, DEVELOPMENT AND IMPLEMENTATION OF DATA WAREHOUSE SYSTEM 5.1 HARDWARE AND SYSTEM IMPLEMENTATION

5.1.1 Hardware Architecture Layout and Implementation All cluster nodes (Power Edge 2950) share networked diskettes. The shared external diskettes are required for accessing by all nodes connected by the SCSI Cards. Cluster service configuration data and building-related data sets were stored on the shared disk as indicated in Figure 4. The “static” or “fixed” IP addresses, 131.216.114.236 and 131.216.114.237, have been assigned to two nodes in the cluster system. The PowerVault (MD 3000) is equipped with two power supply systems providing power redundancy function. The RAID 5 structure of high reading performance and fault-tolerating mirroring functions was used for arraying all internal hard disks. Although nodes always connect to and communicate with the MD3000, there is only one cluster main node or owner at any given point. The ownership of the cluster will be swapped between nodes depends on the node availability.

Figure 4. Data warehouse architecture with nodes and storage system design layout.

Page 19: Development Of A Data Warehouse For Mining Of NCEMBT Field

5. DESIGN, DEVELOPMENT AND IMPLEMENTATION OF DATA WAREHOUSE SYSTEM

NCEMBT-091123

11

5.1.2 System Configuration and Implementation The cluster nodes must be configured under the same domain control member. The cluster system is currently configured under the “ncacm-h2s1” domain name as shown in Figure 5. For security concern, only the domain user with the administrator right can access the cluster service setup. Two nodes and cluster services are all under members of domain control server. Therefore, domain control server works as the manager that coordinates the interaction among cluster services and nodes. The control server also utilizes resources from both nodes and clusters as shown in Figure 6. A separate “static” IP address, 131.216.114.238 was acquired and assigned by the cluster management within the MS Windows 2003 Server shown in Figure 7.

The failover is the default function for windows cluster. Either one of the nodes support and provide the service to the cluster. To guarantee the identical service and function among clusters all the time, the nodes should be setup identically.

Figure 5. The login screenshot with the implemented nodes and member of domain defined.

Figure 6. Cluster administrator screenshot shows the current cluster owner under the defined domain.

Page 20: Development Of A Data Warehouse For Mining Of NCEMBT Field

5. DESIGN, DEVELOPMENT AND IMPLEMENTATION OF DATA WAREHOUSE SYSTEM

NCEMBT-091123

12

Figure 7. Cluster definition screenshot with the public static IP address.

5.1.3 Cluster Services Design, Development, and Implementation Based on design criteria discussed in the system requirement analysis, the cluster service has to integrate web service, ASP.NET scripting, and SQL Server database services for a complete data warehouse practice.

5.1.3.1 Internet Information Service (IIS) Internet information service (IIS) is the building package associated with the MS Windows 2003 Server. Cluster can provide the IIS service through the IIS Script code. Since the cluster service is provided by two separated nodes using the script path as shown in Figure 8, the node setup and script file path should be identical. All required data sets should be put under the shared disk (PowerVault MD3000) for accessing. Figure 9 shows the possible ownership of the cluster displayed in the “IIS properties” window. All nodes are the possible owner candidate. However, at any given time, there is only one active node or owner can be assigned. At current case, DWH-1 is always the active node that provides services. Based on the cluster configuration, the DW can only be accessed through the public IP of 131.216.114.238 with services from either node DWH-1 or DWH-2. The IIS provides simply the internet connection protocol and generic HTML page formats. To complete a user-friendly DW system, ASP.NET and SQL Server components discussed later were used for implementing web portal and DW access interface.

Page 21: Development Of A Data Warehouse For Mining Of NCEMBT Field

5. DESIGN, DEVELOPMENT AND IMPLEMENTATION OF DATA WAREHOUSE SYSTEM

NCEMBT-091123

13

Figure 8. IIS service associated with the script file that uses to execute the cluster configuration.

Figure 9. IIS’s ownership and active node information.

5.1.3.2 ASP.NET Services To provide the high availability, the latest ASP.net configuration file is located in the shared disk (PowerVault MD3000) under M Drive as shown in Figure 10.

Page 22: Development Of A Data Warehouse For Mining Of NCEMBT Field

5. DESIGN, DEVELOPMENT AND IMPLEMENTATION OF DATA WAREHOUSE SYSTEM

NCEMBT-091123

14

Figure 10. ASP.NET configuration file is located under the shared disk for providing high availability.

5.1.3.3 SQL Server Services Microsoft SQL Server 2005 Enterprise configuration, different from the IIS and ASP.NET, is not running with script in the node. The SQL Server 2005 can be configured in another cluster system for better security and availability concerns. A static IP of 131.216.114.239 was preserved for the SQL Server 2005 as indicated in Figure 11. Installation procedure automatically copied and synchronized the essential files and setting into two nodes. Since SQL 2005 cluster is a member of the domain control, the domain user with designated right can access the SQL server cluster as shown in Figure 12. As defined in the Figure 13, the SQL Server 2005 is under the domain control server of “ncacm-h2s1” but SQL engine is actually installed under all cluster nodes (DWH-1 and DWH-2). All physical data sets are stored in the shared disk (PowerVault MD3000)

Page 23: Development Of A Data Warehouse For Mining Of NCEMBT Field

5. DESIGN, DEVELOPMENT AND IMPLEMENTATION OF DATA WAREHOUSE SYSTEM

NCEMBT-091123

15

Figure 11. SQL server 2005 Cluster service is assigned to a public static IP.

Figure 12. SQL Server 2005 is defined to be accessed by domain user.

Page 24: Development Of A Data Warehouse For Mining Of NCEMBT Field

5. DESIGN, DEVELOPMENT AND IMPLEMENTATION OF DATA WAREHOUSE SYSTEM

NCEMBT-091123

16

Figure 13. SQL Server Management Studio is used for managing database storage in the shared disk.

5.2 DESIGNING AND IMPLEMENTATION OF MAP-DRIVEN AND MULTI-DIMENSIONAL DATA PRESENTATION MODULES Due to the complexity of the collected data sets, the development of effective presentation interfaces was crucial.

5.2.1 Map-Driven Data Visualization Interfaces To better interact with users and increase data presentation efficiency, a map-driven data presentation interface was created. The interface was built upon the plug-in AJAX (Asynchronous JavaScript And XML) module originally developed by the popular Google (maps.google.com). AJAX is based on JavaScript and HTTP requests with easy implementation and effective web responses. Three presentation tiers were used to present the data flow shown in Figure 15 and detailed contents are listed below:

Level 1: US map with zoom in and out capability and surveyed building locations and types. To preserve identity of surveyed buildings, all buildings are pointed into park zone within the same city location as shown in Figure 16.

Level 2: Data measurement categories (building characteristics, IEQ, sound, microbiology, lighting) will be displayed once the user click on the location pin on the map. All categories are hyperlinked into further data details or Level 3 data sets.

Level 3: All tabulated/charted measured parameters under each category will be displayed at this level.

Page 25: Development Of A Data Warehouse For Mining Of NCEMBT Field

5. DESIGN, DEVELOPMENT AND IMPLEMENTATION OF DATA WAREHOUSE SYSTEM

NCEMBT-091123

17

Figure 14. Building Sciences Database is presented by the map-driven presentation interface.

Figure 15. Three-tiered map-driven data presentation interface.

Level 3: Tabulated or charted parameters under each category

Level 2: Data measurement categories (building characteristics, IEQ, sound, microbiology, lighting)

Level 1: US map with zoom in/out capability and surveyed building locations.

Page 26: Development Of A Data Warehouse For Mining Of NCEMBT Field

5. DESIGN, DEVELOPMENT AND IMPLEMENTATION OF DATA WAREHOUSE SYSTEM

NCEMBT-091123

18

Figure 16. Map-driven interface with zoom capability (Buildings are marked in the park zone for privacy concern).

5.2.2 Multi-Dimensional Data Visualization Interface Based on input by the team members of the NCEMBT and data characteristics, data presentation tools were separated into four categories, tabulated descriptive information, such as those from building characteristics, statistical data table as shown in IEQ interface, 2-D plot as shown in the hourly temperature data in IEQ category, and 3-D data plot as presented by AVS package.

As the entire building characteristics data sets were significantly larger, the web presentation of the building characteristics is a subset of the total available data sets for less confusion as shown in Figure 17. Building IEQ data sets have been categorized into measured and calculated sections. All data sets are presented by statistical table, and 2-D plots for the ease of comparison as shown in Figure 18. To improve data access and chart output post-processing for the IEQ data presentation module, several revisions were performed and several automatic raw data manipulation functions were added. The benefit of such addition aims to speed up data processing while more building data sets are checking into the

Page 27: Development Of A Data Warehouse For Mining Of NCEMBT Field

5. DESIGN, DEVELOPMENT AND IMPLEMENTATION OF DATA WAREHOUSE SYSTEM

NCEMBT-091123

19

system in the near future. Sound, lighting and microbiology data sets are presented as a tabulated format as shown in Figures 19 through 21.

Figure 17. Building Characteristics data presentation screenshot

Page 28: Development Of A Data Warehouse For Mining Of NCEMBT Field

5. DESIGN, DEVELOPMENT AND IMPLEMENTATION OF DATA WAREHOUSE SYSTEM

NCEMBT-091123

20

Figure 18. IEQ data presentation screenshot

Page 29: Development Of A Data Warehouse For Mining Of NCEMBT Field

5. DESIGN, DEVELOPMENT AND IMPLEMENTATION OF DATA WAREHOUSE SYSTEM

NCEMBT-091123

21

Figure 19. Sound (Acoustic) data presentation screenshot

Figure 20. Lighting data presentation screenshot

Page 30: Development Of A Data Warehouse For Mining Of NCEMBT Field

5. DESIGN, DEVELOPMENT AND IMPLEMENTATION OF DATA WAREHOUSE SYSTEM

NCEMBT-091123

22

Figure 21. Microbiology data presentation screenshot

Results from 50-page long perception questionnaire are valuable but hard to visualize and pinpoint if using the laundry list. Pre-defined categories were implemented to divide all survey questions into four parts, general, IEQ, sound, and lighting as shown in Figure 22. Each category can be further divided into sub-categories. Such categorization process helps not only users to effectively search needed information, but also to query data from database quicker. The result for the specific questionnaire is presented by a pie chart with statistical percentages shown in Figure 23.

Figure 22. Questionnaire resulting data classification screenshot

Page 31: Development Of A Data Warehouse For Mining Of NCEMBT Field

5. DESIGN, DEVELOPMENT AND IMPLEMENTATION OF DATA WAREHOUSE SYSTEM

NCEMBT-091123

23

Figure 23. Questionnaire resulting data presentation screenshot

Multi-dimension data presentation interface has been one the most important assets for the DW implementation project as shown in Figure 25. As the AVS OpenViz software package requires significant coding and configuration several alternate efforts, such as Java applet implementation, desktop software installation and direct online data accessing, were attempted. As the data rendering process requires intensive internet resources, direct online data rendering is not a feasible solution at this point.

Page 32: Development Of A Data Warehouse For Mining Of NCEMBT Field

5. DESIGN, DEVELOPMENT AND IMPLEMENTATION OF DATA WAREHOUSE SYSTEM

NCEMBT-091123

24

The current emphasis is on generating a downloadable package, including data sets of interest and the 3-D data viewer. This viewer can be re-used for future data rendering and analysis. The test case, shown in Figure 24, queries real-time data sets from three (3) data types (BuildingID, Hours, Operative Temperature) and 3686 data sets.

Figure 24. Multi-dimension data presentation interface.

5.3 IMPLEMENTATION OF THE ONLINE QUESTIONNAIRE SYSTEM The online questionnaire system provides a flexible tool for conducting building surveys with minimal resource allocation. The system has been integrated with current NCEMBT knowledge-based resource management system. The processes required to set up the online survey were:

The questionnaire coordinator needs to login as an “Administrator”.

The unique questionnaire survey package can be generated through adding specific Building ID into the “Security roles” interface as shown in Figure 26.

The unique user IDs or respondent IDs need to be added into the “User” interface. Respondent IDs can be generated from the MS Access “ConAdmin” interface.

The survey results can be displayed simultaneously under “Survey Admin” interface as shown in Figure 27.

The online survey system employs client-side data validation process for preserving internet resources. Questions on each page need to be completed prior to continuing into questions in the next page. The respondent cannot go back to the previous pages by design. An example page can be found in Figure 28.

Page 33: Development Of A Data Warehouse For Mining Of NCEMBT Field

5. DESIGN, DEVELOPMENT AND IMPLEMENTATION OF DATA WAREHOUSE SYSTEM

NCEMBT-091123

25

Figure 25. Building assignment process for online perception questionnaire system

Figure 26. Real-time survey result interface for online perception questionnaire system

Page 34: Development Of A Data Warehouse For Mining Of NCEMBT Field

5. DESIGN, DEVELOPMENT AND IMPLEMENTATION OF DATA WAREHOUSE SYSTEM

NCEMBT-091123

26

Figure 27. Online perception questionnaire system screenshot.

5.4 AUTOMATION OF THE BUILDING PERFORMANCE REPORT PROCESS

5.4.1 Report Module Concept The value of the data warehouse can be significantly enhanced by providing better analysis tools. Generating building performance report through web portal can be a powerful tool for both researchers and interested building managers. As shown in Figure 28, four parts of data could be utilized to generate the customized performance report. Based on the interested data resolution (summary, concise, and complete data sets), the participant can customize specific information into report builder. The server-side engine can process needed data sets, plots, tables and formulate the report on the fly. All reported materials are generated based on the customer selection and resource waste can be significantly reduced.

Page 35: Development Of A Data Warehouse For Mining Of NCEMBT Field

5. DESIGN, DEVELOPMENT AND IMPLEMENTATION OF DATA WAREHOUSE SYSTEM

NCEMBT-091123

27

Along the same concept, associated research reports/reference tables can be appended to the back of the report for further references.

Figure 28. Major components for building customized reporting system.

5.4.2 Report Module Implementation While clicking on the “Report” category on the map as shown in Figure 29, the user will be shown three parts of the reporting services, a 4-page summary, a measurement data summary and the complete report. Building characteristics, indoor air quality, lighting, microbiology, sound and perception questionnaire results are automatically summarized into a 4-page “general summary report”. The measured data summary only provides summarized results from building measurement devices. The final and long complete report contains all data sets, including the questionnaire data.

PerceptionQuestionnaire

Building Characteristics

BuildingEnergy Usage

BuildingEnvironment Measurement

Page 36: Development Of A Data Warehouse For Mining Of NCEMBT Field

5. DESIGN, DEVELOPMENT AND IMPLEMENTATION OF DATA WAREHOUSE SYSTEM

NCEMBT-091123

28

Figure 29. Three major components for building customized reporting system.

5.4.2.1 Short Report Module As shown from Figures 30 through 33, the summary report interface obtains the most important information related to each building of interest. The summary report can be exported as an Excel or a PDF file for downloading as shown in Figure 34. Figure 35 shows the resulting pdf file from the “Summary” report.

Page 37: Development Of A Data Warehouse For Mining Of NCEMBT Field

5. DESIGN, DEVELOPMENT AND IMPLEMENTATION OF DATA WAREHOUSE SYSTEM

NCEMBT-091123

29

Figure 30. Page 1 of the “General Summary” report relating to Building characteristics

Page 38: Development Of A Data Warehouse For Mining Of NCEMBT Field

5. DESIGN, DEVELOPMENT AND IMPLEMENTATION OF DATA WAREHOUSE SYSTEM

NCEMBT-091123

30

Figure 31. Page 2 of the “General Summary” report relating to IEQ measurement results

Page 39: Development Of A Data Warehouse For Mining Of NCEMBT Field

5. DESIGN, DEVELOPMENT AND IMPLEMENTATION OF DATA WAREHOUSE SYSTEM

NCEMBT-091123

31

Figure 32. age 3 of the “General Summary” report relating to IEQ and Sound measurement results

Page 40: Development Of A Data Warehouse For Mining Of NCEMBT Field

5. DESIGN, DEVELOPMENT AND IMPLEMENTATION OF DATA WAREHOUSE SYSTEM

NCEMBT-091123

32

Figure 33. Page 4 of the “General Summary” report relating to lighting, microbiology, and questionnaire

Page 41: Development Of A Data Warehouse For Mining Of NCEMBT Field

5. DESIGN, DEVELOPMENT AND IMPLEMENTATION OF DATA WAREHOUSE SYSTEM

NCEMBT-091123

33

Figure 34. All reports can be exported as an Excel or pdf file through the build-in functions.

Figure 35. Exported pdf document from the summary report

Page 42: Development Of A Data Warehouse For Mining Of NCEMBT Field

5. DESIGN, DEVELOPMENT AND IMPLEMENTATION OF DATA WAREHOUSE SYSTEM

NCEMBT-091123

34

5.4.2.2 Measurement and Complete Report Module As shown in Figure 36, the “Measurement Data” reporting module provides all the flexible options for selecting the required data types. The output results are directly converted into an 8-page long pdf file that can save lots of valuable internet resources as defined in Figure 37.

While “Complete” reporting module is similar to the “Measurement data” reporting one shown in Figure 38, it provides extensive selections for the perception questionnaire results. Those questionnaire results are further compiled into horizontal stacked bar charts. The data sets from Figure 38 outputs a 43-page long pdf file (Figure 39).

Figure 36. Measurement data report can be customized by selecting required data type.

Page 43: Development Of A Data Warehouse For Mining Of NCEMBT Field

5. DESIGN, DEVELOPMENT AND IMPLEMENTATION OF DATA WAREHOUSE SYSTEM

NCEMBT-091123

35

Figure 37. Pdf file output from the “Measurement Data” reporting module

Page 44: Development Of A Data Warehouse For Mining Of NCEMBT Field

5. DESIGN, DEVELOPMENT AND IMPLEMENTATION OF DATA WAREHOUSE SYSTEM

NCEMBT-091123

36

Figure 38. Output of the “Complete” data reporting contents can be customized by selecting required data type.

Page 45: Development Of A Data Warehouse For Mining Of NCEMBT Field

5. DESIGN, DEVELOPMENT AND IMPLEMENTATION OF DATA WAREHOUSE SYSTEM

NCEMBT-091123

37

Figure 39. Pdf file output from the “Complete” reporting module

5.5 TESTING AND DEBUGGING OF THE DW SYSTEM Tests were performed to verify that the system operates as designed. Outside industry experts provided valuable comments on improving the sites. Based upon our test findings, adjustments to the data warehouse were made. The information contained in the current data warehouse (DW) has been compiled into interactive data access module and customized reporting module. The system testing performed based on three level of criteria: (1) system reliability, (2) interface usability, and (3) data access efficiency. Bug reports files by external users and internal developers/researchers were fixed.

Page 46: Development Of A Data Warehouse For Mining Of NCEMBT Field

6. REFINEMENT OF THE DATA WAREHOUSE

NCEMBT-091123

38

6. REFINEMENT OF THE DATA WAREHOUSE Intensive discussion and revision on data flow and presentation efficiency were made along the course of the project with project members, the web portal module arrangement has been under several major revisions and the outcome was significant. Building Sciences Database, the highlight part of the entire portal, is now functional and effective. Figure 40 is the latest home portal for the “www.ncembt.org”.

Figure 40. Re-configured NCEMBT web page defines focal features of building sciences database

Page 47: Development Of A Data Warehouse For Mining Of NCEMBT Field

6. REFINEMENT OF THE DATA WAREHOUSE

NCEMBT-091123

39

During the last quarter of the project, several data requests for analysis purpose were made and some unplanned data compilation process exposed few issues of the current data warehouse, including (1) tedious process for creating specific data files readable for SAS statistical software, and (2) multiple steps for querying building-based, parameter-specific data sets into MS Excel or csv files. While expecting increased amount of customized data requests for the data warehouse (DW) after integrating all 30 surveyed buildings into the NCEMBT portal, refining the initial data warehouse design and improving the existing ETL processes were identified to be necessary. A “dimensional model” or a structured information presentation strategy was revisited. As defined earlier, the primary DW design goals were

to present the needed information as simply as possible by providing COMPILED DATA SETS,

to return query results as quickly as possible based on DATA TYPES, such as IEQ or lighting,

to provide customized BUILDING-BASED reports that reflects the building characteristics.

The revised design needs to

present the needed information based on the BUILDING TYPES (office, LEED-certified, school or high-performance) or DATA TYPES (IEQ, micro, sound or lighting) using COMPILED DATA SETS,

download results based on INDIVIDUAL BUILDING, USER GROUP (registered customers or researchers) or DATA RESOULTION (hourly or one-minute); and

provide customized INDIVIDUAL BUILDING performance report (MS Excel or Adobe PDF formats), PARAMETER-BASED BUILDING COMPARISON reports, such as 3-day humidity data sets for selected building types, and BUILDING-BASED BASELINE data sets that reflects performance correlation between individual building and the NCEMBT baseline data sets.

The data sets in the NCEMBT web portal can be treated as multi-dimensional data grids. As proposed in the revised DW goals, a prototype of the integrated interface is presented in Figure 41. Users would be able to access data sets through data type, building type, and geographical location (or climate zone). Since all building characteristics are documented in the DW, users can further query buildings with certain building characteristics, such as building footage, occupant capacity, year of construction to name few. The purpose of the revised interface is to assist users to allocate and compare data sets to their specific needs. The results can be downloaded to their local storage and presented in a specific reporting format, such as Excel spreadsheet or Adobe PDF. The prototype of this interface has been developed at the end of the project period. However, the actual product or interface is beyond the scope of the current project.

Page 48: Development Of A Data Warehouse For Mining Of NCEMBT Field

NCEMBT-091123

40

Figure 41. An integrated interface for searching, reporting and presenting multi-dimensional data.

Download DataDownload Data View ReportView ReportDownload DataDownload Data View ReportView Report

Page 49: Development Of A Data Warehouse For Mining Of NCEMBT Field

7. REFERENCES

NCEMBT-091123

41

7. REFERENCES Boyer, C., Baujard, O., Baujard, V., Aurel, S., Selby, M., and Appel, R.D., 1997, “Health on the Net automated database of health and medical information”, International Journal of Medical Informatics, Vol. 47, pp. 27–29.

Chan, C.-C.H., 2008, “Intelligent spider for information retrieval to support mining-based price prediction for online auctioning”, Expert Systems with Applications, Vol. 34, pp. 347–356.

Chen, D., Orthner, H. F., and Sell, S. M., “Personalized Online Information Search and Visualization”, BMC Medical Informatics and Decision Making, Vol. 5 (6).

Fayyad, U., Piatetsky-Shapiro, G. & Smyth, P. (1996a), “From data mining to knowledge discovery: An overview”, In Advances in Knowledge Discovery and Data Mining, U. Fayyad, G. Piatetsky-Shapiro, P. Smyth, and R. Uthurusamy, Eds. AAAI/MIT Press, Cambridge, Mass.

Fayyad, U., Piatetsky-Shapiro, G. & Smyth, P. (1996b). “The KDD process for extracting useful knowledge from volumes of data”, Communications of the ACM, 39(11), 27-34.

Hack, K. and Jantzen, T., 2006, “Development of Databases and Generation of Stability Diagrams Pertaining to the Modeling of Processes during Hot Corrosion of Heat Exchanger Components”, Materials and Corrosion, Vol. 57(3), pp 252-262.

Hsieh, H-T and Novosel D. 2008. Development of A Knowledge-Based Resources Management System For Indoor Environmental Quality And Building Technologies. Final Report NCEMBT-080501. Alexandria, Virginia: National Center for Energy Management and Building Technologies.

Liu, S., McMahon, C.A., Darlington, M.J., Culley, S.J., and Wild, P.J., 2006, “A Computational framework for retrieval of Document Fragments based on Decomposition Schemes in Engineering Information Management”, Advanced Engineering Informatics, Vol. 20, pp. 401-413.

Lu, T., and Hsu, C., 2007, “Mobile agents for information retrieval in hybrid simulation environment”, Journal of Network and Computer Applications, Vol. 30, pp. 244–264.

Marakas, G.M., 2002, “Modern Data Warehousing, Mining, and Visualization: Core Concepts”, Prentice Hall.

Mundy, J., Thornthwaite, W., and Kimball, R., 2006, “The Microsoft Data Warehouse Toolkit: With SQL Server 2005 and the Microsoft Business Intelligence Toolset”, Wiley.

Plaisant, C., 2004, “The Challenge of Information Visualization Evaluation”, AVI04, ACM May 25-28.

Singh, H., 1998, “Data Warehousing- concepts, Technologies, Implementations, and Management”, Prentice Hall PTR.

Soukup, T. and Davidson, I., 2002,”Visual Data Mining: Techniques and Tools for Data Visualization and Mining”, Wiley.

Street, A.F., Swift, K., Annells, M., Woodruff, R., Gliddon, T., Oakley, A., and Ottman, G., 2007, “Developing a Web-based Information Resource for Palliative Care: an Action-research Inspired Approach”, BMC Medical Informatics and Decision Making, Vol. 7 (26).

Page 50: Development Of A Data Warehouse For Mining Of NCEMBT Field

NATIONAL CENTER FOR ENERGY MANAGEMENT AND BUILDING TECHNOLOGIES

601 NORTH FAIRFAX STREET, SUITE 250

ALEXANDRIA, VA 22314

WWW.NCEMBT.ORG