fuzzydatamining.doc

27
NSF Topic 4(c): MATHEMATICAL SCIENCES: Statistical Methods NSF Sol. 97-64 Proposal Title: “Fuzzy Data Mining” PIN: SFS-97-22 B. Cover Page NSF Topic 4(c) MATHEMATICAL SCIENCES: Statistical Methods “FUZZY DATA MININGSubmitted to: Solicitation 97-64 (SBIR Program) National Science Foundation PPU 4201 Wilson Blvd Room P60 Arlington VA 22230 703/306-1391 SciFish - 1 -

Upload: tommy96

Post on 11-May-2015

436 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: FuzzyDataMining.doc

NSF Topic 4(c): MATHEMATICAL SCIENCES: Statistical Methods NSF Sol. 97-64Proposal Title: “Fuzzy Data Mining” PIN: SFS-97-22

B. Cover Page

NSF Topic 4(c)MATHEMATICAL SCIENCES: Statistical Methods

“FUZZY DATA MINING”

Submitted to:

Solicitation 97-64 (SBIR Program)

National Science Foundation PPU

4201 Wilson Blvd Room P60

Arlington VA 22230

703/306-1391

SciFish - 1 -

Page 2: FuzzyDataMining.doc

NSF Topic 4(c): MATHEMATICAL SCIENCES: Statistical Methods NSF Sol. 97-64Proposal Title: “Fuzzy Data Mining” PIN: SFS-97-22

C. Project SummarySUMMARY

With the proliferation of data, data mining tools are becoming available to meet the market demand for ways to find useful information within that data. One drawback to data mining, specifically data mining of spatial data, is representing vastly different data values and inferring missing data. This is especially evident in data mining applications that seek to find relationships between biological and environmental parameters. Current data mining approaches that utilize neural networks, genetic algorithms, or statistical techniques do not inherently allow for such common data inadequacies. A methodology is needed that can properly represent, and process, data with a large amount of uncertainty.

Scientific Fishery Systems, Inc. (SciFish) proposes the development of a fuzzy data mining methodology that utilizes fuzzy set theory in two key steps of the data mining process. First, fuzzy membership functions are used to represent each data attribute. This allow the data mining practitioner to properly represent each parameter, defining the ranges for low, medium, high, and so on. Second, fuzzy set operations are used during the data mining process, providing different fuzzy correlations that can then be examined to reveal strong trends that traditional correlation techniques might have missed.

COMMERCIAL POTENTIAL

The commercial potential of the proposed fuzzy data mining approach will depend on SciFish’s ability to convince the GIS and data mining users that the incorporation of fuzzy techniques will improve their ability to extract more information from their data than they currently are able. The best way to make this happen is through a successful demonstration of fuzzy data mining to an application that has significant interest to a large community. One such area is the fisheries, where the interactions and relationships between various species and their environment is largely unknown. From this foundation, it will then be possible to extend such applications into other areas, such as: oil exploration, forest management, wildlife management, retail site exploration, and local zoning and planning.

SciFish - 2 -

Page 3: FuzzyDataMining.doc

NSF Topic 4(c): MATHEMATICAL SCIENCES: Statistical Methods NSF Sol. 97-64Proposal Title: “Fuzzy Data Mining” PIN: SFS-97-22

Table of Contents

B. COVER PAGE..................................................................................................................................................

C. PROJECT SUMMARY....................................................................................................................................

TABLE OF CONTENTS.......................................................................................................................................

D. IDENTIFICATION AND SIGNIFICANCE OF THE PROBLEM OR OPPORTUNITY..............................

E. BACKGROUND AND TECHNICAL APPROACH........................................................................................

E.1 BACKGROUND....................................................................................................................................................E.1.1 The Ten Steps of Data Mining.....................................................................................................................E.1.2 Key Environmental Factors Affecting Fishes...............................................................................................E.1.3 Fuzzy Data Mining the North Pacific Fisheries: A Test Case.....................................................................

E.2 TECHNICAL APPROACH.......................................................................................................................................E.2.1 Task 1: Data Collection for the North Pacific............................................................................................E.2.2 Task 2: Develop Fuzzy Representation Methodology..................................................................................E.2.3 Task 3: Develop Fuzzy Correlation Methodology.......................................................................................E.2.4 Task 4: Specifying the Fuzzy Data Mining Software Product......................................................................E.2.5 Task 5: Perform Market Analysis (SciFish Funded)...................................................................................E.2.6 Task 6: Technology Transfer......................................................................................................................

E.3 RELATED RESEARCH AND DEVELOPMENT............................................................................................................E.3.1 Related Work by SciFish.............................................................................................................................E.3.2 Related Work by Others..............................................................................................................................

F. PHASE I TECHNICAL OBJECTIVES............................................................................................................

G. PHASE I RESEARCH PLAN...........................................................................................................................

H. COMMERCIAL POTENTIAL........................................................................................................................

I. PRINCIPAL INVESTIGATOR AND SENIOR PERSONNEL........................................................................

I.1 PATRICK K. SIMPSON, PRINCIPAL INVESTIGATOR..................................................................................................

J. SUBCONTRACTS AND CONSULTANTS.......................................................................................................

K. EQUIPMENT, INSTRUMENTATION, COMPUTERS, AND FACILITIES................................................

L. CURRENT AND PENDING SUPPORT OF PI AND SENIOR PERSONNEL...............................................

M. EQUIVALENT OR OVERLAPPING PROPOSALS TO OTHER FEDERAL AGENCIES........................

N. PROPOSED BUDGET......................................................................................................................................

N.1 GENERAL INFORMATION.....................................................................................................................................N.2 COST REFERENCES.............................................................................................................................................

SciFish - 3 -

Page 4: FuzzyDataMining.doc

NSF Topic 4(c): MATHEMATICAL SCIENCES: Statistical Methods NSF Sol. 97-64Proposal Title: “Fuzzy Data Mining” PIN: SFS-97-22

D. Identification and Significance of the Problem or OpportunityWith the proliferation of data, data mining tools are becoming available to meet the market demand for ways to find useful information within that data. Data mining in an automated search for new and valuable information in a set of data. The ultimate objective of data mining is knowledge discovery. Data mining methodology extracts hidden predictive information from large databases.

The Problem. One drawback to data mining, specifically data mining of spatial data, is representing vastly different data values and inferring missing data. This is especially evident in data mining applications that seek to find relationships between biological and environmental parameters. As an example, data mining can be a valuable tool if applied to the fisheries. But, with fisheries data, there is a tremendous difference in value ranges, spatial extent, temporal extent and data validity. Current data mining approaches that utilize neural networks, genetic algorithms, or statistical techniques do not inherently allow for such common data inadequacies. A methodology is needed that can properly represent, and process, data with a large amount of uncertainty.

The Opportunity. Scientific Fishery Systems, Inc. (SciFish) proposes the development of a fuzzy data mining methodology that utilizes fuzzy set theory in two key steps of the data mining process. First, fuzzy membership functions are used to represent each data attribute. This allow the data mining practitioner to properly represent each parameter, defining the ranges for low, medium, high, and so on. Second, fuzzy set operations are used during the data mining process, providing different fuzzy correlations that can then be examined to reveal strong trends that traditional correlation techniques might have missed. As an example, it is quite possible that a high degree of young fish are strongly correlated with high water temperatures. Such analysis results would be immediately available using the proposed technique. Using existing techniques, the same result would not be revealed because high correlations would be biased to revealing older fish and higher temperatures, the larger end of both value ranges. An illustration of the entire fuzzy data mining approach is outlined below in Figure 1.

The Benefits. The proposed fuzzy data mining approach will allow the practitioner to partition the parameter space into a set of membership functions that are germane to the task. A large Walleye Pollock has a very different length and weight than a large Pacific Halibut. The proposed approach allows those different ranges to be compared equitably.

In addition to the application of fuzzy set technology to the data mining process, the proposed approach is also emphasizing the exploration of spatial data sets. Although it is the intent of geographic information systems (GIS) to provide analyses of spatial data, you’ll find that such analysis is almost entirely application specific, intending to answer questions such as: Where is the water shed? How much area is covered by trees? Where is the best spot to look for oil? The proposed fuzzy data mining approach will be a significant new tool in that arsenal, providing answers to a whole new set of questions, such as: What parameters have the greatest impact on young fish? What is the relationship between depth and fish size? What other species are most strongly correlated with Walleye Pollock?

Prior Experience. SciFish is an innovative technology company with a proven track record of taking concepts into working field prototypes, and prototypes into the marketplace. Current fisheries-related products include the development of a broadband sonar fish identification system, a broadband sonar temperature profiler, and a fisheries geographic information system entitled Fisherman’s Associate that integrates several data sources to help fishers optimize their operations. This last product is currently being sold commercially. The sonar fish identification system will begin manufacturing and sales in late 1998. The temperature profiler recently completed Phase I development. All of this technology is the result of SBIR funded projects. Although the proposed fuzzy data mining

SciFish - 4 -

Page 5: FuzzyDataMining.doc

NSF Topic 4(c): MATHEMATICAL SCIENCES: Statistical Methods NSF Sol. 97-64Proposal Title: “Fuzzy Data Mining” PIN: SFS-97-22

Data BaseParameterExtraction

FuzzyRepresentation

SpatialRepresentation

FuzzyCorrelation

AnalysisReport

Figure 1. Outline of Fuzzy Data Mining Approach

product is not specifically a fisheries-related application, SciFish will be using a fisheries data set to develop the approach. In addition, SciFish’s prior experience in developing a software product provides this project with valuable insights that can enhance the overall probability of becoming a commercial success.

During Phase I, SciFish will develop a fuzzy data mining software product that can be applied to a myriad of spatial problems. To accomplish this goal, SciFish will develop the fuzzy data mining methodology through the application in the fisheries. Several software product specifications will be created for different commercialization opportunities. A detailed market analysis will be conducted with SciFish funding. And, a final report will be produced that describes the details of each stage of this development process.

During Phase II, SciFish will produce at least one of the software products, as well as extend the fuzzy data mining methodology from local spatial analysis to global and spatiotemporal analysis.

The Commercial Potential. The commercial potential of the proposed fuzzy data mining approach will depend on SciFish’s ability to convince the GIS and data mining users that the incorporation of fuzzy techniques will improve their ability to extract more information from their data than they currently are able. The best way to make this happen is through a successful demonstration of fuzzy data mining to an application that has significant interest to a large community. One such area is the fisheries, where the interactions and relationships between various species and their environment is largely unknown. From this foundation, it will then be possible to extend such applications into other areas, such as: oil exploration, forest management, wildlife management, retail site exploration, and local zoning and planning.

E. Background and Technical ApproachThe following three sections provide background (§E.1), describe the technical approach (§E.2), and review related research in the proposed area (§E.3).

E.1 BackgroundThe following background sections lay the groundwork for the Phase I Work Plan that follows. There are four areas that are reviewed. First (§E.1.1) a set of ten steps for data mining is outlined. Next (§E.1.2), the key environmental factors that influence fishes is reviewed. Finally (§E.1.3), the motivation for using the Walleye Pollock as a test case during product development is provided.

E.1.1 The Ten Steps of Data Mining

In a recent PC AI article, a set of 10 steps for data mining were described. These are summarized here to provide an overview of the current data mining methodology. In the following sections, the steps that will be modified are steps 7 and 8, which deal with model construction and validating the findings.

1. Identify the Objective. Clearly define the intent of the analysis.

1. Select the Data. Select the data available for achieving the goal.

1. Prepare the Data. Determine which attributes and parameters within the selected data should be used for the analysis format the selected parameters.

SciFish - 5 -

Page 6: FuzzyDataMining.doc

NSF Topic 4(c): MATHEMATICAL SCIENCES: Statistical Methods NSF Sol. 97-64Proposal Title: “Fuzzy Data Mining” PIN: SFS-97-22

1. Audit the Data. Evaluate the resulting data to determine if the data from the various sources has the same level of confidence, range of values, time extent, and spatial extent. Discard all parameters and attributes that are deemed insufficient.

1. Select the Tools. Decide which tool is the best for meeting the objective. The emphasis of the proposed approach is to utilize a fuzzy systems approach for those data elements which widely varying ranges in value, time, and space.

1. Format the Solution. Determine the format of the solution. With the fuzzy systems approach, this step includes the creation of fuzzy membership functions for each of the parameters and attributes. For the application presented herein as an example, the format of the data will consist of fuzzy membership values defined for cells of a predefined resolution

1. Construct the Model. Apply the selected data mining tool, in this instance a fuzzy correlation approach, to the formatted data. The result of the proposed fuzzy data mining approach will be the identification of strong correlations between variables.

1. Validate the Findings. Share the results of the data mining with the client. Determine if the results are valid. Make corrections as needed and repeat step 7, if needed.

1. Deliver the Findings. Provide a report to the client summarizing the results.

1. Integrate the Solution. Apply the findings as appropriate.

E.1.2 Key Environmental Factors Affecting Fishes

There are several environment factors that influence different aspects of a fishes life. Some environmental factors, such as current, affect the transportation of larvae while others are related to food. The following sections outline many of the key environmental factors that affect fishes. With this information, it is then possible to determine which MTPE-derived data products can be used to measure these environmental factors.

Sea and Swell. Waves in the sea, generated by local and distant wind fields (wind waves and swell, respectively) are the most significant phenomena at sea which affect safety, comfort, fishing operations, and fish behavior and availability. There are three different effects of waves on the sea below the surface which are of concern to the fisheries:

1. Vertical mixing by wave action and turbulence caused by breaking waves. This wave mixing can deepen the surface mixed layer depth and sharpen the thermocline gradient. Furthermore, it can affect fish directly by making them seasick and inducing them to move deeper, where the orbital movement of water, caused by waves, is absent.

1. Waves cause current (mass transport by waves) in addition to surface wind drag.

1. Breaking waves cause wave noise which can affect fish behavior.

Surface Currents. Fish sense currents with the rheotactic organ located on the lateral line. Generally, fish head into the current even when they let themselves be carried with it. The swimming speed of the fish depends on their size, and is affected by temperature, being slower in lower temperatures. Fish eggs, larvae, and small juveniles are carried with currents and dispersed by them. Japanese fisheries scientists as well as fishermen have long recognized that pelagic fish tend to aggregate at current boundaries, where good catches are made. 1 The reasons for this are considered to be threefold:

1. Food supply (micronekton) accumulates at current convergences;

1. The current boundary acts as an environmental boundary; and

1. migrating fish dynamically aggregate at current boundaries.

1 Burbank, A. & Douglas, R. (1969). Fisheries forecasting systems - a study of the Japanese fisheries forecasting system. Final report for IR&D MJO 9843-39. TRW Systems, 99900-6865-RO-00, 55 pp.

SciFish - 6 -

Page 7: FuzzyDataMining.doc

NSF Topic 4(c): MATHEMATICAL SCIENCES: Statistical Methods NSF Sol. 97-64Proposal Title: “Fuzzy Data Mining” PIN: SFS-97-22

Salinity and Basic Nutrients. The chemistry of sea water which might affect fish is little influenced by weather and by climatic changes. Two aspects should, however, be considered: salinity and basic nutrient salts such as phosphates and nitrates. These elements can be limiting factors in basic organic production (phytoplankton production) in the sea. Changes in salinity are small and usually indicative of advective changes and mixing. Of other chemical properties, the changes of nutrient salts are indicative of productivity changes and eutrophication.

Light. Changes in light conditions (cloudiness, turbidity) might affect basic organic production as well as fish behavior.2

Sea Surface Temperature. Heat exchanges between the atmosphere takes place through the sea surface, whereby the sea surface temperature (SST) plays a major role in these processes. SST is the most observed parameter in the sea, and is also a good indicator of various processes in the surface layer which have occurred in the past. At times, the temperature itself might not be the direct affecting factor we are looking for, but it might be used as indicating other changes and conditions in the sea. Examples of indirect uses are the estimation of upwelling intensities and the computation of current and surface water type boundaries. Correlations between temperature and the behavior and occurrence of fish have been sought and found. An extensive summary of this subject is given by Laevastu & Hayes.3

Surface Pressure. The climatological mean surface pressure systems are created by summation of synoptic surface processes which depict the movement of the surface lows. These storm tracks can vary considerably from year to year in space and time. The consequences of these variations are manifold: first they cause changes in the surface layers in the oceans, especially currents and mixing, which, in turn will affect some components of the marine ecosystem. Secondly, they affect the fishing operations.

Wind Speed and Direction. Monthly surface wind anomalies can be computed using daily surface wind from the surface pressure distribution and forming long-term monthly means from which the given monthly mean wind can be subtracted. These surface wind anomalies can have various affects on the ocean, besides the creation of anomalies in surface currents. Cushing4 described the formation of the Great Salinity Anomaly of the 1970’s as being caused by the stress of northerly winds off East Greenland in winter during the 1960’s and drifting across the North Atlantic for nearly twenty years. Upwellings along the coasts are also created by prevailing wind systems and thus are sensitive to local wind anomalies.

Water Type. The association of a stock of fish with a water type (mass) has been described by Seckel & Waldron.5 Different water types have different plankton contents, both by abundance and by species dominance. It is possible that pelagic fish are associated with different plankton as food in these water types or with different food abundance, and are advected with these preferred water types. Large scale changes of types of water masses caused by circulation changes have been assumed to cause changes in fish distribution.

E.1.3 Fuzzy Data Mining the North Pacific Fisheries: A Test Case

In a recent workshop entitled “Changing Oceans and Changing Fisheries: Environmental Data for Fisheries Research and Management”6 there were five themes that emerged. Two of those themes were: (1) to develop baseline time-series of the most important parameters related to the fisheries, and (2) to apply new environmental data technologies to fisheries problems. Other themes dealt with sharing information and demonstrating the applicability of the identified data sources. Each of these themes point toward a need for a tool that can identify key relationships between environmental parameters and fish stocks as well as identify trends among those key parameters. Data mining can be an important asset here.

2 Laevastu, T. (1993). Marine Climate, Weather and Fisheries, Halstead Press, New York, 204 pp.3 Laevastu, T. & Hayes, M. (1981). Fisheries Oceanography and Ecology, Fishing News Books, Oxford, 199 pp.4 Cushing, D. (1982). Climate and Fisheries, Academic Press, London, 373 pp.5 Seckel, G. & Waldron, K. (1960). Oceanography and the Hawaiin skipjack fishery, Pacific Fisherman, 58(2),

11-13.6 On-Line at http://upwell.pfeg.noaa.gov/workshop

SciFish - 7 -

Page 8: FuzzyDataMining.doc

NSF Topic 4(c): MATHEMATICAL SCIENCES: Statistical Methods NSF Sol. 97-64Proposal Title: “Fuzzy Data Mining” PIN: SFS-97-22

The North Pacific fisheries would serve as an excellent test case for the fuzzy data mining approach. The vastly different values of fish catch, environmental data, and geographic information poses a true challenge to traditional data mining approaches. Specifically, the Walleye Pollock will be examined using the fuzzy data mining approach described herein. There are several reasons for selecting this species:

1. Commercially Important. Walleye pollock supports the largest single species fisheries in the world. Fished with various trawls from the Sea of Japan to the Gulf of Alaska, world-wide harvests annually averaged 5.6 million mt during the 1980s. It is fished commercially in the Pacific from the Bering Sea to Oregon. Largest harvests have come from the southeastern Bering Sea, averaging 975,000 mt annually during 1981-83, followed by the western Gulf of Alaska and the Aleutian Islands, averaging 270,000 mt combined. The annual catch off British Columbia averages about 2,000 mt. Historically the focus of foreign fleets, by 1986 joint venture and wholly domestic fisheries accounted for 75% of its harvest in the U.S. EEZ. In 1988, the entire pollock harvest in U.S. waters was 'Americanized' as no directed foreign fishing was permitted off Alaska: U.S. fishermen caught more than 1.4 million mt, valued at over $200 million. The flesh is soft and therefore marketed in processed form, domestically as fish sticks and animal feed overseas. It is marketed as fish meal and minced fish (surimi) and is often exported in such forms as artificial crab legs. The roe is also an important export.

1. Biologically Important. Walleye Pollock is an extremely important prey species for larger fishes, birds, and mammals and an important predator on pelagic organisms.

1. Politically Important. Walleye Pollock are a highly migratory species, with annual routes that pass through U.S., Canadian, Russian, and International waters. The harvest of Pollock at different points along their annual path is becoming an issue of great concern with the nations involved.

1. Prior Research Available. Because of the tremendous commercial, biological, and political value of Walleye Pollock, their has been a great deal of research conducted concerning various aspects of Walleye Pollock. As such, it makes an excellent candidate for validating the results of the fuzzy data mining approach. Excellent examples of such work include papers by Swartzman, Silverman, & Williamson 7

1. Data Availability. There is a tremendous amount of oceanographic, atmospheric, biological, and geographical information available in the North Pacific. This data will allow analysis of interactions between species, as well as between species and environment.

E.2 Technical ApproachThe proposed fuzzy data mining approach will be developed, demonstrated, and evaluated using Walleye Pollock as the primary target. The objective of this analysis is to determine those environmental variables that are locally and globally correlated with Walleye Pollock. The core elements of the proposed fuzzy data mining methodology are as follows:

1. Fuzzy Membership Representation. Create fuzzy set membership functions for each of the parameters of interest.

1. Creation of Spatial Representations. Create a spatial layer for each fuzzy set membership function that defines the degree of membership for each cell in a predefined grid.

1. Fuzzy Correlation of Spatial Data. Using the spatial representation, perform fuzzy correlation analyses, both locally and globally, to determine which parameters are strongly correlated. Condense the results of this analysis into a set of highly correlated parameters, both in value and space.

7 Swartzman, G., Silverman, E., & Williamson, N. (1995). Relating trends in walleye pollock (Theragra chalcogramma) abundance in the Bering Sea to environmental factors, Can. J. Aquat. Sci., Vol. 52, pp 369-380.

Pelletier, D. & Parma, A. (1994). Spatial distribution of Pacific Halibut (Hippoglossus stenolpis): An application of geostatistics to longline survey data, Can. J. Aquat. Sci., Vol. 51, pp.1506-1518.

SciFish - 8 -

Page 9: FuzzyDataMining.doc

NSF Topic 4(c): MATHEMATICAL SCIENCES: Statistical Methods NSF Sol. 97-64Proposal Title: “Fuzzy Data Mining” PIN: SFS-97-22

Table 1 Here (landscape view)

SciFish - 9 -

Page 10: FuzzyDataMining.doc

NSF Topic 4(c): MATHEMATICAL SCIENCES: Statistical Methods NSF Sol. 97-64Proposal Title: “Fuzzy Data Mining” PIN: SFS-97-22

The first three steps are organized into separate tasks in the following sections. In addition, tasks for data collection, software product definition, market analysis, and technology transfer are included. The schedule for these tasks is provided in a later section (§G).

E.2.1 Task 1: Data Collection for the North Pacific

During Phase I, the data sources that are available for the North Pacific will be collected and validated. Data sources are currently available for this region are included in Table 1. SciFish has immediate access to all of these data sources. The data will be transferred from the various forms of storage to a single database of tables for the range of years from 1980 - 1996.

Table 2. Parameters and Attributes for North Pacific Walleye Pollock

Parameter RangeFisheries Statistics1. Length (m)2. Weight (kg)3. Catch Per Unit Effort (kg/hr)4. Count

[0 - 2][0 - 200][0 - 1,000,000][0 - 100,000]

Sea Surface5. Temperature ( C )6. Pressure (mB)7. Current (km/hr)

[-2 - 20][20 - 40][0 - 10]

Wind8. Speed (km/hr)9. Direction (deg)

[0 - 150][0 - 360]

Bathymetry10. Depth (m) [0 - 8,000]Seasons11. Dates (months) [Jan - Dec]

small largemedium

0.0

0.5

1.0

10 cm 30 cm 50 cm

Walleye Pollock Length

young oldmiddleaged

0.0

0.5

1.0

1 yr 3 yrs 5 yrs

Walleye Pollock Age

veryshallow very deep

0.0

0.5

1.0

50 m 250 m 5,000 m

shallow

Bottom Depth

500 m

deep

Figure 2. Illustration of Fuzzy Set Representations for Length of Walleye Pollock

SciFish - 10 -

Page 11: FuzzyDataMining.doc

NSF Topic 4(c): MATHEMATICAL SCIENCES: Statistical Methods NSF Sol. 97-64Proposal Title: “Fuzzy Data Mining” PIN: SFS-97-22

E.2.2 Task 2: Develop Fuzzy Representation Methodology

As an example, some of the parameters and their ranges for the North Pacific are shown below in Table 2. These data attributes would be relevant to a fuzzy data mining objective of looking for fuzzy correlations between fish species and environmental parameters.

As just this sample shows, the values can range from as small as -2 to as large as 1 million, as well as including textual data. Looking for correlations with such vastly different ranges in value. This illustrates one way in which the fuzzy data mining approach will beneficial. Using a fuzzy representation for each parameter would result in a collection of fuzzy sets for each parameter. One possible set of fuzzy representations is shown below for Walleye Pollock length, Walleye Pollock age, and water depth. The length representation represents that the majority of Pollock are around 30 cm in length, with large Pollock exceeding 50 cm and small Pollock are below 10 cm. The number, value ranges, and shape of each fuzzy set can be fine-tuned to the parameter.

Fuzzy sets will be derived for each parameter. The resulting fuzzy sets will be used for the fuzzy spatial correlations that are to follow. The range for each fuzzy set is shown in Table 3, with the corresponding fuzzy membership functions shown in Figure 2.

Table 3. Illustrating Fuzzy Membership Function Ranges for Some Parameters

Parameter RangeWalleye Pollock Length1. Short2. Medium3. Long

(cm)[0 - 25][15 - 45]

[> 35]Walleye Pollock Age4. Young5. Middle-Aged6. Old

(yr)[0 - 2.5]

[1.5 - 4.5][> 3.5]

Bottom Depth7. Very Shallow8. Shallow9. Deep10. Very Deep

(m)[0 - 200][75 - 400]

[300 - 4,000][>750]

Next, the data must be organized into a grid. The spatial resolution of the grid will depend on each application. For the North Pacific example we are developing here, lets assume the grid cell size is 0.5 degree by 0.5 degree, with a spatial range from 159 W to 164 W Longitude and 52 N to 55 N Latitude, resulting in the grid shown below in Figure 3. Each grid cell’s fuzzy membership value is illustrated by various intensities in color, where the darker colors represent lower membership values. Mathematically, this is expressed as

cellijk k x

where i = longitude cell number, j = latitude cell number, k = membership function number, cell is the cell value, and x is the value being applied to the cell. If a grid cell does not have a value, that cell is given zero membership. If multiple values reside in the same grid cell a centroid weighting will be used to determine the membership value, similar to that done during centroid defuzzification with fuzzy control applications 8 or the median of all the values falling within the grid cell will be calculated prior to applying the fuzzy membership function.

8 Eberhart, R., Simpson, P. & Dobbins, R. (1996). Computational Intelligence PC Tools, Academic Press, Boston, MA.

SciFish - 11 -

Page 12: FuzzyDataMining.doc

NSF Topic 4(c): MATHEMATICAL SCIENCES: Statistical Methods NSF Sol. 97-64Proposal Title: “Fuzzy Data Mining” PIN: SFS-97-22

160W162W164W52N

53N

54N

55N

Figure 3. Illustration of Grid Cells Filled with Fuzzy Memberships

A layer of grid cells, like that shown in Figure 3, will be constructed for each fuzzy membership function that is created. An illustration of the result of this processing can be seen below in Figure 4.

During Phase I, the fuzzy representation approach described herein will be applied to each parameter of interest, resulting in a set of fuzzy membership layers that will be used in the next step of processing. A complete description of the parameters and their corresponding fuzzy set membership functions will be included in the Phase I final report.

E.2.3 Task 3: Develop Fuzzy Correlation Methodology

There are at least three types of fuzzy correlation that can be considered: local correlations (as shown in Figure 4), global correlations, and spatiotemporal correlations. There is a dramatic difference in the processing requirements for each, with a relatively modest computational cost for local correlations and a dramatic cost for global and spatiotemporal correlations. The local correlations will be the subject of the Phase I effort, with the development of the global and spatiotemporal correlations left for Phase II.

Local Fuzzy Correlations. Local fuzzy correlation is illustrated below in Figure 4, where correlations are performed for the same grid cell across all fuzzy layers. The appropriate fuzzy correlation function is a subject of research during Phase I. Immediate candidates are fuzzy union (often the MAX function) and the fuzzy intersection (often the MIN function). Other possible fuzzy correlation functions include taking the product

50N

51N52N

53N160W161W162W163W164W

ShortLengthPollock

MediumLengthPollock

LongLengthPollock

VeryDeep

Local Fuzzy Correlation AcrossAll Cells for the Same

Spatial Cell

1/2 Degree Cells

Figure 4. Illustration of Local Fuzzy Spatial Correlations

SciFish - 12 -

Page 13: FuzzyDataMining.doc

NSF Topic 4(c): MATHEMATICAL SCIENCES: Statistical Methods NSF Sol. 97-64Proposal Title: “Fuzzy Data Mining” PIN: SFS-97-22

The fuzzy correlations will be performed across all combinations of fuzzy layers. Assuming there is a total of N fuzzy layers, the corresponding N x N fuzzy correlation matrix, L, for cellij would be constructed using the expression

L cell cellklij

ijk ijl=fuzzy_op( , )

where Lklij represents the correlation between layer k and layer l for cellij , cellijk is the fuzzy layer value for

cellij in layer k, and fuzzy_op is the fuzzy correlation operator. At the end of this operation, each cell position will have a corresponding fuzzy correlation matrix. An example of this matrix is shown below in Figure 5 using the fuzzy layers shown in Figure 4.

Short Length Pollock

Medium Length Pollock

Long Length Pollock

Very Deep

Sho

rt L

engt

h P

ollo

ck

Med

ium

Len

gth

Pol

lock

Lon

g L

engt

h P

ollo

ck

Ver

y D

eep

Lkl

L11 L12 L13 L N1

L21 L22 L23 L N2

L31 L32 L33 L N3

LN1 LN 2 LN 3 LNN Figure 5. Illustration of a Fuzzy Correlation Matrix Produced for Each Grid Cell

Data analysis of the resulting fuzzy correlation matrices will require searching for the larger correlation values among each matrix. These values can be listed for each matrix, and then trends can be sought. Alternatively, further reduction in data can be achieved by computing the median across all matrix locations to produce a summary report for the entire area under examination.

During Phase I, fuzzy correlation matrices will be created using both Union and Intersection operations. The resulting fuzzy correlation matrices will be analyzed to determine which correlations are strongest and where they occur. The results of this fuzzy correlation analysis will be compared against the existing literature to determine if known relationships were revealed and if new relationships were captured. The results of this comparison will be reported in the Phase I Final report.

Global Fuzzy Correlations. There is an immediate extension of the local fuzzy correlations to a global correlation of data elements. In the global correlation, the correlation matrix is extended to include correlations between all cells of all layers, resulting in an (NxM) X (NxM) fuzzy correlation matrix, where N is the number of layers and M is the total number of grid cells. Clearly this is a tremendous computational extension, but it is thought that it might reveal spatial correlations beyond those found in the local approach. During Phase II, this approach will be examined.

Spatiotemporal Fuzzy Correlations. It is an underlying assumption that the fuzzy data mining approach is being applied during a single snapshot in time. It is likely that many other relevant correlations can be found when considering the change of fuzzy membership values from one time slice to the next. This approach would

SciFish - 13 -

Page 14: FuzzyDataMining.doc

NSF Topic 4(c): MATHEMATICAL SCIENCES: Statistical Methods NSF Sol. 97-64Proposal Title: “Fuzzy Data Mining” PIN: SFS-97-22

required the formation of local fuzzy correlation matrices for each increment of time, followed by a second analysis that would track correlations over time. This approach will also be considered during Phase II.

E.2.4 Task 4: Specifying the Fuzzy Data Mining Software Product

It proposed that through the development of the fuzzy data mining approach, using the application to a North Pacific fish stock as a test case, will have defined the methodology to a point that it will now be possible to develop a software product specification. It is felt that the best opportunity for a software product of this type exists as a third party add-in to an existing Geographic Information System (GIS), spreadsheet, or database package.

The software produced during Phase I will be written in Visual Basic 5.0 to allow for the maximum flexibility for developing an add-in package. Products will explored for each of the existing software products:

1. GIS Add-In. Both ArcView and MapInfo have a large market share in GIS products. A plug-in fuzzy data mining module will be explored with each company.

1. Fisherman’s Associate Plug-In. SciFish has a software product that the fuzzy data mining will be added to immediately following Phase I.

1. Spreadsheet Macro Add-In. Several companies have developed add-in packages that are essentially large macros that run within existing spreadsheet products. Microsoft Excel is an example of a spreadsheet that offers the ability to have such extensions.

1. Database Add-In. Similar to spreadsheets, it is possible to write plug-ins for database products as well. As an example, Microsoft Access has plug-ins for report generation produced by Crystal Reports. In a similar fashion, it would possible to produce an add-in for Access.

During Phase I, a set of software product specifications will be produced for each of these commercialization avenues. One application venue will be selected for the Phase II development and subsequent commercialization.

E.2.5 Task 5: Perform Market Analysis (SciFish Funded)

During Phase I, market analysis will be conducted to determine the full commercial potential of the proposed fuzzy data mining product. The results of this analysis will be included in the Phase I Final Report and will include the following:

1. Market Size. The size of the data mining market will be defined. The emphasis will be placed on the domestic market, with a description of the international market if time permits.

1. Market Segmentation. The market will be segmented by industry and geography..

1. Prioritized Targets. The market segments will be prioritized into a set of targets for the initial product introduction.

1. Reaching the Market. The mechanisms for reaching the target market segments will be defined, including direct mail, magazine advertising, internet, and trade shows.

1. Return on Investment (ROI). Using the economic analysis conducted earlier, a return on investment to the customer will be defined over several product price ranges.

1. Sales Projections. A five-year projection of sales under different product price ranges and ROI scenarios will be developed. These scenarios will explore the affect a higher priced product (with higher profit margins) will have over a lower priced, but higher quantity, product.

1. Secure Financing. Using the market analysis described above, define the cash flow requirements for the production, sales, and distribution of the product. Construct a balance sheet to demonstrate a projected return on the investor’s investment, and then initiate a search for the investor(s).

SciFish - 14 -

Page 15: FuzzyDataMining.doc

NSF Topic 4(c): MATHEMATICAL SCIENCES: Statistical Methods NSF Sol. 97-64Proposal Title: “Fuzzy Data Mining” PIN: SFS-97-22

E.2.6 Task 6: Technology Transfer

Technology will be transferred between this project and NSF in three ways. First, email and telephone will be used to provide reports of project status and discuss key design decisions. Second, a final briefing will be presented at the end of Phase I. Lastly, a final report will be produced that documents the entire project, including performance evaluation and future design.

E.3 Related Research and Development

E.3.1 Related Work by SciFish

SciFish has a track record of developing and demonstrating innovative new technologies for fisheries applications. Below is a description of Fisherman’s Associate, an SBIR-funded effort that moved from concept to the market-place. This project demonstrates several aspects of SciFish’s ability to meet the stated objectives of the proposed Fuzzy Data Mining project:

1. Successful SBIR Project. The development of Fisherman’s Associate is being funded by an NSF SBIR project. This project has gone from concept, to prototype, to product introduction within 24 months. The product is now being sold to Alaskan fishermen, with plans to expand to all west coast fisheries by the end of 1998.

1. Availability and Experience with Fisheries Data. SciFish has collected, analyzed, and processed the fisheries data that will be used for the test case that will be used during the development of the proposed fuzzy data mining approach.

1. GIS Product Development Experience. It is most likely that the proposed fuzzy data mining approach will have the most immediate market potential in the GIS-related marketplace. Through the development of Fisherman’s Associate, SciFish has gained a tremendous amount of experience working within this segment of the software industry.

1. Software Sales and Support. SciFish has demonstrated that it can sell and support a software product. Fisherman’s Associate currently has a part-time marketer that shares time with product development and testing.

E.3.2 Related Work by Others

There are other data mining packages available today, and some even tought the use of fuzzy logic, but these systems are largely fuzzy expert systems with neural network model generation components at the front end. In these systems, the emphasis is on the development of a model from data, followed by an explanation of the model’s operation.

The proposed fuzzy data mining product is different from existing approaches in at least two ways:

1. GIS Emphasis. The GIS market is growing quickly, yet the existing of tools such as the proposed fuzzy data mining product, are not available. Developing the proposed product provides a significantly differnet approach to traditional GIS analysis. Furthermore, through the inclusion of fuzzy sets, it is now possible to provide a mechanism for dealing with the vast differences in data ranges that one experiences in the GIS environment.

1. Fuzzy Foundation. Existing data mining products use fuzzy sets for simply one aspect of the processing. In effect, they have added some small fuzzy feature to one of their existing functions in an effort to capitalize on the popularity of fuzzy systems. The proposed fuzzy data mining approach has, instead, used fuzzy sets as the very foundation of its approach. Selecting the fuzzy membership functions for each parameter, determining the fuzzy operations that will be used to create the fuzzy correlation matrix, and, possibly, using fuzzy set operations to further reduce the fuzzy correlation matrices during analysis are the core elements of this approach, not just an option.

SciFish - 15 -

Page 16: FuzzyDataMining.doc

NSF Topic 4(c): MATHEMATICAL SCIENCES: Statistical Methods NSF Sol. 97-64Proposal Title: “Fuzzy Data Mining” PIN: SFS-97-22

F. Phase I Technical ObjectivesThe goal of the proposed Phase I effort is to develop a fuzzy data mining software product that can be applied to a myriad of spatial problems. There are six objectives that must be met for this project to succeed:1. Develop Fuzzy Representation Methodology. A methodology will be developed for creating fuzzy

representations of data values that will be involved in the fuzzy data mining process.

1. Define Fuzzy Data Mining Operations. A set of fuzzy operators and the results of their application will be defined for spatial data sets.

1. Demonstrate Fuzzy Data Mining. The fuzzy data mining methodology will be demonstrated using the spatial data sets from North Pacific fisheries. The objective of this demonstration will be to find correlations between various environmental and biological phenomena.

1. Define Fuzzy Data Mining Software Product. Define the various software product realizations of the fuzzy data mining methodology, including GIS, database, and spreadsheet add-ins and a stand-alone software product.

1. Market Analysis (SciFish Funded). Define and segment the market. Determine the best strategy for capturing a market presence. Evaluate pricing strategies and barriers to entry.

1. Technology Transfer. Provide progress reports to NSF through regular reporting, a briefing, and a final report.

Phase II Objectives. At the end of Phase I, this project will have at least one, and possibly several, product specifications. These product specifications will be in the form of add-ins for existing GIS, database, and spreadsheet products. From this foundation, the following Phase II objectives are anticipated:

1. Develop software product for at least one of the add-in specifications.

1. Extend product to include global and spatiotemporal analysis as described in §E.2.3.

1. Introduce the first version of the product to test market acceptance.

1. Secure additional capital for the marketing, sales, and distribution of the product.

Phase III Objectives. In Phase III, the software product realization of the fuzzy data base mining methodology will be marketed and sold. Infrastructure will be put in place to handle sales, distribution, maintenance, and engineering.

G. Phase I Research PlanThere are six tasks that will be performed during Phase I of this project. The description of what will be accomplished by each task is included in the previous section (§E.2). The schedule for these tasks is outlined in Figure 6. The total duration for this effort is six months with a start date of January 1, 1998.

Months Task 1 2 3 4 5 6 1. Data Collection ========2. Develop Fuzzy Representation ==================3. Develop Fuzzy Correlations ==================4. Specify Software Product ============5. Perform Market Analysis (SciFish Funded) ==================6. Technology Transfer == ====

Figure 6. Schedule of Tasks

SciFish - 16 -

Page 17: FuzzyDataMining.doc

NSF Topic 4(c): MATHEMATICAL SCIENCES: Statistical Methods NSF Sol. 97-64Proposal Title: “Fuzzy Data Mining” PIN: SFS-97-22

H. Commercial PotentialThe commercial potential of the proposed fuzzy data mining approach will depend on SciFish’s ability to convince the GIS and data mining users that the incorporation of fuzzy techniques will improve their ability to extract more information from their data than they currently are able. The best way to make this happen is through a successful demonstration of fuzzy data mining to an application that has significant interest to a large community. One such area is the fisheries, where the interactions and relationships between various species and their environment is largely unknown. From this foundation, it will then be possible to extend such applications into other areas, such as: oil exploration, forest management, wildlife management, retail site exploration, and local zoning and planning.

The immediate customer for the proposed fuzzy data mining product will be existing spatial data and GIS users. These customers will now be able to represent the data parameters using ranges that relate to the problem domain. Now, instead of working with simply numbers, they will be working with the degree of membership in predefined sets that are germane to the application. By allowing this flexibility, it will be possible to extract and reveal more relationships and interactions than had previously been possible.

It is estimated that the GIS market is at least $200 M in annual sales, with estimates as high as $1 B when considering the additional value added services provided. The proliferation of new data sources provided by the internet, new remote sensing systems, and the release of previously classified defense data fuels the need for more sophisticated tools for extracting information, i.e. value, from that data.

Currently, data mining is being performed with neural networks, genetic algorithms, and statistical techniques. Each approach has its own unique value in the data mining tool box. However, none of these techniques provide the representational capability found in the fuzzy data mining approach presented here. Furthermore, each of these approaches emphasizes temporal data, principally financial data, and is not directly addressing the needs of the spatial data community.

It is SciFish’s intent to deliver the fuzzy data mining software product as an plug-in to existing GIS products. This approach allows SciFish to take advantage of large customer bases that currently exist, as well as allow these existing user’s to work seamlessly within a familiar environment.

I. Principal Investigator and Senior PersonnelThe principal investigator for this effort will be Patrick K. Simpson. Mr. Simpson is the original developer of Fisherman’s Associate and has a long track record with successfully developing applications of fuzzy systems to real-world problems. Mr. Simpson will be responsible for the fuzzy data mining design, development, and demonstration, the market analysis, the software product definition, and technology transfer.

Mr. Simpson will hire a software engineer to assist him with this effort immediately upon notification of award.

I.1 Patrick K. Simpson, Principal InvestigatorSimpson has a diverse background in image and signal processing, pattern recognition, computational intelligence, and acoustics. Positions at Ball Systems Engineering Division, General Dynamics Electronics Division, Accurate Automation, Inc., and ORINCON, Inc. have stressed a strong mix of project management, technical leadership, and engineering.

In 1993, Simpson founded Scientific Fishery Systems, Inc. to migrate technology from defense to fisheries applications. Since then, Simpson has lead the initial development of Fisherman’s Associate, a commercially sold software product for use in the fisheries. In addition, Simpson has lead the design and development of two broadband sonar systems and their application to fish species identification and temperature profiling. Simpson also leads the development of a long-range tuna detection system that is being designed in 1997.

Experience History

SciFish - 17 -

Page 18: FuzzyDataMining.doc

NSF Topic 4(c): MATHEMATICAL SCIENCES: Statistical Methods NSF Sol. 97-64Proposal Title: “Fuzzy Data Mining” PIN: SFS-97-22

(1993 - Pres.) Scientific Fishery Systems, Inc., Founder and President(1990 - 1995) Applied Technology Institute, Instructor(1992 - 1994) ORINCON, Principal Engineer & Consultant(1992 - 1994) SeaWay Technologies, Consultant(1991 - 1992) Accurate Automation, Inc., Chief Engineer(1988 - 1991) General Dynamics Electronics Division (GDE), Engineering Specialist(1987 - 1990) University of California at San Diego Extension, Instructor(1987 - 1988) Ball Systems Engineering Division, Member of Technical Staff(1986 - 1987) UNISYS Corporation (Sperry), Member of Technical Staff(1985 - 1986) San Diego State University Foundation (NOSC), Data Analyst(1974 - 1986) Commercial Fishing on Family Boats, Crew & Captain

EducationB.S. Computer Science, University of California at San Diego, 1986.

Editor PositionsEditorial Board The Journal of Neural Network Computing (1989 - 1991)Associate Editor IEEE Trans. on Neural Networks (1991 - 1994)Associate Editor Australian Journal of Intelligent Information Processing (1994 - Present)Guest Editor Special Issue on Neural Networks for Oceanic Engineering, IEEE Journal of Ocean

Engineering, Fall 1992.Patents1. Acoustic method and apparatus for identifying human sonic sources, U.S. Application Serial No. 07/658-

642, filed 2/22/91, continuation filed in March 94.

1. Active broadband acoustic method and apparatus for identifying aquatic life, Patent No. 5,377,163, Date of Issue 12/27/94.

Selected PublicationsSimpson has written two books, edited two others, and published numerous chapters, articles, reviews, papers, white papers, technical reports, and abstracts on topics that include pattern recognition, the application of intelligent systems to signal processing problems, and the development of intelligent decision aides. A complete resume is available off SciFish’s homepage at www.alaska.net/~scifish. A selection of publications pertinent to this proposal are listed below.1. Eberhart, R., Simpson, P. & Dobbins, R. (1996). Computational Intelligence PC Tools, Academic Press,

Boston, MA.1. Simpson, P. (1990). Artificial Neural Systems: Foundations, Paradigms, Applications and

Implementations, McGraw-Hill Book Company, New York, NY.1. Simpson, P. ed. (1995). Neural Networks: Theory and Technology, Two Volumes, IEEE Press,

Piscataway, NJ.1. Simpson, P. (1995). Fisheries Management Geographic Information System, NSF Phase I SBIR Final

Report, July 1995.1. Simpson, P. (1992). Fuzzy min-max neural networks: 1. Classification, IEEE Transactions on Neural

Networks, Vol. 3, No., 5, pp. 776-786.1. Simpson, P. (1993). Fuzzy min-max neural networks: 2. Clustering, IEEE Transactions on Fuzzy Systems,

Vol. 1, No. 1, pp. 32-45.

1. Brotherton, T., Pollard, T., Simpson, P. & DeMaria, T. (1994). Hierarchical fuzzy neural networks for echocardiogram tissue classification, IEEE Biomedical Eng. Magazine.

SciFish - 18 -

Page 19: FuzzyDataMining.doc

NSF Topic 4(c): MATHEMATICAL SCIENCES: Statistical Methods NSF Sol. 97-64Proposal Title: “Fuzzy Data Mining” PIN: SFS-97-22

J. Subcontracts and ConsultantsThere are no consultants or subcontracts needed for this effort.

K. Equipment, Instrumentation, Computers, and FacilitiesScientific Fishery Systems, Inc. (SciFish), located in Anchorage, AK., is a rapidly growing research and manufacturing company, currently employing more than 6 people and occupying 1,600 square feet of office and laboratory. Since the creation of the company in 1993, sales have grown to $500,000 in 1996 and are expected to be over $750,000 in 1997.

SciFish serves as a showcase for the Small Business Innovation Research (SBIR) program by demonstrating that small business holds the key to future technological growth in the United States. SciFish’s SBIRs have been provided by the U.S. Navy, the Department of Commerce, and the National Science Foundation. Objectives of the SBIR program, under which SciFish does the majority of its work, include stimulating technological innovation, strengthening the role of small business in meeting government research and development needs, and increasing the transfer of technology from government research and development programs to private sector applications. As a result of the SBIR programs, SciFish now markets its line of fisheries planning and reporting software products: Fisherman’s Associate and Charter Boat Associate, mapping and planning tools for integrating biological, oceanographic, atmospheric, geographic, and geological information to improve fishing operations.

Facilities. SciFish’s main facility is located at 6100 A Street, Second Floor, Anchorage, AK. This location places SciFish close to the a large collection of commercial fisheries, including pollock, salmon, halibut, sablefish, and crab. The SciFish facility has 1,600 square foot of office space and utilizes approximately 250 square feet of additional dry storage. A new office is located near Seattle, WA with approximately 150 square feet office space and room for expansion. This facility lies close to a significant portion of the North Pacific commercial fishing fleet, as well as access to Sea Grant academic facilities.

Equipment. SciFish currently has several PCs including one large 200 MHz Pentium-Pro w/ 64 MB RAM and 4 GB hard disk that is used to host MapInfo and Fisherman’s Associate development. Another of SciFish’s PC’s is a Pentium that hosts an A/D and DSP board that can sample data at 770 kHz, a magneto optical storage device for mass storage, and a CD-ROM drive. SciFish also has a broadband transceiver capable of transmitting and receiving a variety of signal types to a depth of 50 fathom (100 m). A second broadband transceiver that will operate from 100 to 190 kHz, provide dual beams of 4 and 15 degrees, and operate to a depth of 200m is currently being built.

Software. SciFish has a large collection of geographic information system, signal processing, software development tools that support product development, including development environments for Visual C++ and Visual Basic. SciFish has LINUX installed on a 1.2 GB disk on one of the Pentiums to provide UNIX compatibility. Also, SciFish runs Windows 3.1, Windows 95, and Windows NT 4.0 on separate platforms.

L. Current and Pending Support of PI and Senior PersonnelCurrently the Principal Investigator, Patrick K. Simpson, is currently devoted full-time to the development of a broadband sonar system for fish identification and a fisheries GIS software product. In January 1998, these projects will be nearing completion, allowing Mr. Simpson to dedicate a portion of his time to the proposed effort.

M. Equivalent or Overlapping Proposals to Other Federal Agencies

There are no other equivalent or overlapping proposals to other federal agencies.

SciFish - 19 -