geoinformatics

Introduction

The WebGoal of the current Web is to make knowledge

widely accessible and to increase the utility of this knowledge by enabling advanced applications for searching, browsing, and evaluation.

The Web presents a vast amount of distributed data and information for human consumption using

the Internet infrastructure anda set of WWW communication standards

e.g., FTP, URI, HTTP

Semantic WebSemantic Web is related to the World Wide Web.It is based on data format to encode knowledge for processing

in computer systems (software)Endeavors leading to it:

Building abstract models that simplify the complex reality Ontology: Description of knowledge about a specific domain in a

machine-processable specification with a formally defined meaning. Models are designed with a structure and relationships of its

componentsComputing with knowledge:

Representing knowledge such that computers can automatically come to reasonable conclusions (i.e., infer) from encoded knowledge

Exchanging information (communication): distribute, interlink, and reconcile heterogeneous knowledge at a global scale (i.e., on the Web) (required the Web, HTTP, HTML, etc.)

ProblemsThe knowledge sources are heterogeneous and globally

distributed

The exchange of these heterogeneous information required standard data formats, language (e.g., HTML) and protocols (e.g., HTTP, FTP, Web services)

Problem with the present WebThe problem is that the Web cannot consume the

information which it carriesThe users of the Web are human beings, who try to make

sense of the information (depending on their background) by reading the ‘best’ hits for their probabilistic keyword-based search provided by the search engines

The number of nodes (servers) on the Web, presenting the large volume of data and information, is increasing at a fast pace, making it hard to effectively index the pages and present a useful statistics to the users

What can be done?It would be nice if the information on the Web could

automatically be handled by machines, which are capable of processing vast volumes of information at high speed, in a fraction of the time it takes humans to read a document with information

However, this requires the computers to ‘know’ the meaning, i.e., the semantics of the information, like human beings

A ScenarioSuppose we are interested to know the location of the normal

faults that are currently seismically active in southwest Montana, and which formed through the Tertiary Basin-and-Range tectonics

We, as geologists, type in our keywords in the search engines, and get many probabilistic hits. We start reading the documents available on the Web, and depending on the level of our knowledge, which is very variable, extract and learn different information about these faults

The information may not be correct, because some of the returned hits which are listed as seismic faults may be reactivated Tertiary faults which we are looking for

Others may be younger than the Basin-and-Range tectonics, and formed through thermal expansion and subsequent subsidence when the North American plate moved southwest relative to the fixed Yellowstone hot spot

Geologists can decipher the difference between these two events using their knowledge of the two extensional events (Basin-and-Range and the hot spot)

They can make this distinction based on their experience in relation to the characteristics of each of these events, such as fault orientation, spatial distribution, cross-cutting relationships, and unconformities based on the sedimentary cover

Problem with keyword search The computer has no clue as to the meaning of the fault, normal

fault, and seismically-active keywords

How can we make computers to learn the geological knowledge, so that our queries return more useful information?

How can we tell the computer that the hanging wall in a normal fault moves down during extension; that horsts represent the footwall; the tick (or lollipop) symbol on the fault trace map is on the hanging wall; the trend of the fault trace is read from the North or South reference in azimuth or quadrant format; the length of a fault trace is read based on the scale of the map; or rocks are not liquid?

Computer doesn’t know meaningNot only the computer does not know what a normal

fault or stress isit does not know the fact that a normal fault forms by

extension, not contraction, when the maximum principal compressive stress is vertical, or

that although a fault is a planar feature, it is represented on a map as a horizontal, linear feature (fault trace)

Currently, only geologists know these. We need to make ‘geologistoids’ by formally structuring and specifying our knowledge and feeding them to software that can read them!

Data, Information, and KnowledgeKnowledge management deals with accessing,

manipulating, andsharing of knowledge

Knowledge engineering: Developing knowledge-based systems (software) in any field that can help the community to process data and information based on the consensual knowledge in that field

This requires understating the notion of data, information, and knowledge

TerminologyData refers to values assigned to the attributes (properties) of

particular object or process entity that occupies space and time

An object is a bona fide or fiat portion of reality, such as a class of individuals, an individual or its parts, or a spatial region

Bona fide objects exist independent of our perceptions and classification, and are demarcated from their surrounding(e.g., fold, formation, oil)

Fiat objects, on the other hand, exist only because of our partitioning (classifying) activities. e.g., Montana, west of the Mississippi

What do Earth Scientists do?We collect data about particular:

continuant, geological objectse.g., the San Andreas Fault

occurent processes, such as Mount St. Helens volcanic eruption (e.g., time interval, type, and nature of eruption)

Data constitute the raw values collected during an activity such as field work, experiment, simulation, or calculationFor example, the age of a rock, the salinity of sea water,

and the depth to the water table are data.

InformationWe commonly need a series of data about

something to make sense of their meaning, i.e, extract information

Data may become meaningful and useful to the scientists, i.e., become information, when they are put together, for example, in a plot, map, or pattern

Information is a collection of data, which based on the background knowledge, may mean something to the person examining the data if he/she has the background knowledge about the subject (i.e., domain or knowledge expert) from which data were extractedInformation, is therefore, the meaning of the data based on

background knowledge (e.g., map)

Map is informationA map that presents the orientation

(e.g., strike and dip or trend), spatial data (location, distribution), and temporal data (age) about many thrust faults and related folds (axial trace, limb attitude), is information

As such, a geological map is more meaningful to a structural geologist than it may be to say a geographer who may not have the required ‘knowledge’ about these geologic structures

Information makes sense with knowledgeInformation may be an emergent property

of data after they are processed in a context

For example, a population of faults oriented parallel to each other may represent a set, which based on the domain knowledge, may be assumed by the domain experts, i.e., structural geologists, to have formed together during a single tectonic event

Same information may be interpreted differently applying different knowledge based on different truths, beliefs, perspectives, judgments, and know-how!

Truth depends on knowledgeThat’s how science expands, by interpreting same data

differently, until the ‘truth’ is found and verified with the existing knowledge

The ‘truth’, which is what is believed to be true at a given time, may change with new knowledge and discoveries

Knowledge in current scientific books generally present the latest sets of true statements

Although the data, which are presented on thecurrent Web, may be created and formed into meaningful information by both humans and machines, they may only be understood by humans

Difference between knowledgeand information

Knowledge is a collection or total sumof true beliefs (statements) about real objects in a field (domain or universe of discourse), which can be used to make a decision

The true beliefs are mainly about universals (i.e., types of things such as fault, mineral), but also include facts about particulars or individuals (i.e., instance of the general types, such as San Andreas Fault, a sample of quartz)

True statements

General, universal truthsNotice that here we are talking about general

true statements (facts) which have been discovered by geoscientists throughout the history of geoscience through scientific method

Although scientists learn about the general (i.e., universal) types of objects and features which they study, they study particular objects in their research

A hydrogeology book presents the hydrogeology knowledge by dealing more with general facts about universal types of aquifers (confined, unconfined, and leaky) and to some extentabout particular aquifers (e.g., Floridan Aquifer)

Knowledge = set of known true statementsKnowledge is a set of true statements

(i.e., knowledge fragments)

Examples of knowledge fragment: ‘rock is made of one or more minerals’‘thrusting moves older rocks on top of younger ones’‘mylonite forms in a ductile shear zone’‘pressure of an unconfined aquifer is atmospheric’

The goal of a knowledge-based system is to translate these knowledge fragments into a machine-understandable and processable code:

Rock hasPart MineralMineral partOf Rock

ExampleSuppose a good number of temperature

measurements in the past winter month rangedbetween -10oC and -20oC, with an average of -11oC,and a cooling trend

The -10oC to -20oC temperatures are data, the cooling trend, and the comparison of the average temperature over many years are information

The statement: ‘average temperature drops in winter’ is a piece of general knowledge

Given that even colder temperatures are coming (information), we may make a decision not to go out with a T-shirt unless we want to freeze (we have the knowledge that we may freeze at extremely low temperatures) or to show off our ‘Love Earth’s Diploes’ T-shirt

Geoscience exampleWe are planning to build a large structure

(a nuclear reactor or dam) in an area. We are not sure if a fault runs through the area

The epicenters of microseismic events over the past two decades show a linear spatial distribution, which coincides with a straight drainage cut in Quaternary alluvium

In this case the epicenters are data; the linear spatial distribution of the epicenters is information

We know that Quaternary faults may cut through recent alluvium, and that seismic faults are active, i.e., they can slip at any timeThis knowledge, and the knowledge that ’building a nuclear reaction on a

fault may be dangerous’ lead us to make a decision not to build the reactor or the fault in this location. The reasoning to make these decisions is based on background knowledge which resides in geoscientists’ heads

Rules of InferenceComputers can also make use of information through inference rules

if we explicitly formalize our knowledge with specific rule-based machine language and logic

Automatic processing of information and performing inference about it requires specific languages (e.g., RDF, RDFS, and OWL) with built-in inference rules

We need ways to represent the semantics of our knowledge fragments by identifying real domain objects, and modeling the relationships among these objects and processes that involve them

This knowledge-based model of reality (ontology), with embedded metadata and inference rules, canbe used for reasoning (i.e., drawing implicit entailments from the explicitly asserted facts), e.g.:NormalFault isA FaultFault isA PlanarStructureEntailment: NormalFault isA PlanarStructure

PlanarStructure

Fault

NormalFault

Realist View to the WorldEach group of scientists studies a part of

the world (domain) by abstracting and simplifying it based on community’s interest

These so called domain or knowledge experts (e.g., paleontologist, petroleum geologist):

look at the reality from specific perspectives, and

understand the relationships among the domain object and process entities differently

Different perspectivesAn oil or gas ‘reservoir’ for a petroleum

geologist is a ‘formation’ for a stratigrapher, a ‘rock type’ for a sedimentologist, and may be an ‘anticline’ to a structural geologist

It is clear that these related domains have a lot in common, and integrated information, collected about the same objects (e.g., a reservoir), viewed from different perspectives, and applying variable knowledge (sedimentology, structural geology), can improve existing geological knowledge, and lead to knowledge discovery and better decision making. This requires integration!

Scientists work autonomouslyIndividuals or a group of scientists in a same

domain (e.g., isotope geology, planetary geology) often work independent of each other, applying autonomous data acquisition and processing methodologies, despite sharing the same general knowledge about real domain objects

These scientists may store their data in either worksheets (e.g., MS Excel) or relational databases with ad-hoc design or schemaThe names in their database tables are as variable

as the number of their databases and worksheetsEach geologist wants to say something about a geological

feature or process that he/she studies This is a mess; isn’t it?

These geologists do this by publishing their peer reviewed work in scientific journals

To see their work, one has to study the article in paper or digital format

Their database may be a node on the Web, and available for human consumption, but commonly cannot be processed with different computers distributed over the Web

Is this frustrating?

The AAA slogan and the OWAThe good thing is that any geoscientist can study and present

his/her findings about any geological problem as long as the statements go through the scientific peer review process

This happens to nicely fit the Semantic Web’s AAA slogan: Anyone can say Anything about Any topic

The good news is that scientific endeavors are based on the open world assumption (OWA):Scientists may find new information at any

time, and what they presently know is but apart of a never-ending universe of knowledgewhich they will accumulate So, be cool!

Semantic Web based on AAA, OWA, and NUAJust like the AAA slogan, the open world assumption is exactly

what the current Web and the Semantic Web are based on

Another parallel between the Semantic Web and scientific research is the No unique Naming Assumption:Different scientists may refer to the same object

or process by different names (synonymy), and draw different meaning from the same process or object (polysemy)

Although this seems to be a problem, it reflects the reality, and is hard to change Let’s just live with it!The good news is that the Semantic Web is also based on the

‘No Unique Assumption (NUA)’, and can elegantly handle the disparate naming and meaning in scientific researchNo problem?

Scientific data are either stored in diverse databases, randomly scattered in unstructured publication tables, or in ad hoc Excel spreadsheetsThe problem with these data stores is the

lack of a capability to efficiently link their content (i.e., integration)

Moreover, the quality of the data, when they were collected, entered, or processed, is not controlled (lack data integrity), sufficiently enough, to turn the often voluminous data into information that can lead to knowledge discovery, and useful decision making

Software cannot interoperate and process these disparate data

The question in Earth Sciences is how we can automatically (i.e., with autonomous, distributed computers) use a large body of data, collected about components and processes of the Earth system, turn it into information, and then discover and improve our current understanding about the Earth Is there hope?

geoinformatics

Documents

web servicesproblem

heterogeneous information

current web

heterogeneous knowledge

different information

faultsthe information

description of knowledge

httpsemantic websemantic