time series data concepts - doc

21
CRF-RDTE-TR-20101102-02 11/2/2009 Public Distribution| Michael Corsello CORSELLO RESEARCH FOUNDATION TIME SERIES DATA BACKGROUND CONCEPTS

Upload: michael-corsello

Post on 27-Apr-2015

580 views

Category:

Documents


0 download

DESCRIPTION

CRF-RDTE-TR-20101102-02 11/2/2009CORSELLO RESEARCH FOUNDATIONTIME SERIES DATABACKGROUND CONCEPTSPublic Distribution| Michael CorselloCorsello Research FoundationAbstractTime Series data is an informational construct to deal with sequential data taken in time. Based upon a priori knowledge of temporal sequences, data may be managed in ways that enhance storage or performance efficiencies. Time Series analysis is the set of analytic techniques that operate upon time series data.Publ

TRANSCRIPT

Page 1: Time Series Data Concepts - Doc

CRF-RDTE-TR-20101102-02

11/2/2009

Public Distribution| Michael Corsello

CORSELLO

RESEARCH

FOUNDATION

TIME SERIES DATA BACKGROUND CONCEPTS

Page 2: Time Series Data Concepts - Doc

Corsello Research Foundation

Public Distribution CRF-RDTE-TR-20101102-02

Abstract Time Series data is an informational construct to deal with sequential data taken in time. Based upon a

priori knowledge of temporal sequences, data may be managed in ways that enhance storage or

performance efficiencies. Time Series analysis is the set of analytic techniques that operate upon time

series data.

Page 3: Time Series Data Concepts - Doc

Corsello Research Foundation

Public Distribution CRF-RDTE-TR-20101102-02

Table of Contents Abstract ......................................................................................................................................................... 2

Introduction .................................................................................................................................................. 4

Temporal Concept ..................................................................................................................................... 4

Time Series .................................................................................................................................................... 4

Time Domain ............................................................................................................................................. 5

Time Interval ............................................................................................................................................. 5

Measure Interval ....................................................................................................................................... 6

Relationship to Temporal Data ................................................................................................................. 6

Collection ...................................................................................................................................................... 7

Use ................................................................................................................................................................ 8

Methodologies of Use ............................................................................................................................... 9

Enumeration ............................................................................................................................................. 9

Storage ........................................................................................................................................................ 10

Mechanisms of Storage .......................................................................................................................... 10

Field Storage ....................................................................................................................................... 10

Time Series Field Storage Concepts ........................................................................................................ 11

Single Field Time Series ....................................................................................................................... 12

Multiple Field Time Series ................................................................................................................... 12

RDBMS Storage Patterns......................................................................................................................... 15

Flat Temporal ...................................................................................................................................... 15

Flat Time Series ................................................................................................................................... 15

Entity Time Series ................................................................................................................................ 16

Dynamic Time Series ........................................................................................................................... 18

Conclusions ................................................................................................................................................. 20

Appendices .................................................................................................................................................. 21

Acronym List ........................................................................................................................................... 21

Works Cited ............................................................................................................................................. 21

Page 4: Time Series Data Concepts - Doc

Corsello Research Foundation

Public Distribution CRF-RDTE-TR-20101102-02

Introduction Sensors and other continual monitoring data collection efforts are used scientific computing and many

related fields. These forms of data collection have a common underlying premise of a fixed set of data

fields collected at regular time intervals over a longer-term time period. This is the essence of a time

series.

How time series data is collected and used has a direct influence on the storage methodology that

should be used. In general, time series data is a form of temporal data that is managed as a set. It is the

uniformity of the collection that enables and favors specific treatment for management.

Temporal Concept Time is an intrinsic concept familiar to us all, it marks the “when” of all events and conversely, all events

may be marked by “when” they occur. All measurements are collected in time, a temperature (say

15oC) is a value measured at a point in time (and space). Most intuitively, measurements are “taken” at

the time “now” for when the measurement occurs. At any point in time after the measurement was

recorded, it can be referred to by the time of the measurement. This basic concept implies that all data

is temporal in nature.

The term “temporal” in this construct means “with respect to time” or “of or pertaining to time” where

are key term is “data”. Thus, temporal data is any data that has value measured with respect to or

bound by time. More specifically, temporal data is about bounding the validity or relevance of a specific

data point in time. Again, for our temperature data measure, that measure is only valid with respect to

what was measured at a fixed point in time. If a river is measured to be 15oC; that measurement is only

valid at the point in time the measurement was taken.

For any data measurement, the value measured (15oC) is non-temporal (15oC is a value, it is the “thing”

measured “a river” which makes it temporal). For certain very specific applications, a measurement

may not be capable of variance over time and is therefore temporally static. This does not imply that

the data is not temporal, only that the temporal validity of the measurement is equivalent to the

temporal life of the item measured. For example, a standard bath tub is 60” in length. That

measurement is temporal for any bath tub, once manufactured it may be measured and will be 60” in

length. Since the length of the tub does not change (at a relevant scale – which may be important), we

can say the tub is “always 60” long”. In this case all subsequent measures of length will be equivalent.

This indicates a deeper meaning – the item measured doesn’t change over time – however, each

measurement is still temporal indicating when it was measured. These are two distinct concepts,

temporality of the measurement and temporality of the item measured.

Time Series A time series is defined as a fixed structure of data collected repeatedly over time at fixed intervals. This

definition is very broad and as such allows for variability in several areas.

Page 5: Time Series Data Concepts - Doc

Corsello Research Foundation

Public Distribution CRF-RDTE-TR-20101102-02

Figure 1. Time Series Example, Single Variable

Time Domain A single time series data set will have a time domain marking the start and end of the time series. For

continual monitoring scenarios, the end may be thought of as both “now” and “the end of time”. Since

the data is a time series, “now” represents the current last record which is often represented by the

current actual time. However, “now” may represent an arbitrary point in time for a simulation

generated time series. In the simulation time series, “now” represents the latest point in time that the

simulation is currently processing. In the “end of time” case, the time series is expected to go on forever

and therefore has no known end, but the “now” time represents the last entry currently available in the

time series.

Figure 2. Bounded and Unbounded Time Series

Time Interval For any time series, there is a fixed interval between value points. For example “every five minutes” is

an interval for a time series of data points collected at five minute intervals. It is this exact concept that

permits a time series to only store the data collected and not the time value it is a measurement for. In

this manner, a time series only stores two actual times, a start and end date/time. Additionally, the

Page 6: Time Series Data Concepts - Doc

Corsello Research Foundation

Public Distribution CRF-RDTE-TR-20101102-02

time series stores a single interval value that is the return period or sampling interval separating discrete

readings (five minutes in our example).

Figure 3. Time Interval Example, with Hole

Measure Interval An important related concept of time series data (actually in any sensor collection), is the actual

measurement interval. If a measurement is taken every five minutes, what is the collection method for

the measure? For example, if a temperature measurement is recorded every five minutes on the “0”

and “5” (e.g. 5:00, 5:05), then is the measure an instantaneous temperature or an average temperature

from the previous time or an average of a split time (5:00 recorded, sampled from 4:57:30-5:02:30).

This information is not part of the time series itself, but is instead metadata about the series. An

important concept here is that for continual monitoring time series, changes of sensors over time may

measure using different approaches. In the case of different measure intervals, the time series should

be split for consistency.

Figure 4. Measure Interval Example

Relationship to Temporal Data Time series data is a special case of temporal data. A time series is temporal in that each measurement

within the time series may be treated as a single temporal measurement (what measured when). The

fixed interval of measures makes the treatment of the data special, whereas the data itself is not special

in any way.

Page 7: Time Series Data Concepts - Doc

Corsello Research Foundation

Public Distribution CRF-RDTE-TR-20101102-02

Figure 5. Time Series vs Temporal Representation

A single time series may have thousands or millions of individual measurements, each spaced at fixed

intervals (note that “fixed” may be overstating, in that some intervals may be “monthly” which is not

explicitly “fixed”). If a time series were to have only a single measurement (the degenerate case), it

would be exactly a temporal measure. Likewise, any collection of temporal measures that have the

property of being evenly spaced in time may be treated as a time series. Further, it is possible to

construct a time series from non-evenly spaced data via an interpolation process. It is common to

abstract detailed measures (such as hourly temperatures at uneven intervals – sparse data) into more

abstract time series such as daily, weekly or monthly means.

Collection Time series data may be collected in any of a number of ways. For example, a simulation or other

application may generate a time series directly. This relates to a single run of an application generating

a full time series at once. An application may also append to a time series each time it runs. In the

latter case, it is critical the application is consistent in each run to maintain the integrity of time series

temporal offsets. Further, it is often desirable to know which run produced which part of the time

series. In this case, each run must produce a separate time series that is simply linearly synchronous

with the previous time series.

In the collection of time series data from sensors or manual entry, each subsequent “round” of

collection is conceptually separate from the previous “round” of collection. In the case of a field

deployed sensor (non-telemetry), each time the sensor is changed out or data is downloaded there is a

new time series created for that batch of data. This is critical in that each deployment of a sensor may

overlap slightly, may have short gaps, or may be skewed slightly (every five minutes, but on the “1s” and

“6s”).

Page 8: Time Series Data Concepts - Doc

Corsello Research Foundation

Public Distribution CRF-RDTE-TR-20101102-02

Figure 6. Multiple Data, Single Sensor

This collection concept of multiple time series that align with collection efforts establishes a need for a

“virtual time series” that spans multiple individual time series. This virtual time series is the defined

“global” time series for a collection definition (fields, interval and domain) that is composed of individual

“physical” time series that each contains actual data records for a collection effort.

Figure 7. Virtual Time Series Example

Use Once time series data is collected and stored, the long-term purpose of the time series can be realized.

The ultimate purpose of time series data is no different than that of any data – use. People and

applications will use and re-use time series data for varying purposes and in various ways. How time

series data is used will influence the approach used for storage to ensure adequate performance and

storage volumes are available to handle the demand.

In general terms, it is the nature of how time series data is used that most influences its special

treatment. In many cases a time series is used as a whole (the entire series) rather than as individual

measures. Without such a directed form of use, the notion of a time series would be irrelevant as a

separate entity from the more general temporal data. Further, it is the cost of storage and transmission

Page 9: Time Series Data Concepts - Doc

Corsello Research Foundation

Public Distribution CRF-RDTE-TR-20101102-02

which can greatly affect the performance of applications using time series data that suggests the special

treatment of time series to reduce size and increase access performance.

Methodologies of Use When using time series data, the data may be used in any of several ways. It is possible that only part of

the time series is needed, or that the entire time series is used. Further, the data that is used may be

enumerated in different ways.

Random Extraction

The most basic form of use for a time series is that of random extractions. In a random extraction, a

user needs data from a time series based upon a set of criterion known only by the user at extraction

time (not planned or expected at data collection time). This is one of the most common scenarios for

any data use and has large implications in storage format. For random extraction, a user may request

“all records where temperature is over 32”. This form of access results in a search over the time series

to extract the individual elements matching the criteria provided.

Temporal Extraction

The easiest form of extraction from a time series is temporal extraction. In temporal extraction, the

user wants a portion of the time series between two dates. This results in a new time series being

returned that is bounded by the most constrained limits between the user defined limits and the time

series internal limits (such as requesting an extraction starting prior to the start of the time series itself).

Complete Delivery

The best case use scenario for a time series is complete delivery. Notice this is not an extraction, in that

the entire data set is delivered as a whole. No processing is required beyond integrating “physical” time

series into the “virtual” record.

Enumeration Once time series data is delivered, a user will general “walk through” the data in some manner toward a

goal. For example, to compute the average of a time series a full forward-scrolling read is performed to

sum all values in the time series. This is a complete linear access from start to finish.

Linear or Sequential

Linear or sequential access is the direct reading of the time series in time order of the data. Linear

access has no special requirements and is one common access scenario.

Partial

For any type of access, including linear, it is possible that only a portion of the data is to be reviewed. In

this case, the access will only need to visit a portion of the data points within the time series.

Random

In a random access methodology, the user may need to access any point within the time series at any

time. In this way, the user must be able to “move” within the time series at will. Random access is the

most complex form of access for any data structure, and is commonly required. One common example

Page 10: Time Series Data Concepts - Doc

Corsello Research Foundation

Public Distribution CRF-RDTE-TR-20101102-02

of random access is for sort. If a user wanted to sort a time series by temperature rather than by time,

they would used both linear access to enumerate and random access to read specific items.

More significantly, random access allows for access by data field, such as temperature (e.g. get record

for temperature = 26). This form of random access is closed related to random extraction and has

similar impacts for performance.

Index or Ordinal

Index or ordinal access to a time series is access by time offset or by “offset” into the time series by

position (e.g. the 26th data point in the series). Index access is closely related to random access (and is in

fact a mechanism for random access) without the performance issues of other forms of random access.

In general, index access is the only form of random access with low performance costs.

Storage Time series data may be stored in many different ways. There are many well-defined storage formats

for dealing with the storage and transport of time series data such as CDF (Common Data Format) and

NetCDF (Network Common Data Format). Further, there are many databases and applications that have

support for time series data such as Aquarius, Historis, Temporal Analyst, GrADs, Timescape XDB. Each

of these tools and formats has pros and cons to use and any of these may be adequate for a specific use.

In general, there is a common thread across all time series formats, a time series is a set of data

delimited in time by a fixed interval with a fixed start date (our general definition). In specific

implementations, there may be constraints on the data stored in a single time series (the fields) or on

the maximum size of the time series when stored (Aquarius for example has the limit of the underlying

database).

When planning time series storage, considerations must be made for the collection and use of the data

to be stored to ensure adequate capacity and performance. Further, each type of data to be stored in a

time series (the field set) will require a dedicated time series store. For example, a water quality time

series cannot store sediment data (there are different fields). However, a water/sediment time series

may be created that stores both together as a single entity. In this case however, there is no separation

between the data which will impact flexibility of collection, storage and use.

Mechanisms of Storage As with any data, a time series may be stored in a relational database management system (RDBMS), in

flat files, as XML or in any other manner. The selection of storage location (e.g. flat file or RDBMS) will

influence how the data within that location is structured. For example, in an RDBMS, each time series

could be stored as a dedicated table, a set of rows in a shared table, or a single row in a shared table.

Field Storage

An important aspect of the time series is the fields within the series. If a time series stores only a single

parameter (such as temperature), the time series storage is relatively trivial. If the time series stores a

Page 11: Time Series Data Concepts - Doc

Corsello Research Foundation

Public Distribution CRF-RDTE-TR-20101102-02

complex data structure (such as several fields, some of which may have sub-fields), the storage of the

time series will be equally complex.

Storage Basics

For any storage on a computer, data must be reduced into bytes that are written to and read from disk.

Even in an RDBMS, the same is true. In any programming language or RDBMS, there are a set of specific

data types that are well known and can be directly converted between bytes and the data type (such as

a 32-bit integer or text string).

Each language and database understands a different way of converting between bytes and data types

(for example, a 32-bit integer in Java does not represent the same byte pattern as a 32-bit integer in

Visual Basic or SQL Server). The conversion of a data type to bytes is called “serialization” and the

reverse is called “deserialization”. This is an ongoing issue in computer science and affects all computing

applications. As long as there is a single platform performing all operations across the lifecycle, there is

no measurable issue, however this is rarely the case. The most consistent format across all platforms is

text, which is a powerful indicator of why XML has been so successful as everything is represented as

text in XML.

The comparison of data (such as during search) required the processing software to “understand” the

data stored, so any mechanism that compresses or encodes the data stored must be able to uncompress

or decode the data prior to performing a comparison. Due to this fundamental concept, the storage

format used should be aligned with the ultimate patterns of use and limitations of the storage platforms

(for example maximum allowed field lengths in an RDBMS).

Storage Considerations

As with all storage related concepts, it is critical that storage designers consider volume (size), access

speed (read and write) and general performance when designing a storage implementation. If most

access will enumerate a data set for example, the selected storage mechanism should favor that form of

access. However, if random access is still needed, then no optimizations should be used for

enumerations that make random access unusable. This is always a trade-off and must be evaluated on a

case-by-case basis.

Take Away

Storage planning and format is not a trivial concept. For any given storage platform (such as SQL Server

and C#), a set of storage patterns may be implemented for all reasonably expected needs and then used

wherever those needs arise. This patterning allows for near optimal application at minimal costs since

patterns are developed once and reused over the enterprise.

Time Series Field Storage Concepts Each time series may have multiple fields of data collected. Further, each time series may have different

fields collected than another time series. Given both of these premises, the design of the data fields

within a time series may be of considerable importance. As will be discussed in later sections, time

series data may be stored in any number of ways using various technologies such as RDBMS

applications, XML, binary files and so on. With each of these technologies, the time series and the data

Page 12: Time Series Data Concepts - Doc

Corsello Research Foundation

Public Distribution CRF-RDTE-TR-20101102-02

values are related and may be treated differently based upon the specific technology used. This section

will discuss general concepts relating to the structure of a time series and its data value fields.

Single Field Time Series

The most basic form of time series is a single field time series. In this form of time series only a single

value is collected at each time interval. This form of time series may be thought of and treated as a

basic “value stream” of discrete values for the single field at the fixed interval of the time series. The

field and storage design for this type of time series only needs to deal with the most primitive anomaly;

missing data values. Within any time series it must be expected that some individual value points may

be corrupt and therefore are missing from the series.

In any time series that uses IEEE 754 compliant single (32-bit) or double (64-bit) precision floating point

numbers, there is a built-in “not a number” (NaN) value. In this case, no special handling is required for

the time series except to expect that NaN values may be present anywhere within the value stream. If a

single field time series is storing data in another format, such as integer or string values,

accommodations must be made for the absence of value within the value stream.

For the design of single field time series data, there are two basic approaches:

Time coupled

Sequential

A time coupled single value series will associate each record within the time series as the (T,V) pair of

time (T) and value (V). This set of pairs becomes the time series. A sequential single value series will

provide all records within the time series as a stream of values with only a single time stored indicating

the start of the series and a single interval which indicates the temporal spacing of the values within the

series. In this manner, the time series may be thought of simply as an array of values.

Multiple Field Time Series

In multiple field time series, each temporal record within the time series has a set of multiple fields.

Based upon our working definition of a time series, all records have exactly the same set of fields within

a single time series. However, each time series defines its own set of fields, and therefore may result in

arbitrarily many time series field sets within an organizations corpus of time series data.

The pattern for storing multiple field time series data can take several forms. The most basic form is to

treat each field within the time series as a distinct, single field time series. This approach isolates each

data field as a distinct time series and provides the ability to distribute the storage of each time series to

different storage locations. There is however the overhead of additional storage for the time series

metadata.

Hub and Spoke

A basic expansion of the single field time series pattern for multiple fields is to create a “hub and spoke”

or “star” pattern for the time series (depicted below). The core time series metadata is recorded as a

single entity, with each field modeled as a discrete time series data value stream.

Page 13: Time Series Data Concepts - Doc

Corsello Research Foundation

Public Distribution CRF-RDTE-TR-20101102-02

Figure 8. Example of Hub-and-Spoke Model

Beyond representing a multi-field time series as a collection of single-field time series, any multi-field

time series can be represented as a single entity, with storage modeled again in multiple ways.

Field Interleaved

In a field interleaved representation, a time series is stored as a series of value streams. Each value

stream is complete for the time series, containing all values for a single field. This model most closely

resembles the result of the hub and spoke model, where each parameter is isolated as a series.

Figure 9. Field Interleaved Representation Example

The total time series has one value stream per field in this storage strategy that can be easily

enumerated. If values of multiple fields must be accessed together, there is additional overhead for

enumerating multiple streams.

Page 14: Time Series Data Concepts - Doc

Corsello Research Foundation

Public Distribution CRF-RDTE-TR-20101102-02

Interval Interleaved

In an interval interleaved representation, the fields are stored in order within each temporal interval.

This permits each temporal interval to be the primary unit of separation between each data record.

Within a single temporal record, the fields are consecutive in a pre-defined order.

Figure 10. Interval Interleaved Representation Example

Interval interleaved storage provides rapid enumeration of the time series when all fields are used in the

enumeration. If it is most common to enumerate the time series to access only a single parameter,

there is overhead in the transport and skipping of unused fields to access the required field.

Coupled Interleaved

For any time series where general enumeration involves specific known groups of fields, a hybrid of field

and interval interleaving may be used. A coupled interleaved method allows for groups of fields to be

represented as field interleaved with the remainder of the dataset interval interleaved.

Page 15: Time Series Data Concepts - Doc

Corsello Research Foundation

Public Distribution CRF-RDTE-TR-20101102-02

Figure 11. Coupled Interleaved Representation Example

The coupled representation provides fast enumeration for the coupled fields while avoiding the cost of

skipping unused fields. If the coupling of fields is not known at design time, this representation is

difficult to plan for. Use of this pattern has the overhead of both interleaving methods if enumerating

uncoupled fields (e.g. field 1 and field 5 in above example).

RDBMS Storage Patterns Within an RDBMS, time series data can be stored in a number of ways. Most simply, time series data

may be stored as temporal records, one value per row. Likewise, time series data can be compacted

into a single field and stored as a binary object (BLOB) or XML.

Flat Temporal

In the flat temporal model of storing time series data, there is no notion of a time series specifically.

Instead, all data is simply stored as temporal records.

Figure 12. Flat Temporal Pattern Example

This flat data storage is the most simplistic method of storing temporal data overall, and provides good

performance for random access, but suffers from poor insert performance (mainly when indexed) and

slow overall sequential access performance due to the table-scan nature of retrieval.

Flat Time Series

In the flat time series model of storing time series data, each time series is “registered” in a time series

table that defines only the time series reference information (metadata). All the actual data for the time

Page 16: Time Series Data Concepts - Doc

Corsello Research Foundation

Public Distribution CRF-RDTE-TR-20101102-02

series is stored in a values table, where each record in the values table stores a single time series record

(point in time).

Figure 13. Flat Time Series Pattern Example

In most cases, each time series will have a different set of fields and therefore be best represented by a

separate values table. In this model, that results in a single “master” time series table and multiple

values tables as shown below.

Figure 14. Mutiple Flat Time Series Example

Flat time series data storage provides good a similar performance characteristic to the flat temporal

model as data is stored in a similar fashion. However, the time series table allows for retrieval based

upon a specific time series instance. This allows for a long-term time series (such as continual

monitoring) to be identified in a single values table, with separate physical time series for each

instrument deployment (e.g. single hydrolab collection event).

If random access to data is the most common, this model will yield the best overall performance

characteristics and allow for query by data values with no special software capabilities utilized.

Entity Time Series

If an entire time series is treated as an entity, with the individual data values treated simply as atoms

within the entity, a time series may be stored as a single record in a database table. This general

concept is the basis for entity time series storage.

Entity time series storage has multiple “flavors” that each has subtle differences to improve some aspect

of the time series storage size or performance.

Page 17: Time Series Data Concepts - Doc

Corsello Research Foundation

Public Distribution CRF-RDTE-TR-20101102-02

Flat BLOB Storage

The first entity time series storage model is blob storage. Each time series entity is stored in a single

database record and all data points within the time series are stored within a single blob field.

Figure 15. Flat BLOB Pattern Example

Notice that a single table is all that is required, regardless of the data fields collected for the values. The

data within the “DataValueBLOB” field is software specifically encoded binary representation of the

entire set of data values for the time series.

This form of storage is very simple in that each time series read or write operation acts upon a single

record within a single table. Overall, the table will have a very small number of records due to the

storage model. However, there is an implied limit in the size of the time series based upon the limit of

the underlying RDBMS. In most cases, this will be in the 2-4Gb range for traditional RDBMS applications,

which may be unacceptable for long-term monitoring.

In the case of long-term monitoring, not only is the size a limiting factor, but each time a value is

appended to the time series, the entire record must be re-constructed and re-written. This is a wasteful

process that generally requires interim caching until sufficient data is collected to mitigate the

write/transfer overhead.

An important aspect of BLOB-based storage is the requirement for custom software (external to the

RDBMS) to interpret, encode and decode the data within the BLOB. This may result in additional

considerations in terms of software costs, software version stability and operating environment

constraints.

Enhancing Flat BLOB Storage

Given the nature of flat BLOB storage, it may seem to be a bad choice. However, if the number of data

fields are small and the size of the overall time series is small (e.g. one month instrument deployment of

two parameter data at hourly sample rate), flat BLOB storage will provide excellent performance and

storage efficiency.

Flat XML Storage

A close analog of the flat BLOB storage is flat XML storage. Flat XML and flat BLOB storage work in the

same way, except instead of using a BLOB field to store binary data, the values are encoded in XML and

stored in an XML or text field within the time series table.

Page 18: Time Series Data Concepts - Doc

Corsello Research Foundation

Public Distribution CRF-RDTE-TR-20101102-02

Figure 16. Flat XML Pattern Example

A major benefit of flat XML storage is that XML data may generally be searched into by the RDBMS

without the need for external software. Further, XML is text based and therefore resilient to software

platform and version differences. There are still the same size limitations imposed by the RDBMS as for

BLOB storage, but they are more restrictive in XML as XML encoded data is generally larger by some

amount. It is not uncommon for XML data to be twice or more the size of comparable binary encoded

data. Of course there is large variability in the actual size of data depending on how efficient of

encoding is used.

External File Storage

Another close analog of the flat BLOB storage is external file storage. In external file storage, instead of

any value data being within the RDBMS, all values are encoded in a data file. The time series table then

simply contains a reference to the file. Again, the encoding of the data within the file is managed by

external software and the RDBMS itself has no knowledge of the file.

External file storage has the benefits of BLOB storage without any RDBMS imposed size limitations.

External file storage has a potential for performance benefits as well based upon how the file storage is

implemented.

Figure 17. External File Pattern Example

Dynamic Time Series

A further refinement for RDBMS storage of time series data is to dynamically structure the storage

rather than use fixed elements as in the previous methodologies. Again, dynamic time series storage is

a broad class of methodologies that attempt to gain advantages in performance and size for managing

time series data within an RDBMS.

In all dynamic time series storage strategies, the data within the values fields may be encoded as BLOB

or XML data similarly to the basic entity storage mechanisms from the previous section. In dynamic

storage, the time series is simply broken into multiple individual records each of which contains multiple

data values.

Page 19: Time Series Data Concepts - Doc

Corsello Research Foundation

Public Distribution CRF-RDTE-TR-20101102-02

Fixed Size Dynamic Storage

In a fixed size dynamic storage each record has a field target size limit (e.g. 100kb, 10Mb, etc) for storing

data values. The data value encoding software is responsible for breaking the time series into “chunks”

of data values that do not exceed this size limit. The goal is to encode the most discrete values possible,

in time order, that do not exceed this size limit. In this manner, each record will contain all values

between a min and max time.

Figure 18. Fixed Site Dynamic Pattern Example

There are two basic sub-strategies for fixed size time series storage:

Time Window

Entity Window

In the time window strategy (as depicted above), the time series values table maintains a start date and

end date for each time series record that indicates the bounds stored within that record.

The entity window strategy (depicted below) is very similar, except that if the time series records are all

of fixed size, it is possible to know a priori what the exact maximum number of data values may be

stored within a single record of the time series. In this strategy, the time series itself indicates the

number of values stored within a record and the “offset” is computed to any value as:

𝑟𝑒𝑐𝑜𝑟𝑑𝐼𝑑 =𝑅𝑒𝑞𝑢𝑒𝑠𝑡𝑒𝑑𝑇𝑖𝑚𝑒 − 𝑆𝑡𝑎𝑟𝑡𝑇𝑖𝑚𝑒

𝑖𝑛𝑡𝑒𝑟𝑣𝑎𝑙

𝑟𝑒𝑐𝑜𝑟𝑑𝑂𝑓𝑓𝑠𝑒𝑡 =𝑟𝑒𝑐𝑜𝑟𝑑𝐼𝑑

𝑣𝑎𝑙𝑢𝑒𝑠𝑃𝑒𝑟𝑅𝑒𝑐𝑜𝑟𝑑; 𝑒𝑙𝑒𝑚𝑒𝑛𝑡𝑂𝑓𝑓𝑠𝑒𝑡 = 𝑚𝑜𝑑

𝑟𝑒𝑐𝑜𝑟𝑑𝐼𝑑

𝑣𝑎𝑙𝑢𝑒𝑠𝑃𝑒𝑟𝑅𝑒𝑐𝑜𝑟𝑑

Once the computation is completed, the recordOffset indicates the sequenceId (zero-based) containing

the value and the elementOffset indicates which value within the record is to be returned. The need for

this computation makes random access possible but slightly computationally costly. For enumeration of

data, there is no such overhead cost. Instead, enumeration of the time series requires a query with an

ordering upon the sequenceId column of the values table.

Page 20: Time Series Data Concepts - Doc

Corsello Research Foundation

Public Distribution CRF-RDTE-TR-20101102-02

Figure 19. Fixed Size Entity Window Example

Fixed Entities Dynamic Storage

The fixed entities dynamic storage strategy is similar to the fixed size storage except that each record

will contain an exact number of interval records, regardless of the size required to store that number of

records. Of course it is imperative that the number of records stored will fit within the constraints of the

underlying RDBMS. This form of storage is most similar to the fixed entity strategy of fixed size.

Figure 20. Fixed Entities Pattern Example

The fixed entities storage must be able to access each temporal record within the time series. If the

temporal records are of fixed size then this strategy is identical to the above discussed entity window

strategy where offsets are directly computed. For situations where the record size is not fixed (such as

multi-field with empty records for holes) the offset to a specific record cannot be directly computed and

must instead be indexed. In these cases, the index offsets of each element are generally written into a

“header” of the data values or stored in a separate “indices” column in the values table.

Conclusions Every organization must evaluate its information strategy and time series data needs to ensure

adequate planning and effective implementations are used for an effective lifecycle for all users. There

are many considerations for each type of time series data that comprises the organizational information

corpus. Data modeling and implementation planning is an activity which is critical to ensure the proper

entities are captured in a repeatable, standardized and maintainable manner.

Time series data can be reduced to a simple set of concepts and a small set of general patterns for

implementation. Each actual time series data set within the organization can then use these concepts

and patterns to create and effective and efficient implementation for that time series that can be reused

over the organizations lifetime. However, each time series data type will need to be evaluated

separately and the most effective storage patterns used.

Page 21: Time Series Data Concepts - Doc

Corsello Research Foundation

Public Distribution CRF-RDTE-TR-20101102-02

Appendices

Acronym List

Acronym Description

CDF Common Data Format

DID Data Item Description

NARA National Archives and Records Administration

NetCDF Network Common Data Format

RDBMS Relational Database Management System

SQL Structured Query Language

XML Extensible Markup Language

Works Cited Corsello, M. (2009, Feb 02). Temporal Concepts CRF-RDTE-TR-20090202-01. Retrieved from http://cid-

75594e1c43b40d0a.skydrive.live.com/browse.aspx/Public/Skydrive%20Series/Temporal%20Concepts%2

0CRF-RDTE-TR-20090202-01.pdf

Kamas, G., & Lombardi, M. A. (1990). NIST Time and Frequency Users Manual. Washington DC: NIST US

Department of Commerce.

Levine, J. (1999). Introduction to time and frequency metrology. Review of Scientific Instruments , 2567-

2596.

Lombardi, M. A. (2001). Fundamentals of Time and Frequency. In R. H. Bishop, The Mechatronics

Handbook. Boca Raton: CRC Press LLC.

Lombardi, M. (2002). NIST Special Publication 432. Washington DC: NIST US Department of Commerce.

Lombardi, M. (1999). Traceability in Time and Frequency Metrology. Washington DC: National Institute

of Standards and Technology Time and Frequency Division.

Sullivan, D. B., & Bergquist, J. C. (2001). Primary Atomic Frequency Standards at NIST. Journal of

Research of the National Institute of Standards and Technology , 47–63.

Wikipedia contributors . (2009, November 13). Knowledge . Retrieved November 13, 2009, from

Wikipedia, The Free Encyclopedia:

http://en.wikipedia.org/w/index.php?title=Knowledge&oldid=325539292