time series data concepts - doc
DESCRIPTION
CRF-RDTE-TR-20101102-02 11/2/2009CORSELLO RESEARCH FOUNDATIONTIME SERIES DATABACKGROUND CONCEPTSPublic Distribution| Michael CorselloCorsello Research FoundationAbstractTime Series data is an informational construct to deal with sequential data taken in time. Based upon a priori knowledge of temporal sequences, data may be managed in ways that enhance storage or performance efficiencies. Time Series analysis is the set of analytic techniques that operate upon time series data.PublTRANSCRIPT
CRF-RDTE-TR-20101102-02
11/2/2009
Public Distribution| Michael Corsello
CORSELLO
RESEARCH
FOUNDATION
TIME SERIES DATA BACKGROUND CONCEPTS
Corsello Research Foundation
Public Distribution CRF-RDTE-TR-20101102-02
Abstract Time Series data is an informational construct to deal with sequential data taken in time. Based upon a
priori knowledge of temporal sequences, data may be managed in ways that enhance storage or
performance efficiencies. Time Series analysis is the set of analytic techniques that operate upon time
series data.
Corsello Research Foundation
Public Distribution CRF-RDTE-TR-20101102-02
Table of Contents Abstract ......................................................................................................................................................... 2
Introduction .................................................................................................................................................. 4
Temporal Concept ..................................................................................................................................... 4
Time Series .................................................................................................................................................... 4
Time Domain ............................................................................................................................................. 5
Time Interval ............................................................................................................................................. 5
Measure Interval ....................................................................................................................................... 6
Relationship to Temporal Data ................................................................................................................. 6
Collection ...................................................................................................................................................... 7
Use ................................................................................................................................................................ 8
Methodologies of Use ............................................................................................................................... 9
Enumeration ............................................................................................................................................. 9
Storage ........................................................................................................................................................ 10
Mechanisms of Storage .......................................................................................................................... 10
Field Storage ....................................................................................................................................... 10
Time Series Field Storage Concepts ........................................................................................................ 11
Single Field Time Series ....................................................................................................................... 12
Multiple Field Time Series ................................................................................................................... 12
RDBMS Storage Patterns......................................................................................................................... 15
Flat Temporal ...................................................................................................................................... 15
Flat Time Series ................................................................................................................................... 15
Entity Time Series ................................................................................................................................ 16
Dynamic Time Series ........................................................................................................................... 18
Conclusions ................................................................................................................................................. 20
Appendices .................................................................................................................................................. 21
Acronym List ........................................................................................................................................... 21
Works Cited ............................................................................................................................................. 21
Corsello Research Foundation
Public Distribution CRF-RDTE-TR-20101102-02
Introduction Sensors and other continual monitoring data collection efforts are used scientific computing and many
related fields. These forms of data collection have a common underlying premise of a fixed set of data
fields collected at regular time intervals over a longer-term time period. This is the essence of a time
series.
How time series data is collected and used has a direct influence on the storage methodology that
should be used. In general, time series data is a form of temporal data that is managed as a set. It is the
uniformity of the collection that enables and favors specific treatment for management.
Temporal Concept Time is an intrinsic concept familiar to us all, it marks the “when” of all events and conversely, all events
may be marked by “when” they occur. All measurements are collected in time, a temperature (say
15oC) is a value measured at a point in time (and space). Most intuitively, measurements are “taken” at
the time “now” for when the measurement occurs. At any point in time after the measurement was
recorded, it can be referred to by the time of the measurement. This basic concept implies that all data
is temporal in nature.
The term “temporal” in this construct means “with respect to time” or “of or pertaining to time” where
are key term is “data”. Thus, temporal data is any data that has value measured with respect to or
bound by time. More specifically, temporal data is about bounding the validity or relevance of a specific
data point in time. Again, for our temperature data measure, that measure is only valid with respect to
what was measured at a fixed point in time. If a river is measured to be 15oC; that measurement is only
valid at the point in time the measurement was taken.
For any data measurement, the value measured (15oC) is non-temporal (15oC is a value, it is the “thing”
measured “a river” which makes it temporal). For certain very specific applications, a measurement
may not be capable of variance over time and is therefore temporally static. This does not imply that
the data is not temporal, only that the temporal validity of the measurement is equivalent to the
temporal life of the item measured. For example, a standard bath tub is 60” in length. That
measurement is temporal for any bath tub, once manufactured it may be measured and will be 60” in
length. Since the length of the tub does not change (at a relevant scale – which may be important), we
can say the tub is “always 60” long”. In this case all subsequent measures of length will be equivalent.
This indicates a deeper meaning – the item measured doesn’t change over time – however, each
measurement is still temporal indicating when it was measured. These are two distinct concepts,
temporality of the measurement and temporality of the item measured.
Time Series A time series is defined as a fixed structure of data collected repeatedly over time at fixed intervals. This
definition is very broad and as such allows for variability in several areas.
Corsello Research Foundation
Public Distribution CRF-RDTE-TR-20101102-02
Figure 1. Time Series Example, Single Variable
Time Domain A single time series data set will have a time domain marking the start and end of the time series. For
continual monitoring scenarios, the end may be thought of as both “now” and “the end of time”. Since
the data is a time series, “now” represents the current last record which is often represented by the
current actual time. However, “now” may represent an arbitrary point in time for a simulation
generated time series. In the simulation time series, “now” represents the latest point in time that the
simulation is currently processing. In the “end of time” case, the time series is expected to go on forever
and therefore has no known end, but the “now” time represents the last entry currently available in the
time series.
Figure 2. Bounded and Unbounded Time Series
Time Interval For any time series, there is a fixed interval between value points. For example “every five minutes” is
an interval for a time series of data points collected at five minute intervals. It is this exact concept that
permits a time series to only store the data collected and not the time value it is a measurement for. In
this manner, a time series only stores two actual times, a start and end date/time. Additionally, the
Corsello Research Foundation
Public Distribution CRF-RDTE-TR-20101102-02
time series stores a single interval value that is the return period or sampling interval separating discrete
readings (five minutes in our example).
Figure 3. Time Interval Example, with Hole
Measure Interval An important related concept of time series data (actually in any sensor collection), is the actual
measurement interval. If a measurement is taken every five minutes, what is the collection method for
the measure? For example, if a temperature measurement is recorded every five minutes on the “0”
and “5” (e.g. 5:00, 5:05), then is the measure an instantaneous temperature or an average temperature
from the previous time or an average of a split time (5:00 recorded, sampled from 4:57:30-5:02:30).
This information is not part of the time series itself, but is instead metadata about the series. An
important concept here is that for continual monitoring time series, changes of sensors over time may
measure using different approaches. In the case of different measure intervals, the time series should
be split for consistency.
Figure 4. Measure Interval Example
Relationship to Temporal Data Time series data is a special case of temporal data. A time series is temporal in that each measurement
within the time series may be treated as a single temporal measurement (what measured when). The
fixed interval of measures makes the treatment of the data special, whereas the data itself is not special
in any way.
Corsello Research Foundation
Public Distribution CRF-RDTE-TR-20101102-02
Figure 5. Time Series vs Temporal Representation
A single time series may have thousands or millions of individual measurements, each spaced at fixed
intervals (note that “fixed” may be overstating, in that some intervals may be “monthly” which is not
explicitly “fixed”). If a time series were to have only a single measurement (the degenerate case), it
would be exactly a temporal measure. Likewise, any collection of temporal measures that have the
property of being evenly spaced in time may be treated as a time series. Further, it is possible to
construct a time series from non-evenly spaced data via an interpolation process. It is common to
abstract detailed measures (such as hourly temperatures at uneven intervals – sparse data) into more
abstract time series such as daily, weekly or monthly means.
Collection Time series data may be collected in any of a number of ways. For example, a simulation or other
application may generate a time series directly. This relates to a single run of an application generating
a full time series at once. An application may also append to a time series each time it runs. In the
latter case, it is critical the application is consistent in each run to maintain the integrity of time series
temporal offsets. Further, it is often desirable to know which run produced which part of the time
series. In this case, each run must produce a separate time series that is simply linearly synchronous
with the previous time series.
In the collection of time series data from sensors or manual entry, each subsequent “round” of
collection is conceptually separate from the previous “round” of collection. In the case of a field
deployed sensor (non-telemetry), each time the sensor is changed out or data is downloaded there is a
new time series created for that batch of data. This is critical in that each deployment of a sensor may
overlap slightly, may have short gaps, or may be skewed slightly (every five minutes, but on the “1s” and
“6s”).
Corsello Research Foundation
Public Distribution CRF-RDTE-TR-20101102-02
Figure 6. Multiple Data, Single Sensor
This collection concept of multiple time series that align with collection efforts establishes a need for a
“virtual time series” that spans multiple individual time series. This virtual time series is the defined
“global” time series for a collection definition (fields, interval and domain) that is composed of individual
“physical” time series that each contains actual data records for a collection effort.
Figure 7. Virtual Time Series Example
Use Once time series data is collected and stored, the long-term purpose of the time series can be realized.
The ultimate purpose of time series data is no different than that of any data – use. People and
applications will use and re-use time series data for varying purposes and in various ways. How time
series data is used will influence the approach used for storage to ensure adequate performance and
storage volumes are available to handle the demand.
In general terms, it is the nature of how time series data is used that most influences its special
treatment. In many cases a time series is used as a whole (the entire series) rather than as individual
measures. Without such a directed form of use, the notion of a time series would be irrelevant as a
separate entity from the more general temporal data. Further, it is the cost of storage and transmission
Corsello Research Foundation
Public Distribution CRF-RDTE-TR-20101102-02
which can greatly affect the performance of applications using time series data that suggests the special
treatment of time series to reduce size and increase access performance.
Methodologies of Use When using time series data, the data may be used in any of several ways. It is possible that only part of
the time series is needed, or that the entire time series is used. Further, the data that is used may be
enumerated in different ways.
Random Extraction
The most basic form of use for a time series is that of random extractions. In a random extraction, a
user needs data from a time series based upon a set of criterion known only by the user at extraction
time (not planned or expected at data collection time). This is one of the most common scenarios for
any data use and has large implications in storage format. For random extraction, a user may request
“all records where temperature is over 32”. This form of access results in a search over the time series
to extract the individual elements matching the criteria provided.
Temporal Extraction
The easiest form of extraction from a time series is temporal extraction. In temporal extraction, the
user wants a portion of the time series between two dates. This results in a new time series being
returned that is bounded by the most constrained limits between the user defined limits and the time
series internal limits (such as requesting an extraction starting prior to the start of the time series itself).
Complete Delivery
The best case use scenario for a time series is complete delivery. Notice this is not an extraction, in that
the entire data set is delivered as a whole. No processing is required beyond integrating “physical” time
series into the “virtual” record.
Enumeration Once time series data is delivered, a user will general “walk through” the data in some manner toward a
goal. For example, to compute the average of a time series a full forward-scrolling read is performed to
sum all values in the time series. This is a complete linear access from start to finish.
Linear or Sequential
Linear or sequential access is the direct reading of the time series in time order of the data. Linear
access has no special requirements and is one common access scenario.
Partial
For any type of access, including linear, it is possible that only a portion of the data is to be reviewed. In
this case, the access will only need to visit a portion of the data points within the time series.
Random
In a random access methodology, the user may need to access any point within the time series at any
time. In this way, the user must be able to “move” within the time series at will. Random access is the
most complex form of access for any data structure, and is commonly required. One common example
Corsello Research Foundation
Public Distribution CRF-RDTE-TR-20101102-02
of random access is for sort. If a user wanted to sort a time series by temperature rather than by time,
they would used both linear access to enumerate and random access to read specific items.
More significantly, random access allows for access by data field, such as temperature (e.g. get record
for temperature = 26). This form of random access is closed related to random extraction and has
similar impacts for performance.
Index or Ordinal
Index or ordinal access to a time series is access by time offset or by “offset” into the time series by
position (e.g. the 26th data point in the series). Index access is closely related to random access (and is in
fact a mechanism for random access) without the performance issues of other forms of random access.
In general, index access is the only form of random access with low performance costs.
Storage Time series data may be stored in many different ways. There are many well-defined storage formats
for dealing with the storage and transport of time series data such as CDF (Common Data Format) and
NetCDF (Network Common Data Format). Further, there are many databases and applications that have
support for time series data such as Aquarius, Historis, Temporal Analyst, GrADs, Timescape XDB. Each
of these tools and formats has pros and cons to use and any of these may be adequate for a specific use.
In general, there is a common thread across all time series formats, a time series is a set of data
delimited in time by a fixed interval with a fixed start date (our general definition). In specific
implementations, there may be constraints on the data stored in a single time series (the fields) or on
the maximum size of the time series when stored (Aquarius for example has the limit of the underlying
database).
When planning time series storage, considerations must be made for the collection and use of the data
to be stored to ensure adequate capacity and performance. Further, each type of data to be stored in a
time series (the field set) will require a dedicated time series store. For example, a water quality time
series cannot store sediment data (there are different fields). However, a water/sediment time series
may be created that stores both together as a single entity. In this case however, there is no separation
between the data which will impact flexibility of collection, storage and use.
Mechanisms of Storage As with any data, a time series may be stored in a relational database management system (RDBMS), in
flat files, as XML or in any other manner. The selection of storage location (e.g. flat file or RDBMS) will
influence how the data within that location is structured. For example, in an RDBMS, each time series
could be stored as a dedicated table, a set of rows in a shared table, or a single row in a shared table.
Field Storage
An important aspect of the time series is the fields within the series. If a time series stores only a single
parameter (such as temperature), the time series storage is relatively trivial. If the time series stores a
Corsello Research Foundation
Public Distribution CRF-RDTE-TR-20101102-02
complex data structure (such as several fields, some of which may have sub-fields), the storage of the
time series will be equally complex.
Storage Basics
For any storage on a computer, data must be reduced into bytes that are written to and read from disk.
Even in an RDBMS, the same is true. In any programming language or RDBMS, there are a set of specific
data types that are well known and can be directly converted between bytes and the data type (such as
a 32-bit integer or text string).
Each language and database understands a different way of converting between bytes and data types
(for example, a 32-bit integer in Java does not represent the same byte pattern as a 32-bit integer in
Visual Basic or SQL Server). The conversion of a data type to bytes is called “serialization” and the
reverse is called “deserialization”. This is an ongoing issue in computer science and affects all computing
applications. As long as there is a single platform performing all operations across the lifecycle, there is
no measurable issue, however this is rarely the case. The most consistent format across all platforms is
text, which is a powerful indicator of why XML has been so successful as everything is represented as
text in XML.
The comparison of data (such as during search) required the processing software to “understand” the
data stored, so any mechanism that compresses or encodes the data stored must be able to uncompress
or decode the data prior to performing a comparison. Due to this fundamental concept, the storage
format used should be aligned with the ultimate patterns of use and limitations of the storage platforms
(for example maximum allowed field lengths in an RDBMS).
Storage Considerations
As with all storage related concepts, it is critical that storage designers consider volume (size), access
speed (read and write) and general performance when designing a storage implementation. If most
access will enumerate a data set for example, the selected storage mechanism should favor that form of
access. However, if random access is still needed, then no optimizations should be used for
enumerations that make random access unusable. This is always a trade-off and must be evaluated on a
case-by-case basis.
Take Away
Storage planning and format is not a trivial concept. For any given storage platform (such as SQL Server
and C#), a set of storage patterns may be implemented for all reasonably expected needs and then used
wherever those needs arise. This patterning allows for near optimal application at minimal costs since
patterns are developed once and reused over the enterprise.
Time Series Field Storage Concepts Each time series may have multiple fields of data collected. Further, each time series may have different
fields collected than another time series. Given both of these premises, the design of the data fields
within a time series may be of considerable importance. As will be discussed in later sections, time
series data may be stored in any number of ways using various technologies such as RDBMS
applications, XML, binary files and so on. With each of these technologies, the time series and the data
Corsello Research Foundation
Public Distribution CRF-RDTE-TR-20101102-02
values are related and may be treated differently based upon the specific technology used. This section
will discuss general concepts relating to the structure of a time series and its data value fields.
Single Field Time Series
The most basic form of time series is a single field time series. In this form of time series only a single
value is collected at each time interval. This form of time series may be thought of and treated as a
basic “value stream” of discrete values for the single field at the fixed interval of the time series. The
field and storage design for this type of time series only needs to deal with the most primitive anomaly;
missing data values. Within any time series it must be expected that some individual value points may
be corrupt and therefore are missing from the series.
In any time series that uses IEEE 754 compliant single (32-bit) or double (64-bit) precision floating point
numbers, there is a built-in “not a number” (NaN) value. In this case, no special handling is required for
the time series except to expect that NaN values may be present anywhere within the value stream. If a
single field time series is storing data in another format, such as integer or string values,
accommodations must be made for the absence of value within the value stream.
For the design of single field time series data, there are two basic approaches:
Time coupled
Sequential
A time coupled single value series will associate each record within the time series as the (T,V) pair of
time (T) and value (V). This set of pairs becomes the time series. A sequential single value series will
provide all records within the time series as a stream of values with only a single time stored indicating
the start of the series and a single interval which indicates the temporal spacing of the values within the
series. In this manner, the time series may be thought of simply as an array of values.
Multiple Field Time Series
In multiple field time series, each temporal record within the time series has a set of multiple fields.
Based upon our working definition of a time series, all records have exactly the same set of fields within
a single time series. However, each time series defines its own set of fields, and therefore may result in
arbitrarily many time series field sets within an organizations corpus of time series data.
The pattern for storing multiple field time series data can take several forms. The most basic form is to
treat each field within the time series as a distinct, single field time series. This approach isolates each
data field as a distinct time series and provides the ability to distribute the storage of each time series to
different storage locations. There is however the overhead of additional storage for the time series
metadata.
Hub and Spoke
A basic expansion of the single field time series pattern for multiple fields is to create a “hub and spoke”
or “star” pattern for the time series (depicted below). The core time series metadata is recorded as a
single entity, with each field modeled as a discrete time series data value stream.
Corsello Research Foundation
Public Distribution CRF-RDTE-TR-20101102-02
Figure 8. Example of Hub-and-Spoke Model
Beyond representing a multi-field time series as a collection of single-field time series, any multi-field
time series can be represented as a single entity, with storage modeled again in multiple ways.
Field Interleaved
In a field interleaved representation, a time series is stored as a series of value streams. Each value
stream is complete for the time series, containing all values for a single field. This model most closely
resembles the result of the hub and spoke model, where each parameter is isolated as a series.
Figure 9. Field Interleaved Representation Example
The total time series has one value stream per field in this storage strategy that can be easily
enumerated. If values of multiple fields must be accessed together, there is additional overhead for
enumerating multiple streams.
Corsello Research Foundation
Public Distribution CRF-RDTE-TR-20101102-02
Interval Interleaved
In an interval interleaved representation, the fields are stored in order within each temporal interval.
This permits each temporal interval to be the primary unit of separation between each data record.
Within a single temporal record, the fields are consecutive in a pre-defined order.
Figure 10. Interval Interleaved Representation Example
Interval interleaved storage provides rapid enumeration of the time series when all fields are used in the
enumeration. If it is most common to enumerate the time series to access only a single parameter,
there is overhead in the transport and skipping of unused fields to access the required field.
Coupled Interleaved
For any time series where general enumeration involves specific known groups of fields, a hybrid of field
and interval interleaving may be used. A coupled interleaved method allows for groups of fields to be
represented as field interleaved with the remainder of the dataset interval interleaved.
Corsello Research Foundation
Public Distribution CRF-RDTE-TR-20101102-02
Figure 11. Coupled Interleaved Representation Example
The coupled representation provides fast enumeration for the coupled fields while avoiding the cost of
skipping unused fields. If the coupling of fields is not known at design time, this representation is
difficult to plan for. Use of this pattern has the overhead of both interleaving methods if enumerating
uncoupled fields (e.g. field 1 and field 5 in above example).
RDBMS Storage Patterns Within an RDBMS, time series data can be stored in a number of ways. Most simply, time series data
may be stored as temporal records, one value per row. Likewise, time series data can be compacted
into a single field and stored as a binary object (BLOB) or XML.
Flat Temporal
In the flat temporal model of storing time series data, there is no notion of a time series specifically.
Instead, all data is simply stored as temporal records.
Figure 12. Flat Temporal Pattern Example
This flat data storage is the most simplistic method of storing temporal data overall, and provides good
performance for random access, but suffers from poor insert performance (mainly when indexed) and
slow overall sequential access performance due to the table-scan nature of retrieval.
Flat Time Series
In the flat time series model of storing time series data, each time series is “registered” in a time series
table that defines only the time series reference information (metadata). All the actual data for the time
Corsello Research Foundation
Public Distribution CRF-RDTE-TR-20101102-02
series is stored in a values table, where each record in the values table stores a single time series record
(point in time).
Figure 13. Flat Time Series Pattern Example
In most cases, each time series will have a different set of fields and therefore be best represented by a
separate values table. In this model, that results in a single “master” time series table and multiple
values tables as shown below.
Figure 14. Mutiple Flat Time Series Example
Flat time series data storage provides good a similar performance characteristic to the flat temporal
model as data is stored in a similar fashion. However, the time series table allows for retrieval based
upon a specific time series instance. This allows for a long-term time series (such as continual
monitoring) to be identified in a single values table, with separate physical time series for each
instrument deployment (e.g. single hydrolab collection event).
If random access to data is the most common, this model will yield the best overall performance
characteristics and allow for query by data values with no special software capabilities utilized.
Entity Time Series
If an entire time series is treated as an entity, with the individual data values treated simply as atoms
within the entity, a time series may be stored as a single record in a database table. This general
concept is the basis for entity time series storage.
Entity time series storage has multiple “flavors” that each has subtle differences to improve some aspect
of the time series storage size or performance.
Corsello Research Foundation
Public Distribution CRF-RDTE-TR-20101102-02
Flat BLOB Storage
The first entity time series storage model is blob storage. Each time series entity is stored in a single
database record and all data points within the time series are stored within a single blob field.
Figure 15. Flat BLOB Pattern Example
Notice that a single table is all that is required, regardless of the data fields collected for the values. The
data within the “DataValueBLOB” field is software specifically encoded binary representation of the
entire set of data values for the time series.
This form of storage is very simple in that each time series read or write operation acts upon a single
record within a single table. Overall, the table will have a very small number of records due to the
storage model. However, there is an implied limit in the size of the time series based upon the limit of
the underlying RDBMS. In most cases, this will be in the 2-4Gb range for traditional RDBMS applications,
which may be unacceptable for long-term monitoring.
In the case of long-term monitoring, not only is the size a limiting factor, but each time a value is
appended to the time series, the entire record must be re-constructed and re-written. This is a wasteful
process that generally requires interim caching until sufficient data is collected to mitigate the
write/transfer overhead.
An important aspect of BLOB-based storage is the requirement for custom software (external to the
RDBMS) to interpret, encode and decode the data within the BLOB. This may result in additional
considerations in terms of software costs, software version stability and operating environment
constraints.
Enhancing Flat BLOB Storage
Given the nature of flat BLOB storage, it may seem to be a bad choice. However, if the number of data
fields are small and the size of the overall time series is small (e.g. one month instrument deployment of
two parameter data at hourly sample rate), flat BLOB storage will provide excellent performance and
storage efficiency.
Flat XML Storage
A close analog of the flat BLOB storage is flat XML storage. Flat XML and flat BLOB storage work in the
same way, except instead of using a BLOB field to store binary data, the values are encoded in XML and
stored in an XML or text field within the time series table.
Corsello Research Foundation
Public Distribution CRF-RDTE-TR-20101102-02
Figure 16. Flat XML Pattern Example
A major benefit of flat XML storage is that XML data may generally be searched into by the RDBMS
without the need for external software. Further, XML is text based and therefore resilient to software
platform and version differences. There are still the same size limitations imposed by the RDBMS as for
BLOB storage, but they are more restrictive in XML as XML encoded data is generally larger by some
amount. It is not uncommon for XML data to be twice or more the size of comparable binary encoded
data. Of course there is large variability in the actual size of data depending on how efficient of
encoding is used.
External File Storage
Another close analog of the flat BLOB storage is external file storage. In external file storage, instead of
any value data being within the RDBMS, all values are encoded in a data file. The time series table then
simply contains a reference to the file. Again, the encoding of the data within the file is managed by
external software and the RDBMS itself has no knowledge of the file.
External file storage has the benefits of BLOB storage without any RDBMS imposed size limitations.
External file storage has a potential for performance benefits as well based upon how the file storage is
implemented.
Figure 17. External File Pattern Example
Dynamic Time Series
A further refinement for RDBMS storage of time series data is to dynamically structure the storage
rather than use fixed elements as in the previous methodologies. Again, dynamic time series storage is
a broad class of methodologies that attempt to gain advantages in performance and size for managing
time series data within an RDBMS.
In all dynamic time series storage strategies, the data within the values fields may be encoded as BLOB
or XML data similarly to the basic entity storage mechanisms from the previous section. In dynamic
storage, the time series is simply broken into multiple individual records each of which contains multiple
data values.
Corsello Research Foundation
Public Distribution CRF-RDTE-TR-20101102-02
Fixed Size Dynamic Storage
In a fixed size dynamic storage each record has a field target size limit (e.g. 100kb, 10Mb, etc) for storing
data values. The data value encoding software is responsible for breaking the time series into “chunks”
of data values that do not exceed this size limit. The goal is to encode the most discrete values possible,
in time order, that do not exceed this size limit. In this manner, each record will contain all values
between a min and max time.
Figure 18. Fixed Site Dynamic Pattern Example
There are two basic sub-strategies for fixed size time series storage:
Time Window
Entity Window
In the time window strategy (as depicted above), the time series values table maintains a start date and
end date for each time series record that indicates the bounds stored within that record.
The entity window strategy (depicted below) is very similar, except that if the time series records are all
of fixed size, it is possible to know a priori what the exact maximum number of data values may be
stored within a single record of the time series. In this strategy, the time series itself indicates the
number of values stored within a record and the “offset” is computed to any value as:
𝑟𝑒𝑐𝑜𝑟𝑑𝐼𝑑 =𝑅𝑒𝑞𝑢𝑒𝑠𝑡𝑒𝑑𝑇𝑖𝑚𝑒 − 𝑆𝑡𝑎𝑟𝑡𝑇𝑖𝑚𝑒
𝑖𝑛𝑡𝑒𝑟𝑣𝑎𝑙
𝑟𝑒𝑐𝑜𝑟𝑑𝑂𝑓𝑓𝑠𝑒𝑡 =𝑟𝑒𝑐𝑜𝑟𝑑𝐼𝑑
𝑣𝑎𝑙𝑢𝑒𝑠𝑃𝑒𝑟𝑅𝑒𝑐𝑜𝑟𝑑; 𝑒𝑙𝑒𝑚𝑒𝑛𝑡𝑂𝑓𝑓𝑠𝑒𝑡 = 𝑚𝑜𝑑
𝑟𝑒𝑐𝑜𝑟𝑑𝐼𝑑
𝑣𝑎𝑙𝑢𝑒𝑠𝑃𝑒𝑟𝑅𝑒𝑐𝑜𝑟𝑑
Once the computation is completed, the recordOffset indicates the sequenceId (zero-based) containing
the value and the elementOffset indicates which value within the record is to be returned. The need for
this computation makes random access possible but slightly computationally costly. For enumeration of
data, there is no such overhead cost. Instead, enumeration of the time series requires a query with an
ordering upon the sequenceId column of the values table.
Corsello Research Foundation
Public Distribution CRF-RDTE-TR-20101102-02
Figure 19. Fixed Size Entity Window Example
Fixed Entities Dynamic Storage
The fixed entities dynamic storage strategy is similar to the fixed size storage except that each record
will contain an exact number of interval records, regardless of the size required to store that number of
records. Of course it is imperative that the number of records stored will fit within the constraints of the
underlying RDBMS. This form of storage is most similar to the fixed entity strategy of fixed size.
Figure 20. Fixed Entities Pattern Example
The fixed entities storage must be able to access each temporal record within the time series. If the
temporal records are of fixed size then this strategy is identical to the above discussed entity window
strategy where offsets are directly computed. For situations where the record size is not fixed (such as
multi-field with empty records for holes) the offset to a specific record cannot be directly computed and
must instead be indexed. In these cases, the index offsets of each element are generally written into a
“header” of the data values or stored in a separate “indices” column in the values table.
Conclusions Every organization must evaluate its information strategy and time series data needs to ensure
adequate planning and effective implementations are used for an effective lifecycle for all users. There
are many considerations for each type of time series data that comprises the organizational information
corpus. Data modeling and implementation planning is an activity which is critical to ensure the proper
entities are captured in a repeatable, standardized and maintainable manner.
Time series data can be reduced to a simple set of concepts and a small set of general patterns for
implementation. Each actual time series data set within the organization can then use these concepts
and patterns to create and effective and efficient implementation for that time series that can be reused
over the organizations lifetime. However, each time series data type will need to be evaluated
separately and the most effective storage patterns used.
Corsello Research Foundation
Public Distribution CRF-RDTE-TR-20101102-02
Appendices
Acronym List
Acronym Description
CDF Common Data Format
DID Data Item Description
NARA National Archives and Records Administration
NetCDF Network Common Data Format
RDBMS Relational Database Management System
SQL Structured Query Language
XML Extensible Markup Language
Works Cited Corsello, M. (2009, Feb 02). Temporal Concepts CRF-RDTE-TR-20090202-01. Retrieved from http://cid-
75594e1c43b40d0a.skydrive.live.com/browse.aspx/Public/Skydrive%20Series/Temporal%20Concepts%2
0CRF-RDTE-TR-20090202-01.pdf
Kamas, G., & Lombardi, M. A. (1990). NIST Time and Frequency Users Manual. Washington DC: NIST US
Department of Commerce.
Levine, J. (1999). Introduction to time and frequency metrology. Review of Scientific Instruments , 2567-
2596.
Lombardi, M. A. (2001). Fundamentals of Time and Frequency. In R. H. Bishop, The Mechatronics
Handbook. Boca Raton: CRC Press LLC.
Lombardi, M. (2002). NIST Special Publication 432. Washington DC: NIST US Department of Commerce.
Lombardi, M. (1999). Traceability in Time and Frequency Metrology. Washington DC: National Institute
of Standards and Technology Time and Frequency Division.
Sullivan, D. B., & Bergquist, J. C. (2001). Primary Atomic Frequency Standards at NIST. Journal of
Research of the National Institute of Standards and Technology , 47–63.
Wikipedia contributors . (2009, November 13). Knowledge . Retrieved November 13, 2009, from
Wikipedia, The Free Encyclopedia:
http://en.wikipedia.org/w/index.php?title=Knowledge&oldid=325539292