representation models introduction to data science: data · text attributes : based on edit...

91
Overview Principles of preparing and organizing data in a way that is amenable for analysis. 1 / 74 Overview Principles of preparing and organizing data in a way that is amenable for analysis. Data representation model: collection of concepts that describes how data is represented and accessed. 1 / 74 Overview Principles of preparing and organizing data in a way that is amenable for analysis. Data representation model: collection of concepts that describes how data is represented and accessed. Thinking abstractly of data structure, beyond a specific implementation, makes it easier to share data across programs and systems, and integrate data from different sources. 1 / 74 Overview Structure: We have assumed that data is organized in rectangular data structures (tables with rows and columns) Semantics: We have discussed the notion of values, attributes, and entities. 2 / 74 Overview Structure: We have assumed that data is organized in rectangular data structures (tables with rows and columns) Semantics: We have discussed the notion of values, attributes, and entities. So far, data semantics: a dataset is a collection of values, numeric or categorical, organized into entities (observations) and attributes (variables). 2 / 74 Overview Structure: We have assumed that data is organized in rectangular data structures (tables with rows and columns) Semantics: We have discussed the notion of values, attributes, and entities. So far, data semantics: a dataset is a collection of values, numeric or categorical, organized into entities (observations) and attributes (variables). Each attribute contains values of a specific measurement across entities, and entities collect all measurements across attributes. 2 / 74 Overview In the database literature, we call this exercise of defining structure and semantics as data modeling. 3 / 74 Overview In the database literature, we call this exercise of defining structure and semantics as data modeling. In this course we use the term data representational modeling, to distinguish from data statistical modeling. 3 / 74 Data representational modeling Data model: A collection of concepts that describes how data is represented and accessed Schema: A description of a specific collection of data, using a given data model 4 / 74 Data representational modeling Modeling Constructs: A collection of concepts used to represent the structure in the data. Typically we need to represent types of entities, their attributes, types of relationships between entities, and relationship attributes 5 / 74 Data representational modeling Integrity Constraints: Constraints to ensure data integrity (i.e., avoid errors) 6 / 74 Data representational modeling Integrity Constraints: Constraints to ensure data integrity (i.e., avoid errors) Manipulation Languages: Constructs for manipulating the data 6 / 74 Data representational modeling We desire that models are: sufficiently expressive so they can capture realworld data well, easy to use, lend themselves to defining computational methods that have good performance. 7 / 74 Data representational modeling Some examples of data models are Relational, Entityrelationship model, XML... Objectoriented, Objectrelational, RDF... Current favorites in the industry: JSON, Protocol Buffers, Avro, Thrift, Property Graph 8 / 74 Data representational modeling Data independence: The idea that you can change the representation of data w/o changing programs that operate on it. Physical data independence: I can change the layout of data on disk and my programs won't change index the data partition/distribute/replicate the data compress the data sort the data 9 / 74 Modeling constructs: entities and their attributes relationships and relationship attributes. Entities are objects represented in a dataset: people, places, things, etc. Relationships model just that, relationships between entities. The Entity-Relationship and Relational Models 10 / 74 Diagrams: rectangles are entitites diamonds and edges indicate relationships Circles describe either entity or relationship attributes. The Entity-Relationship and Relational Models 11 / 74 Arrows are used indicate multiplicity of relationships The Entity-Relationship and Relational Models 12 / 74 The Entity-Relationship and Relational Models Relationships are defined over pairs of entities. Relationship over sets of entities and is defined over the cartesian product . For example: if and , then . 13 / 74 The Entity-Relationship and Relational Models Arrows specify how entities participate in relationships. For example: this diagram specifies that entities in appear in only one relationship pair. That is, if , and , then there is no other pair . 14 / 74 The Entity-Relationship and Relational Models In databases and general datasets we work on, both Entities and Relationships are represented as Relations (tables). 15 / 74 The Entity-Relationship and Relational Models In databases and general datasets we work on, both Entities and Relationships are represented as Relations (tables). Such that a unique entity/relationship is represented by a single tuple (the list of attribute values that represent an entity or relationship). 15 / 74 The Entity-Relationship and Relational Models In databases and general datasets we work on, both Entities and Relationships are represented as Relations (tables). Such that a unique entity/relationship is represented by a single tuple (the list of attribute values that represent an entity or relationship). How can we ensure uniqueness of entities? 15 / 74 The Entity-Relationship and Relational Models In databases and general datasets we work on, both Entities and Relationships are represented as Relations (tables). Such that a unique entity/relationship is represented by a single tuple (the list of attribute values that represent an entity or relationship). How can we ensure uniqueness of entities? keys are an essential ingredient to uniquely identify entities and relationships in tables. 15 / 74 Formal introduction to keys Attribute set is a superkey of relation if values for are sufficient to identify a unique tuple of each possible relation Example: {SSN} and {SSN,name} are both superkeys of person 16 / 74 Formal introduction to keys Attribute set is a superkey of relation if values for are sufficient to identify a unique tuple of each possible relation Example: {SSN} and {SSN,name} are both superkeys of person Superkey is a candidate key if is minimal Example: {SSN} is a candidate key for person 16 / 74 Formal introduction to keys Attribute set is a superkey of relation if values for are sufficient to identify a unique tuple of each possible relation Example: {SSN} and {SSN,name} are both superkeys of person Superkey is a candidate key if is minimal Example: {SSN} is a candidate key for person One of the candidate keys is selected to be the primary key Typically one that is small and immutable (doesn’t change often) Primary key typically highlighted in ER diagram 16 / 74 Formal introduction to keys Foreign key: Primary key of a relation that appears in another relation {SSN} from person appears in employs person called referenced relation employs is the referencing relation 17 / 74 Formal introduction to keys Foreign key: Primary key of a relation that appears in another relation {SSN} from person appears in employs person called referenced relation employs is the referencing relation Foreign key constraint: the tuple corresponding to that primary key must exist Imagine: Tuple: ('123-45-6789', 'Apple')in employs But no tuple corresponding to '123456789' in person Also called referential integrity constraint 17 / 74 Tidy Data We use the term Tidy Data to refer to datasets that are represented in a form that is amenable for manipulation and statistical modeling. It is very closely related to the concept of normal forms in the ER model and the process of normalization in the database literature. 18 / 74 Tidy Data Here we assume we are working in the ER data model represented as relations: rectangular data structures where 1. Each attribute (or variable) forms a column 2. Each entity (or observation) forms a row 3. Each type of entity (observational unit) forms a table 19 / 74 Tidy Data Here is an example of a tidy dataset: One entity per row, a single attribute per column. Only information about flights included. year month day dep_time sched_dep_time dep_delay arr_time sched_arr_ 2013 1 1 517 515 2 830 2013 1 1 533 529 4 850 20 / 74 Structure Query Language The StructuredQueryLanguage (SQL) is the predominant language used in database systems. It is tailored to the Relational data representation model. SQL is a declarative language, we don't write a procedure to compute a relation, we declare what the relation we want to compute looks like. 21 / 74 Structure Query Language The basic construct in SQL is the socalled SFW construct: selectfrom where which specifies: select: which attributes you want the answer to have from: which relation (table) you want the answer to be computed from where: what conditions you want to be satisfied by the rows (tuples) of the answer 22 / 74 Structure Query Language E.g.: movies produced by Disney in 1990: note the rename select m.title, m.year from movie m where m.studioname = 'disney' and m.year = 1990 23 / 74 Structure Query Language The select clause can contain expressions (this is paralleled by the mutate operation we saw previously) select title || ' (' || to_char(year) || ')' as titleyear select 2014 - year 24 / 74 Structure Query Language The where clause support a large number of different predicates and combinations thereof (this is parallel to the filter operation) year between 1990 and 1995 title like 'star wars%' title like 'star wars _' 25 / 74 Structure Query Language We can include ordering, e.g., find distinct movies sorted by title select distinct title from movie where studioname = 'disney' and year = 1990 order by title; 26 / 74 Structure Query Language Group-by and summarize SQL has an idiom for grouping and summarizing E.g., compute the average movie length by year select name, avg(length) from movie group by year 27 / 74 Two-table operations So far we have looked at data operations defined over single tables and data frames. In this section we look at efficient methods to combine data from multiple tables. The fundamental operation here is the join, which is a workhorse of database system design and impementation. 28 / 74 Two-table operations The join operation: Combines rows from two tables to create a new single table Based on matching criteria specified over attributes of each of the two tables. 29 / 74 Two-table operations Consider a database of flights and airlines: flights ## # A tibble: 336,776 x 19 ## year month day dep_time sched_dep_time dep_delay arr_time ## <int> <int> <int> <int> <int> <dbl> <int> ## 1 2013 1 1 517 515 2 830 ## 2 2013 1 1 533 529 4 850 ## 3 2013 1 1 542 540 2 923 ## 4 2013 1 1 544 545 -1 1004 ## 5 2013 1 1 554 600 -6 812 30 / 74 Two-table operations airlines ## # A tibble: 16 x 2 ## carrier name ## <chr> <chr> ## 1 9E Endeavor Air Inc. ## 2 AA American Airlines Inc. ## 3 AS Alaska Airlines Inc. ## 4 B6 JetBlue Airways ## 5 DL Delta Air Lines Inc. ## 6 EV ExpressJet Airlines Inc. ## 7 F9 Frontier Airlines Inc. 31 / 74 Two-table operations Here, we want to add airline information to each flight. Join the attributes of the respective airline from the airlines table with the flights table based on the values of attributes flights$carrier and airlines$carrier. 32 / 74 Two-table operations Every row of flights with a specific value for flights$carrier, is joined with the the corresponding row in airlines with the same value for airlines$carrier. 33 / 74 Two-table operations There are multiple ways of performing this operation that differ on how nonmatching observations are handled. 34 / 74 Two-table operations Left Join In a left join, all observations on left operand (LHS) are retained: 35 / 74 Two-table operations Other operations: right join: all observations in RHS are retained outer join: all observations are retained (full join) inner join: only matching observations are retained Details in lecture notes 36 / 74 Two-table operations Join conditions All join operations are based on a matching condition: flights %>% inner_join(airlines, by="carrier") specifies to join observations where flights$carrier equals airlines$carrier. 37 / 74 Two-table operations In this case, where no conditions are specified using the by argument: flights %>% left_join(airlines) a natural join is perfomed. In this case all variables with the same name in both tables are used in join condition. 38 / 74 Two-table operations You can also specify join conditions on arbitrary attributes using the by argument. flights %>% left_join(airlines, by=c("carrier" = "name")) 39 / 74 Two-table operations SQL Constructs: Multi-table Queries Key idea: Do a join to combine multiple tables into an appropriate table Use SFW constructs for singletable queries 40 / 74 Two-table operations SQL Constructs: Multi-table Queries Key idea: Do a join to combine multiple tables into an appropriate table Use SFW constructs for singletable queries For the first part, where we use a join to get an appropriate table, the general SQL construct includes: The name of the first table to join The type of join to do The name of the second table to join 40 / 74 Two-table operations select title, year, me.name as producerName from movies m join movieexec me where m.producer = me.id; 41 / 74 Entity Resolution and Record Linkage Often, we will be faced with the problem of data integration: combine two (or more) datasets from different sources that may contain information about the same entities. 42 / 74 Entity Resolution and Record Linkage Often, we will be faced with the problem of data integration: combine two (or more) datasets from different sources that may contain information about the same entities. But,... the attributes in the two datasets may not be the same, 42 / 74 Entity Resolution and Record Linkage Often, we will be faced with the problem of data integration: combine two (or more) datasets from different sources that may contain information about the same entities. But,... the attributes in the two datasets may not be the same, 42 / 74 Entity Resolution and Record Linkage 43 / 74 Entity Resolution and Record Linkage 43 / 74 Entity Resolution and Record Linkage These are examples of a general problem referred to as Entity Resolution and Record Linkage. 44 / 74 Entity Resolution and Record Linkage Problem Denition Given: Entity sets and , Find: Linked entities with and . 45 / 74 Entity Resolution and Record Linkage One approach: similarity function Define a similarity function between entities and Link entities with high similarity. 46 / 74 Entity Resolution and Record Linkage Define similarity as an additive function over some set of shared attributes : with a similarity function defined for each attribute , 47 / 74 Entity Resolution and Record Linkage Example attribute functions Categorical attribute: pairs of entities with the same value are more similar to each other than pairs of entities with different values. E.g., 48 / 74 Entity Resolution and Record Linkage Example attribute functions Continuous attribute: pairs of entities with values that are close to each other are more similar than pairs of entities with values that are farther to each other. Note that to specify close or far we need to introduce some notion of distance. We can use Euclidean distance for example, 49 / 74 Entity Resolution and Record Linkage Example attribute functions Text attributes: based on edit distance between strings rather than Euclidean distance. We can use domain knowledge to specify similarity. For example, fact that John and Johnathan are similar requires domain knowledge of common usage of English names. 50 / 74 Solving the resolution problem Need a rule to match entities we think are linked. This depends on assumptions we make about the dataset, similar to assumptions we made when performing joins. 51 / 74 Solving the resolution problem Model the entity resolution problem as an optimization problem: maximize objective function (based on similarity) over possible sets of valid pairs , where set constraints pairs based on problemspecific assumptions. 52 / 74 Solving the resolution problem Many-to-one resolutions Constrain sets to represent manytoone resolutions. Thus, entities in can only appear once in pairs in , but entities may appear more than once. In this case, we can match where 53 / 74 Solving the resolution problem One-to-one resolutions Suppose we constrain sets to those that represent onetoone resolutions: If then and appear in only one pair in . In this case, we have a harder computational problem. In fact, this is an instance of the maximum bipartite matching problem, and would look at network flow algorithms to solve. 54 / 74 Solving the resolution problem Other constraints We can add additional constraints to to represent other information we have about the task. A common one would be to only allow pairs to have similarity above some threshold . I.e., only if . 55 / 74 Solving the resolution problem Discussion The procedure outlined above is an excellent first attempt to solve the Entity Resolution problem. This is a classical problem in Data Science for which a variety of approaches and methods are in use. 56 / 74 Database Query Optimization Earlier we made the distinction that SQL is a declarative language rather than a procedural language. A reason why data base systems rely on a declarative language is that it allows the system to decide how to evaluate a query most efficiently. 57 / 74 Database Query Optimization Consider a Baseball database where we have two tables Batting and Master what is the maximum batting "average" for a player from the state of California? select max(1.0 * b.H / b.AB) as best_ba from Batting as b join Master as m on b.playerId = m.playerId where b.AB >= 100 and m.birthState = "CA" 58 / 74 Database Query Optimization Now, let's do the same computation using dplyr operations: Version 1: Batting %>% inner_join(Master, by="playerID") %>% filter(AB >= 100, birthState == "CA") %>% mutate(AB=1.0 * H / AB) %>% summarize(max(AB)) ## max(AB) ## 1 0.4057018 59 / 74 Database Query Optimization Version 2: Batting %>% filter(AB >= 100) %>% inner_join( filter(Master, birthState == "CA")) %>% mutate(AB = 1.0 * H / AB) %>% summarize(max(AB)) ## max(AB) ## 1 0.4057018 60 / 74 Database Query Optimization Which should be most efficient? Think about a simple cost model. The costliest operation here is the join between two tables. 61 / 74 Database Query Optimization What is the cost of this algorithm? . For the rest of the operations, let's assume we perform this with a single pass through the table. For example, we assume that filter(T) has cost . 62 / 74 Database Query Optimization Let's write out the cost of each of the two pipelines. Batting %>% inner_join(Master, by="playerID") %>% # cost: |Batting| x |Master| filter(AB >= 100, birthState == "CA") %>% # cost: |R1| mutate(AB=1.0 * H / AB) %>% # cost: |R| summarize(max(AB)) # cost: |R| 63 / 74 Database Query Optimization Cost of version 1 is : inner join between Batting and Master : is filtered to rows with AB >=100 & birthState == "CA". In this example: 2.08e+09 64 / 74 Database Query Optimization Now, let's look at the second version. Batting %>% filter(AB >= 100) %>% # cost: |Batting| inner_join( Master %>% filter(birthState == "CA") # cost: |Master| ) %>% # cost: |B1| x |M1| mutate(AB = 1.0 * H / AB) %>% # cost |R| summarize(max(AB)) # cost |R| 65 / 74 Database Query Optimization Cost of version 2 is : Batting filtered to include only rows with AB >= 100 : Master filtered to include birthState == "CA". In our example: 8.95e+07 66 / 74 Database Query Optimization Version 1 (join tables before filtering) is 23 times costlier. When using SQL in a database system we only write the one query describing our desired result, With the procedural (dplyr) we need to think which of the two versions is more efficient. 67 / 74 Database Query Optimization Database systems use query optimization to decide how to evaluate queries efficiently. The goal of query optimization is to decide the most efficient query plan to use to evaluate a query out of the many possible candidate plans it could use. It needs to solve two problems: search the space of possible plans, approximate the cost of evaluating a specific plan. 68 / 74 Database Query Optimization Think of the two procedural versions above as two candidate plans that the DB system could use to evaluate the query. Query optimzation approximates what it would cost to evaluate each of the two plans and decides to use the most efficient plan. 69 / 74 Semi-structured Data Representation Model The EntityRelational data model we have described so far is mostly defined for structured data: where a specific and consistent schema is assumed. Data models like XML and JSON are instead intended for semi structured data. 70 / 74 Semi-structured Data Representation Model XML: eXtensible Markup Language Data models like XML rely on flexible, selfdescribing schemas: <?xml version="1.0" encoding="UTF-8"?> <!-- Edited by XMLSpy --> <CATALOG> <CD> <TITLE>Empire Burlesque</TITLE> <ARTIST>Bob Dylan</ARTIST> <COUNTRY>USA</COUNTRY> <COMPANY>Columbia</COMPANY> 71 / 74 Semi-structured Data Representation Model JSON: Javascript Object Notation { "firstName": "John", "lastName": "Smith", "isAlive": true, "age": 25, "height_cm": 167.6, "address": { "streetAddress": "21 2nd Street", "city": "New York", "state": "NY", 72 / 74 Semi-structured Data Representation Model This is the format most contemporary data REST APIs use to transfer data. For instance, here is part of a JSON record from a Twitter stream: { "created_at":"Sun May 05 14:01:34+00002013", "id":331046012875583488, "id_str":"331046012875583488", "text":"\u0425\u043e\u0447\u0443, \u0447\u0442\u043e\u0431 \u0442\u044b \u0441\u0434\u0 "source":"\u003ca href=\"http:\/\/twitterfeed.com\"rel=\"nofollow\"\u003etwitterfeed\u0 "in_reply_to_user_id_str":null, "user":{ "id":548422428, 73 / 74 Summary We have looked at specifics of Data Representation Modeling Entity Relationship and Relational Models Definition of Tidy Data Joining tables Entity Resolution Models for semistructured data 74 / 74 Introduction to Data Science: Data Representation Models Héctor Corrada Bravo University of Maryland, College Park, USA 20200212

Upload: others

Post on 30-Sep-2020

1 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Representation Models Introduction to Data Science: Data · Text attributes : based on edit distance between strings rather than Euclidean distance. We can use domain knowledge to

OverviewPrinciples of preparing and organizing data in a way that is amenable foranalysis.

1 / 74

OverviewPrinciples of preparing and organizing data in a way that is amenable foranalysis.

Data representation model: collection of concepts that describes howdata is represented and accessed.

1 / 74

OverviewPrinciples of preparing and organizing data in a way that is amenable foranalysis.

Data representation model: collection of concepts that describes howdata is represented and accessed.

Thinking abstractly of data structure, beyond a specific implementation,makes it easier to share data across programs and systems, andintegrate data from different sources.

1 / 74

OverviewStructure: We have assumed that data is organized in rectangulardata structures (tables with rows and columns)Semantics: We have discussed the notion of values, attributes, andentities.

2 / 74

OverviewStructure: We have assumed that data is organized in rectangulardata structures (tables with rows and columns)Semantics: We have discussed the notion of values, attributes, andentities.

So far, data semantics: a dataset is a collection of values, numeric orcategorical, organized into entities (observations) and attributes(variables).

2 / 74

OverviewStructure: We have assumed that data is organized in rectangulardata structures (tables with rows and columns)Semantics: We have discussed the notion of values, attributes, andentities.

So far, data semantics: a dataset is a collection of values, numeric orcategorical, organized into entities (observations) and attributes(variables).

Each attribute contains values of a specific measurement across entities,and entities collect all measurements across attributes.

2 / 74

OverviewIn the database literature, we call this exercise of defining structure andsemantics as data modeling.

3 / 74

OverviewIn the database literature, we call this exercise of defining structure andsemantics as data modeling.

In this course we use the term data representational modeling, todistinguish from data statistical modeling.

3 / 74

Data representational modelingData model: A collection of concepts that describes how data isrepresented and accessedSchema: A description of a specific collection of data, using a givendata model

4 / 74

Data representational modelingModeling Constructs: A collection of concepts used to represent thestructure in the data.

Typically we need to represent types of entities, their attributes, types ofrelationships between entities, and relationship attributes

5 / 74

Data representational modelingIntegrity Constraints: Constraints to ensure data integrity (i.e., avoiderrors)

6 / 74

Data representational modelingIntegrity Constraints: Constraints to ensure data integrity (i.e., avoiderrors)

Manipulation Languages: Constructs for manipulating the data

6 / 74

Data representational modelingWe desire that models are:

sufficiently expressive so they can capture real­world data well,

easy to use,

lend themselves to defining computational methods that have goodperformance.

7 / 74

Data representational modelingSome examples of data models are

Relational, Entity­relationship model, XML...Object­oriented, Object­relational, RDF...Current favorites in the industry: JSON, Protocol Buffers, Avro, Thrift,Property Graph

8 / 74

Data representational modelingData independence: The idea that you can change therepresentation of data w/o changing programs that operate on it.

Physical data independence: I can change the layout of data ondisk and my programs won't change

index the datapartition/distribute/replicate the datacompress the datasort the data

9 / 74

Modeling constructs:

entities and their attributesrelationships and relationshipattributes.

Entities are objects represented in adataset: people, places, things, etc.

Relationships model just that,relationships between entities.

The Entity-Relationship and Relational Models

10 / 74

Diagrams:

rectangles are entititesdiamonds and edges indicaterelationshipsCircles describe either entity orrelationship attributes.

The Entity-Relationship and Relational Models

11 / 74

Arrows are used indicate multiplicityof relationships

The Entity-Relationship and Relational Models

12 / 74

The Entity-Relationship and Relational ModelsRelationships are defined over pairs of entities.

Relationship   over sets of entities   and   is defined over thecartesian product  .

For example: if   and  , then  .

R E1 E2

E1 × E2

e1 ∈ E1 e2 ∈ E2 (e1, e2) ∈ R

13 / 74

The Entity-Relationship and Relational ModelsArrows specify how entities participate in relationships.

For example: this diagram specifies that entities in   appear in only onerelationship pair.

That is, if  ,   and  , then there is no other pair.

E1

ei ∈ E1 ej ∈ E2 (ei, ej) ∈ R

(ei, ek) ∈ R

14 / 74

The Entity-Relationship and Relational ModelsIn databases and general datasets we work on, both Entities andRelationships are represented as Relations (tables).

15 / 74

The Entity-Relationship and Relational ModelsIn databases and general datasets we work on, both Entities andRelationships are represented as Relations (tables).

Such that a unique entity/relationship is represented by a single tuple(the list of attribute values that represent an entity or relationship).

15 / 74

The Entity-Relationship and Relational ModelsIn databases and general datasets we work on, both Entities andRelationships are represented as Relations (tables).

Such that a unique entity/relationship is represented by a single tuple(the list of attribute values that represent an entity or relationship).

How can we ensure uniqueness of entities?

15 / 74

The Entity-Relationship and Relational ModelsIn databases and general datasets we work on, both Entities andRelationships are represented as Relations (tables).

Such that a unique entity/relationship is represented by a single tuple(the list of attribute values that represent an entity or relationship).

How can we ensure uniqueness of entities?

keys are an essential ingredient to uniquely identify entities andrelationships in tables.

15 / 74

Formal introduction to keysAttribute set   is a superkey of relation   if values for   aresufficient to identify a unique tuple of each possible relation 

Example: {SSN} and {SSN,name} are both superkeys of person

K R K

r(R)

16 / 74

Formal introduction to keysAttribute set   is a superkey of relation   if values for   aresufficient to identify a unique tuple of each possible relation 

Example: {SSN} and {SSN,name} are both superkeys of person

Superkey   is a candidate key if   is minimalExample: {SSN} is a candidate key for person

K R K

r(R)

K K

16 / 74

Formal introduction to keysAttribute set   is a superkey of relation   if values for   aresufficient to identify a unique tuple of each possible relation 

Example: {SSN} and {SSN,name} are both superkeys of person

Superkey   is a candidate key if   is minimal

Example: {SSN} is a candidate key for person

One of the candidate keys is selected to be the primary key

Typically one that is small and immutable (doesn’t change often)Primary key typically highlighted in ER diagram

K R K

r(R)

K K

16 / 74

Formal introduction to keysForeign key: Primary key of a relation that appears in anotherrelation

{SSN} from person appears in employsperson called referenced relationemploys is the referencing relation

17 / 74

Formal introduction to keysForeign key: Primary key of a relation that appears in anotherrelation

{SSN} from person appears in employsperson called referenced relationemploys is the referencing relation

Foreign key constraint: the tuple corresponding to that primary keymust exist

Imagine:Tuple: ('123-45-6789', 'Apple')in employsBut no tuple corresponding to '123­45­6789' in person

Also called referential integrity constraint 17 / 74

Tidy DataWe use the term Tidy Data to refer to datasets that are represented in aform that is amenable for manipulation and statistical modeling.

It is very closely related to the concept of normal forms in the ER modeland the process of normalization in the database literature.

18 / 74

Tidy DataHere we assume we are working in the ER data model represented asrelations: rectangular data structures where

1. Each attribute (or variable) forms a column2. Each entity (or observation) forms a row3. Each type of entity (observational unit) forms a table

19 / 74

Tidy DataHere is an example of a tidy dataset: One entity per row, a singleattribute per column. Only information about flights included.

year month day dep_time sched_dep_time dep_delay arr_time sched_arr_

2013 1 1 517 515 2 830

2013 1 1 533 529 4 850

20 / 74

Structure Query LanguageThe Structured­Query­Language (SQL) is the predominant languageused in database systems.

It is tailored to the Relational data representation model.

SQL is a declarative language, we don't write a procedure to compute arelation, we declare what the relation we want to compute looks like.

21 / 74

Structure Query LanguageThe basic construct in SQL is the so­called SFW construct: select­from­where which specifies:

select: which attributes you want the answer to havefrom: which relation (table) you want the answer to be computed fromwhere: what conditions you want to be satisfied by the rows (tuples)of the answer

22 / 74

Structure Query LanguageE.g.: movies produced by Disney in 1990: note the rename

select m.title, m.year

from movie m

where m.studioname = 'disney' and m.year = 1990

23 / 74

Structure Query LanguageThe select clause can contain expressions (this is paralleled by themutate operation we saw previously)

select title || ' (' || to_char(year) || ')' astitleyearselect 2014 - year

24 / 74

Structure Query LanguageThe where clause support a large number of different predicates andcombinations thereof (this is parallel to the filter operation)

year between 1990 and 1995title like 'star wars%' title like 'star wars _'

25 / 74

Structure Query LanguageWe can include ordering, e.g., find distinct movies sorted by title

select distinct title

from movie

where studioname = 'disney' and year = 1990

order by title;

26 / 74

Structure Query Language

Group-by and summarize

SQL has an idiom for grouping and summarizing

E.g., compute the average movie length by year

select name, avg(length)

from movie

group by year

27 / 74

Two-table operationsSo far we have looked at data operations defined over single tables anddata frames.

In this section we look at efficient methods to combine data from multipletables.

The fundamental operation here is the join, which is a workhorse ofdatabase system design and impementation.

28 / 74

Two-table operations

The join operation:

Combines rows from two tables to create a new single table

Based on matching criteria specified over attributes of each of the twotables.

29 / 74

Two-table operationsConsider a database of flights and airlines:

flights

## # A tibble: 336,776 x 19

## year month day dep_time sched_dep_time dep_delay arr_time

## <int> <int> <int> <int> <int> <dbl> <int>

## 1 2013 1 1 517 515 2 830

## 2 2013 1 1 533 529 4 850

## 3 2013 1 1 542 540 2 923

## 4 2013 1 1 544 545 -1 1004

## 5 2013 1 1 554 600 -6 812 30 / 74

Two-table operations

airlines

## # A tibble: 16 x 2

## carrier name

## <chr> <chr>

## 1 9E Endeavor Air Inc.

## 2 AA American Airlines Inc.

## 3 AS Alaska Airlines Inc.

## 4 B6 JetBlue Airways

## 5 DL Delta Air Lines Inc.

## 6 EV ExpressJet Airlines Inc.

## 7 F9 Frontier Airlines Inc. 31 / 74

Two-table operationsHere, we want to add airline information to each flight.

Join the attributes of the respective airline from the airlines table withthe flights table based on the values of attributes flights$carrierand airlines$carrier.

32 / 74

Two-table operationsEvery row of flights with a specific value for flights$carrier, isjoined with the the corresponding row in airlines with the same valuefor airlines$carrier.

33 / 74

Two-table operationsThere are multiple ways of performing this operation that differ on hownon­matching observations are handled.

34 / 74

Two-table operations

Left JoinIn a left join, all observations on left operand (LHS) are retained:

35 / 74

Two-table operationsOther operations:

right join: all observations in RHS are retained

outer join: all observations are retained (full join)

inner join: only matching observations are retained

Details in lecture notes

36 / 74

Two-table operations

Join conditionsAll join operations are based on a matching condition:

flights %>%

inner_join(airlines, by="carrier")

specifies to join observations where flights$carrier equalsairlines$carrier.

37 / 74

Two-table operationsIn this case, where no conditions are specified using the by argument:

flights %>%

left_join(airlines)

a natural join is perfomed. In this case all variables with the same namein both tables are used in join condition.

38 / 74

Two-table operationsYou can also specify join conditions on arbitrary attributes using the byargument.

flights %>%

left_join(airlines, by=c("carrier" = "name"))

39 / 74

Two-table operations

SQL Constructs: Multi-table QueriesKey idea:

Do a join to combine multiple tables into an appropriate tableUse SFW constructs for single­table queries

40 / 74

Two-table operations

SQL Constructs: Multi-table QueriesKey idea:

Do a join to combine multiple tables into an appropriate tableUse SFW constructs for single­table queries

For the first part, where we use a join to get an appropriate table, thegeneral SQL construct includes:

The name of the first table to joinThe type of join to doThe name of the second table to join 40 / 74

Two-table operations

select title, year, me.name as producerName

from movies m join movieexec me

where m.producer = me.id;

41 / 74

Entity Resolution and Record LinkageOften, we will be faced with the problem of data integration:

combine two (or more) datasets from different sources

that may contain information about the same entities.

42 / 74

Entity Resolution and Record LinkageOften, we will be faced with the problem of data integration:

combine two (or more) datasets from different sources

that may contain information about the same entities.

But,... the attributes in the two datasets may not be the same,

42 / 74

Entity Resolution and Record LinkageOften, we will be faced with the problem of data integration:

combine two (or more) datasets from different sources

that may contain information about the same entities.

But,... the attributes in the two datasets may not be the same,

42 / 74

Entity Resolution and Record Linkage

43 / 74

Entity Resolution and Record Linkage

43 / 74

Entity Resolution and Record LinkageThese are examples of a general problem referred to as EntityResolution and Record Linkage.

44 / 74

Entity Resolution and Record Linkage

Problem De�nition

Given: Entity sets   and  ,

Find: Linked entities   with   and  .

E1 E2

(e1, e2) e1 ∈ E1 e2 ∈ E2

45 / 74

Entity Resolution and Record Linkage

One approach: similarity function

Define a similarity function between entities   and 

Link entities with high similarity.

e1 e2

46 / 74

Entity Resolution and Record LinkageDefine similarity as an additive function over some set of sharedattributes  :

with   a similarity function defined for each attribute  ,

A

s(e1, e2) = ∑j∈A

sj(e1[j], e2[j])

sj j

47 / 74

Entity Resolution and Record Linkage

Example attribute functions

Categorical attribute: pairs of entities with the same value are moresimilar to each other than pairs of entities with different values. E.g.,

sj(e1[j], e2[j]) = { 1 if e1[j] == e2[j]

0 o. w.

48 / 74

Entity Resolution and Record Linkage

Example attribute functions

Continuous attribute: pairs of entities with values that are close to eachother are more similar than pairs of entities with values that are farther toeach other.

Note that to specify close or far we need to introduce some notion ofdistance. We can use Euclidean distance for example,

dj(e1[j], e2[j]) = (e1[j] − e2[j])2;  sj(e1[j], e2[j]) = e−dj(e1[j],e2[j])

49 / 74

Entity Resolution and Record Linkage

Example attribute functions

Text attributes: based on edit distance between strings rather thanEuclidean distance. We can use domain knowledge to specify similarity.

For example, fact that John and Johnathan are similar requires domainknowledge of common usage of English names.

50 / 74

Solving the resolution problemNeed a rule to match entities we think are linked.

This depends on assumptions we make about the dataset, similar toassumptions we made when performing joins.

51 / 74

Solving the resolution problemModel the entity resolution problem as an optimization problem:

maximize objective function (based on similarity)

over possible sets   of valid pairs  , where set   constraintspairs based on problem­specific assumptions.

V (e1, e2) V

R = arg maxV

∑(e1,e2)∈V

s(e1, e2)

52 / 74

Solving the resolution problem

Many-to-one resolutions

Constrain sets   to represent many­to­one resolutions.

Thus, entities in   can only appear once in pairs in  , but entities may appear more than once.

In this case, we can match   where

V

e1 V e2

(e1, e2)

e2 = arg maxe∈E2

s(e1, e)

53 / 74

Solving the resolution problem

One-to-one resolutions

Suppose we constrain sets   to those that represent one­to­oneresolutions:

If   then   and   appear in only one pair in  .

In this case, we have a harder computational problem. In fact, this is aninstance of the maximum bipartite matching problem, and would look atnetwork flow algorithms to solve.

V

(e1, e2) ∈ V e1 e2 V

54 / 74

Solving the resolution problem

Other constraints

We can add additional constraints to   to represent other information wehave about the task.

A common one would be to only allow pairs   to havesimilarity above some threshold  . I.e.,   only if 

.

V

(e1, e2) ∈ V

t (e1, e2) ∈ V

s(e1, e2) ≥ t

55 / 74

Solving the resolution problem

Discussion

The procedure outlined above is an excellent first attempt to solve theEntity Resolution problem.

This is a classical problem in Data Science for which a variety ofapproaches and methods are in use.

56 / 74

Database Query OptimizationEarlier we made the distinction that SQL is a declarative language ratherthan a procedural language.

A reason why data base systems rely on a declarative language is that itallows the system to decide how to evaluate a query most efficiently.

57 / 74

Database Query OptimizationConsider a Baseball database where we have two tables Batting andMaster

what is the maximum batting "average" for a player from the state ofCalifornia?

select max(1.0 * b.H / b.AB) as best_ba

from Batting as b join Master as m on b.playerId = m.playerId

where b.AB >= 100 and m.birthState = "CA"

58 / 74

Database Query OptimizationNow, let's do the same computation using dplyr operations:

Version 1:

Batting %>%

inner_join(Master, by="playerID") %>%

filter(AB >= 100, birthState == "CA") %>%

mutate(AB=1.0 * H / AB) %>%

summarize(max(AB))

## max(AB)

## 1 0.4057018 59 / 74

Database Query OptimizationVersion 2:

Batting %>%

filter(AB >= 100) %>%

inner_join(

filter(Master, birthState == "CA")) %>%

mutate(AB = 1.0 * H / AB) %>%

summarize(max(AB))

## max(AB)

## 1 0.4057018

60 / 74

Database Query OptimizationWhich should be most efficient? Think about a simple cost model. Thecostliest operation here is the join between two tables.

61 / 74

Database Query OptimizationWhat is the cost of this algorithm?  .

For the rest of the operations, let's assume we perform this with a singlepass through the table.

For example, we assume that filter(T) has cost  .

|T1| × |T2|

|T |

62 / 74

Database Query OptimizationLet's write out the cost of each of the two pipelines.

Batting %>%

inner_join(Master, by="playerID") %>% # cost: |Batting| x |Master|

filter(AB >= 100, birthState == "CA") %>% # cost: |R1|

mutate(AB=1.0 * H / AB) %>% # cost: |R|

summarize(max(AB)) # cost: |R|

63 / 74

Database Query OptimizationCost of version 1 is

: inner join between Batting and Master  : is   filtered to rowswith AB >=100 & birthState == "CA".

In this example: 2.08e+09

|Batting| × |Master| + |R1| + 2|R|

R1 R R1

64 / 74

Database Query OptimizationNow, let's look at the second version.

Batting %>%

filter(AB >= 100) %>% # cost: |Batting|

inner_join(

Master %>% filter(birthState == "CA") # cost: |Master|

) %>% # cost: |B1| x |M1|

mutate(AB = 1.0 * H / AB) %>% # cost |R|

summarize(max(AB)) # cost |R|

65 / 74

Database Query OptimizationCost of version 2 is 

: Batting filtered to include only rows with AB >= 100  :Master filtered to include birthState == "CA".

In our example: 8.95e+07

|Batting| × |Master| + |B1| × |M1| + 2|R|

B1 M2

66 / 74

Database Query OptimizationVersion 1 (join tables before filtering) is 23 times costlier.

When using SQL in a database system we only write the one querydescribing our desired result,

With the procedural (dplyr) we need to think which of the two versionsis more efficient.

67 / 74

Database Query OptimizationDatabase systems use query optimization to decide how to evaluatequeries efficiently.

The goal of query optimization is to decide the most efficient query planto use to evaluate a query out of the many possible candidate plans itcould use.

It needs to solve two problems: search the space of possible plans,approximate the cost of evaluating a specific plan.

68 / 74

Database Query OptimizationThink of the two procedural versions above as two candidate plans thatthe DB system could use to evaluate the query.

Query optimzation approximates what it would cost to evaluate each ofthe two plans and decides to use the most efficient plan.

69 / 74

Semi-structured Data Representation ModelThe Entity­Relational data model we have described so far is mostlydefined for structured data: where a specific and consistent schema isassumed.

Data models like XML and JSON are instead intended for semi­structured data.

70 / 74

Semi-structured Data Representation Model

XML: eXtensible Markup Language

Data models like XML rely on flexible, self­describing schemas:

<?xml version="1.0" encoding="UTF-8"?>

<!-- Edited by XMLSpy -->

<CATALOG>

<CD>

<TITLE>Empire Burlesque</TITLE>

<ARTIST>Bob Dylan</ARTIST>

<COUNTRY>USA</COUNTRY>

<COMPANY>Columbia</COMPANY> 71 / 74

Semi-structured Data Representation Model

JSON: Javascript Object Notation

{

"firstName": "John",

"lastName": "Smith",

"isAlive": true,

"age": 25,

"height_cm": 167.6,

"address": {

"streetAddress": "21 2nd Street",

"city": "New York",

"state": "NY",72 / 74

Semi-structured Data Representation ModelThis is the format most contemporary data REST APIs use to transferdata. For instance, here is part of a JSON record from a Twitter stream:

{

"created_at":"Sun May 05 14:01:34+00002013",

"id":331046012875583488,

"id_str":"331046012875583488",

"text":"\u0425\u043e\u0447\u0443, \u0447\u0442\u043e\u0431 \u0442\u044b \u0441\u0434\u0

"source":"\u003ca href=\"http:\/\/twitterfeed.com\"rel=\"nofollow\"\u003etwitterfeed\u0

"in_reply_to_user_id_str":null,

"user":{

"id":548422428, 73 / 74

SummaryWe have looked at specifics of Data Representation Modeling

Entity Relationship and Relational Models

Definition of Tidy Data

Joining tables

Entity Resolution

Models for semi­structured data

74 / 74

Introduction to Data Science: DataRepresentation Models

Héctor Corrada Bravo

University of Maryland, College Park, USA 2020­02­12

Page 2: Representation Models Introduction to Data Science: Data · Text attributes : based on edit distance between strings rather than Euclidean distance. We can use domain knowledge to

OverviewPrinciples of preparing and organizing data in a way that is amenable foranalysis.

1 / 74

Page 3: Representation Models Introduction to Data Science: Data · Text attributes : based on edit distance between strings rather than Euclidean distance. We can use domain knowledge to

OverviewPrinciples of preparing and organizing data in a way that is amenable foranalysis.

Data representation model: collection of concepts that describes howdata is represented and accessed.

1 / 74

Page 4: Representation Models Introduction to Data Science: Data · Text attributes : based on edit distance between strings rather than Euclidean distance. We can use domain knowledge to

OverviewPrinciples of preparing and organizing data in a way that is amenable foranalysis.

Data representation model: collection of concepts that describes howdata is represented and accessed.

Thinking abstractly of data structure, beyond a specific implementation,makes it easier to share data across programs and systems, andintegrate data from different sources.

1 / 74

Page 5: Representation Models Introduction to Data Science: Data · Text attributes : based on edit distance between strings rather than Euclidean distance. We can use domain knowledge to

OverviewStructure: We have assumed that data is organized in rectangulardata structures (tables with rows and columns)Semantics: We have discussed the notion of values, attributes, andentities.

2 / 74

Page 6: Representation Models Introduction to Data Science: Data · Text attributes : based on edit distance between strings rather than Euclidean distance. We can use domain knowledge to

OverviewStructure: We have assumed that data is organized in rectangulardata structures (tables with rows and columns)Semantics: We have discussed the notion of values, attributes, andentities.

So far, data semantics: a dataset is a collection of values, numeric orcategorical, organized into entities (observations) and attributes(variables).

2 / 74

Page 7: Representation Models Introduction to Data Science: Data · Text attributes : based on edit distance between strings rather than Euclidean distance. We can use domain knowledge to

OverviewStructure: We have assumed that data is organized in rectangulardata structures (tables with rows and columns)Semantics: We have discussed the notion of values, attributes, andentities.

So far, data semantics: a dataset is a collection of values, numeric orcategorical, organized into entities (observations) and attributes(variables).

Each attribute contains values of a specific measurement across entities,and entities collect all measurements across attributes.

2 / 74

Page 8: Representation Models Introduction to Data Science: Data · Text attributes : based on edit distance between strings rather than Euclidean distance. We can use domain knowledge to

OverviewIn the database literature, we call this exercise of defining structure andsemantics as data modeling.

3 / 74

Page 9: Representation Models Introduction to Data Science: Data · Text attributes : based on edit distance between strings rather than Euclidean distance. We can use domain knowledge to

OverviewIn the database literature, we call this exercise of defining structure andsemantics as data modeling.

In this course we use the term data representational modeling, todistinguish from data statistical modeling.

3 / 74

Page 10: Representation Models Introduction to Data Science: Data · Text attributes : based on edit distance between strings rather than Euclidean distance. We can use domain knowledge to

Data representational modelingData model: A collection of concepts that describes how data isrepresented and accessedSchema: A description of a specific collection of data, using a givendata model

4 / 74

Page 11: Representation Models Introduction to Data Science: Data · Text attributes : based on edit distance between strings rather than Euclidean distance. We can use domain knowledge to

Data representational modelingModeling Constructs: A collection of concepts used to represent thestructure in the data.

Typically we need to represent types of entities, their attributes, types ofrelationships between entities, and relationship attributes

5 / 74

Page 12: Representation Models Introduction to Data Science: Data · Text attributes : based on edit distance between strings rather than Euclidean distance. We can use domain knowledge to

Data representational modelingIntegrity Constraints: Constraints to ensure data integrity (i.e., avoiderrors)

6 / 74

Page 13: Representation Models Introduction to Data Science: Data · Text attributes : based on edit distance between strings rather than Euclidean distance. We can use domain knowledge to

Data representational modelingIntegrity Constraints: Constraints to ensure data integrity (i.e., avoiderrors)

Manipulation Languages: Constructs for manipulating the data

6 / 74

Page 14: Representation Models Introduction to Data Science: Data · Text attributes : based on edit distance between strings rather than Euclidean distance. We can use domain knowledge to

Data representational modelingWe desire that models are:

sufficiently expressive so they can capture real­world data well,

easy to use,

lend themselves to defining computational methods that have goodperformance.

7 / 74

Page 15: Representation Models Introduction to Data Science: Data · Text attributes : based on edit distance between strings rather than Euclidean distance. We can use domain knowledge to

Data representational modelingSome examples of data models are

Relational, Entity­relationship model, XML...Object­oriented, Object­relational, RDF...Current favorites in the industry: JSON, Protocol Buffers, Avro, Thrift,Property Graph

8 / 74

Page 16: Representation Models Introduction to Data Science: Data · Text attributes : based on edit distance between strings rather than Euclidean distance. We can use domain knowledge to

Data representational modelingData independence: The idea that you can change therepresentation of data w/o changing programs that operate on it.

Physical data independence: I can change the layout of data ondisk and my programs won't change

index the datapartition/distribute/replicate the datacompress the datasort the data

9 / 74

Page 17: Representation Models Introduction to Data Science: Data · Text attributes : based on edit distance between strings rather than Euclidean distance. We can use domain knowledge to

Modeling constructs:

entities and their attributesrelationships and relationshipattributes.

Entities are objects represented in adataset: people, places, things, etc.

Relationships model just that,relationships between entities.

The Entity-Relationship and Relational Models

10 / 74

Page 18: Representation Models Introduction to Data Science: Data · Text attributes : based on edit distance between strings rather than Euclidean distance. We can use domain knowledge to

Diagrams:

rectangles are entititesdiamonds and edges indicaterelationshipsCircles describe either entity orrelationship attributes.

The Entity-Relationship and Relational Models

11 / 74

Page 19: Representation Models Introduction to Data Science: Data · Text attributes : based on edit distance between strings rather than Euclidean distance. We can use domain knowledge to

Arrows are used indicate multiplicityof relationships

The Entity-Relationship and Relational Models

12 / 74

Page 20: Representation Models Introduction to Data Science: Data · Text attributes : based on edit distance between strings rather than Euclidean distance. We can use domain knowledge to

The Entity-Relationship and Relational ModelsRelationships are defined over pairs of entities.

Relationship   over sets of entities   and   is defined over thecartesian product  .

For example: if   and  , then  .

R E1 E2

E1 × E2

e1 ∈ E1 e2 ∈ E2 (e1, e2) ∈ R

13 / 74

Page 21: Representation Models Introduction to Data Science: Data · Text attributes : based on edit distance between strings rather than Euclidean distance. We can use domain knowledge to

The Entity-Relationship and Relational ModelsArrows specify how entities participate in relationships.

For example: this diagram specifies that entities in   appear in only onerelationship pair.

That is, if  ,   and  , then there is no other pair.

E1

ei ∈ E1 ej ∈ E2 (ei, ej) ∈ R

(ei, ek) ∈ R

14 / 74

Page 22: Representation Models Introduction to Data Science: Data · Text attributes : based on edit distance between strings rather than Euclidean distance. We can use domain knowledge to

The Entity-Relationship and Relational ModelsIn databases and general datasets we work on, both Entities andRelationships are represented as Relations (tables).

15 / 74

Page 23: Representation Models Introduction to Data Science: Data · Text attributes : based on edit distance between strings rather than Euclidean distance. We can use domain knowledge to

The Entity-Relationship and Relational ModelsIn databases and general datasets we work on, both Entities andRelationships are represented as Relations (tables).

Such that a unique entity/relationship is represented by a single tuple(the list of attribute values that represent an entity or relationship).

15 / 74

Page 24: Representation Models Introduction to Data Science: Data · Text attributes : based on edit distance between strings rather than Euclidean distance. We can use domain knowledge to

The Entity-Relationship and Relational ModelsIn databases and general datasets we work on, both Entities andRelationships are represented as Relations (tables).

Such that a unique entity/relationship is represented by a single tuple(the list of attribute values that represent an entity or relationship).

How can we ensure uniqueness of entities?

15 / 74

Page 25: Representation Models Introduction to Data Science: Data · Text attributes : based on edit distance between strings rather than Euclidean distance. We can use domain knowledge to

The Entity-Relationship and Relational ModelsIn databases and general datasets we work on, both Entities andRelationships are represented as Relations (tables).

Such that a unique entity/relationship is represented by a single tuple(the list of attribute values that represent an entity or relationship).

How can we ensure uniqueness of entities?

keys are an essential ingredient to uniquely identify entities andrelationships in tables.

15 / 74

Page 26: Representation Models Introduction to Data Science: Data · Text attributes : based on edit distance between strings rather than Euclidean distance. We can use domain knowledge to

Formal introduction to keysAttribute set   is a superkey of relation   if values for   aresufficient to identify a unique tuple of each possible relation 

Example: {SSN} and {SSN,name} are both superkeys of person

K R K

r(R)

16 / 74

Page 27: Representation Models Introduction to Data Science: Data · Text attributes : based on edit distance between strings rather than Euclidean distance. We can use domain knowledge to

Formal introduction to keysAttribute set   is a superkey of relation   if values for   aresufficient to identify a unique tuple of each possible relation 

Example: {SSN} and {SSN,name} are both superkeys of person

Superkey   is a candidate key if   is minimalExample: {SSN} is a candidate key for person

K R K

r(R)

K K

16 / 74

Page 28: Representation Models Introduction to Data Science: Data · Text attributes : based on edit distance between strings rather than Euclidean distance. We can use domain knowledge to

Formal introduction to keysAttribute set   is a superkey of relation   if values for   aresufficient to identify a unique tuple of each possible relation 

Example: {SSN} and {SSN,name} are both superkeys of person

Superkey   is a candidate key if   is minimal

Example: {SSN} is a candidate key for person

One of the candidate keys is selected to be the primary key

Typically one that is small and immutable (doesn’t change often)Primary key typically highlighted in ER diagram

K R K

r(R)

K K

16 / 74

Page 29: Representation Models Introduction to Data Science: Data · Text attributes : based on edit distance between strings rather than Euclidean distance. We can use domain knowledge to

Formal introduction to keysForeign key: Primary key of a relation that appears in anotherrelation

{SSN} from person appears in employsperson called referenced relationemploys is the referencing relation

17 / 74

Page 30: Representation Models Introduction to Data Science: Data · Text attributes : based on edit distance between strings rather than Euclidean distance. We can use domain knowledge to

Formal introduction to keysForeign key: Primary key of a relation that appears in anotherrelation

{SSN} from person appears in employsperson called referenced relationemploys is the referencing relation

Foreign key constraint: the tuple corresponding to that primary keymust exist

Imagine:Tuple: ('123-45-6789', 'Apple')in employsBut no tuple corresponding to '123­45­6789' in person

Also called referential integrity constraint 17 / 74

Page 31: Representation Models Introduction to Data Science: Data · Text attributes : based on edit distance between strings rather than Euclidean distance. We can use domain knowledge to

Tidy DataWe use the term Tidy Data to refer to datasets that are represented in aform that is amenable for manipulation and statistical modeling.

It is very closely related to the concept of normal forms in the ER modeland the process of normalization in the database literature.

18 / 74

Page 32: Representation Models Introduction to Data Science: Data · Text attributes : based on edit distance between strings rather than Euclidean distance. We can use domain knowledge to

Tidy DataHere we assume we are working in the ER data model represented asrelations: rectangular data structures where

1. Each attribute (or variable) forms a column2. Each entity (or observation) forms a row3. Each type of entity (observational unit) forms a table

19 / 74

Page 33: Representation Models Introduction to Data Science: Data · Text attributes : based on edit distance between strings rather than Euclidean distance. We can use domain knowledge to

Tidy DataHere is an example of a tidy dataset: One entity per row, a singleattribute per column. Only information about flights included.

year month day dep_time sched_dep_time dep_delay arr_time sched_arr_

2013 1 1 517 515 2 830

2013 1 1 533 529 4 850

20 / 74

Page 34: Representation Models Introduction to Data Science: Data · Text attributes : based on edit distance between strings rather than Euclidean distance. We can use domain knowledge to

Structure Query LanguageThe Structured­Query­Language (SQL) is the predominant languageused in database systems.

It is tailored to the Relational data representation model.

SQL is a declarative language, we don't write a procedure to compute arelation, we declare what the relation we want to compute looks like.

21 / 74

Page 35: Representation Models Introduction to Data Science: Data · Text attributes : based on edit distance between strings rather than Euclidean distance. We can use domain knowledge to

Structure Query LanguageThe basic construct in SQL is the so­called SFW construct: select­from­where which specifies:

select: which attributes you want the answer to havefrom: which relation (table) you want the answer to be computed fromwhere: what conditions you want to be satisfied by the rows (tuples)of the answer

22 / 74

Page 36: Representation Models Introduction to Data Science: Data · Text attributes : based on edit distance between strings rather than Euclidean distance. We can use domain knowledge to

Structure Query LanguageE.g.: movies produced by Disney in 1990: note the rename

select m.title, m.year

from movie m

where m.studioname = 'disney' and m.year = 1990

23 / 74

Page 37: Representation Models Introduction to Data Science: Data · Text attributes : based on edit distance between strings rather than Euclidean distance. We can use domain knowledge to

Structure Query LanguageThe select clause can contain expressions (this is paralleled by themutate operation we saw previously)

select title || ' (' || to_char(year) || ')' astitleyearselect 2014 - year

24 / 74

Page 38: Representation Models Introduction to Data Science: Data · Text attributes : based on edit distance between strings rather than Euclidean distance. We can use domain knowledge to

Structure Query LanguageThe where clause support a large number of different predicates andcombinations thereof (this is parallel to the filter operation)

year between 1990 and 1995title like 'star wars%' title like 'star wars _'

25 / 74

Page 39: Representation Models Introduction to Data Science: Data · Text attributes : based on edit distance between strings rather than Euclidean distance. We can use domain knowledge to

Structure Query LanguageWe can include ordering, e.g., find distinct movies sorted by title

select distinct title

from movie

where studioname = 'disney' and year = 1990

order by title;

26 / 74

Page 40: Representation Models Introduction to Data Science: Data · Text attributes : based on edit distance between strings rather than Euclidean distance. We can use domain knowledge to

Structure Query Language

Group-by and summarize

SQL has an idiom for grouping and summarizing

E.g., compute the average movie length by year

select name, avg(length)

from movie

group by year

27 / 74

Page 41: Representation Models Introduction to Data Science: Data · Text attributes : based on edit distance between strings rather than Euclidean distance. We can use domain knowledge to

Two-table operationsSo far we have looked at data operations defined over single tables anddata frames.

In this section we look at efficient methods to combine data from multipletables.

The fundamental operation here is the join, which is a workhorse ofdatabase system design and impementation.

28 / 74

Page 42: Representation Models Introduction to Data Science: Data · Text attributes : based on edit distance between strings rather than Euclidean distance. We can use domain knowledge to

Two-table operations

The join operation:

Combines rows from two tables to create a new single table

Based on matching criteria specified over attributes of each of the twotables.

29 / 74

Page 43: Representation Models Introduction to Data Science: Data · Text attributes : based on edit distance between strings rather than Euclidean distance. We can use domain knowledge to

Two-table operationsConsider a database of flights and airlines:

flights

## # A tibble: 336,776 x 19

## year month day dep_time sched_dep_time dep_delay arr_time

## <int> <int> <int> <int> <int> <dbl> <int>

## 1 2013 1 1 517 515 2 830

## 2 2013 1 1 533 529 4 850

## 3 2013 1 1 542 540 2 923

## 4 2013 1 1 544 545 -1 1004

## 5 2013 1 1 554 600 -6 812 30 / 74

Page 44: Representation Models Introduction to Data Science: Data · Text attributes : based on edit distance between strings rather than Euclidean distance. We can use domain knowledge to

Two-table operations

airlines

## # A tibble: 16 x 2

## carrier name

## <chr> <chr>

## 1 9E Endeavor Air Inc.

## 2 AA American Airlines Inc.

## 3 AS Alaska Airlines Inc.

## 4 B6 JetBlue Airways

## 5 DL Delta Air Lines Inc.

## 6 EV ExpressJet Airlines Inc.

## 7 F9 Frontier Airlines Inc. 31 / 74

Page 45: Representation Models Introduction to Data Science: Data · Text attributes : based on edit distance between strings rather than Euclidean distance. We can use domain knowledge to

Two-table operationsHere, we want to add airline information to each flight.

Join the attributes of the respective airline from the airlines table withthe flights table based on the values of attributes flights$carrierand airlines$carrier.

32 / 74

Page 46: Representation Models Introduction to Data Science: Data · Text attributes : based on edit distance between strings rather than Euclidean distance. We can use domain knowledge to

Two-table operationsEvery row of flights with a specific value for flights$carrier, isjoined with the the corresponding row in airlines with the same valuefor airlines$carrier.

33 / 74

Page 47: Representation Models Introduction to Data Science: Data · Text attributes : based on edit distance between strings rather than Euclidean distance. We can use domain knowledge to

Two-table operationsThere are multiple ways of performing this operation that differ on hownon­matching observations are handled.

34 / 74

Page 48: Representation Models Introduction to Data Science: Data · Text attributes : based on edit distance between strings rather than Euclidean distance. We can use domain knowledge to

Two-table operations

Left JoinIn a left join, all observations on left operand (LHS) are retained:

35 / 74

Page 49: Representation Models Introduction to Data Science: Data · Text attributes : based on edit distance between strings rather than Euclidean distance. We can use domain knowledge to

Two-table operationsOther operations:

right join: all observations in RHS are retained

outer join: all observations are retained (full join)

inner join: only matching observations are retained

Details in lecture notes

36 / 74

Page 50: Representation Models Introduction to Data Science: Data · Text attributes : based on edit distance between strings rather than Euclidean distance. We can use domain knowledge to

Two-table operations

Join conditionsAll join operations are based on a matching condition:

flights %>%

inner_join(airlines, by="carrier")

specifies to join observations where flights$carrier equalsairlines$carrier.

37 / 74

Page 51: Representation Models Introduction to Data Science: Data · Text attributes : based on edit distance between strings rather than Euclidean distance. We can use domain knowledge to

Two-table operationsIn this case, where no conditions are specified using the by argument:

flights %>%

left_join(airlines)

a natural join is perfomed. In this case all variables with the same namein both tables are used in join condition.

38 / 74

Page 52: Representation Models Introduction to Data Science: Data · Text attributes : based on edit distance between strings rather than Euclidean distance. We can use domain knowledge to

Two-table operationsYou can also specify join conditions on arbitrary attributes using the byargument.

flights %>%

left_join(airlines, by=c("carrier" = "name"))

39 / 74

Page 53: Representation Models Introduction to Data Science: Data · Text attributes : based on edit distance between strings rather than Euclidean distance. We can use domain knowledge to

Two-table operations

SQL Constructs: Multi-table QueriesKey idea:

Do a join to combine multiple tables into an appropriate tableUse SFW constructs for single­table queries

40 / 74

Page 54: Representation Models Introduction to Data Science: Data · Text attributes : based on edit distance between strings rather than Euclidean distance. We can use domain knowledge to

Two-table operations

SQL Constructs: Multi-table QueriesKey idea:

Do a join to combine multiple tables into an appropriate tableUse SFW constructs for single­table queries

For the first part, where we use a join to get an appropriate table, thegeneral SQL construct includes:

The name of the first table to joinThe type of join to doThe name of the second table to join 40 / 74

Page 55: Representation Models Introduction to Data Science: Data · Text attributes : based on edit distance between strings rather than Euclidean distance. We can use domain knowledge to

Two-table operations

select title, year, me.name as producerName

from movies m join movieexec me

where m.producer = me.id;

41 / 74

Page 56: Representation Models Introduction to Data Science: Data · Text attributes : based on edit distance between strings rather than Euclidean distance. We can use domain knowledge to

Entity Resolution and Record LinkageOften, we will be faced with the problem of data integration:

combine two (or more) datasets from different sources

that may contain information about the same entities.

42 / 74

Page 57: Representation Models Introduction to Data Science: Data · Text attributes : based on edit distance between strings rather than Euclidean distance. We can use domain knowledge to

Entity Resolution and Record LinkageOften, we will be faced with the problem of data integration:

combine two (or more) datasets from different sources

that may contain information about the same entities.

But,... the attributes in the two datasets may not be the same,

42 / 74

Page 58: Representation Models Introduction to Data Science: Data · Text attributes : based on edit distance between strings rather than Euclidean distance. We can use domain knowledge to

Entity Resolution and Record LinkageOften, we will be faced with the problem of data integration:

combine two (or more) datasets from different sources

that may contain information about the same entities.

But,... the attributes in the two datasets may not be the same,

42 / 74

Page 59: Representation Models Introduction to Data Science: Data · Text attributes : based on edit distance between strings rather than Euclidean distance. We can use domain knowledge to

Entity Resolution and Record Linkage

43 / 74

Page 60: Representation Models Introduction to Data Science: Data · Text attributes : based on edit distance between strings rather than Euclidean distance. We can use domain knowledge to

Entity Resolution and Record Linkage

43 / 74

Page 61: Representation Models Introduction to Data Science: Data · Text attributes : based on edit distance between strings rather than Euclidean distance. We can use domain knowledge to

Entity Resolution and Record LinkageThese are examples of a general problem referred to as EntityResolution and Record Linkage.

44 / 74

Page 62: Representation Models Introduction to Data Science: Data · Text attributes : based on edit distance between strings rather than Euclidean distance. We can use domain knowledge to

Entity Resolution and Record Linkage

Problem De�nition

Given: Entity sets   and  ,

Find: Linked entities   with   and  .

E1 E2

(e1, e2) e1 ∈ E1 e2 ∈ E2

45 / 74

Page 63: Representation Models Introduction to Data Science: Data · Text attributes : based on edit distance between strings rather than Euclidean distance. We can use domain knowledge to

Entity Resolution and Record Linkage

One approach: similarity function

Define a similarity function between entities   and 

Link entities with high similarity.

e1 e2

46 / 74

Page 64: Representation Models Introduction to Data Science: Data · Text attributes : based on edit distance between strings rather than Euclidean distance. We can use domain knowledge to

Entity Resolution and Record LinkageDefine similarity as an additive function over some set of sharedattributes  :

with   a similarity function defined for each attribute  ,

A

s(e1, e2) = ∑j∈A

sj(e1[j], e2[j])

sj j

47 / 74

Page 65: Representation Models Introduction to Data Science: Data · Text attributes : based on edit distance between strings rather than Euclidean distance. We can use domain knowledge to

Entity Resolution and Record Linkage

Example attribute functions

Categorical attribute: pairs of entities with the same value are moresimilar to each other than pairs of entities with different values. E.g.,

sj(e1[j], e2[j]) = { 1 if e1[j] == e2[j]

0 o. w.

48 / 74

Page 66: Representation Models Introduction to Data Science: Data · Text attributes : based on edit distance between strings rather than Euclidean distance. We can use domain knowledge to

Entity Resolution and Record Linkage

Example attribute functions

Continuous attribute: pairs of entities with values that are close to eachother are more similar than pairs of entities with values that are farther toeach other.

Note that to specify close or far we need to introduce some notion ofdistance. We can use Euclidean distance for example,

dj(e1[j], e2[j]) = (e1[j] − e2[j])2;  sj(e1[j], e2[j]) = e−dj(e1[j],e2[j])

49 / 74

Page 67: Representation Models Introduction to Data Science: Data · Text attributes : based on edit distance between strings rather than Euclidean distance. We can use domain knowledge to

Entity Resolution and Record Linkage

Example attribute functions

Text attributes: based on edit distance between strings rather thanEuclidean distance. We can use domain knowledge to specify similarity.

For example, fact that John and Johnathan are similar requires domainknowledge of common usage of English names.

50 / 74

Page 68: Representation Models Introduction to Data Science: Data · Text attributes : based on edit distance between strings rather than Euclidean distance. We can use domain knowledge to

Solving the resolution problemNeed a rule to match entities we think are linked.

This depends on assumptions we make about the dataset, similar toassumptions we made when performing joins.

51 / 74

Page 69: Representation Models Introduction to Data Science: Data · Text attributes : based on edit distance between strings rather than Euclidean distance. We can use domain knowledge to

Solving the resolution problemModel the entity resolution problem as an optimization problem:

maximize objective function (based on similarity)

over possible sets   of valid pairs  , where set   constraintspairs based on problem­specific assumptions.

V (e1, e2) V

R = arg maxV

∑(e1,e2)∈V

s(e1, e2)

52 / 74

Page 70: Representation Models Introduction to Data Science: Data · Text attributes : based on edit distance between strings rather than Euclidean distance. We can use domain knowledge to

Solving the resolution problem

Many-to-one resolutions

Constrain sets   to represent many­to­one resolutions.

Thus, entities in   can only appear once in pairs in  , but entities may appear more than once.

In this case, we can match   where

V

e1 V e2

(e1, e2)

e2 = arg maxe∈E2

s(e1, e)

53 / 74

Page 71: Representation Models Introduction to Data Science: Data · Text attributes : based on edit distance between strings rather than Euclidean distance. We can use domain knowledge to

Solving the resolution problem

One-to-one resolutions

Suppose we constrain sets   to those that represent one­to­oneresolutions:

If   then   and   appear in only one pair in  .

In this case, we have a harder computational problem. In fact, this is aninstance of the maximum bipartite matching problem, and would look atnetwork flow algorithms to solve.

V

(e1, e2) ∈ V e1 e2 V

54 / 74

Page 72: Representation Models Introduction to Data Science: Data · Text attributes : based on edit distance between strings rather than Euclidean distance. We can use domain knowledge to

Solving the resolution problem

Other constraints

We can add additional constraints to   to represent other information wehave about the task.

A common one would be to only allow pairs   to havesimilarity above some threshold  . I.e.,   only if 

.

V

(e1, e2) ∈ V

t (e1, e2) ∈ V

s(e1, e2) ≥ t

55 / 74

Page 73: Representation Models Introduction to Data Science: Data · Text attributes : based on edit distance between strings rather than Euclidean distance. We can use domain knowledge to

Solving the resolution problem

Discussion

The procedure outlined above is an excellent first attempt to solve theEntity Resolution problem.

This is a classical problem in Data Science for which a variety ofapproaches and methods are in use.

56 / 74

Page 74: Representation Models Introduction to Data Science: Data · Text attributes : based on edit distance between strings rather than Euclidean distance. We can use domain knowledge to

Database Query OptimizationEarlier we made the distinction that SQL is a declarative language ratherthan a procedural language.

A reason why data base systems rely on a declarative language is that itallows the system to decide how to evaluate a query most efficiently.

57 / 74

Page 75: Representation Models Introduction to Data Science: Data · Text attributes : based on edit distance between strings rather than Euclidean distance. We can use domain knowledge to

Database Query OptimizationConsider a Baseball database where we have two tables Batting andMaster

what is the maximum batting "average" for a player from the state ofCalifornia?

select max(1.0 * b.H / b.AB) as best_ba

from Batting as b join Master as m on b.playerId = m.playerId

where b.AB >= 100 and m.birthState = "CA"

58 / 74

Page 76: Representation Models Introduction to Data Science: Data · Text attributes : based on edit distance between strings rather than Euclidean distance. We can use domain knowledge to

Database Query OptimizationNow, let's do the same computation using dplyr operations:

Version 1:

Batting %>%

inner_join(Master, by="playerID") %>%

filter(AB >= 100, birthState == "CA") %>%

mutate(AB=1.0 * H / AB) %>%

summarize(max(AB))

## max(AB)

## 1 0.4057018 59 / 74

Page 77: Representation Models Introduction to Data Science: Data · Text attributes : based on edit distance between strings rather than Euclidean distance. We can use domain knowledge to

Database Query OptimizationVersion 2:

Batting %>%

filter(AB >= 100) %>%

inner_join(

filter(Master, birthState == "CA")) %>%

mutate(AB = 1.0 * H / AB) %>%

summarize(max(AB))

## max(AB)

## 1 0.4057018

60 / 74

Page 78: Representation Models Introduction to Data Science: Data · Text attributes : based on edit distance between strings rather than Euclidean distance. We can use domain knowledge to

Database Query OptimizationWhich should be most efficient? Think about a simple cost model. Thecostliest operation here is the join between two tables.

61 / 74

Page 79: Representation Models Introduction to Data Science: Data · Text attributes : based on edit distance between strings rather than Euclidean distance. We can use domain knowledge to

Database Query OptimizationWhat is the cost of this algorithm?  .

For the rest of the operations, let's assume we perform this with a singlepass through the table.

For example, we assume that filter(T) has cost  .

|T1| × |T2|

|T |

62 / 74

Page 80: Representation Models Introduction to Data Science: Data · Text attributes : based on edit distance between strings rather than Euclidean distance. We can use domain knowledge to

Database Query OptimizationLet's write out the cost of each of the two pipelines.

Batting %>%

inner_join(Master, by="playerID") %>% # cost: |Batting| x |Master|

filter(AB >= 100, birthState == "CA") %>% # cost: |R1|

mutate(AB=1.0 * H / AB) %>% # cost: |R|

summarize(max(AB)) # cost: |R|

63 / 74

Page 81: Representation Models Introduction to Data Science: Data · Text attributes : based on edit distance between strings rather than Euclidean distance. We can use domain knowledge to

Database Query OptimizationCost of version 1 is

: inner join between Batting and Master  : is   filtered to rowswith AB >=100 & birthState == "CA".

In this example: 2.08e+09

|Batting| × |Master| + |R1| + 2|R|

R1 R R1

64 / 74

Page 82: Representation Models Introduction to Data Science: Data · Text attributes : based on edit distance between strings rather than Euclidean distance. We can use domain knowledge to

Database Query OptimizationNow, let's look at the second version.

Batting %>%

filter(AB >= 100) %>% # cost: |Batting|

inner_join(

Master %>% filter(birthState == "CA") # cost: |Master|

) %>% # cost: |B1| x |M1|

mutate(AB = 1.0 * H / AB) %>% # cost |R|

summarize(max(AB)) # cost |R|

65 / 74

Page 83: Representation Models Introduction to Data Science: Data · Text attributes : based on edit distance between strings rather than Euclidean distance. We can use domain knowledge to

Database Query OptimizationCost of version 2 is 

: Batting filtered to include only rows with AB >= 100  :Master filtered to include birthState == "CA".

In our example: 8.95e+07

|Batting| × |Master| + |B1| × |M1| + 2|R|

B1 M2

66 / 74

Page 84: Representation Models Introduction to Data Science: Data · Text attributes : based on edit distance between strings rather than Euclidean distance. We can use domain knowledge to

Database Query OptimizationVersion 1 (join tables before filtering) is 23 times costlier.

When using SQL in a database system we only write the one querydescribing our desired result,

With the procedural (dplyr) we need to think which of the two versionsis more efficient.

67 / 74

Page 85: Representation Models Introduction to Data Science: Data · Text attributes : based on edit distance between strings rather than Euclidean distance. We can use domain knowledge to

Database Query OptimizationDatabase systems use query optimization to decide how to evaluatequeries efficiently.

The goal of query optimization is to decide the most efficient query planto use to evaluate a query out of the many possible candidate plans itcould use.

It needs to solve two problems: search the space of possible plans,approximate the cost of evaluating a specific plan.

68 / 74

Page 86: Representation Models Introduction to Data Science: Data · Text attributes : based on edit distance between strings rather than Euclidean distance. We can use domain knowledge to

Database Query OptimizationThink of the two procedural versions above as two candidate plans thatthe DB system could use to evaluate the query.

Query optimzation approximates what it would cost to evaluate each ofthe two plans and decides to use the most efficient plan.

69 / 74

Page 87: Representation Models Introduction to Data Science: Data · Text attributes : based on edit distance between strings rather than Euclidean distance. We can use domain knowledge to

Semi-structured Data Representation ModelThe Entity­Relational data model we have described so far is mostlydefined for structured data: where a specific and consistent schema isassumed.

Data models like XML and JSON are instead intended for semi­structured data.

70 / 74

Page 88: Representation Models Introduction to Data Science: Data · Text attributes : based on edit distance between strings rather than Euclidean distance. We can use domain knowledge to

Semi-structured Data Representation Model

XML: eXtensible Markup Language

Data models like XML rely on flexible, self­describing schemas:

<?xml version="1.0" encoding="UTF-8"?>

<!-- Edited by XMLSpy -->

<CATALOG>

<CD>

<TITLE>Empire Burlesque</TITLE>

<ARTIST>Bob Dylan</ARTIST>

<COUNTRY>USA</COUNTRY>

<COMPANY>Columbia</COMPANY> 71 / 74

Page 89: Representation Models Introduction to Data Science: Data · Text attributes : based on edit distance between strings rather than Euclidean distance. We can use domain knowledge to

Semi-structured Data Representation Model

JSON: Javascript Object Notation

{

"firstName": "John",

"lastName": "Smith",

"isAlive": true,

"age": 25,

"height_cm": 167.6,

"address": {

"streetAddress": "21 2nd Street",

"city": "New York",

"state": "NY",72 / 74

Page 90: Representation Models Introduction to Data Science: Data · Text attributes : based on edit distance between strings rather than Euclidean distance. We can use domain knowledge to

Semi-structured Data Representation ModelThis is the format most contemporary data REST APIs use to transferdata. For instance, here is part of a JSON record from a Twitter stream:

{

"created_at":"Sun May 05 14:01:34+00002013",

"id":331046012875583488,

"id_str":"331046012875583488",

"text":"\u0425\u043e\u0447\u0443, \u0447\u0442\u043e\u0431 \u0442\u044b \u0441\u0434\u0

"source":"\u003ca href=\"http:\/\/twitterfeed.com\"rel=\"nofollow\"\u003etwitterfeed\u0

"in_reply_to_user_id_str":null,

"user":{

"id":548422428, 73 / 74

Page 91: Representation Models Introduction to Data Science: Data · Text attributes : based on edit distance between strings rather than Euclidean distance. We can use domain knowledge to

SummaryWe have looked at specifics of Data Representation Modeling

Entity Relationship and Relational Models

Definition of Tidy Data

Joining tables

Entity Resolution

Models for semi­structured data

74 / 74