1 sql, databases, and ensembl modules. 2 please look for next lecture ensembl api tutorial:

33
1 SQL, Databases, and Ensembl Modules

Upload: grant-malone

Post on 11-Jan-2016

221 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: 1 SQL, Databases, and Ensembl Modules. 2 Please look for next lecture Ensembl API Tutorial:

1

SQL, Databases, and Ensembl Modules

Page 2: 1 SQL, Databases, and Ensembl Modules. 2 Please look for next lecture Ensembl API Tutorial:

2

Please look for next lecture

Ensembl API Tutorial:http://www.ensembl.org/info/software/core/core_tutorial.html

Page 3: 1 SQL, Databases, and Ensembl Modules. 2 Please look for next lecture Ensembl API Tutorial:

3

Survey of Databases(if time)

Ensembl web interface

Ensembl Biomart

Page 4: 1 SQL, Databases, and Ensembl Modules. 2 Please look for next lecture Ensembl API Tutorial:

4

RDBMS• Oracle and Sybase (many others)

– industry standard, commercial products– development and management tools

• PostgresSQL– full-featured relational DBMS– open source– found in most linux distributions– handles unusual datatypes well which adds flexibility for future

extensions• MySQL

– open-source relational DBMS– easy to setup and use– Linux/Windows/Mac

• Each has variations– SQL, datatypes, functions, features

Page 5: 1 SQL, Databases, and Ensembl Modules. 2 Please look for next lecture Ensembl API Tutorial:

5

Structured Query Language (SQL)

• SQL is the standard language used to create, modify, maintain, and query relational databases.

• SQL commands are issued within the context of the DBMS interface

• SQL commands can be passed to the DBMS by other programs (such as perl, PHP, Java, manually, etc.)

• Many venders have extensions (Sybase-SQL)• SQL Example:• select FIELD from TABLE where CONDITION• select TABLE.FIELD from TABLE where CONDITION• select program from software where program like 'blast%';

Page 6: 1 SQL, Databases, and Ensembl Modules. 2 Please look for next lecture Ensembl API Tutorial:

6

SQL datatypes

• datatype -- describes the data stored in a particular column of a table

• typically is either numeric or character strings

• SQL defines subtypes that set different upper limits on the size of text or numerical data

• Also -- special types such as DATE, MONEY

Page 7: 1 SQL, Databases, and Ensembl Modules. 2 Please look for next lecture Ensembl API Tutorial:

7

SQL datatypes

• INT• FLOAT• REAL (larger float)• CHAR -- fixed length text string• TEXT -- variable length text• BLOB -- variable length binary field• DECIMAL -- real number stored as character string• DATE • TIMESTAMP -- value updates every time the record is

modified• ENUM -- limited set of options (numeric or named)• SET -- value is one of a limited set

Page 8: 1 SQL, Databases, and Ensembl Modules. 2 Please look for next lecture Ensembl API Tutorial:

8

Creating a new table in the database

CREATE TABLE tablename (columnane type [modifiers], columanme type [modifiers])

create table software_package ( packid INT NOT NULL PRIMARY KEY AUTO_INCREMENT,

packname VARCHAR(100),packurl VARCHAR(255),function TEXT,keyword ENUM,os SET,format SET,archfile VARCHAR(255) );

Page 9: 1 SQL, Databases, and Ensembl Modules. 2 Please look for next lecture Ensembl API Tutorial:

9

Inserting data

INSERT INTO table(colname1, colname2, columname3) VALUES ('value1','value2','value3')

Page 10: 1 SQL, Databases, and Ensembl Modules. 2 Please look for next lecture Ensembl API Tutorial:

10

Accessing Data -- SELECT

SELECT [fields] FROM [table] where [condition]

select * from [table] where [condition]

select packname, packurl from software

select packname, packurl from software where keyword = "sequence alignment";

select packname, packurl from software where packname like "B%";

Page 11: 1 SQL, Databases, and Ensembl Modules. 2 Please look for next lecture Ensembl API Tutorial:

11

Joining multiple tables

• SELECT can "join" two related tables• relationships between tables are created by

replicating information ("primary key") from one table as a "foreign key" in another table

select attribute from table1, table2 where primarykey1 = foreignkey2

SELECT packname, packurl, reference_date FROM software, reference WHERE software.package_id = reference.package_id AND reference_date >= 1998;

Page 12: 1 SQL, Databases, and Ensembl Modules. 2 Please look for next lecture Ensembl API Tutorial:

12

ExampleAgentsAgentID First Last HireDate Phone

100 Mike Hernandez 5/16/95 553-3992101 Greg Piercy 10/15/95 790-3992102 Kathy Ehrlich 3/1/96 351-4993

ClientsClientID AgentID First Last Phone City

9001 100 Stewart Jameson 553-3999 El Paso9002 101 Shannon McLain 555-5555 Atlanta9003 102 Estela Pundt 541-4993 El Paso

EntertainersEntertainerIDAgentID First Last ….

3000 100 John Slade …3001 101 Mark Jebavy …3002 102 Teresa Weiss …

EngagementsClientID EntertainerIDEngagementDate StartTime StopTime

9003 3001 4/1/96 1:00 PM 3:30 PM9002 3000 4/13/96 9:00 PM 1:30 AM9001 3002 5/2/96 3:00 PM 6:00 PM

select Clients.Last, Clients.First, Clients.Phone from Clients where Clients.City = “El Paso” order by Clients.Last, Clients.First

select Last, First, Phone from Clients where City = "El Paso" order by Last, First

Page 13: 1 SQL, Databases, and Ensembl Modules. 2 Please look for next lecture Ensembl API Tutorial:

13

SQL Basics• select * from table_name• select count(*) from table_name• Find start times of Entertainers where Jameson is the Client• Nested query (may not always be supported)

– select StartTime from Engagements where ClientID = (select ClientID from Clients where Last = “Jameson”)

• Join– select Engagements.StartTime, Clients.ClientID from Engagements,

Clients where Engagements.ClientID = Clients.ClientID and Client.Last = “Jameson”

– select Engagements.StartTime from Engagements,Clients where Engagements.ClientID = Clients.ClientID and Client.Last = “Jameson”

• insert into TABLE values (val1, val2, …)• insert into TABLE (FIELDa, FIELDb, …FIELDn) values (value1,

value2, …)• insert into Clients values (“Terry”, “Braun”, 11/10/03,555-6666)

Page 14: 1 SQL, Databases, and Ensembl Modules. 2 Please look for next lecture Ensembl API Tutorial:

14

…as a spreadsheetAgents

Clients

Entertainers

Entertainers

Engagements

Engagements

Agents

First

Last HireDate

Phone

ClientID

AgentID

First

Last Phone

City EntertainerID

AgentID

First

Last ClientID

EntertainerID

EngagementDate

StartTime

StopTime

100 Mike Hernandez

5/17/99

553-3992

9001 100 Stewart

Jameson

553-3999

El Paso

3000 100 John Slade

9001 3002 5/3/00

3:00 PM

6:00 PM

101 Greg Piercy

10/16/99

790-3992

9002 101 Shannon

McLain

555-5555

Atlanta

3001 101 Mark Jebavy

9002 3000 4/14/00

9:00 PM

1:30 AM

102 Kathy

Ehrlich

3/2/00

351-4993

9003 102 Estela

Pundt

541-4993

El Paso

3002 102 Teresa

Weiss

9003 3001 4/2/00

1:00 PM

3:30 PM

Flattened table structure, and lined up the rows.

Appears to be okay….

Except…

The relationships in this trivial example are more complicated than a

simple flat structure such as a spreadsheet.

Example) The first Engagement is between Client 9001, and Entertainer 3002

Page 15: 1 SQL, Databases, and Ensembl Modules. 2 Please look for next lecture Ensembl API Tutorial:

15

SQL Basics

delete TABLE where FIELD = value

update TABLE set FIELD = value

drop DATABASE

Page 16: 1 SQL, Databases, and Ensembl Modules. 2 Please look for next lecture Ensembl API Tutorial:

16

exon

transcript_id

exon_num

sequence_start

sequence_stop

intron

transcript_id

intron_num

sequence_start

sequence_stop

primer_pair

id

transcript_id

left_primer_id

right_primer_id

transcript

id

sequence_id

source

source_id

sequence

id

target_id

type

sequence

chr_name

strand

genomic_start

genomic_stop

source

source_id

refresh

target

id

date

gene_name

description

accession

status

project

id

name

description

date

set_table

id

project_id

name

date

description

set_target_join

set_id

target_id

rank

cas_rank

cas_options

Page 17: 1 SQL, Databases, and Ensembl Modules. 2 Please look for next lecture Ensembl API Tutorial:

17

exon

transcript_id

exon_num = 3

sequence_start

sequence_stop

intron

transcript_id

intron_num = 3

sequence_start

sequence_stop

primer_pair

id

transcript_id

left_primer_id

right_primer_id

transcript

id

sequence_id

source = Ensembl

source_id

sequence

id

target_id

type = nucleotide

sequence = ATG…

chr_name = 15

strand = 1

genomic_start = 15,123,120

genomic_stop = 16,378,131

source

source_id

refresh

target

id

date

gene_name = BBS4

description

accession

status

project

id

name = pro1

description

date

set_table

id

project_id

name = testset

date

description

target_set_info

set_id

target_id

rank = 5

cas_rank

cas_options

Sample Data

Page 18: 1 SQL, Databases, and Ensembl Modules. 2 Please look for next lecture Ensembl API Tutorial:

18

Extracting Data from a DB

• This simple figure provides me with enough understanding of the database structure to extract data

Page 19: 1 SQL, Databases, and Ensembl Modules. 2 Please look for next lecture Ensembl API Tutorial:

19

Local SQL Example:• We have gene name (BBS4), and would like to find project name,

and the exon number, start/stops• select id from target where gene_name = "BBS4"

– 11• select set_table.id from set_table, target_set_info where

target_set_info.set_id = set_table.id and target_set_info.target_id = TARGET_ID– 3

• select project.name from project, set_table where set_table.project_id = project.id and set_table.id = SET_ID– project333

• select id from sequence where target_id = TARGET_ID– 33

• select id from transcript where sequence_id = SEQUENCE_ID– 68

• select exon_num, sequence_start, sequence_stop from exon where transcript_id = TRANSCRIPT_ID

• Whew.

Page 20: 1 SQL, Databases, and Ensembl Modules. 2 Please look for next lecture Ensembl API Tutorial:

20

MySQL demo (gscr)

• mysql -h gscr -u tabraun –p

• use trapss;

Page 21: 1 SQL, Databases, and Ensembl Modules. 2 Please look for next lecture Ensembl API Tutorial:

21

TrAPSS Database

Show tables;

Show columns from table TABLE_NAME;

Page 22: 1 SQL, Databases, and Ensembl Modules. 2 Please look for next lecture Ensembl API Tutorial:

22

exon

transcript_id

exon_num = 3

sequence_start

sequence_stop

intron

transcript_id

intron_num = 3

sequence_start

sequence_stop

primer_pair

id

transcript_id

left_primer_id

right_primer_id

transcript

id

sequence_id

source = Ensembl

source_id

sequence

id

target_id

type = nucleotide

sequence = ATG…

chr_name = 15

strand = 1

genomic_start = 15,123,120

genomic_stop = 16,378,131

source

source_id

refresh

target

id

date

gene_name = BBS4

description

accession

status

project

id

name = pro1

description

date

set_table

id

project_id

name =testset

date

description

target_set_info

set_id

target_id

rank = 5

cas_rank

cas_options

select id from target where gene_name = "BBS4";

Page 23: 1 SQL, Databases, and Ensembl Modules. 2 Please look for next lecture Ensembl API Tutorial:

23

exon

transcript_id

exon_num = 3

sequence_start

sequence_stop

intron

transcript_id

intron_num = 3

sequence_start

sequence_stop

primer_pair

id

transcript_id

left_primer_id

right_primer_id

transcript

id

sequence_id

source = Ensembl

source_id

sequence

id

target_id

type = nucleotide

sequence = ATG…

chr_name = 15

strand = 1

genomic_start = 15,123,120

genomic_stop = 16,378,131

source

source_id

refresh

target

id = 23

date

gene_name = BBS4

description

accession

status

project

id

name = pro1

description

date

set_table

id

project_id

name = testset

date

description

target_set_info

set_id

target_id = 23

rank = 5

cas_rank

cas_options

select set_table.id from set_table, target_set_info where target_set_info.set_id = set_table.id and target_set_info.target_id = 23; (3 rows ==> in 3 sets)

Page 24: 1 SQL, Databases, and Ensembl Modules. 2 Please look for next lecture Ensembl API Tutorial:

24

exon

transcript_id

exon_num = 3

sequence_start

sequence_stop

intron

transcript_id

intron_num = 3

sequence_start

sequence_stop

primer_pair

id

transcript_id

left_primer_id

right_primer_id

transcript

id

sequence_id

source = Ensembl

source_id

sequence

id

target_id

type = nucleotide

sequence = ATG…

chr_name = 15

strand = 1

genomic_start = 15,123,120

genomic_stop = 16,378,131

source

source_id

refresh

target

id

date

gene_name = BBS4

description

accession

status

project

id

name = pro1

description

date

set_table

id = 7

project_id

name =testset

date

description

target_set_info

set_id =

target_id

rank = 5

cas_rank

cas_options

select project.name from project, set_table where set_table.project_id = project.id and set_table.id = 7;

Page 25: 1 SQL, Databases, and Ensembl Modules. 2 Please look for next lecture Ensembl API Tutorial:

25

exon

transcript_id

exon_num = 3

sequence_start

sequence_stop

intron

transcript_id

intron_num = 3

sequence_start

sequence_stop

primer_pair

id

transcript_id

left_primer_id

right_primer_id

transcript

id

sequence_id

source = Ensembl

source_id

sequence

id

target_id = 23

type = nucleotide

sequence = ATG…

chr_name = 15

strand = 1

genomic_start = 15,123,120

genomic_stop = 16,378,131

source

source_id

refresh

target

id

date

gene_name = BBS4

description

accession

status

project

id

name = pro1

description

date

set_table

id

project_id

name = testset

date

description

target_set_info

set_id

target_id

rank = 5

cas_rank

cas_options

select id from sequence where target_id = 23;

Page 26: 1 SQL, Databases, and Ensembl Modules. 2 Please look for next lecture Ensembl API Tutorial:

26

exon

transcript_id

exon_num = 3

sequence_start

sequence_stop

intron

transcript_id

intron_num = 3

sequence_start

sequence_stop

primer_pair

id

transcript_id

left_primer_id

right_primer_id

transcript

id

sequence_id = 2216

source = Ensembl

source_id

sequence

id

target_id

type = nucleotide

sequence = ATG…

chr_name = 15

strand = 1

genomic_start = 15,123,120

genomic_stop = 16,378,131

source

source_id

refresh

target

id

date

gene_name = BBS4

description

accession

status

project

id

name = pro1

description

date

set_table

id

project_id

name = testset

date

description

target_set_info

set_id

target_id

rank = 5

cas_rank

cas_options

select id from transcript where sequence_id = 2216;

Page 27: 1 SQL, Databases, and Ensembl Modules. 2 Please look for next lecture Ensembl API Tutorial:

27

exon

transcript_id = 5053

exon_num = 3

sequence_start

sequence_stop

intron

transcript_id

intron_num = 3

sequence_start

sequence_stop

primer_pair

id

transcript_id

left_primer_id

right_primer_id

transcript

id= 5053

sequence_id

source = Ensembl

source_id

sequence

id

target_id

type = nucleotide

sequence = ATG…

chr_name = 15

strand = 1

genomic_start = 15,123,120

genomic_stop = 16,378,131

source

source_id

refresh

target

id

date

gene_name = BBS4

description

accession

status

project

id

name = pro1

description

date

set_table

id

project_id

name = testset

date

description

target_set_info

set_id

target_id

rank = 5

cas_rank

cas_options

select exon_num, sequence_start, sequence_stop from exon where transcript_id = 5053;

Page 28: 1 SQL, Databases, and Ensembl Modules. 2 Please look for next lecture Ensembl API Tutorial:

28

SQL Examples – One Large Query

select distinct exon_num, project.name, sequence_start, sequence_stop from exon, transcript, sequence, target, target_set_info, set_table, project where target.gene_name = "BBS4" and project.name = "pro1" and set_table.name = "testset" and set_table.project_id = project.id and target_set_info.set_id = set_table.id and target_set_info.target_id = target.id and sequence.target_id = target.id and transcript.sequence_id = sequence.id and exon.transcript_id = transcript.id order by exon_num;

Page 29: 1 SQL, Databases, and Ensembl Modules. 2 Please look for next lecture Ensembl API Tutorial:

29

Executing SQL commands from a text file

http://www.mysql.org/doc/refman/5.0/en/batch-commands.html• However, it is also possible to put your SQL statements in a file and then tell

mysql to read its input from that file. To do so, create a text file text_file that contains the statements you wish to execute. Then invoke mysql as shown here:

• shell> mysql db_name < text_file

• If you place a USE db_name statement as the first statement in the file, it is unnecessary to specify the database name on the command line:

• shell> mysql < text_file

• If you are already running mysql, you can execute an SQL script file using the source or \. command:

• mysql> source file_name• mysql> \. file_name

Page 30: 1 SQL, Databases, and Ensembl Modules. 2 Please look for next lecture Ensembl API Tutorial:

30

Remotely

mysql -h gscr.eng.uiowa.edu -u tabraun -p

Page 31: 1 SQL, Databases, and Ensembl Modules. 2 Please look for next lecture Ensembl API Tutorial:

31

Setting up MySQL

• server -- database application that is running on the machine where the data is stored– runs as a "daemon" on Unix machines– a process that is always on, listening for and responding

to requests (from clients)– mysqld -- server

• client -- programs that connect to the server and request data– can be on same machine as server, or– located on other machines and connected by network– mysql -- client– mysqladmin -- administration tool

Page 32: 1 SQL, Databases, and Ensembl Modules. 2 Please look for next lecture Ensembl API Tutorial:

32

MySLQ help

http://dev.mysql.com/

Show tables;

Show columns from table TABLE_NAME;

Show TrAPSS (time permitting)

Page 33: 1 SQL, Databases, and Ensembl Modules. 2 Please look for next lecture Ensembl API Tutorial:

33

Survey of Databases(if time)

Ensembl web interface

Ensembl Ensmart