ctseq : a gene sequence analysis database built for speed (code-tolerant presentation) september 11,...

35
CTSeq : CTSeq : A Gene Sequence A Gene Sequence Analysis Database Built for Analysis Database Built for Speed Speed (Code-tolerant presentation) (Code-tolerant presentation) September 11, 2001 September 11, 2001 James E. Ries, M.S., Depts. CECS & HMI Gordon K. Springer, Ph.D., Depts. CECS & IATS

Upload: hillary-morgan

Post on 03-Jan-2016

224 views

Category:

Documents


0 download

TRANSCRIPT

  • CTSeq : A Gene Sequence Analysis Database Built for Speed(Code-tolerant presentation)

    September 11, 2001James E. Ries, M.S., Depts. CECS & HMI

    Gordon K. Springer, Ph.D., Depts. CECS & IATS

  • AbstractThe discovery and analysis of genomic sequence information generates huge quantities of raw data. To deal with this data in an effective way, we have created a custom database system called CTSeq which is based on Faircom Corporation's CTree database toolkit. This system provides high performance access to a wide variety of sequence information with very low system overhead. CTSeq is modular and general, and should be adaptable to a wide variety of large scale data.

  • Background (cont.)

  • Swine female reproductive structures and embryos are removed at various times of gestation.

    Gene annotation is quickly enhanced via access to other data bases through the MU Internet2 high-speed network.

    Patterns of gene expression are analyzed using microarray technology to reveal the mechanisms controlling reproductive efficiency.

    Increased efficiency, quality,and profitability is the goal.

    High-throughputDNA sequencing at the DNA CoreFacility

    Sequence data are analyzedon-campus at UMC.

  • Background (cont.)

  • Background (cont.)

  • Background (cont.)

  • MotivationMany current bioinformatics projects are file-basedMonsanto Project Example:A single plate (96 wells) generates over 500 data files.Currently (9/11/2001) approximately 394 plates (500 * 394 = 197,000 files)Raw plate data alone for the project takes up over 4 gig of space.Analysis programs are typically Unix command-line.

  • Motivation (cont.)Many projects run without any infrastructureResearchers must cut and paste data from one application to another.Some data must be transformed in format in order to work in another application, and this is difficult if not impossible for many researchers.

  • Motivation (cont.)Even powerful machines can be slow in processing this enormous amount of data.Darwin is a Compaq True64 Unix box with 4 processors and 2 Gig (4 Gig?) of RAMDarwin takes approximately 1 week to completely search Genbank for Monsanto project sequences

  • Motivation (cont.)There are approximately 11,101,000,000 bases in 10,106,000 sequence records as of December 2000

    Source: http://www.ncbi.nlm.nih.gov/Genbank/GenbankOverview.html

  • Solution DiscussionOur solution is to build a custom DBMS using Faircoms ISAM toolkit (C-Tree+).Allows multiple interfaces (C, C++, ODBC, CGI, etc.)Provides complete control of important implementation details (for performance)Avoids costly and less effective shrink-wrapped solutionsOur DBMS (CTSeq) ties together all bioinformatic data and internally does necessary transformations.

  • Solution Discussion (cont.)

  • Solution Discussion (cont. C++)template < class DATATYPE, class PACKEDDATATYPE, int INDEX_COUNT, int INDEX_SEGMENT_COUNT>class CTTable : public CTBaseTable{private:typedef BOOL (*PACKUNPACKFUNC)(DATATYPE * pDTIn, PACKEDDATATYPE * pDTOut);public:...

    // MethodsBOOL Open(char * pszTableName=NULL);BOOL Close();COUNT GetFileNo();COUNT GetIndexFileNo(char * pszIndexName);BOOL AddRecord(/* In */ void * pVoid, int iSize);BOOL UpdateRecord(/* In */ void * pVoid, int iSize);...

  • Solution Discussion (cont. C)CEXTERN BOOL InitCTDB();CEXTERN BOOL UnInitCTDB();CEXTERN BOOL OpenCTTable(/* In */ TABLETYPE tt, /* Out */ CTDBHANDLE * phDB);CEXTERN BOOL OpenCTTableByName(/* In */ TABLETYPE tt,/* Out */ CTDBHANDLE * phDB, /* In */ char * pszDBName);CEXTERN BOOL CloseCTDB(/* In */ CTDBHANDLE hDB);CEXTERN BOOL FindFirstCTRecord(/* In */ CTDBHANDLE hDB,/* Out */ void * pSeq);CEXTERN BOOL FindNextCTRecord(/* In */ CTDBHANDLE hDB,/* Out */ void * pSeq);CEXTERN BOOL AddCTRecord(/* In */CTDBHANDLE hDB,/* In */ void * pSeq,/* In */ int iRecordSize);

    . . .

  • Solution Discussion (cont.)typedef struct _SEQUENCE{TEXT szPlateName[MAX_PLATE_NAME];TEXT szWellName[MAX_WELL_NAME];intiSeqStart;intiSeqEnd;intiRestrictionStart;intiRestrictionEnd;intiLibTagStart;intiLibTagEnd;intiPolyAStart;intiPolyAEnd;intiPolyASignalStart;intiPolyASignalEnd;intiVectorStart;intiVectorEnd;char cStatus;TEXT szBases[MAX_BASES];unsigned charuPhredScores[MAX_BASES];} SEQUENCE, * PSEQUENCE;

  • Solution Discussion (cont.)////////////////////////////////////////// CHROMATOGRAM Table////////////////////////////////////////typedef struct _CHROMATOGRAM{TEXTszPlateName[MAX_PLATE_NAME];TEXTszWellName[MAX_WELL_NAME];intiChromatogramSize;unsigned charblobChromatogram[MAX_CHROM_SIZE];} CHROMATOGRAM, * PCHROMATOGRAM;

  • Solution Discussion (cont.)////////////////////////////////////////// TRANSACTION Table////////////////////////////////////////typedef struct _TRANSACTION{TEXTszPlateName[MAX_PLATE_NAME];TEXTszWellName[MAX_WELL_NAME];TEXTszUserID[MAX_USER_ID];intiSeqStart;intiSeqEnd;time_ttimeStamp;charcStatus;TEXTszComment[MAX_COMMENT];} TRANSACTION, * PTRANSACTION;

  • Solution Discussion (cont.)typedef struct _CLUSTER{TEXTszPlateName[MAX_PLATE_NAME];TEXTszWellName[MAX_WELL_NAME];TEXTszClusterClassName[MAX_CLUSTER_NAME];// e.g., "Folicle" for a "T" type clusterunsigned intuClusterNumber;unsigned charbSeqType;// Primary - P; Secondary - Sunsigned charbClusterType;// Plate - P; Library - L; // Tissue - T; Unigene (project) - U} CLUSTER, * PCLUSTER;

  • Solution Discussion (cont.)typedef struct _BLAST{TEXTszPlateName[MAX_PLATE_NAME];TEXTszWellName[MAX_WELL_NAME];TEXTszSource[MAX_SOURCE];TEXTszGI[MAX_GI];TEXTszAccession[MAX_ACCESSION];TEXTszLocus[MAX_LOCUS];TEXTszAnnotation[MAX_ANNOTATION];} BLAST, * PBLAST;

  • Solution Discussion (cont.)static inline CTSequenceTable * NewCTSequenceTable(){const int iSegOffSets[]={0,MAX_PLATE_NAME,MAX_PLATE_NAME+MAX_WELL_NAME};. . .return (new CTSequenceTable("Sequence",SEQUENCE_FIXED,USHRT_MAX,SHARED | VLENGTH,SHARED | ctFIXED,iSegOffSets,iSegLengths,iSegModes,iKeyLens,iKeyTypes,iKeyAllowDups,iKeyNullChecks,. . .

  • Solution Discussion (cont.)// A Simple example of using CTSeq libraryInitCTDB()if (OpenCTTableByName(SEQUENCE_TABLE,&hDB,pszSequenceTable)){StrUpper(szPlateName);fRC=FirstLenFieldQuery(hDB,"PlateName",szPlateName,sizeof(szPlateName),&seq);while (fRC){printf("%s-%s : start=%d, end=%d\n",seq.szPlateName,seq.szWellName,seq.iSeqStart,seq.iSeqEnd);fRC=NextFieldQuery(hDB,&seq);}CloseCTDB(hDB);}UnInitCTDB();

  • Solution Discussion (cont.)

  • Solution Discussion (cont.)

  • BenchmarksWe benchmarked our Sequence Quality table.27,456 recordsRecords containPlate NameWell NameBasesVarious offsets (restriction site, lib tag, etc.)StatusQuality Scores (Phred)

  • Benchmarks (cont.)

    Binary Size Graph

    458752200000020000002618765126187651

    CTSeq (client)

    Access - Client Cursor

    Access - Server Cursor

    SQL Server - Client Cursor

    SQL Server - Server Cursor

    Bytes

    Binary Sizes

    Data File Size Graph

    44758222

    35500032

    35500032

    60948480

    60948480

    Bytes

    Data File Sizes

    Sequential Read Graph

    10969.24

    341.26

    4421.968111

    345.440592

    573.805095

    Records / Second

    Sequential Read

    Simple Query Graph

    7150

    4783.33333

    2831.683168

    4783.33333

    582.484725

    Records / Sec

    Simple Query

    Raw Numbers

    DB TypeTableFile SizeBinary SizeStartup Time (secs)Sequential Read (recs/sec)Simple Query (recs/sec)

    CTSeq (client)SEQUENCE44,758,222458,7520.0010,969.247,150.00

    Access - Client CursorSEQUENCE35,500,0322,000,0002.08341.264,783.33

    Access - Server CursorSEQUENCE35,500,0322,000,0002.004,421.972,831.68

    SQL Server - Client CursorSEQUENCE60,948,48026,187,6513.77345.444,783.33

    SQL Server - Server CursorSEQUENCE60,948,48026,187,6513.14573.81582.48

  • Benchmarks (cont.)

    Binary Size Graph

    458752200000020000002618765126187651

    CTSeq (client)

    Access - Client Cursor

    Access - Server Cursor

    SQL Server - Client Cursor

    SQL Server - Server Cursor

    Bytes

    Binary Sizes

    Data File Size Graph

    44758222

    35500032

    35500032

    60948480

    60948480

    Bytes

    Data File Sizes

    Sequential Read Graph

    10969.24

    341.26

    4421.968111

    345.440592

    573.805095

    Records / Second

    Sequential Read

    Simple Query Graph

    7150

    4783.33333

    2831.683168

    4783.33333

    582.484725

    Records / Sec

    Simple Query

    Raw Numbers

    DB TypeTableFile SizeBinary SizeStartup Time (secs)Sequential Read (recs/sec)Simple Query (recs/sec)

    CTSeq (client)SEQUENCE44,758,222458,7520.0010,969.247,150.00

    Access - Client CursorSEQUENCE35,500,0322,000,0002.08341.264,783.33

    Access - Server CursorSEQUENCE35,500,0322,000,0002.004,421.972,831.68

    SQL Server - Client CursorSEQUENCE60,948,48026,187,6513.77345.444,783.33

    SQL Server - Server CursorSEQUENCE60,948,48026,187,6513.14573.81582.48

  • Benchmarks (cont.)

    Binary Size Graph

    458752200000020000002618765126187651

    CTSeq (client)

    Access - Client Cursor

    Access - Server Cursor

    SQL Server - Client Cursor

    SQL Server - Server Cursor

    Bytes

    Binary Sizes

    Data File Size Graph

    44758222

    35500032

    35500032

    60948480

    60948480

    Bytes

    Data File Sizes

    Sequential Read Graph

    10969.24

    341.26

    4421.968111

    345.440592

    573.805095

    Records / Second

    Sequential Read

    Simple Query Graph

    7150

    4783.33333

    2831.683168

    4783.33333

    582.484725

    Records / Sec

    Simple Query

    Raw Numbers

    DB TypeTableFile SizeBinary SizeStartup Time (secs)Sequential Read (recs/sec)Simple Query (recs/sec)

    CTSeq (client)SEQUENCE44,758,222458,7520.0010,969.247,150.00

    Access - Client CursorSEQUENCE35,500,0322,000,0002.08341.264,783.33

    Access - Server CursorSEQUENCE35,500,0322,000,0002.004,421.972,831.68

    SQL Server - Client CursorSEQUENCE60,948,48026,187,6513.77345.444,783.33

    SQL Server - Server CursorSEQUENCE60,948,48026,187,6513.14573.81582.48

  • Benchmarks (cont.)

    Binary Size Graph

    458752200000020000002618765126187651

    CTSeq (client)

    Access - Client Cursor

    Access - Server Cursor

    SQL Server - Client Cursor

    SQL Server - Server Cursor

    Bytes

    Binary Sizes

    Data File Size Graph

    44758222

    35500032

    35500032

    60948480

    60948480

    Bytes

    Data File Sizes

    Sequential Read Graph

    10969.24

    341.26

    4421.968111

    345.440592

    573.805095

    Records / Second

    Sequential Read

    Simple Query Graph

    7150

    4783.33333

    2831.683168

    4783.33333

    582.484725

    Records / Sec

    Simple Query

    Raw Numbers

    DB TypeTableFile SizeBinary SizeStartup Time (secs)Sequential Read (recs/sec)Simple Query (recs/sec)

    CTSeq (client)SEQUENCE44,758,222458,7520.0010,969.247,150.00

    Access - Client CursorSEQUENCE35,500,0322,000,0002.08341.264,783.33

    Access - Server CursorSEQUENCE35,500,0322,000,0002.004,421.972,831.68

    SQL Server - Client CursorSEQUENCE60,948,48026,187,6513.77345.444,783.33

    SQL Server - Server CursorSEQUENCE60,948,48026,187,6513.14573.81582.48

  • Benchmarks (cont.)

    Binary Size Graph

    458752200000020000002618765126187651

    CTSeq (client)

    Access - Client Cursor

    Access - Server Cursor

    SQL Server - Client Cursor

    SQL Server - Server Cursor

    Bytes

    Binary Sizes

    Data File Size Graph

    44758222

    35500032

    35500032

    60948480

    60948480

    Bytes

    Data File Sizes

    Sequential Read Graph

    10969.24

    341.26

    4421.968111

    345.440592

    573.805095

    Records / Second

    Sequential Read

    Simple Query Graph

    7150

    4783.33333

    2831.683168

    4783.33333

    582.484725

    Records / Sec

    Simple Query

    Raw Numbers

    DB TypeTableFile SizeBinary SizeStartup Time (secs)Sequential Read (recs/sec)Simple Query (recs/sec)

    CTSeq (client)SEQUENCE44,758,222458,7520.0010,969.247,150.00

    Access - Client CursorSEQUENCE35,500,0322,000,0002.08341.264,783.33

    Access - Server CursorSEQUENCE35,500,0322,000,0002.004,421.972,831.68

    SQL Server - Client CursorSEQUENCE60,948,48026,187,6513.77345.444,783.33

    SQL Server - Server CursorSEQUENCE60,948,48026,187,6513.14573.81582.48

  • A Straw ManIsnt this just another case of Not Invented Here foolish pride?Standard solutions such as Microsoft SQL Server or Oracle generally dont provide the control needed for a truly performance-bound project.Server-side extensions which would be necessary if using standard solutions would be just as proprietary as CTree.Oracle or MS SQL Server typically require a full-time administrator and are financially expensive in general.

  • A Straw Man (cont.)However, off-the-shelf DBMS solutions have advantagesExisting support infrastructureFlexibilityMature development toolsSo, a custom DBMS is really only needed where truly high performance is required.

  • Weve only just begunCurrently, only core data is in our DBHave trimming/quality informationHave chromatograms (binary blobs)Adding BLAST resultsAdding cluster informationAdding transaction historyPlan to integrate our security system which currently uses another DBMSSwitch to client-server (currently using stand-alone compile)

  • Weve only just begun(cont.)Server-side processing modulesBLASTMSAGCG?Automatic compression / decompressionAutomatic encryption / decryptionODBC Driver installation/test

  • Acknowledgements

    Faircom Corporation

    Monsanto Company

    National Library of Medicine

  • Questionshttp://swine.rnet.missouri.edu/

    http://swine.rnet.missouri.edu/Demo/index.html

    http://jimries.com/SeqCTree/

    [email protected]

    Many UNIX command-line tools abound:BLAST from NCBIMSAGCGPhredPhrappCross_matchClusterdOne of the criticisms of CTree is that it is a fairly raw C API. Weve written a nice OO wrapper that is applicable to any CTree table and which we feel dramatically improves the usability of CTree.Talk about how CTSEQDBHANDLE is really the this pointer.Talk about how CTSEQDBHANDLE is really the this pointer.Talk about how CTSEQDBHANDLE is really the this pointer.Talk about how CTSEQDBHANDLE is really the this pointer.Talk about how CTSEQDBHANDLE is really the this pointer.Talk about how CTSEQDBHANDLE is really the this pointer.Talk about how CTSEQDBHANDLE is really the this pointer.Talk about how CTSEQDBHANDLE is really the this pointer.Talk about how CTSEQDBHANDLE is really the this pointer.Talk about how CTSEQDBHANDLE is really the this pointer.