scaffolding problems

23
Scaffolding Problems Gao Song 2010/04/27

Upload: reegan

Post on 23-Feb-2016

31 views

Category:

Documents


0 download

DESCRIPTION

Scaffolding Problems. Gao Song 2010/04/27. Outline. Concepts Problem definition Non-error Case Edge-error Case Disconnected Components Simulated Data Future Work. Concepts. Contig : Edge (PET ): library size Scaffolding: a sequence of contigs Happy Edge: - PowerPoint PPT Presentation

TRANSCRIPT

Scaffolding ProblemsGao Song

2010/04/27

OutlineConceptsProblem definitionNon-error CaseEdge-error CaseDisconnected ComponentsSimulated DataFuture Work

ConceptsContig:Edge (PET): library sizeScaffolding: a sequence of contigsHappy Edge:

Real distance <= expected distanceOrientation of both contigs are correct

Problem DefinitionVersion 1: Given a set of contigs and a set of

edges, find a scaffold which has at most p unhappy edges

Version 2: Given a set of contigs and a set of edges, find a scaffold which has at most p unhappy edges and is also the optimal solution

Non-error CaseConnected graphPartial Layout:

Dangling Edge: only one end in partial layoutActive region: the sequence from the first

contig having dangling edges to the end of partial layout; less than library size

Domain of a partial layout: all nodes in partial layout

Non-error CaseTheorem: if two partial layout l1 and l2 have

same active region and dangling set, then (1) they have same domain(2) both or neither of them can extend to a

solutionProof:

ProcedureFind the unassigned node

Select the nearest node as next assigned nodeUpdate current partial layout

Remove all dangling edges incident to new node

Add new dangling edges of new nodeRemove contigs from active region

Main ProcedureFind all nodes which has no ancestors and

select one to startFrom an active region, get all unassigned

nodes, and update the partial layoutRemember all visited partial layoutIf dangling edge set is empty, output the

results

Time and space complexityTwo possibilities

k vertices in active region – one possible next nodes

Less than k vertices in active region – n possible next nodes

ComlexityO(nk)*O(1)O(nk-1)*O(n)Total time complexity: O(nk)Total space complexity: store all visited partial

order

Introduce Edge ErrorTypes of edge error

Chimeric PETs: Mapping errorMisassembled contigs

SolutionFiltering – filter chimeric PETs

Select x% of PETs Shuffle them to get chimeric PETs Cluster them to find threshold

Local threshold

.

.

.

.

.

.

Introduce Edge ErrorThere are p unhappy edges in final

scaffoldingPartial layout

Dangling edges: real dangling edges; wrong edges

Equivalent ClassActive region, dangling edges’ set,

count of current wrong edgesSame domainAssumption: the partial order is a connected

graph

Get Unassigned NodesSort the unassigned nodesProperties of nodes:

Steps to reach this nodeDistance to the end of active regionUnhappy edges introduced due to this node

Sort Unassigned NodesBreadth-first search

Select the smallest possible distance: > threshold

Sort nodes:Less than 5 steps, compare with distance;

same distance, compare with unhappy edges

Update Partial LayoutCheck if all incident un-wrong dangling edges are

happyIf yes, just remove all those edges and add new nodeIf no, check if setting all unhappy edges as omitted

will result in disconnected graph If no, just add new node and remove dangling edges If yes, discard current partial layout – to avoid insert

disconnected component into sequenceAdd new dangling edgesRemove all dangling edges which is not happy –

check connectness

Main ProcedureIf active region is empty

Current connected component is finishedCheck if dangling edge set is empty

If yes, output the result If no, using dangling edges to find a new node and

start another scaffolding

Disconnected ComponentsFirst find all the connected components and sort them

according to the number of nodes

From the first component, find a solution, which omits p1 edges

For ith component, if there is no solution omits p-sum(p1,…, pi-1) edges, remember all the stop point, return to (i-1)th component, and see if it can find a solution which omits less than pi-1 edges. If yes, continue from the stop point of ith component.

If ith component finishes the whole search and found more than one solutions. Then, only remember the solution with minimum pi. Then, in the future, when comes to this component, just use this solution as part of the partial results

Optimal SolutionBranch and Bound

P’ edges

Simulated Data ResultNode Num: 1522 nodesContig length: 600 - 10,000

Wrong edges p Time(ms)0 0 27651 1 29842 2 49843 3 65624 4 70005 5 73286 6 72817 7 73438 8 74069 9 5181310 10 216984

Future WorkFind the optimal solutionWrong contigsRepeatsHow to deal with large pFind a good way to sort the unassigned nodes

Thank you