the missing data issue for hiseq runs

9
The Queensland Brain Institute | The missing data issue and the data resurrection miracle [ElCierne ] 6/13/22

Upload: denis-bauer

Post on 30-Jun-2015

1.805 views

Category:

Education


1 download

DESCRIPTION

Critical Run files can be missing/corrupt after the Run folder was transferred from the HiSeq storage to the cluster storage. This presentation discusses the issue and suggests four workarounds.

TRANSCRIPT

Page 1: The missing data issue for HiSeq runs

The Queensland Brain Institute |

The missing data issueand the data resurrection miracle

[ElCierne ]

April 14, 2023

Page 2: The missing data issue for HiSeq runs

The Queensland Brain Institute |

What is the missing data issue

• Consequence– Config.xml might need to

be corrected– Missing *.bcl, *.stats can

be recreated– Missing *.filter, *.pos.txt

causes the loss of a tile

April 14, 2023

Critical Run files are missing/corrupt after the Run folder was transferred from the HiSeq storage to the cluster storage

Page 3: The missing data issue for HiSeq runs

The Queensland Brain Institute |

What causes the missing data issue?

• Files are not transferred correctly– Millisecond hang-ups of the network, which are not

recognized by windows

• RTA did not generate files in the first place– HiSeq computer overload– Mismanagement of parallel threads (two processes

accessing the same file)

April 14, 2023

Page 4: The missing data issue for HiSeq runs

The Queensland Brain Institute |

Why is it an issue?

• Usual workflow crashes: bclConverter does not proceed if there are missing files.

April 14, 2023

Page 5: The missing data issue for HiSeq runs

The Queensland Brain Institute |

Solutions to recoverable missing data issues

1. Copy .stats from the same tile of a different cycle– PRO: fast – CON: fudge, trusts RTA, requires separate workflow for missing *.bcl files

2. Recalculate *.stats from *.dif, *.filter and *.bcl (Sanger)– PRO: accurate & fast– CON: requires separate workflow for missing *.bcl files, trusts RTA

3. Calculate *.qseq from *.cif for missing tile (QBI)– PRO: handles missing *.stats, *.bcl– CON: slow, trusts RTA

4. Calculate *.qseq from *.cif for all tiles– PRO: handles missing *.stats, *.bcl, recalculates all – no usage of potentially corrupt

RTA bcl/stats files– CON: slow (days)

April 14, 2023

1 2 3 4

Page 6: The missing data issue for HiSeq runs

The Queensland Brain Institute |

New workflow with OLB

Identify missing files, calculate qseq for them and merge with the qseqs from the normal workflow to proceed

April 14, 2023

Page 7: The missing data issue for HiSeq runs

The Queensland Brain Institute |

Details: If *.stats or *.bcl was missing

1. Start offline base caller (OLB) for the missing tiles

2. Comment out missing tile in config.xml and start bclConverter to convert intact tiles(or use setupBclToQseq + bcl2qseq directly with --ignore-missing-bcl or --ignore-missing-stats)

3. Merge *.qseq generated from OLB and bclConverter in one directory (BaseCalls_<date>_<user>)

4. Start GERALD to convert to fastq (_sequence.txt)

April 14, 2023

Page 8: The missing data issue for HiSeq runs

The Queensland Brain Institute |

Solution requires .cifs to be saved

• Intensity files (*.cif) are not stored by default

– Remember to tick the safe intensity box when starting a run

– Or make it default: In c:/illumina/HiSeqControlSoftware/RTA/RTA.exe.config add

April 14, 2023

<add key="DeleteIntensityFiles" value="0" />

Page 9: The missing data issue for HiSeq runs

The Queensland Brain Institute |

Acknowledgement

• Thanks to – Dr. Steven Leonard, Informatics Division, The Sanger

Institute. – Eugene, illumina tech-support.

April 14, 2023