media. · web viewmitochondrial genome because the depth is much higher than that of nuclear...
TRANSCRIPT
Supplementary Methods
Supplementary Methods S1. DNA Extraction, Library Preparation, and
Sequencing
To confirm whether the two sequence outputs of the same individual, extracted
independently, were consistent, the first two extracts (A1 and B) were prepared at the
National Institute of Genetics, Mishima, and the new DNA extract (A2) was prepared at
the National Museum of Nature and Science, Tsukuba, using a previously published
protocol1 with some modifications. Briefly, powdered sample was decalcified with 0.5M
EDTA (pH 8.0) at 56°C for overnight, and the buffer was then replaced with a fresh
solution and decalcified again at 56°C for overnight, and the decalcified samples were
lysed in 1000 µl of Genomic Lyse buffer (Genetic ID) with 50 µl of 20 mg/ml
proteinase K at 56°C for overnight, and DNA was extracted from the lysate by using
FAST ID DNA extraction kit (Genetic ID) in accordance with the manufacturer’s
protocol. Three DNA libraries for A1, A2, and B were prepared from the extracted DNA
solution using the GS Titanium Rapid Library Preparation kit (454 Life Science
Corporation). Modifications to the protocol2 were as follows: (i) In the adapter ligation
step, we used 1 µl of Illumina adapter mix instead of 2 µl to minimize the amount of
adapter dimer; (ii) A first round PCR was set up with Multiplex PCR kit (Qiagen) as
follows: 5 µl DNA library, 25 µl Mix, 20-100 nM each Multiplexing PCR primer, and
H2O up to 50 µl. Cycling conditions were 15 min at 95°C, 12 cycles of 30 sec at 95°C,
30 sec at 65°C, and 30 sec at 72°C, with a final extension at 72˚C for 10 min; (iii) A
second round PCR was set up as follows: 5µl product from first PCR, 25 µl Multiplex
PCR mix, 500-2,000 nM each Multiplexing PCR primer, and H2O to 50 µl. Cycling
conditions were 95°C 15 min, 7-15 cycles of 95°C for 30 sec, 60°C for 30 sec, 72°C for
30 sec, with a final extension 72°C for 10 min. (iv) Half the amount of DNA library A1
was treated with two restriction enzymes, FastDigest® Bsh1236I (Thermo Scientific)
and FastDigest® TauI (Thermo Scientific), that recognize CGCG and GCSGC sites,
respectively. This is to reduce the amount of non-human DNA as much as possible, thus
enriching the human derived sequence reads. PCR-amplified DNA libraries were
purified with 2% agarose gel (BIO RAD) to remove adapter and primer dimers, and the
purified libraries were quantified with the Agilent 2100 Bioanalyzer DNA High-
Sensitivity chip for sequencing with sequencers, Illumina GAIIx and Hiseq2000.
Supplementary Methods S2. Sequence mapping
FASTX-Toolkit3 was used to trim adapter sequences and to remove reads containing N's
and those having base quality smaller than 25 while being shorter than 11 bases. For the
adapter trimming, we used fastx_clipper (-a AGATCGG -n -l 11), and for the filtering
low quality bases, we used fastq_quality_trimmer (-t 25 -l 11). After sorting, forward
and reverse reads were merged using fastq-join4 (-m 11). Some unmerged reads still
contain adapter sequences, which are shorter than seven bases. Therefore, we removed
seven bases of unmerged reads termini with fastx_trimmer (-l 94 or 114), and merged
them again with fastq-join, which can rescue several percent of unmerged reads. After
we filtered reads shorter than 35 bases, the remaining merged reads were mapped to the
human reference genome (hg19) using BWA5. The frequencies of sequence reads
mapped to hg19 were examined using SAMtools6, flagstat option. Mapped reads with
mapping quality ≥ 30 were retained for further analyses. After making mpileup with
SAMtools, we counted the sites covering autosomes and mitochondrial DNA,
independently, and then calculated the coverage of them. Metagenomic analysis was
also performed using BLASTN7 to check for non-human DNA contamination. A total of
20 Mbp were searched in each sample (parameters were –evalue 3.80e-2, -dbsize
3,200,000,000).
In PickingBases (PB), any reads that mapped to the same reference location
are considered duplicates and are merged into single read. When the two reads being
merged have two or more alleles at some site, the allele that matches the reference
genome is chosen for the merged read, or masks the site as N (using --use-n option)
(graphical explanation is given in Supplementary Figure S6). More sequence sites and
smaller error rates are expected in PB, and if we use --use-n option in PB, it might weed
out bias toward to reference genome. We therefore used --use-n option for the current
study. There are two merits in PB. First, PB can detect plus strands (black-colored
bases) and minus strands (gray-colored bases) originating from same DNA templates as
PCR duplicates, but not in MD (e.g. group 1). In addition, since MD identifies reads
having identical 5' positions as duplicates, and choose a read having highest sum of base
qualities from those reads, MD keeps longer DNA reads even if some of the reads are
short and originated from different templates (e.g. group 2). This selectively removes
shorter DNA reads. The effect is quite strong in the DNA reads mapped to
mitochondrial genome because the depth is much higher than that of nuclear genome,
and many DNA reads originating from different templates are frequently mapped to
same 5’ ends. Second, PB can ideally reduce error rate compared to MD especially
when Illumina Y-shape adapter was used because PB masks some PCR and sequencing
errors as “N” (blue bases) if there are PCR duplicates.
Supplementary Methods S3. DNA authenticity checking
Since it is difficult to remove all modern human DNA contaminations that come from
reagents and experimental rooms, it is essential to estimate the frequency of endogenous
DNA from Jomon individuals. To estimate the frequency of contamination, we focused
on sequence reads mapped to mitochondrial DNA (mtDNA). MtDNA haplogroup of
both samples, A and B, were previously classified into N9b8. First, we tried to construct
consensus sequences in each sample using mitochondrial capture method and
sequencing with MiSeq9. The mtDNA captured reads were mapped and filtered with
same method of shotgun sequence reads, but used mapping quality 20 or larger instead
of 30. To correctly estimate consensus sequences in sample A, we merged A1 and A2. It
seems that many sequence reads experienced misidentifications into different indexes
(Supplementary Tables S6) because of single index in our DNA libraries
(Supplementary Table S3). Kircher et al.10 revealed that the misidentification rates were
~0.3% though they used dual indexes. Higher misclassification rates were inferred in
our sequencing results. Therefore, we tried to detect false-assignment reads and remove
them to construct correct consensus sequences and to infer correct contamination rates
in each of the samples. To detect the false-assignment reads, we grouped the sequence
reads mapped to the same reference location as PCR duplicates in each samples, and
checked whether there are same group in other samples. In each group, we assume that
a sample having the highest number of PCR duplicates is the origin of the group, and
other samples have contaminants. After removing the false-assignment reads, we
filtered PCR duplicates with PB using --use-n and --ignore-strand option (the option
recognize the reads mapped to same reference location but to different strands as same
groups deriving from same templates), and trimmed first and last five bases using
BamUtil11 to minimize the effect of post-mortem misincorporation. Sites having three or
higher read depth were used to call individual specific SNPs. The site 515-522 and
8,860 were not used because of ambiguous typing. Read depth were estimated by
calculating the average depth at all sites.
List of the SNPs and indels were compared to list of mutation reported in
PhyloTree.org, build 1612, and the mtDNA haplogroup of both Sanganji Jomon
individuals were classified into haplogroup N9b, and individual specific mutations and
rare mutations in haplogroup N9b were observed. This haplogroup is common in
northern Jomon people (about 45% in the Hokkaido Jomon), but rare in modern
Japanese (1.9% in the mainland Japanese). Namely, since the haplogroup N9b is rare in
modern humans and the mtDNA haplotype of Kanzawa-Kiriyama (who did the DNA
extractions and library preparations) is non N, if we find non-matched sequence reads,
we can assume that those reads are contaminants. We used sequence reads having
mapping quality 30 or larger, and calculated the proportion of the reads that did not
match to consensus sequences. The proportion of sample A1 and A2 were independently
calculated. In the estimations, we removed nucleotide position 14,893 in sample B
because we empirically know that there are ambiguous mapping around the regions. In
fact, the sequence read identified as contaminants in sample B was also mapped to chr2
and chr5 with highest E-value in MEGABLAST and it is difficult to correctly identify
whether those reads are contaminants. We calculated the frequency of contamination,
Finc_mtDNA, with the following equation;
Finc_mtDNA = (Ninconsistent/(Nconsistent + Ninconsistent)) x 100, (1)
where Ninconsistent indicates the number of reads inconsistent with mtDNA haplotype of
each individual, and Nconsistent indicates the number of reads consistent with mtDNA
haplotype of each individual. 95% C.I. were computed using the standard t-test. When
all determined sequences were consistent and the contamination proportion was
estimated to be zero, we assumed a simple binomial distribution with p and 1–p
proportions as and contaminant and authentic mtDNA sequences, respectively. We
estimated p which gives the 95% C.I. by numerically solving equation (2);
0.95 = (1 – p)n, (2)
where n is the number of observed mutually consistent reads.
Two perl scripts were newly developed by one of us (K. K.); sam-count-ref-
nucleotides.pl for depurination analysis and sam-count-substitutions.pl for C-to-T
misincorporation analysis (see Supplementary Perl Scripts).
Supplementary Methods S4. Estimation of error rates and patterns of errors
After sequence termini of Sanganji Jomon were trimmed and the contamination
frequency of each library were checked, the data of low contaminant libraries were
merged. After making mpileup of both ancient and modern samples using SAMtools,
we randomly chose one base having base quality 30 or larger at each site, and then we
generate alignments of all the 17 human samples with PanTro2, and hg19. We then
chose the nucleotide positions where the sequence was determined in all 19 individuals
and where not more than two bases were observed. In addition, we removed tandem
repeat regions (reported in Prüfer et al.13,
http://hgdownload.soe.ucsc.edu/goldenPath/hg19/database/simpleRepeat.txt.gz) and
CpG sites, where hg19 and/or PanTro2 have these, because those regions are prone to
have mapping error and repetitive mutation.
We estimated error rate from the alignments (SI2 of Reich et al.14). We
assumed that there were no errors in PanTro2 and hg19. We chose three sequences from
the alignments: PanTro2, hg19, and one of 17 samples, and counted individual specific
different sites. We regarded the numbers as lineage specific substitutions since they
diverged. Assuming that an equal number of true substitutions have occurred on hg19
and samples after their divergence, excess number of substitutions on samples can be
attributed to errors (e.g. sequencing error, PCR error, post-mortem changes, and so on).
The error rates were inferred for both all sites and transversion-only sites. We also
estimated the error rate in another way; among sample specific transversion
substitutions, unreported transversions in dbSNP 141 were recognized as errors. To
identify which types of errors are frequent, we compared the alignment of PanTro2 and
samples. There are 12 different types of substitutions, and the number of substitutions
between PanTro2 and each sample were counted. In downstream statistical analyses,
tandem repeat and CpG masked sequences were used.
Supplementary Methods S5. Phylogenetic analyses
TreeMix15 was used to construct maximum likelihood trees that can accommodate
possible gene flow events. Two datasets including Sanganji Jomon were used for this
analysis: all SNP sites (43k SNP) and transversion only sites (14k SNP). To ensure
reliability of the tree topology and the gene flow based on large number of SNPs, new
dataset including only 1000 genome worldwide humans, Papuan, and Denisovan
genomes, but not including Sanganji Jomon, were also used for the analysis. To prepare
the new dataset, four filters, HWE <1e-10, MAF <0.01, pruning high linkage
equiliburium (plink --indep 50 5 2), and picking up 30% of SNPs (plink --thin 0.3),
were applied. TreeMix was run with an incremental number of assumed migrations,
starting from zero migration. Denisovan was used as the root. Each run was done using
100 and 50 SNP blocks for all SNP dataset and transversion sites, respectively. To test
how well the observed branching patterns and gene flow events are supported, we
generated 1,000 pseudosamples for each of the two datasets using a newly made perl
script and used them as inputs for TreeMix runs of 0 to 9 migrations. Using the original
tree (from non-bootstrapped data) as the reference, the number of corresponding
population splits in the 1,000 bootstrap replicates were counted using four newly made
perl scripts (see Supplementary Perl Scripts).
Phylip Package16 as used to produce NJ trees. Nei’s17 genetic distances between
populations were calculated with 1,000 bootstrap replicates, and Splitstree418 was used
to view the NJ trees. For phylogenetic network analysis, we directly constructed
Neighbor-Net networks from Nei’s genetic distances using Splitstree4.
Supplementary Results
Supplementary Results S1. DNA sequence authenticity through mtDNA sequences
Average depths of captured mtDNA were 1.97, 12.24, and 8.59 in samples A1, A2, and
B, respectively. The estimated amount of modern human DNA contamination
frequencies based on captured mtDNA sequence reads were 0.0% (0.0-15.4%, 95%
C.I.), 1.7% (0.0-4.1%, 95% C.I.), and 2.0% (0.0-6.0%, 95% C.I.) for DNA samples A1,
A2, and B, respectively (Supplementary Table S4), while the frequencies based on
simple shotgun sequencing were 0.0% (0.0-28.4%, 95% C.I.), 11.3% (2.8-19.9%, 95%
C.I.), and 5.6% (0.0-16.1%, 95% C.I.), respectively (Supplementary Table S5). The
confidence interval of sample A1 is wider compared to other samples because of low
depth of mtDNA, though we found no inconsistent nucleotides for this sample. Increase
of contamination frequency in sample A2 for the shotgun sequencing was not expected,
and the reasons are not clear, though cross contamination during experiments or/and
sequencing (e.g. index misclassification) are possibilities.
Possible sources of cross contamination to sample A2 during the simple shotgun
sequencing are sample B and some other Jomon samples which are not reported in this
paper. Their mtDNA haplogroups were different from the Sanganji Jomon samples used
in this study, and this difference might have increased the contamination frequency if
there were cross contaminations from those other Jomon samples. We also cannot reject
the possibility that the cross contamination from non-Jomon samples happened during
experiments. The possible source of contamination in captured mtDNA sequences are
other Jomon samples having haplogroups N9b and M7a. However, the evidence of cross
contamination from them was not shown in individual A-specific SNPs (Supplementary
Table S4) which are not shared with those other Jomon samples (data not shown), and
the cross contamination in mtDNA captured sequences would be unlikely.
Supplementary Results S2. TreeMix analysis
Genome sequence data of the Sanganji Jomon, the 1000 Genomes Project populations,
Karitiana (Native American from Brazil) and Papuan13, Ust’-Ishim from 45,000 YBP
western Siberian19, Mal’ta MA1 from 24,000 YBP south-central Siberian20, and archaic
Denisovan21 were used. We used two datasets; one based on 43,310 all sites and the
other based on 15,455 transversion only sites. Number of gene flow events were set
from zero to nine, and resultant trees with bootstrap probabilities on interior branches
are shown in Supplementary Figures S16a-S16j for all sites, and in Supplementary
Figures S17a- S17j for transversion only sites. Bootstrap values were computed by
using five newly made perl scripts with combination of the TreeMix software. In all the
20 trees, the Sanganji Jomon was consistently outside of five modern East Eurasian
populations (JPT, CHB, CHS, CDX, and KHV). Furthermore, except for no gene flow
on all site data (Supplementary Figure S16a) and for no or one gene flow on
transversion only data (Supplementary Figures S17a and S17b), South American
Karitiana clustered with the five modern East Eurasians and the Sanganji Jomon was
outside of this cluster. Bootstrap values supporting this pattern were relatively low (50-
75%) when transversion only data were used, (Supplementary Figures S17c - S17j),
however, those for all site data were pretty high (96-98%) for three to five gene flow
events (Supplementary Figures S16d - S16f). Gene flow from the Sanganji Jomon to
JPT appeared when the number of gene flow events was three or more for all site data
(Supplementary Figures S16d - S16j), and when the number of gene flow events was
four or more for transversion only site data (Supplementary Figures S17e - S17j). The
proportion of the Sanganji Jomon gene flow to JPT were in the range of 12-16% in
these 13 cases, and the frequencies (equal to bootstrap probabilities) of appearing this
gene flow out of 1,000 pseudo-samples were 98-99% for all site data and 84-96% for
transversion only site data (detailed data not shown).
HGDP populations were also used for TreeMix analysis (Supplementary
Figure S18; bootstrap values for modern East Eurasian populations were suppressed).
Although the dataset used was small (only 7,529 SNP sites) and had ascertainment bias,
the Sanganji Jomon was in the basal position to the five modern East Eurasians with a
high bootstrap support (95%). This result also strengthens the basal position of the
Sanganji Jomon among East Eurasian populations. When we introduced gene flow
events, however, the gene flow from Jomon to JPT appeared as seventh one (tree not
shown). It should be noted that estimation of gene flow events is not easy from small
dataset such as this dataset.
Supplementary Results S3. NeighborNet analysis
When HGDP data were used, three major splits X, Y, and Z related to Sanganji Jomon
were observed (Supplementary Figure S21). Split Z excludes Yakuts from the other East
Asians but groups the Sanganji Jomon with the non-Yakut East Asians. This split may
reflect a recent admixture between Yakuts and West Eurasians. Split Y groups all the
modern East Eurasians, which may correspond to the divergence of the Sanganji Jomon
from the other East Eurasians. Split X, which has a longer distance than those of splits Y
and Z, groups the Sanganji Jomon, West Eurasians, and Africans. It might be possible
that the split X is the result of postmortem changes in the Sanganji Jomon. However, if
we accept this split as a phylogenetic signal, the following scenario can be considered;
some of Sanganji Jomon ancestors originated from populations diverged before the
Sahulian-East Eurasian divergence, and/or the Sahulian experienced some gene flow
with ancestors of modern East Eurasians after the divergence of the Sanganji Jomon.
Supplementary Results S4. D-statistic analysis
We also examined whether the uniqueness of the Sanganji Jomon in East Eurasia were
the result of gene flow with non-East Eurasians, for example, African, European,
Sahulian, and ancient Siberian (Mal’ta MA1 and Ust’-Ishim). Comparison with the
modern populations did not suggest additional gene flow between Sanganji Jomon and
non-East Eurasians (Supplementary Figures S24 and S25). The tree of ((X, Sanganji
Jomon) Ust’-Ishim) and ((X, Sanganji Jomon) Mal’ta MA1) also did not show the
evidence of gene flow with Ust’-Ishim and Mal’ta MA1 after their divergence
(Supplementary Figure S22). Next, we tried to detect gene flow between non-East
Eurasians and East Eurasians, who diverged after the divergence of the Sanganji Jomon.
A closer genetic affinity between the Sahulians, especially Melanesian, and the modern
East Eurasians compared to the Sanganji Jomon, Yakut, and Native Americans was
inferred, but the Z-scores were not significant (Supplementary Figure S26). It is
possible that this affinity corresponds to split X of Supplementary Figure S21. To
evaluate the effect of genotype errors in Jomon to the affinity between Sahulians and
modern East Eurasians, we compared D-values using all sites and transversion only. We
observed some bias when using all sites, and the bias appeared to be artificially
produced by post-mortem changes. The affinity between Melanesian and East Eurasians
compared to the Sanganji Jomon was still inferred when only transversion sites with
1000 Genomes Project populations and a Melanesian genome13 were used, though the
reliability is not secured yet with Z-score (Supplementary Figure S27). The affinity with
Papuan was not supported (Supplementary Figures S27 and S28). If we accept the tree
((East Eurasians, Jomon/Yakut/Native Americans), Melanesian), gene flow between the
East Eurasians and Melanesians after the divergence of Jomon, Yakut, or Native
Americans is a plausible explanation. We also inspected relationships among Sanganji
Jomon, East Eurasians, and Native Americans. Both trees ((Native American, East
Eurasians) Sanganji Jomon) and ((Sanganji Jomon, East Eurasians) Native American)
were not rejected, but some skew from zero were observed (Supplementary Figures S29
and S30). Assuming no gene flow between the Native American and the East Eurasians
after their divergence, the skew would be explained by the earlier splits of the Sanganji
Jomon in the ancestry of East Eurasians and Native Americans, as suggested in TreeMix
analyses. This result is consistent with the low bootstrap value for the branch suggesting
earlier Jomon divergence in the neighbor-joining tree (Figure 5a and Supplementary
Figure S20).
References Cited
1 Adachi, N., Sawada, J., Yoneda, M., Kobayashi, K. & Itoh, S. Mitochondrial
DNA analysis of the human skeleton of the initial Jomon phase excavated at
the Yugura cave site, Nagano, Japan. Anthropol. Sci. 121, 137-143 (2013).
2 Rasmussen, M., Guo, X., Wang, Y., Lohmueller, K. E., Rasmussen, S.,
Albrechtsen, A. et al. An Aboriginal Australian genome reveals separate
human dispersals into Asia. Science 334, 94-98 (2011).
3 http://hannonlab.cshl.edu/fastx_toolkit/
4 https://code.google.com/p/ea-utils/wiki/FastqJoin
5 Li, H. & Durbin, R. Fast and accurate short read alignment with Burrows-
Wheeler Transform. Bioinformatics 25, 1754-60 (2009). Burrows-Wheeler
Aligner, http://bio-bwa.sourceforge.net/index.shtml
6 Li, H., Handsaker, B., Wysoker, A., Fennell, T., Ruan, J., Homer, N. et al.
The sequence alignment/map format and SAMtools. Bioinformatics 25, 2078–2079 (2009). SAMtools, http://samtools.sourceforge.net/index.shtml
7 Altschul, S. F., Madden, T. L., Schaffer, A. A., Zhang, J., Zhang, Z., Miller, W.
& Lipman D.J. Gapped BLAST and PSI-BLAST: a new generation of protein
database search programs. Nucl. Acids Res. 25, 3389–3402 (1997).
Software was downloaded from: https://blast.ncbi.nlm.nih.gov/Blast.cgi?
PAGE_TYPE=BlastSearch
8 Kanzawa-Kiriyama, H., Saso, A., Suwa, G. & Saitou, N. Ancient
mitochondrial DNA sequences of Jomon teeth samples from Sanganji,
Tohoku district, Japan. Anthropol. Sci. 121, 89-103 (2013).
9 Maricic, T., Whitten, M. & Pääbo, S. Multiplexed DNA sequence capture of
mitochondrial genomes using PCR products. PLoS ONE 5, e14004 (2010).
10 Kircher, M., Sawye,r S. & Meyer, M. Double indexing overcomes
inaccuracies in multiplex sequencing on the Illumina platform. Nucl. Acids
Res. 40, e3 (2012).
11 http://genome.sph.umich.edu/wiki/BamUtil
12 Van Oven, M. & Kayser, M. Updated comprehensive phylogenetic tree of
global human mitochondrial DNA variation. Hum. Mut. 30, E386-E394
(2008) http://www.phylotree.org/
13 Prüfer, K., Racimo, F., Patterson, N., Jay, F., Sankararaman, S., Sawyer, S.
et al. The complete genome sequence of a Neanderthal from the Altai
Mountains. Nature 505, 43-49 (2014).
14 Reich, D., Green, R. E., Kircher, M., Krause, J., Patterson, N., Durand, E. Y.
et al. Genetic history of an archaic hominin group from Denisova Cave in
Siberia. Nature 468, 1053-1060 (2010).
15 Pickrell, J. K. & Pritchard, J. K. Inference of population splits and mixtures
from genome-wide allele frequency data. PLoS Genet. 8, e1002967 (2012).
16 Philip pacage: http://evolution.genetics.washington.edu/phylip.html
17 Nei, M. Genetic distance between populations. Amer. Nat. 106, 283-292
(1972).
18 Huson, D. H. & Bryant, D. Application of phylogenetic networks in
evolutionary studies. Mol. Biol. Evol. 23, 254–267 (2006). Splitstree,
http://www.splitstree.org/
19 Fu, Q., Li, H., Moorjani, P., Jay, F., Slepchenko, S. M., Bondarev, A. A. et al.
Genome sequence of a 45,000-year-old modern human from western
Siberia. Nature 514, 445–449 (2014).
20 Raghavan, M., Skoglund, P., Graf, K. E., Metspalu, M., Albrechtsen, A.,
Moltke, I. et al. Upper Palaeolithic Siberian genome reveals dual ancestry of
Native Americans. Nature 505, 87-91 (2014).
21 Meyer, M, Kircher, M, Gansauge, M, Li H, Racimo, F, Mallick, S. et al. A
high-coverage genome sequence from an archaic Denisovan individual.
Science 338, 222-226 (2012).
22 Tamura, K., Stecher, G., Peterson, D., Filipski, A. & Kumar, S. MEGA6:
Molecular Evolutionary Genetics Analysis version 6.0. Mol. Biol. Evol. 30,
2725-2729 (2013).
23 The 1000 Genomes Project Consortium. An integrated map of genetic
variation from 1,092 human genomes. Nature 491, 56-65 (2012).
24 Andrews, R. M., Kubacka, I., Chinnery, P. F., Lightowlers, R. N., Turnbull, D.
M. & Howell, N. Reanalysis and revision of the Cambridge reference
sequence for human mitochondrial DNA. Nat. Genet. 23, 147 (1999).
Supplementary Figure S1. Geographical location of the Sanganji Shell Mound
Red dot shows location of Sanganji Shell Mound. The geographical locations of the
other populations compared in the current study are also plotted: black, orange, blue,
green, brown, and purple are African, West Eurasian, Northern East Eurasian, Southern
East Eurasian, Sahulian, and Native American populations, respectively. The population
information for the 12 1000 Genomes Project names are as follows; 1 (YRI) = Yoruba
in Ibadan, Nigeria, 2 (LWK) = Luhya in Webuye, Kenya, 3 (CEU) = Utah residents with
Northern and Western European ancestry, 4 (IBS) = Iberian populations in Spain, 5
(GBR) = British in England and Scotland, 6 (FIN) = Finnish in Finland, 7 (TSI) =
Toscani in Italy, 8 (JPT) = Japanese in Tokyo, Japan, 9 (CHB) = Han Chinese in Bejing,
China, 10 (CHS) = Southern Han Chinese, China, 11 (CDX) = Chinese Dai in
Xishuangbanna, China, 12 (KHV) = Kinh in Ho Chi Minh City, Vietnam.
Supplementary Figure S2. Content and distribution of the meta-genome for each
sample
About 1.12%, 0.43%, and 8.91% of the meta-genome were assigned to primates with
BLASTN, whereas bacterial DNA were around 35%. About half of the reads were
unclassified into specific taxa.
Supplementary Figure S3. Read length distribution for each sampleLength distribution of sequence reads mapped to the reference genome with mapping quality greater than or equal to 35 were shown. PickingBases was used to remove duplicates.
Supplementary Figure S4. Pattern of postmortem misincorporation for each sampleRed line indicates C in reference genome and T in the Jomon sample, and blue line indicates G in reference genome and A in the Jomon sample.
Supplementary Figure S5. Pattern of postmortem depurination for each sampleBase composition of the human reference genome around the 5'- and 3'-ends of the sequence reads.
Supplementary Figure S6. Removing duplicates with MarkDuplicates and PickingBasesThe procedure from library preparation to removing duplicates using MarkDuplicates (MD) and PickingBases (PB) are shown. a) Y-shaped adapter was used for library preparations. Plus strand and minus strand are colored by black and gray, respectively. Forward and reverse adapter sequences are described as red and blue bar, respectively. Red-colored bases are mismatches compared to reference genome. b) After PCR amplification, two different
amplicons are produced from same DNA templates. c) After mapping the DNAs into reference genome, some PCR duplicates will be observed. Red-colored bases are mismatches compared to reference genome, and such mismatches include SNPs, post-mortem changes, PCR errors, sequencing errors, and mapping errors. Three groups, group1, 2, and 3 were described as an example to explain the difference between MD and PB. Since MD identifies reads having identical 5' positions as duplicates, and choose a read having highest sum of base qualities from those reads, MD choose two DNA reads and one longest DNA reads (ideally) from group 1 and 2, respectively. PB instead create one new reads from group 1 and 2, and other two DNA reads of group 2 are kept. To create the new reads, PB picks one base having highest base quality within each PCR duplicate at each site, and if two or more alleles were detected, we rely on bases having the same nucleotide with the reference sequence or we mask those sites as “N” (blue bases) with --use-n option. This step can remove some (not all) PCR and sequencing errors still present in MD output.
Supplementary Figure S7. Principal Component Analysis of Sanganji Jomon and 1000 Genomes Project worldwide humans based on 68,556 SNPs with MD
Supplementary Figure S8. Principal Component Analysis of Sanganji Jomon and 1000 Genomes Project worldwide humans based on transversion sites(a) based on 24,631 transversion sites with PB, (b) based on 24,632 transversion sites with MD.
Supplementary Figure S9. PCA of Sanganji Jomon and non-AfricansGenetic relationship among Sanganji Jomon and 1000 Genome Project populations (Europeans, and East Eurasians) were compared based on 53,955 SNP sites with PB. PC1 and PC4 divide Sanganji Jomon from other populations. Since PC3 is the signal dividing FIN from other Europeans, we didn’t show the results.
Supplementary Figure S10. PCA of Sanganji Jomon and non-Africans based on transversion sitesGenetic relationship among Sanganji Jomon and 1000 Genome Project populations (Europeans, and East Eurasians) were compared based on 19,415 transversion sites with PB. PC1 and PC4 divide Sanganji Jomon from other populations. Since PC3 is the signal dividing FIN from other Europeans, we didn’t show the results.
Supplementary Figure S11. PCA of Sanganji Jomon, Native Americans, East Eurasians, and SahuliansGenetic relationship among Sanganji Jomon and HGDP populations (Native Americans, East Eurasians, and Sahulians) were compared based on 7,081 SNP sites with PB. PC1 and PC2 divide Sahulians and Native Americans, respectively, from East Eurasians. East Eurasians are the closest to the Sanganji Jomon.
Supplementary Figure S12. PCA of Sanganji Jomon and East Eurasians based on transversion sitesGenetic relationship among Sanganji Jomon and 1000 Genome Project populations from East Eurasia were compared based on 16,720 transversion sites with PB. The distributions of each populations are basically same with Figure 1b, and no effect of post-mortem changes observed.
Supplementary Figure S13. PCA of Sanganji Jomon and East Eurasians based on two independent datasetsGenetic relationship among Sanganji Jomon and 1000 Genome Project populations from East Eurasia were compared based on (a) 12,837 SNP sites from A1 (GAIIx) and (b) 33,531 SNP sites from B (index12) with PB. The distributions of each population are basically same with Figure 1b, and no effect of merging the two datasets was observed.
Supplementary Figure S14. PCA of Sanganji Jomon and East EurasiansGenetic relationship among Sanganji Jomon and HGDP populations from East Eurasia were compared based on 6,864 SNP sites with PB. PC1 and PC3 also described uniqueness of Sanganji Jomon within East Eurasians as Figure 2a.
Supplementary Figure S15. PCA of Sanganji Jomon, three Japanese populations, and other East EurasiansGenetic relationship among Sanganji Jomon, Ainu, Mainland Japanese, Ryukyuan, and 1000 Genome Project populations from East Eurasia were compared based on 3,645 SNP sites with PB. PC1 and PC2 show genetic similarity of Sanganji Jomon and Ainu people, and PC3 divides Ainu from Sanganji Jomon.
Supplementary Figure S16. Maximum likelihood tree for Sanganji Jomon, 12 populations, and 5 individuals using all variant sitesA comparison of Sanganji Jomon, 1000 Genomes Project worldwide populations, Papuan, Karitiana, Mal’ta MA1, Ust’-Ishim, and Denisovan based on 43,310 SNP sites with PB. Denisovan was used as the outgroup. (a)~(j) the tree without/with assuming gene flow from zero to nine.
Supplementary Figure S17. Maximum likelihood tree for Sanganji Jomon, 12 populations, and 5 individuals using transversion sitesA comparison of Sanganji Jomon, 1000 Genomes Project worldwide populations, Papuan, Karitiana, Mal’ta MA1, Ust’-Ishim, and Denisovan based on 15,455 transversion sites with PB. Denisovan was used as the outgroup. (a)~(j) the tree without/with assuming gene flow from zero to nine.
Supplementary Figure S18. Maximum likelihood tree for Sanganji Jomon and 24 populations using all variant sitesA comparison of Sanganji Jomon and HGDP worldwide populations based on 7,529 SNP sites with PB. San was used as the outgroup.
Supplementary Figure S19. TreeMix tree without Sanganji Jomon.A comparison of 1000 Genomes worldwide populations, Papuan, and Denisovan based on 702,660 SNP sites. Denisovan was used as outgroup and three migration events were estimated. The tree was drawn by using MEGA622. Red colored values are bootstrap probabilities (%) for their adjacent internal branch. Arrows were manually added to this tree, and colors of migration weight (ratio of gene flow) follow TreeMix outputs. Values inside arrows are the ratio of gene flow. Bootstrap probabilities (%) of the gene flow from JPT to the root of CHB and CHS, CEU to Papuan, and Papuan to Denisovan, estimated out of 1,000 bootstrap replicate TreeMix outputs, are 89%, 86%, and 98 %, respectively.
Supplementary Figure S20. Neighbor-joining tree of Sanganji Jomon and HGDP worldwide populations based on 7,529 SNP sites with PB
Supplementary Figure S21. Phylogenetic Network of Sanganji Jomon and HGDP worldwide populations based on 7,529 SNP sites with PB
Supplementary Figure S22. D-statistic tests of Sanganji Jomon and 1000 Genome Project worldwide populations, Mal’ta MA1, and Ust’-Ishim based on 15,549 transversion sites with PB
Supplementary Figure S23. D-statistic tests of Sanganji Jomon, Chimpanzee, 1000 Genome Project populations from East Eurasia, and Ust’-Ishim based on 14,978 transversion sites with PBStrong genetic affinity between JPT and other East Eurasians were detected. The position of Sanganji Jomon in TreeMix analyses would not be affected by sequence errors common to ancient DNA. Each bars indicate standard errors.
Supplementary Figure S24. D-statistic tests of Sanganji Jomon, Chimpanzee, 1000 Genome Project worldwide populations, Karitiana, Mal’ta MA1, and Ust’-Ishim based on 14,978 transversion sites with PBConsidering the tree of (YRI, (non-East Eurasians, (Jomon, CHB))), no evidence of gene flow between Jomon and non-East Eurasians were observed after the divergence. Each bars indicate standard errors.
Supplementary Figure S25. D-statistic tests of Sanganji Jomon and HGDP worldwide populations based on 7,529 SNP sites with PBConsidering the tree of (San, (X, (Jomon, Han))), and no evidence of gene flow between Jomon and population X was observed after the divergence. Each bars indicate standard errors.
Supplementary Figure S26. D-statistic tests of Sanganji Jomon and HGDP
populations based on 7,529 SNP sites with PB
Considering the tree of (San, (Y, (Jomon, X))), and population X is Native Americans or
East Eurasians, and population Y is Sahulians. Each bars indicate standard errors.
Supplementary Figure S27. D-statistic tests of Sanganji Jomon, 1000 Genome
Project worldwide populations, Papuan, Melanesian, Karitiana, Mal’ta MA1, and
Ust’-Ishim with PB
(a), (b) The tree based on 21,286 SNP sites and 7,490 transversion sites. The affinity
between Melanesian and East Eurasians compared to Sanganji Jomon was inferred, but
the affinity with Papuan was not supported when using transversion sites. Each bars
indicate standard errors.
Supplementary Figure S28. D-statistic tests of Sanganji Jomon, Chimpanzee, 1000
Genome Project worldwide populations, Papuan, Karitiana, Mal’ta MA1, and
Ust’-Ishim with PB
(a), (b) The tree based on 42,128 SNP sites and 14,968 transversion sites. The affinity
between Papuan and East Eurasians compared to Sanganji Jomon was still not inferred
as supplementary figure S29 even though more SNP sites and different outgroups were
used. Each bars indicate standard errors.
Supplementary Figure S29. D-statistic tests of Sanganji Jomon, Chimpanzee, 1000
Genome Project worldwide populations, and Karitiana based on 14,978
transversion sites with PB
(a), (b) Outgroup of the tree were Karitiana and Sanganji Jomon, respectively, and both
figures show some skew from zero, but not significant Z-score. Each bars indicate
standard errors.
Supplementary Figure S30. D-statistic tests of Sanganji Jomon and HGDP based
on 7,529 SNP sites with PB
Population X is European or Sahulian or East Eurasians, and population Y is Native
American. Each bars indicate standard errors.
Supplementary Figure S31. D-statistic tests of Sanganji Jomon, 15 humans, and
archaic humans based on 224,646 transversion sites with PB
The tree of (Chimp, (Y, (San, X))), and population X is Non-Africans, and population Y
is archaic humans. Any pair of individuals did not represent significant Z-score, but
non-Africans including Jomon sift to positive value compared to Africans. Each bar
indicates standard errors.
===== Perl Script for PB (picking bases) =====
If you have any question on this perl script, please contact Kirill Kryukov ([email protected]).
#!/usr/bin/env perl
#
# sam-merge-duplicates-picking-bases.pl
#
# Version 0.1.4 (August 3, 2015)
#
# Copyright (c) 2015 Kirill Kryukov
#
# This software is provided 'as-is', without any express or implied
# warranty. In no event will the authors be held liable for any damages
# arising from the use of this software.
#
# Permission is granted to anyone to use this software for any purpose,
# including commercial applications, and to alter it and redistribute it
# freely, subject to the following restrictions:
#
# 1. The origin of this software must not be misrepresented; you must not
# claim that you wrote the original software. If you use this software
# in a product, an acknowledgment in the product documentation would be
# appreciated but is not required.
# 2. Altered source versions must be plainly marked as such, and must not be
# misrepresented as being the original software.
# 3. This notice may not be removed or altered from any source distribution.
#
#
# Usage:
# sam-merge-duplicates-picking-bases.pl --refdir REFDIR [Options] <input.sam >output.sam
# Where:
# REFDIR is directory with reference FASTA files.
# Options:
# --ignore-strand - Merge reads regardless of strand (by default merges only same strand reads).
# --debug - Add alignment into output (output is no longer in SAM format).
# --use-n - Use N for ambiguous positions in merged read (those having multiple alleles).
#
# Reads and writes SAM format (when no "--debug" option is used).
#
# Reference directory must contain reference as one file per chromosome, named as:
# "chr1.fa", "chr2.fa", etc.
#
# For each group of reads that look like duplicates, merges them into single read.
#
# Two reads are considered duplicates if they:
# - Are mapped to the same chromosome and at the same starting position
# - Are mapped to reference region of same length
# - Have identically long soft-clipped parts (for both beginning and end).
# - Are mapped to same strand (unless --ignore-strand is specified).
#
# It's OK if they have different insertions/deletions.
#
# Merging means:
# - At sites where all reads in a cluster have same substitution, insertion, or deletion,
# it is preserved in the merged read.
# - At sites where only some reads have substitution, insertion, or deletion, reference
# sequence is used in the merged read.
# - At sites where there are different insertions, the majority among the shortest ones wins.
# - Quality at each merged site is computed as the maximum of all input qualities at this site.
#
# Limitations:
# - This script expects sorged input SAM file.
# - This script discards all reads mapped to reference other than 1..22,X,Y,MT (hardcoded).
# - This script only works with CIGAR containing M,I,D,S - hard clipping is not handled, as well as N.
# - Only reads with flag equal to 0 and 16 are processed.
# - This script only writes optional tags NM and MD for the merged reads.
# - Soft-clipped sequence and quality is simply copied from one of the reads
# (instead of checking which soft-clipped sequence is major, or finding maximum quality).
# - RNEXT, PNEXT and TLEN of the merged read are copied from one of the input reads without any
verification.
# - QNAME, FLAG and MAPQ of the merged reads are copied from one of the input reads verbatim.
#
# This script is tested on limited data and may not work on other data.
#
# Let me know if you have suggestions. In case of any issues with this script, please send me
# example input and the expected output.
#
# If you use --debug option, the output will contain complete alignment and other information for
# each cluster of duplicate reads.
#
use strict;
use File::Basename qw(basename);
use File::Glob qw(:bsd_glob);
use File::Slurp;
use Getopt::Long;
$| = 1;
my ($start_time,$debug,$ignore_strand,$use_n,$ref_dir) = (time,0,0,0);
GetOptions(
"debug" => \$debug,
"ref-dir=s" => \$ref_dir,
"ignore-strand" => \$ignore_strand,
"use-n" => \$use_n
);
if (!defined $ref_dir) { die "Reference directory is not specified, please use --ref-dir option\n"; }
if (!-e $ref_dir) { die "Can't find reference directory\n"; }
my @ref_chr_to_use = (1..22,'X','Y','MT');
my %use_chr = map { $_ => 1 } @ref_chr_to_use;
for (@ref_chr_to_use) { if (!-e "$ref_dir/chr$_.fa") { die "Can't find reference file \"$ref_dir/chr$_.fa\"\
n"; } }
my ($r_name,$r_seq,$r_len,$cluster_start) = ('','',0,-1);
my ($n_reads_total,$n_single_reads,$n_duplicate_clusters,$n_duplicate_reads,$n_reads_saved,
$n_skipped_by_flag,$n_skipped_by_chromosome) = (0,0,0,0);
my @cluster = ();
#my %flag_count = ();
#my %cigar_char_count = ();
binmode STDIN;
binmode STDOUT;
while (<STDIN>)
{
s/[\x0D\x0A]+$//;
if (/^@/) { print "$_\n"; next; }
my @fields = split /\t/, $_;
my $n_fields = scalar(@fields);
my ($qname,$flag,$rname,$pos,$mapq,$cigar,$rnext,$pnext,$tlen,$seq,$qual) = @fields;
$n_reads_total++;
#$flag_count{$flag}++;
if ($pos != $cluster_start or $rname ne $r_name)
{
if ($rname eq $r_name and $pos < $cluster_start) { die "SAM file is not sorted! Following read is out
of order:\n$_\n"; }
process_cluster();
$cluster_start = $pos;
if ($rname ne $r_name)
{
if (exists $use_chr{$rname}) { load_reference_chromosome($rname); }
$r_name = $rname;
}
}
if ($cigar eq '*' or $seq eq '*') { next; }
if (!exists $use_chr{$rname}) { $n_skipped_by_chromosome++; next; }
if ($flag ne '0' and $flag ne '16') { $n_skipped_by_flag++; next; }
#while ($cigar =~ /[0-9]+([MIDNSHPX=])/g) { $cigar_char_count{$1}++; }
#if ($cigar =~ /[0-9]+S/) { print STDERR ' ', $qname; }
my $cs = scalar(@cluster);
@{$cluster[$cs]} = @fields;
}
process_cluster();
print STDERR "\n";
print STDERR "Input has $n_reads_total reads:\n";
if ($n_skipped_by_chromosome > 0) { print STDERR " discarded $n_skipped_by_chromosome reads
mapped to reference sequence other than chr. 1-22,X,Y,MT\n"; }
if ($n_skipped_by_flag > 0) { print STDERR " discarded $n_skipped_by_flag reads with flag different
from 0 and 16\n"; }
print STDERR " merged $n_duplicate_reads duplicate reads into $n_duplicate_clusters merged reads\n";
print STDERR " kept the other $n_single_reads reads unchanged\n";
#print STDERR "Flags:\n";
#foreach my $flag (sort { $a <=> $b } keys %flag_count) { print STDERR ' ', $flag, ': ',
$flag_count{$flag}, "\n"; }
#print STDERR "CIGAR chars:\n";
#foreach my $c (sort { $a cmp $b } keys %cigar_char_count) { print STDERR ' ', $c, ': ',
$cigar_char_count{$c}, "\n"; }
my $secs = time - $start_time;
print STDERR "$secs second", (($secs==1)?'':'s'), " elapsed\n";
sub process_cluster
{
if (!scalar @cluster) { return; }
if (scalar(@cluster) == 1) { print join("\t",@{$cluster[0]}), "\n"; $n_single_reads++; @cluster = ();
return; }
my (@soft_clip_start,@soft_clip_end,@ins_len,@del_len,@match_len,@slen,@end_pos);
foreach my $i (keys @cluster)
{
if ($cluster[$i]->[5] =~ /^([0-9]+)S/) { $soft_clip_start[$i] = $1; } else { $soft_clip_start[$i] = 0; }
if ($cluster[$i]->[5] =~ /([0-9]+)S$/) { $soft_clip_end[$i] = $1; } else { $soft_clip_end[$i] = 0; }
$ins_len[$i] = 0;
while ($cluster[$i]->[5] =~ /([0-9]+)I/g) { $ins_len[$i] += $1; }
$del_len[$i] = 0;
while ($cluster[$i]->[5] =~ /([0-9]+)D/g) { $del_len[$i] += $1; }
$match_len[$i] = 0;
while ($cluster[$i]->[5] =~ /([0-9]+)[M=X]/g) { $match_len[$i] += $1; }
$slen[$i] = $match_len[$i] + $del_len[$i];
$end_pos[$i] = $cluster[$i]->[3] + $slen[$i] - 1;
}
my %subclusters;
foreach my $i (keys @cluster)
{
my $key = "$soft_clip_start[$i]-$end_pos[$i]-$soft_clip_end[$i]";
if (!$ignore_strand) { $key .= '-' . (($cluster[$i]->[1] & 16) ? 'minus' : 'plus'); }
push @{$subclusters{$key}->{'indexes'}}, $i;
$subclusters{$key}->{'start_pos'} = $cluster[$i]->[3];
$subclusters{$key}->{'end_pos'} = $end_pos[$i];
$subclusters{$key}->{'slen'} = $slen[$i];
$subclusters{$key}->{'soft_clip_start'} = $soft_clip_start[$i];
$subclusters{$key}->{'soft_clip_end'} = $soft_clip_end[$i];
}
#if (scalar(keys %subclusters) != 1) { print "\n----- multiple clusters starting at the same position! -----\
n"; }
foreach my $k (sort { $a cmp $b } keys %subclusters)
{
my $n = scalar(@{$subclusters{$k}->{'indexes'}});
if ($n == 1) { print join("\t",@{$cluster[$subclusters{$k}->{'indexes'}->[0]]}), "\n";
$n_single_reads++; next; }
$n_duplicate_clusters++;
$n_duplicate_reads += $n;
my $slen = $subclusters{$k}->{'slen'};
my $start_pos = $subclusters{$k}->{'start_pos'};
my $end_pos = $subclusters{$k}->{'end_pos'};
my $soft_clip_start = $subclusters{$k}->{'soft_clip_start'};
my $soft_clip_end = $subclusters{$k}->{'soft_clip_end'};
my $rs = substr($r_seq,$start_pos-1,$slen);
my %cigar_char_count = ();
for (my $ii=0; $ii<$n; $ii++)
{
my $i = $subclusters{$k}->{'indexes'}->[$ii];
while ($cluster[$i]->[5] =~ /[0-9]+([MIDNSHPX=])/g) { $cigar_char_count{$1}++; }
}
if ($debug)
{
print "\n----- cluster of $n reads, mapped length: $slen bp (on reference), key: $k -----\n";
if (exists $cigar_char_count{'S'}) { print "----- cluster is soft-clipped -----\n"; }
if (exists $cigar_char_count{'D'}) { print "----- cluster has deletions -----\n"; }
if (exists $cigar_char_count{'I'}) { print "----- cluster has insertions -----\n"; }
print "----- ref-seq: $rs -----\n";
}
my (@aln,@alnq,@ins,@insq,@mininslen,@maxinslen,@ast,@out,@outq,@outins,@outinsq);
# Aligning nucleotides according to CIGAR string.
for (my $ii=0; $ii<$n; $ii++)
{
my $i = $subclusters{$k}->{'indexes'}->[$ii];
my $ap = 0;
my $qp = $soft_clip_start;
while ($cluster[$i]->[5] =~ /([0-9]+)([MIDNPX=])/g)
{
my ($len,$op) = ($1,$2);
if ($op eq 'D') {
for (my $p=0; $p<$len; $p++)
{
$aln[$ii]->[$ap+$p] = '-';
$alnq[$ii]->[$ap+$p] = '';
}
$ap += $len;
}
elsif ($op eq 'I')
{
$ins[$ii]->[$ap] = substr($cluster[$i]->[9],$qp,$len);
$insq[$ii]->[$ap] = substr($cluster[$i]->[10],$qp,$len);
$qp += $len;
}
elsif ($op eq 'M' or $op eq '=' or $op eq 'X')
{
for (my $p=0; $p<$len; $p++)
{
$aln[$ii]->[$ap+$p] = substr($cluster[$i]->[9],$qp+$p,1);
$alnq[$ii]->[$ap+$p] = substr($cluster[$i]->[10],$qp+$p,1);
}
$ap += $len;
$qp += $len;
}
else { die "Unsupported CIGAR component: $op\n"; }
}
}
for (my $p=0; $p<$slen; $p++) { $aln[$n]->[$p] = substr($rs,$p,1); }
# Merging aligned nucleotides
for (my $p=0; $p<$slen; $p++)
{
my %cn = ();
for (my $ii=0; $ii<$n; $ii++) { $cn{$aln[$ii]->[$p]}++; }
my $maxcn = 0;
foreach my $c (keys %cn) { if ($cn{$c} > $maxcn) { $maxcn = $cn{$c}; } }
if ($maxcn == $n)
{
$out[$p] = $aln[0]->[$p];
$ast[$p] = ($aln[0]->[$p] eq '-') ? 'D' : ($aln[0]->[$p] eq $aln[$n]->[$p]) ? '=' : 'X';
}
else
{
$out[$p] = $use_n ? 'N' : $aln[$n]->[$p];
$ast[$p] = '~';
}
}
# Merging insertions
my $has_disagreeing_insertions = 0;
my $has_interesting_insertions = 0;
for (my $p=0; $p<$slen; $p++)
{
my ($min_ins_len,$max_ins_len) = (1000000,0);
for (my $ii=0; $ii<$n; $ii++)
{
my $l = defined($ins[$ii]->[$p]) ? length($ins[$ii]->[$p]) : 0;
if ($l < $min_ins_len) { $min_ins_len = $l; }
if ($l > $max_ins_len) { $max_ins_len = $l; }
}
$mininslen[$p] = $min_ins_len;
$maxinslen[$p] = $max_ins_len;
if ($min_ins_len < 1) { next; }
my %ins_num = ();
my %ins_len_num = ();
for (my $ii=0; $ii<$n; $ii++)
{
my $l = defined($ins[$ii]->[$p]) ? length($ins[$ii]->[$p]) : 0;
if ($l == $min_ins_len) { $ins_num{$ins[$ii]->[$p]}++; }
$ins_len_num{$l}++;
}
if (scalar(keys %ins_num) > 1) { $has_disagreeing_insertions = 1; }
if (scalar(keys %ins_len_num) > 1) { $has_interesting_insertions = 1; }
my $major_ins = (sort { $ins_num{$b} <=> $ins_num{$a} } keys %ins_num)[0];
$outins[$p] = $major_ins;
}
if ($debug)
{
#if ($has_disagreeing_insertions) { print "----- cluster has disagreeing insertions -----\n"; }
#if ($has_interesting_insertions) { print "----- cluster has insertions of varying lengths -----\n"; }
print "----- input reads: -----\n";
foreach my $i (@{$subclusters{$k}->{'indexes'}}) { print join("\t",@{$cluster[$i]}), "\n"; }
print "----- alignment: -----\n";
print 'ref ';
for (my $p=0; $p<$slen; $p++)
{
if ($maxinslen[$p] > 0) { print (('-') x ($maxinslen[$p])); }
print $aln[$n]->[$p];
}
print "\n";
for (my $ii=0; $ii<$n; $ii++)
{
my $aname = sprintf('%3d',$ii+1);
print $aname, ' ';
for (my $p=0; $p<$slen; $p++)
{
if ($maxinslen[$p] > 0)
{
my $l = defined($ins[$ii]->[$p]) ? length($ins[$ii]->[$p]) : 0;
if (defined($ins[$ii]->[$p])) { print $ins[$ii]->[$p]; }
if ($l < $maxinslen[$p]) { print (('-') x ($maxinslen[$p]-$l)); }
}
print $aln[$ii]->[$p];
}
print "\n";
}
print ' ';
for (my $p=0; $p<$slen; $p++)
{
if ($mininslen[$p] > 0) { print (('I') x ($mininslen[$p])); }
if ($maxinslen[$p] > $mininslen[$p]) { print (('-') x ($maxinslen[$p]-$mininslen[$p])); }
print $ast[$p];
}
print "\n";
print 'out ';
for (my $p=0; $p<$slen; $p++)
{
if ($maxinslen[$p] > 0)
{
my $l = defined($outins[$p]) ? length($outins[$p]) : 0;
if (defined($outins[$p])) { print $outins[$p]; }
if ($l < $maxinslen[$p]) { print (('-') x ($maxinslen[$p]-$l)); }
}
print $out[$p];
}
print "\n";
}
# Constructing output CIGAR
my $out_cigar_chars = '';
if ($soft_clip_start > 0) { $out_cigar_chars .= ('S') x $soft_clip_start; }
for (my $p=0; $p<$slen; $p++)
{
if (defined($outins[$p])) { $out_cigar_chars .= ('I') x length($outins[$p]); }
$out_cigar_chars .= ($out[$p] eq '-') ? 'D' : 'M';
}
if ($soft_clip_end > 0) { $out_cigar_chars .= ('S') x $soft_clip_end; }
my $out_cigar = '';
while ($out_cigar_chars =~ /([MIDS])\1*/g) { $out_cigar .= length($&) . $1; }
# Constructing output sequence.
my $out_seq = '';
if ($soft_clip_start > 0) { $out_seq .= substr($cluster[$subclusters{$k}->{'indexes'}->[0]]-
>[9],0,$soft_clip_start); }
for (my $p=0; $p<$slen; $p++)
{
if (defined($outins[$p])) { $out_seq .= $outins[$p]; }
if ($out[$p] ne '-') { $out_seq .= $out[$p]; }
}
if ($soft_clip_end > 0) { $out_seq .= substr($cluster[$subclusters{$k}->{'indexes'}->[0]]->[9],-
$soft_clip_end,$soft_clip_end); }
# Constructing output quality.
my $out_qual = '';
if ($soft_clip_start > 0) { $out_qual .= substr($cluster[$subclusters{$k}->{'indexes'}->[0]]-
>[10],0,$soft_clip_start); }
for (my $p=0; $p<$slen; $p++)
{
if (defined($outins[$p]))
{
for (my $pp=0; $pp<length($outins[$p]); $pp++)
{
my $q = 33;
for (my $ii=0; $ii<$n; $ii++)
{
if (!defined($ins[$ii]->[$p])) { next; }
if ($ins[$ii]->[$p] ne $outins[$p]) { next; }
my $qa = ord(substr($insq[$ii]->[$p],$pp,1));
if ($qa > $q) { $q = $qa; }
}
$out_qual .= chr($q);
}
}
if ($out[$p] ne '-')
{
my $q = 33;
for (my $ii=0; $ii<$n; $ii++)
{
if ($aln[$ii]->[$p] ne $out[$p]) { next; }
my $qa = ord($alnq[$ii]->[$p]);
if ($qa > $q) { $q = $qa; }
}
$out_qual .= chr($q);
}
}
if ($soft_clip_end > 0) { $out_qual .= substr($cluster[$subclusters{$k}->{'indexes'}->[0]]->[10],-
$soft_clip_end,$soft_clip_end); }
# Computing NM tag (edit distance from reference)
my $NM = 0;
for (my $p=0; $p<$slen; $p++)
{
if (defined($outins[$p])) { $NM += length($outins[$p]); }
if ($out[$p] ne $aln[$n]->[$p]) { $NM++; }
}
# Constructing MD tag (reference bases that differ from read)
my $MD = '';
for (my $p=0; $p<$slen; $p++)
{
if ($out[$p] eq '-') { $MD .= '-' . $aln[$n]->[$p]; }
else { $MD .= ($out[$p] eq $aln[$n]->[$p]) ? '=' : ('.' . $aln[$n]->[$p]); }
}
while ($MD =~ /(\=+)/) { $MD = $` . length($1) . $'; }
$MD =~ s/([a-zA-Z])(\.)/${1}0$2/g;
while ($MD =~ /(\-[a-zA-Z]+)\-([a-zA-Z]+)/) { $MD = $` . $1 . $2 . $'; }
$MD =~ s/\-/^/g;
$MD =~ s/\.//g;
if ($MD !~ /^\d/) { $MD = '0' . $MD; }
if ($MD !~ /\d$/) { $MD .= '0'; }
# Printing the output read.
if ($debug) { print "----- merged read: -----\n"; }
my $i0 = $subclusters{$k}->{'indexes'}->[0];
print $cluster[$i0]->[0], "\t", $cluster[$i0]->[1], "\t", $cluster[$i0]->[2], "\t", $cluster[$i0]->[3], "\t",
$cluster[$i0]->[4];
print "\t", $out_cigar, "\t", $cluster[$i0]->[6], "\t", $cluster[$i0]->[7], "\t", $cluster[$i0]->[8];
print "\t$out_seq\t$out_qual\tNM:i:$NM\tMD:Z:$MD\n";
if ($debug) { print "\n"; }
}
@cluster = ();
}
sub load_reference_chromosome
{
my ($name) = @_;
my $file = "$ref_dir/chr$name.fa";
if (!-e $file) { die "Can't find reference file \"$file\"\n"; }
open(my $R,'<',$file) or die "Can't open \"$file\"\n";
binmode $R;
print STDERR "Loading chr$name ..";
$r_seq = '';
<$R>;
while (<$R>) { s/[\x0D\x0A]+$//; $r_seq .= $_; }
close $R;
$r_len = length($r_seq);
print STDERR " OK - $r_len bp\n";
}
===== Five Perl Scripts for obtaining bootstrap values from TreeMix outputs =====
We provide following five perl scripts for obtaining bootstrap values from TreeMix outputs.
(A) 1Newick_to_Split_name_change.pl
(B) 2Newick_to_Split_one_original.pl
(C) 3Newick_to_Split_many.pl
(D) 4Split_Count.p
(E) 5Split_to_Newick+bootstrap.pl
We transform Newick format to Split Matrix format for simplifying counting interior branch (= split) for
obtaining bootstap probabilities. For example, splits matrix and population names corresponding to
TreeMix tree shown in Figure 4 are as follows.
--------------------------------------
Population ID:
000000000111111111 Bootstrap
123456789012345678 Frequency
--------------------------------------
Split 01: 000000000000000011 910
Split 02: 000000000000011000 123
Split 03: 000000000000011100 380
Split 04: 000000000000011111 982
Split 05: 000010000000011111 956
Split 06: 011101111111100000 998
Split 07: 011100111111100000 973
Split 08: 011000111111100000 952
Split 09: 000000000100100000 888
Split 10: 000000001010000000 701
Split 11: 000000001110100000 701
Split 12: 000000001111100000 975
Split 13: 001000001111100000 996
Split 14: 010000110000000000 1000
Split 15: 000000110000000000 526
======================================
List of Populations
ID: name
1 Saganji Jomon
2 Denisovan
3 Malta_MA1
4 Ust_Ishim
5 Karitiana
6 Papuan
7 YRI
8 LWK
9 CEU
10 IBS
11 GBR
12 FIN
13 TSI
14 JPT
15 CHB
16 CHS
17 CDX
18 KHV
See Chapter 3 of “Introduction to Evolutionary Genomics” (Naruya Saitou, 2014, Springer) for
explanation of splits matrix. If you have any question on this perl script, please contact Naruya Saitou
(A) 1Newick_to_Split_name_change.pl
print "Please type population file ==> ";
$pop_file = <STDIN>;
chomp($pop_file);
open (FILE2,$pop_file) or die "$!";
@poplist=<FILE2>;
close(FILE2);
print @poplist;
@popname=();
@popno=();
foreach $poplist(@poplist) {
($no,$popname) = split(/¥s+/,$poplist);
push(@popno,$no);
push(@popname,$popname);
$tree_topology =~ s/$popname:/$no:/;
}
print "Please type name of new folder which copy treeout files ==> ";
$folder = <STDIN>;
chomp($folder);
opendir (DIR, $folder) or die "$dirname: $!";
@dirs = readdir(DIR);
foreach $dir(@dirs) {
if ( $dir =~ /^¥./) {
next;
}
$filename = "$folder¥/$dir";
open (FILE2,$filename) or die "$!";
@split_row=<FILE2>;
close(FILE);
#print join("¥n",@split_row);<STDIN>;
for ($i=0;$i<@poplist;$i++) {
$split_row[0] =~ s/$popname[$i]/$popno[$i]/;
}
#print join("¥n",$split_row[0]);<STDIN>;
open (NEWFILE, "> $filename") or die "$!";
print NEWFILE $split_row[0];
close (NEWFILE);
}
closedir(DIR);
end_proc:
($sec,$min,$hour) = localtime();
print "Computation is normaly ended. $hour:$min¥nThe new treeout file is created in '$folder'.¥n";
(B) 2Newick_to_Split_one_original.pl
($sec,$min,$hour,$mday,$mon,$year) = localtime();
$year += 1900;
$mon += 1;
print "start Newick_to_Split_one.pl $year/$mon/$mday $hour:$min¥n";
start:
print 'This perl program is written by Mizuguchi Masako.',"¥n";
print 'Algorithm is provided by Saitou Naruya.',"¥n";
print 'Version Date: Nov 21, A.S. 0015 (2015 A.D.)',"¥n";
print 'This script transforms Newick format file produced by TreeMix to split list format.',"¥n";
print "Please type Newick format file ==> ";
$in_file = <STDIN>;
chomp($in_file);
if ($in_file =~ /^(.*)¥.treeout$/) {
$out_file = $1 . "_original.split";
$out_file2 = $1 . "_original.popout";
$out_file3 = $1 . "_original.split+tree";
}
open (FILE,$in_file) or die "$!";
@line=<FILE>;
close(FILE);
$tree_topology = $line[0];
$tree_topology =~ /^(.*);/;
$tree_topology = $1;
open (OUTFILE2,"> $out_file2") or die "$!";
print OUTFILE2 $tree_topology, "¥n¥n";
print "Please type population file ==> ";
$pop_file = <STDIN>;
chomp($pop_file);
open (FILE2,$pop_file) or die "$!";
@poplist=<FILE2>;
close(FILE2);
print OUTFILE2 @poplist;
@popname=();
@popno=();
foreach $poplist(@poplist) {
($no,$popname) = split(/¥s+/,$poplist);
push(@popno,$no);
push(@popname,$popname);
$tree_topology =~ s/$popname:/$no:/;
}
print OUTFILE2 "¥n",$tree_topology,"¥n¥n",;
$OTUs_no = $no;
while ($tree_topology =~ /(¥:-?¥d+¥.?¥d*e?¥-?¥d*)[¥,¥)]/g) {
$tree_topology =~ s/$1//;
}
#print OUTFILE2 "¥n",$tree_topology,"¥n¥n",;
LOOP:
@row_list = ();
for ($i=0;$i<$OTUs_no;$i++) {
@one_row = ();
for ($j=0;$j<$OTUs_no;$j++) {
if ($j == $i ) {
push(@one_row,1);
} else {
push(@one_row,0);
}
}
push(@row_list,join('',@one_row));
}
$new_no = $OTUs_no;
@new_row = ();
$new_row = 0;
while ($tree_topology =~ /(¥(¥d+¥,¥d+¥))/g) {
$new_no++;
$tree_topology =~ s/$2/$new_no/;
#print $&;<STDIN>;
$& =~ /(¥d+)¥,(¥d+)/;
@row1=split(//,$row_list[$1-1]);
@row2=split(//,$row_list[$2-1]);
@one_row=();
$total = 0;
for ($j=0;$j<$OTUs_no;$j++) {
$total = $total + $row1[$j]+$row2[$j];
push(@one_row,$row1[$j]+$row2[$j]);
}
push(@popno,"($popno[$1-1],$popno[$2-1])");
push(@popname,"($popname[$1-1],$popname[$2-1])");
#print join("¥n",@popno);<STDIN>;
#print join("¥n",@popname);<STDIN>;
if ($total > $OTUs_no-2 ) {
last;
}
push(@new_row,,join('',@one_row));
push(@row_list,join('',@one_row));
$new_row++;
if ($new_row>=$OTUs_no-3) {
last;
}
}
for ($i=0;$i<@new_row;$i++) {
@one_row = ();
@one_row = split(//,$new_row[$i]);
if ($one_row[0] == 0) {
next;
}
for ($j=0;$j<@one_row;$j++) {
if ($one_row[$j] == 0) {
$one_row[$j] = 1;
} else {
$one_row[$j] = 0;
}
}
$new_row[$i] = join('',@one_row);
}
open (OUTFILE,"> $out_file") or die "$!";
print OUTFILE join("¥n",@new_row),"¥n";
print OUTFILE2 join("¥n",@new_row),"¥n";
open (OUTFILE3,"> $out_file3") or die "$!";
for ($i=0;$i<@new_row;$i++) {
print OUTFILE3 "$new_row[$i] $popname[$i+$OTUs_no]¥n";
#print OUTFILE3 "$new_row[$i] $popno[$i+$OTUs_no]¥n";
}
close(OUTFILE);
close(OUTFILE2);
close(OUTFILE3);
end_proc:
($sec,$min,$hour) = localtime();
print "Computation is normaly ended. $hour:$min¥nThe file name created split list of original tree is
'$out_file' and the check list is '$out_file2'. ¥nThe number of populations compared is $OTUs_no¥n";
(C) 3Newick_to_Split_many.pl
($sec,$min,$hour,$mday,$mon,$year) = localtime();
$year += 1900;
$mon += 1;
print "start Newick_to_Split.pl $year/$mon/$mday $hour:$min\n";
start:
print 'This perl program is written by Mizuguchi Masako.',"\n";
print 'Algorithm is provided by Saitou Naruya.',"\n";
print 'Version Date: Nov 21, A.S. 0015 (2015 A.D.)',"\n";
print 'This script transforms Newick format file produced by TreeMix to split list format.',"\n";
print "Please type common part of Newick format file ==> ";
$file_before = <STDIN>;
chomp($file_before);
print "Please type number of bootstraping pseudosamples ==> ";
$bootstrap_no = <STDIN>;
chomp($bootstrap_no);
print "Please type number of populations compared ==> ";
$OTUs_no = <STDIN>;
chomp($OTUs_no);
chomp($file);
for ($no=1;$no<=$bootstrap_no;$no++) {
if ($no < 10) {
$in_file = $file_before . "000" . $no . ".treeout";
$out_file = $file_before . "000" . $no . ".split";
} elsif ($no < 100) {
$in_file = $file_before . "00" . $no . ".treeout";
$out_file = $file_before . "00" . $no . ".split";
} elsif ($no < 1000) {
$in_file = $file_before . "0" . $no . ".treeout";
$out_file = $file_before . "0" . $no . ".split";
} else {
$in_file = $file_before . $no . ".treeout";
$out_file = $file_before . $no . ".split";
}
open (FILE,$in_file) or die "$!";
@line=<FILE>;
close(FILE);
$tree_topology = $line[0];
$tree_topology =~ /^(.*);/;
$tree_topology = $1;
open (OUTFILE,"> $out_file") or die "$!";
LOOP:
@row_list = ();
for ($i=0;$i<$OTUs_no;$i++) {
@one_row = ();
for ($j=0;$j<$OTUs_no;$j++) {
if ($j == $i ) {
push(@one_row,1);
} else {
push(@one_row,0);
}
}
push(@row_list,join('',@one_row));
}
while ($tree_topology =~ /(\:-?\d+\.?\d*e?\-?\d*)[\,\)]/g) {
$tree_topology =~ s/$1//;
}
$new_no = $OTUs_no;
@new_row = ();
$new_row = 0;
while ($tree_topology =~ /(\(\d+\,\d+\))/g) {
$new_no++;
$tree_topology =~ s/$2/$new_no/;
$& =~ /(\d+)\,(\d+)/;
@row1=split(//,$row_list[$1-1]);
@row2=split(//,$row_list[$2-1]);
@one_row=();
$total = 0;
for ($j=0;$j<$OTUs_no;$j++) {
$total = $total + $row1[$j]+$row2[$j];
push(@one_row,$row1[$j]+$row2[$j]);
}
if ($total > $OTUs_no-2 ) {
last;
}
push(@new_row,,join('',@one_row));
push(@row_list,join('',@one_row));
$new_row++;
if ($new_row>=$OTUs_no-3) {
last;
}
}
for ($i=0;$i<@new_row;$i++) {
@one_row = ();
@one_row = split(//,$new_row[$i]);
if ($one_row[0] == 0) {
next;
}
for ($j=0;$j<@one_row;$j++) {
if ($one_row[$j] == 0) {
$one_row[$j] = 1;
} else {
$one_row[$j] = 0;
}
}
$new_row[$i] = join('',@one_row);
}
print OUTFILE join("\n",@new_row),"\n";
close(OUTFILE);
}
end_proc:
($sec,$min,$hour) = localtime();
print "Computation is normaly ended. $hour:$min\n";
(D) 4Split_Count.pl
#use strict;
($sec,$min,$hour,$mday,$mon,$year) = localtime();
$year += 1900;
$mon += 1;
print "start Split_Count.pl $year/$mon/$mday $hour:$min\n";
start:
print 'This perl program is written by Mizuguchi Masako.',"\n";
print 'Algorithm is provided by Saitou Naruya.',"\n";
print 'Version Date: Nov 21, A.S. 0015 (2015 A.D.)',"\n";
print 'This script counts numbers splits to obtain bootstrap probabilities.',"\n";
print "Please type file name for split list of original tree produced by TreeMix ==> ";
$file = <STDIN>;
chomp($file);
print "Please type name of folder (or directory) which contains split list files ==> ";
$folder = <STDIN>;
chomp($folder);
open (FILE,$file) or die "$!";
@row=<FILE>;
close(FILE);
opendir (DIR, $folder) or die "$dirname: $!";
@dirs = readdir(DIR);
@count = ();
for($i=0;$i<@row;$i++) {
$count[$i] = 0;
}
foreach $dir(@dirs) {
if ( $dir =~ /^\./) {
next;
}
if ($dir !~ /split/) {
next;
}
$filename = "$folder\/$dir";
open (FILE2,$filename) or die "$!";
@split_row=<FILE2>;
close(FILE);
for ($i=0;$i<@row;$i++) {
for ($j=0;$j<@split_row;$j++) {
if ($row[$i] eq $split_row[$j]) {
$count[$i]++;
}
}
}
}
closedir(DIR);
$outfile = $folder . "_count.txt";
open (OUTFILE,"> $outfile") or die "$!";
for ($i=0;$i<@row;$i++) {
chomp($row[$i]);
print OUTFILE "$row[$i] $count[$i] \n";
}
close(OUTFILE);
end_proc:
($sec,$min,$hour) = localtime();
print "Computation is normaly ended. $hour:$min\nThe output file is '$outfile'.\n";
(E) 5Split_to_Newick+bootstrap.pl
#use strict;
($sec,$min,$hour,$mday,$mon,$year) = localtime();
$year += 1900;
$mon += 1;
print "start Split_Count.pl $year/$mon/$mday $hour:$min¥n";
start:
print 'This perl program is written by Mizuguchi Masako.',"¥n";
print 'Algorithm is provided by Saitou Naruya.',"¥n";
print 'Version Date: Jan 19, A.S. 0016 (2016 A.D.)',"¥n";
print "Please type Newick format file ==> ";
$file1 = <STDIN>;
chomp($file1);
print "Please type file name of xxxxxx_count.txt ==> ";
$file2 = <STDIN>;
chomp($file2);
if ($file1 =~ /^(.*)¥.treeout$/) {
$file3 = $1 . "_original.split+tree";
$outfile = $1. ".tree";
}
open (FILE1,$file1) or die "$!";
@row1=<FILE1>;
close(FILE1);
$treeout = $row1[0];
open (FILE2,$file2) or die "$!";
@row2=<FILE2>;
close(FILE2);
open (FILE3,$file3) or die "$!";
@row3=<FILE3>;
close(FILE3);
LOOP:
$item = pop(@row3);
if (defined($item)) {
($distance1,$pair) = split(/¥s+/,$item);
@pair = split(/,/,$pair);
$item2 = $pair[-1];
@char = split(/¥)/,$item2);
$name = $char[0];
@matches = $item2 =~ m/¥)/g;
$n = scalar(@matches);
($distance2,$cnt) = split(/¥s+/,pop(@row2));
$treeout =~ /^(.*)$name(.*)$/;
$front_char = $1;
$item3 = $2;
@item3 = split(/¥)/,$item3);
@item4 = split(/,/,$item3[$n]);
$item4[0] = $item4[0] . "¥[" . $cnt/10 . "¥]";
$item3[$n] = join(',',@item4);
$item3 = join("¥)",@item3);
$treeout = $front_char . $name . $item3;
goto LOOP;
}
open (OUTFILE,"> $outfile") or die "$!";
print OUTFILE $treeout;
close(OUTFILE);
end_proc:
($sec,$min,$hour) = localtime();
print "Computation is normaly ended. $hour:$min¥n";
print "The tree is '$outfile'. ¥n";