media. · web viewmitochondrial genome because the depth is much higher than that of nuclear...

Supplementary Methods

Supplementary Methods S1. DNA Extraction, Library Preparation, and

Sequencing

To confirm whether the two sequence outputs of the same individual, extracted

independently, were consistent, the first two extracts (A1 and B) were prepared at the

National Institute of Genetics, Mishima, and the new DNA extract (A2) was prepared at

the National Museum of Nature and Science, Tsukuba, using a previously published

protocol1 with some modifications. Briefly, powdered sample was decalcified with 0.5M

EDTA (pH 8.0) at 56°C for overnight, and the buffer was then replaced with a fresh

solution and decalcified again at 56°C for overnight, and the decalcified samples were

lysed in 1000 µl of Genomic Lyse buffer (Genetic ID) with 50 µl of 20 mg/ml

proteinase K at 56°C for overnight, and DNA was extracted from the lysate by using

FAST ID DNA extraction kit (Genetic ID) in accordance with the manufacturer’s

protocol. Three DNA libraries for A1, A2, and B were prepared from the extracted DNA

solution using the GS Titanium Rapid Library Preparation kit (454 Life Science

Corporation). Modifications to the protocol2 were as follows: (i) In the adapter ligation

step, we used 1 µl of Illumina adapter mix instead of 2 µl to minimize the amount of

adapter dimer; (ii) A first round PCR was set up with Multiplex PCR kit (Qiagen) as

follows: 5 µl DNA library, 25 µl Mix, 20-100 nM each Multiplexing PCR primer, and

H2O up to 50 µl. Cycling conditions were 15 min at 95°C, 12 cycles of 30 sec at 95°C,

30 sec at 65°C, and 30 sec at 72°C, with a final extension at 72˚C for 10 min; (iii) A

second round PCR was set up as follows: 5µl product from first PCR, 25 µl Multiplex

PCR mix, 500-2,000 nM each Multiplexing PCR primer, and H2O to 50 µl. Cycling

conditions were 95°C 15 min, 7-15 cycles of 95°C for 30 sec, 60°C for 30 sec, 72°C for

30 sec, with a final extension 72°C for 10 min. (iv) Half the amount of DNA library A1

was treated with two restriction enzymes, FastDigest® Bsh1236I (Thermo Scientific)

and FastDigest® TauI (Thermo Scientific), that recognize CGCG and GCSGC sites,

respectively. This is to reduce the amount of non-human DNA as much as possible, thus

enriching the human derived sequence reads. PCR-amplified DNA libraries were

purified with 2% agarose gel (BIO RAD) to remove adapter and primer dimers, and the

purified libraries were quantified with the Agilent 2100 Bioanalyzer DNA High-

Sensitivity chip for sequencing with sequencers, Illumina GAIIx and Hiseq2000.

Supplementary Methods S2. Sequence mapping

FASTX-Toolkit3 was used to trim adapter sequences and to remove reads containing N's

and those having base quality smaller than 25 while being shorter than 11 bases. For the

adapter trimming, we used fastx_clipper (-a AGATCGG -n -l 11), and for the filtering

low quality bases, we used fastq_quality_trimmer (-t 25 -l 11). After sorting, forward

and reverse reads were merged using fastq-join4 (-m 11). Some unmerged reads still

contain adapter sequences, which are shorter than seven bases. Therefore, we removed

seven bases of unmerged reads termini with fastx_trimmer (-l 94 or 114), and merged

them again with fastq-join, which can rescue several percent of unmerged reads. After

we filtered reads shorter than 35 bases, the remaining merged reads were mapped to the

human reference genome (hg19) using BWA5. The frequencies of sequence reads

mapped to hg19 were examined using SAMtools6, flagstat option. Mapped reads with

mapping quality ≥ 30 were retained for further analyses. After making mpileup with

SAMtools, we counted the sites covering autosomes and mitochondrial DNA,

independently, and then calculated the coverage of them. Metagenomic analysis was

also performed using BLASTN7 to check for non-human DNA contamination. A total of

20 Mbp were searched in each sample (parameters were –evalue 3.80e-2, -dbsize

3,200,000,000).

In PickingBases (PB), any reads that mapped to the same reference location

are considered duplicates and are merged into single read. When the two reads being

merged have two or more alleles at some site, the allele that matches the reference

genome is chosen for the merged read, or masks the site as N (using --use-n option)

(graphical explanation is given in Supplementary Figure S6). More sequence sites and

smaller error rates are expected in PB, and if we use --use-n option in PB, it might weed

out bias toward to reference genome. We therefore used --use-n option for the current

study. There are two merits in PB. First, PB can detect plus strands (black-colored

bases) and minus strands (gray-colored bases) originating from same DNA templates as

PCR duplicates, but not in MD (e.g. group 1). In addition, since MD identifies reads

having identical 5' positions as duplicates, and choose a read having highest sum of base

qualities from those reads, MD keeps longer DNA reads even if some of the reads are

short and originated from different templates (e.g. group 2). This selectively removes

shorter DNA reads. The effect is quite strong in the DNA reads mapped to

mitochondrial genome because the depth is much higher than that of nuclear genome,

and many DNA reads originating from different templates are frequently mapped to

same 5’ ends. Second, PB can ideally reduce error rate compared to MD especially

when Illumina Y-shape adapter was used because PB masks some PCR and sequencing

errors as “N” (blue bases) if there are PCR duplicates.

Supplementary Methods S3. DNA authenticity checking

Since it is difficult to remove all modern human DNA contaminations that come from

reagents and experimental rooms, it is essential to estimate the frequency of endogenous

DNA from Jomon individuals. To estimate the frequency of contamination, we focused

on sequence reads mapped to mitochondrial DNA (mtDNA). MtDNA haplogroup of

both samples, A and B, were previously classified into N9b8. First, we tried to construct

consensus sequences in each sample using mitochondrial capture method and

sequencing with MiSeq9. The mtDNA captured reads were mapped and filtered with

same method of shotgun sequence reads, but used mapping quality 20 or larger instead

of 30. To correctly estimate consensus sequences in sample A, we merged A1 and A2. It

seems that many sequence reads experienced misidentifications into different indexes

(Supplementary Tables S6) because of single index in our DNA libraries

(Supplementary Table S3). Kircher et al.10 revealed that the misidentification rates were

~0.3% though they used dual indexes. Higher misclassification rates were inferred in

our sequencing results. Therefore, we tried to detect false-assignment reads and remove

them to construct correct consensus sequences and to infer correct contamination rates

in each of the samples. To detect the false-assignment reads, we grouped the sequence

reads mapped to the same reference location as PCR duplicates in each samples, and

checked whether there are same group in other samples. In each group, we assume that

a sample having the highest number of PCR duplicates is the origin of the group, and

other samples have contaminants. After removing the false-assignment reads, we

filtered PCR duplicates with PB using --use-n and --ignore-strand option (the option

recognize the reads mapped to same reference location but to different strands as same

groups deriving from same templates), and trimmed first and last five bases using

BamUtil11 to minimize the effect of post-mortem misincorporation. Sites having three or

higher read depth were used to call individual specific SNPs. The site 515-522 and

8,860 were not used because of ambiguous typing. Read depth were estimated by

calculating the average depth at all sites.

List of the SNPs and indels were compared to list of mutation reported in

PhyloTree.org, build 1612, and the mtDNA haplogroup of both Sanganji Jomon

individuals were classified into haplogroup N9b, and individual specific mutations and

rare mutations in haplogroup N9b were observed. This haplogroup is common in

northern Jomon people (about 45% in the Hokkaido Jomon), but rare in modern

Japanese (1.9% in the mainland Japanese). Namely, since the haplogroup N9b is rare in

modern humans and the mtDNA haplotype of Kanzawa-Kiriyama (who did the DNA

extractions and library preparations) is non N, if we find non-matched sequence reads,

we can assume that those reads are contaminants. We used sequence reads having

mapping quality 30 or larger, and calculated the proportion of the reads that did not

match to consensus sequences. The proportion of sample A1 and A2 were independently

calculated. In the estimations, we removed nucleotide position 14,893 in sample B

because we empirically know that there are ambiguous mapping around the regions. In

fact, the sequence read identified as contaminants in sample B was also mapped to chr2

and chr5 with highest E-value in MEGABLAST and it is difficult to correctly identify

whether those reads are contaminants. We calculated the frequency of contamination,

Finc_mtDNA, with the following equation;

Finc_mtDNA = (Ninconsistent/(Nconsistent + Ninconsistent)) x 100, (1)

where Ninconsistent indicates the number of reads inconsistent with mtDNA haplotype of

each individual, and Nconsistent indicates the number of reads consistent with mtDNA

haplotype of each individual. 95% C.I. were computed using the standard t-test. When

all determined sequences were consistent and the contamination proportion was

estimated to be zero, we assumed a simple binomial distribution with p and 1–p

proportions as and contaminant and authentic mtDNA sequences, respectively. We

estimated p which gives the 95% C.I. by numerically solving equation (2);

0.95 = (1 – p)n, (2)

where n is the number of observed mutually consistent reads.

Two perl scripts were newly developed by one of us (K. K.); sam-count-ref-

nucleotides.pl for depurination analysis and sam-count-substitutions.pl for C-to-T

misincorporation analysis (see Supplementary Perl Scripts).

Supplementary Methods S4. Estimation of error rates and patterns of errors

After sequence termini of Sanganji Jomon were trimmed and the contamination

frequency of each library were checked, the data of low contaminant libraries were

merged. After making mpileup of both ancient and modern samples using SAMtools,

we randomly chose one base having base quality 30 or larger at each site, and then we

generate alignments of all the 17 human samples with PanTro2, and hg19. We then

chose the nucleotide positions where the sequence was determined in all 19 individuals

and where not more than two bases were observed. In addition, we removed tandem

repeat regions (reported in Prüfer et al.13,

http://hgdownload.soe.ucsc.edu/goldenPath/hg19/database/simpleRepeat.txt.gz) and

CpG sites, where hg19 and/or PanTro2 have these, because those regions are prone to

have mapping error and repetitive mutation.

We estimated error rate from the alignments (SI2 of Reich et al.14). We

assumed that there were no errors in PanTro2 and hg19. We chose three sequences from

the alignments: PanTro2, hg19, and one of 17 samples, and counted individual specific

different sites. We regarded the numbers as lineage specific substitutions since they

diverged. Assuming that an equal number of true substitutions have occurred on hg19

and samples after their divergence, excess number of substitutions on samples can be

attributed to errors (e.g. sequencing error, PCR error, post-mortem changes, and so on).

The error rates were inferred for both all sites and transversion-only sites. We also

estimated the error rate in another way; among sample specific transversion

substitutions, unreported transversions in dbSNP 141 were recognized as errors. To

identify which types of errors are frequent, we compared the alignment of PanTro2 and

samples. There are 12 different types of substitutions, and the number of substitutions

http://hgdownload.soe.ucsc.edu/goldenPath/hg19/database/simpleRepeat.txt.gz

between PanTro2 and each sample were counted. In downstream statistical analyses,

tandem repeat and CpG masked sequences were used.

Supplementary Methods S5. Phylogenetic analyses

TreeMix15 was used to construct maximum likelihood trees that can accommodate

possible gene flow events. Two datasets including Sanganji Jomon were used for this

analysis: all SNP sites (43k SNP) and transversion only sites (14k SNP). To ensure

reliability of the tree topology and the gene flow based on large number of SNPs, new

dataset including only 1000 genome worldwide humans, Papuan, and Denisovan

genomes, but not including Sanganji Jomon, were also used for the analysis. To prepare

the new dataset, four filters, HWE <1e-10, MAF <0.01, pruning high linkage

equiliburium (plink --indep 50 5 2), and picking up 30% of SNPs (plink --thin 0.3),

were applied. TreeMix was run with an incremental number of assumed migrations,

starting from zero migration. Denisovan was used as the root. Each run was done using

100 and 50 SNP blocks for all SNP dataset and transversion sites, respectively. To test

how well the observed branching patterns and　 gene flow events are supported, we

generated 1,000 pseudosamples for each of the two datasets using a newly made perl

script and used them as inputs for TreeMix runs of 0 to 9 migrations. Using the original

tree (from non-bootstrapped data) as the reference, the number of corresponding

population splits in the 1,000 bootstrap replicates were counted using four newly made

perl scripts (see Supplementary Perl Scripts).

Phylip Package16 as used to produce NJ trees. Nei’s17 genetic distances between

populations were calculated with 1,000 bootstrap replicates, and Splitstree418 was used

to view the NJ trees. For phylogenetic network analysis, we directly constructed

Neighbor-Net networks from Nei’s genetic distances using Splitstree4.

Supplementary Results

Supplementary Results S1. DNA sequence authenticity through mtDNA sequences

Average depths of captured mtDNA were 1.97, 12.24, and 8.59 in samples A1, A2, and

B, respectively. The estimated amount of modern human DNA contamination

frequencies based on captured mtDNA sequence reads were 0.0% (0.0-15.4%, 95%

C.I.), 1.7% (0.0-4.1%, 95% C.I.), and 2.0% (0.0-6.0%, 95% C.I.) for DNA samples A1,

A2, and B, respectively (Supplementary Table S4), while the frequencies based on

simple shotgun sequencing were 0.0% (0.0-28.4%, 95% C.I.), 11.3% (2.8-19.9%, 95%

C.I.), and 5.6% (0.0-16.1%, 95% C.I.), respectively (Supplementary Table S5). The

confidence interval of sample A1 is wider compared to other samples because of low

depth of mtDNA, though we found no inconsistent nucleotides for this sample. Increase

of contamination frequency in sample A2 for the shotgun sequencing was not expected,

and the reasons are not clear, though cross contamination during experiments or/and

sequencing (e.g. index misclassification) are possibilities.

Possible sources of cross contamination to sample A2 during the simple shotgun

sequencing are sample B and some other Jomon samples which are not reported in this

paper. Their mtDNA haplogroups were different from the Sanganji Jomon samples used

in this study, and this difference might have increased the contamination frequency if

there were cross contaminations from those other Jomon samples. We also cannot reject

the possibility that the cross contamination from non-Jomon samples happened during

experiments. The possible source of contamination in captured mtDNA sequences are

other Jomon samples having haplogroups N9b and M7a. However, the evidence of cross

contamination from them was not shown in individual A-specific SNPs (Supplementary

Table S4) which are not shared with those other Jomon samples (data not shown), and

the cross contamination in mtDNA captured sequences would be unlikely.

Supplementary Results S2. TreeMix analysis

Genome sequence data of the Sanganji Jomon, the 1000 Genomes Project populations,

Karitiana (Native American from Brazil) and Papuan13, Ust’-Ishim from 45,000 YBP

western Siberian19, Mal’ta MA1 from 24,000 YBP south-central Siberian20, and archaic

Denisovan21 were used. We used two datasets; one based on 43,310 all sites and the

other based on 15,455 transversion only sites. Number of gene flow events were set

from zero to nine, and resultant trees with bootstrap probabilities on interior branches

are shown in Supplementary Figures S16a-S16j for all sites, and in Supplementary

Figures S17a- S17j for transversion only sites. Bootstrap values were computed by

using five newly made perl scripts with combination of the TreeMix software. In all the

20 trees, the Sanganji Jomon was consistently outside of five modern East Eurasian

populations (JPT, CHB, CHS, CDX, and KHV). Furthermore, except for no gene flow

on all site data (Supplementary Figure S16a) and for no or one gene flow on

transversion only data (Supplementary Figures S17a and S17b), South American

Karitiana clustered with the five modern East Eurasians and the Sanganji Jomon was

outside of this cluster. Bootstrap values supporting this pattern were relatively low (50-

75%) when transversion only data were used, (Supplementary Figures S17c - S17j),

however, those for all site data were pretty high (96-98%) for three to five gene flow

events (Supplementary Figures S16d - S16f). Gene flow from the Sanganji Jomon to

JPT appeared when the number of gene flow events was three or more for all site data

(Supplementary Figures S16d - S16j), and when the number of gene flow events was

four or more for transversion only site data (Supplementary Figures S17e - S17j). The

proportion of the Sanganji Jomon gene flow to JPT were in the range of 12-16% in

these 13 cases, and the frequencies (equal to bootstrap probabilities) of appearing this

gene flow out of 1,000 pseudo-samples were 98-99% for all site data and 84-96% for

transversion only site data (detailed data not shown).

HGDP populations were also used for TreeMix analysis (Supplementary

Figure S18; bootstrap values for modern East Eurasian populations were suppressed).

Although the dataset used was small (only 7,529 SNP sites) and had ascertainment bias,

the Sanganji Jomon was in the basal position to the five modern East Eurasians with a

high bootstrap support (95%). This result also strengthens the basal position of the

Sanganji Jomon among East Eurasian populations. When we introduced gene flow

events, however, the gene flow from Jomon to JPT appeared as seventh one (tree not

shown). It should be noted that estimation of gene flow events is not easy from small

dataset such as this dataset.

Supplementary Results S3. NeighborNet analysis

When HGDP data were used, three major splits X, Y, and Z related to Sanganji Jomon

were observed (Supplementary Figure S21). Split Z excludes Yakuts from the other East

Asians but groups the Sanganji Jomon with the non-Yakut East Asians. This split may

reflect a recent admixture between Yakuts and West Eurasians. Split Y groups all the

modern East Eurasians, which may correspond to the divergence of the Sanganji Jomon

from the other East Eurasians. Split X, which has a longer distance than those of splits Y

and Z, groups the Sanganji Jomon, West Eurasians, and Africans. It might be possible

that the split X is the result of postmortem changes in the Sanganji Jomon. However, if

we accept this split as a phylogenetic signal, the following scenario can be considered;

some of Sanganji Jomon ancestors originated from populations diverged before the

Sahulian-East Eurasian divergence, and/or the Sahulian experienced some gene flow

with ancestors of modern East Eurasians after the divergence of the Sanganji Jomon.

Supplementary Results S4. D-statistic analysis

We also examined whether the uniqueness of the Sanganji Jomon in East Eurasia were

the result of gene flow with non-East Eurasians, for example, African, European,

Sahulian, and ancient Siberian (Mal’ta MA1 and Ust’-Ishim). Comparison with the

modern populations did not suggest additional gene flow between Sanganji Jomon and

non-East Eurasians (Supplementary Figures S24 and S25). The tree of ((X, Sanganji

Jomon) Ust’-Ishim) and ((X, Sanganji Jomon) Mal’ta MA1) also did not show the

evidence of gene flow with Ust’-Ishim and Mal’ta MA1 after their divergence

(Supplementary Figure S22). Next, we tried to detect gene flow between non-East

Eurasians and East Eurasians, who diverged after the divergence of the Sanganji Jomon.

A closer genetic affinity between the Sahulians, especially Melanesian, and the modern

East Eurasians compared to the Sanganji Jomon, Yakut, and Native Americans was

inferred, but the Z-scores were not significant (Supplementary Figure S26). It is

possible that this affinity corresponds to split X of Supplementary Figure S21. To

evaluate the effect of genotype errors in Jomon to the affinity between Sahulians and

modern East Eurasians, we compared D-values using all sites and transversion only. We

observed some bias when using all sites, and the bias appeared to be artificially

produced by post-mortem changes. The affinity between Melanesian and East Eurasians

compared to the Sanganji Jomon was still inferred when only transversion sites with

1000 Genomes Project populations and a Melanesian genome13 were used, though the

reliability is not secured yet with Z-score (Supplementary Figure S27). The affinity with

Papuan was not supported (Supplementary Figures S27 and S28). If we accept the tree

((East Eurasians, Jomon/Yakut/Native Americans), Melanesian), gene flow between the

East Eurasians and Melanesians after the divergence of Jomon, Yakut, or Native

Americans is a plausible explanation. We also inspected relationships among Sanganji

Jomon, East Eurasians, and Native Americans. Both trees ((Native American, East

Eurasians) Sanganji Jomon) and ((Sanganji Jomon, East Eurasians) Native American)

were not rejected, but some skew from zero were observed (Supplementary Figures S29

and S30). Assuming no gene flow between the Native American and the East Eurasians

after their divergence, the skew would be explained by the earlier splits of the Sanganji

Jomon in the ancestry of East Eurasians and Native Americans, as suggested in TreeMix

analyses. This result is consistent with the low bootstrap value for the branch suggesting

earlier Jomon divergence in the neighbor-joining tree (Figure 5a and Supplementary

Figure S20).

References Cited

1 Adachi, N., Sawada, J., Yoneda, M., Kobayashi, K. & Itoh, S. Mitochondrial

DNA analysis of the human skeleton of the initial Jomon phase excavated at

the Yugura cave site, Nagano, Japan. Anthropol. Sci. 121, 137-143 (2013).

2 Rasmussen, M., Guo, X., Wang, Y., Lohmueller, K. E., Rasmussen, S.,

Albrechtsen, A. et al. An Aboriginal Australian genome reveals separate

human dispersals into Asia. Science 334, 94-98 (2011).

3 http://hannonlab.cshl.edu/fastx_toolkit/

4 https://code.google.com/p/ea-utils/wiki/FastqJoin

5 Li, H. & Durbin, R. Fast and accurate short read alignment with Burrows-

Wheeler Transform. Bioinformatics 25, 1754-60 (2009). Burrows-Wheeler

Aligner, http://bio-bwa.sourceforge.net/index.shtml

6 Li, H., Handsaker, B., Wysoker, A., Fennell, T., Ruan, J., Homer, N. et al.

The sequence alignment/map format and SAMtools. Bioinformatics 25, 2078–2079 (2009). SAMtools, http://samtools.sourceforge.net/index.shtml

http://samtools.sourceforge.net/index.shtml

http://bio-bwa.sourceforge.net/index.shtml

https://code.google.com/p/ea-utils/wiki/FastqJoin

http://hannonlab.cshl.edu/fastx_toolkit/

7 Altschul, S. F., Madden, T. L., Schaffer, A. A., Zhang, J., Zhang, Z., Miller, W.

& Lipman D.J. Gapped BLAST and PSI-BLAST: a new generation of protein

database search programs. Nucl. Acids Res. 25, 3389–3402 (1997).

Software was downloaded from: https://blast.ncbi.nlm.nih.gov/Blast.cgi?

PAGE_TYPE=BlastSearch

8 Kanzawa-Kiriyama, H., Saso, A., Suwa, G. & Saitou, N. Ancient

mitochondrial DNA sequences of Jomon teeth samples from Sanganji,

Tohoku district, Japan. Anthropol. Sci. 121, 89-103 (2013).

9 Maricic, T., Whitten, M. & Pääbo, S. Multiplexed DNA sequence capture of

mitochondrial genomes using PCR products. PLoS ONE 5, e14004 (2010).

10 Kircher, M., Sawye,r S. & Meyer, M. Double indexing overcomes

inaccuracies in multiplex sequencing on the Illumina platform. Nucl. Acids

Res. 40, e3 (2012).

11 http://genome.sph.umich.edu/wiki/BamUtil

12 Van Oven, M. & Kayser, M. Updated comprehensive phylogenetic tree of

global human mitochondrial DNA variation. Hum. Mut. 30, E386-E394

(2008) http://www.phylotree.org/

13 Prüfer, K., Racimo, F., Patterson, N., Jay, F., Sankararaman, S., Sawyer, S.

et al. The complete genome sequence of a Neanderthal from the Altai

Mountains. Nature 505, 43-49 (2014).

14 Reich, D., Green, R. E., Kircher, M., Krause, J., Patterson, N., Durand, E. Y.

et al. Genetic history of an archaic hominin group from Denisova Cave in

Siberia. Nature 468, 1053-1060 (2010).

15 Pickrell, J. K. & Pritchard, J. K. Inference of population splits and mixtures

from genome-wide allele frequency data. PLoS Genet. 8, e1002967 (2012).

16 Philip pacage: http://evolution.genetics.washington.edu/phylip.html

17 Nei, M. Genetic distance between populations. Amer. Nat. 106, 283-292

(1972).

18 Huson, D. H. & Bryant, D. Application of phylogenetic networks in

evolutionary studies. Mol. Biol. Evol. 23, 254–267 (2006). Splitstree,

http://www.splitstree.org/

http://www.splitstree.org/

http://evolution.genetics.washington.edu/phylip.html

http://www.phylotree.org/

http://genome.sph.umich.edu/wiki/BamUtil

https://blast.ncbi.nlm.nih.gov/Blast.cgi?PAGE_TYPE=BlastSearch

https://blast.ncbi.nlm.nih.gov/Blast.cgi?PAGE_TYPE=BlastSearch

19 Fu, Q., Li, H., Moorjani, P., Jay, F., Slepchenko, S. M., Bondarev, A. A. et al.

Genome sequence of a 45,000-year-old modern human from western

Siberia. Nature 514, 445–449 (2014).

20 Raghavan, M., Skoglund, P., Graf, K. E., Metspalu, M., Albrechtsen, A.,

Moltke, I. et al. Upper Palaeolithic Siberian genome reveals dual ancestry of

Native Americans. Nature 505, 87-91 (2014).

21 Meyer, M, Kircher, M, Gansauge, M, Li H, Racimo, F, Mallick, S. et al. A

high-coverage genome sequence from an archaic Denisovan individual.

Science 338, 222-226 (2012).

22 Tamura, K., Stecher, G., Peterson, D., Filipski, A. & Kumar, S. MEGA6:

Molecular Evolutionary Genetics Analysis version 6.0. Mol. Biol. Evol. 30,

2725-2729 (2013).

23 The 1000 Genomes Project Consortium. An integrated map of genetic

variation from 1,092 human genomes. Nature 491, 56-65 (2012).

24 Andrews, R. M., Kubacka, I., Chinnery, P. F., Lightowlers, R. N., Turnbull, D.

M. & Howell, N. Reanalysis and revision of the Cambridge reference

sequence for human mitochondrial DNA. Nat. Genet. 23, 147 (1999).

http://www.nature.com/nature/journal/v491/n7422/full/nature11632.html#group-1

Supplementary Figure S1. Geographical location of the Sanganji Shell Mound

Red dot shows location of Sanganji Shell Mound. The geographical locations of the

other populations compared in the current study are also plotted: black, orange, blue,

green, brown, and purple are African, West Eurasian, Northern East Eurasian, Southern

East Eurasian, Sahulian, and Native American populations, respectively. The population

information for the 12 1000 Genomes Project names are as follows; 1 (YRI) = Yoruba

in Ibadan, Nigeria, 2 (LWK) = Luhya in Webuye, Kenya, 3 (CEU) = Utah residents with

Northern and Western European ancestry, 4 (IBS) = Iberian populations in Spain, 5

(GBR) = British in England and Scotland, 6 (FIN) = Finnish in Finland, 7 (TSI) =

Toscani in Italy, 8 (JPT) = Japanese in Tokyo, Japan, 9 (CHB) = Han Chinese in Bejing,

China, 10 (CHS) = Southern Han Chinese, China, 11 (CDX) = Chinese Dai in

Xishuangbanna, China, 12 (KHV) = Kinh in Ho Chi Minh City, Vietnam.

Supplementary Figure S2. Content and distribution of the meta-genome for each

sample

About 1.12%, 0.43%, and 8.91% of the meta-genome were assigned to primates with

BLASTN, whereas bacterial DNA were around 35%. About half of the reads were

unclassified into specific taxa.

Supplementary Figure S3. Read length distribution for each sampleLength distribution of sequence reads mapped to the reference genome with mapping quality greater than or equal to 35 were shown. PickingBases was used to remove duplicates.

Supplementary Figure S4. Pattern of postmortem misincorporation for each sampleRed line indicates C in reference genome and T in the Jomon sample, and blue line indicates G in reference genome and A in the Jomon sample.

Supplementary Figure S5. Pattern of postmortem depurination for each sampleBase composition of the human reference genome around the 5'- and 3'-ends of the sequence reads.

Supplementary Figure S6. Removing duplicates with MarkDuplicates and PickingBasesThe procedure from library preparation to removing duplicates using MarkDuplicates (MD) and PickingBases (PB) are shown. a) Y-shaped adapter was used for library preparations. Plus strand and minus strand are colored by black and gray, respectively. Forward and reverse adapter sequences are described as red and blue bar, respectively. Red-colored bases are mismatches compared to reference genome. b) After PCR amplification, two different

amplicons are produced from same DNA templates. c) After mapping the DNAs into reference genome, some PCR duplicates will be observed. Red-colored bases are mismatches compared to reference genome, and such mismatches include SNPs, post-mortem changes, PCR errors, sequencing errors, and mapping errors. Three groups, group1, 2, and 3 were described as an example to explain the difference between MD and PB. Since MD identifies reads having identical 5' positions as duplicates, and choose a read having highest sum of base qualities from those reads, MD choose two DNA reads and one longest DNA reads (ideally) from group 1 and 2, respectively. PB instead create one new reads from group 1 and 2, and other two DNA reads of group 2 are kept. To create the new reads, PB picks one base having highest base quality within each PCR duplicate at each site, and if two or more alleles were detected, we rely on bases having the same nucleotide with the reference sequence or we mask those sites as “N” (blue bases) with --use-n option. This step can remove some (not all) PCR and sequencing errors still present in MD output.

Supplementary Figure S7. Principal Component Analysis of Sanganji Jomon and 1000 Genomes Project worldwide humans based on 68,556 SNPs with MD

Supplementary Figure S8. Principal Component Analysis of Sanganji Jomon and 1000 Genomes Project worldwide humans based on transversion sites(a) based on 24,631 transversion sites with PB, (b) based on 24,632 transversion sites with MD.

Supplementary Figure S9. PCA of Sanganji Jomon and non-AfricansGenetic relationship among Sanganji Jomon and 1000 Genome Project populations (Europeans, and East Eurasians) were compared based on 53,955 SNP sites with PB. PC1 and PC4 divide Sanganji Jomon from other populations. Since PC3 is the signal dividing FIN from other Europeans, we didn’t show the results.

Supplementary Figure S10. PCA of Sanganji Jomon and non-Africans based on transversion sitesGenetic relationship among Sanganji Jomon and 1000 Genome Project populations (Europeans, and East Eurasians) were compared based on 19,415 transversion sites with PB. PC1 and PC4 divide Sanganji Jomon from other populations. Since PC3 is the signal dividing FIN from other Europeans, we didn’t show the results.

Supplementary Figure S11. PCA of Sanganji Jomon, Native Americans, East Eurasians, and SahuliansGenetic relationship among Sanganji Jomon and HGDP populations (Native Americans, East Eurasians, and Sahulians) were compared based on 7,081 SNP sites with PB. PC1 and PC2 divide Sahulians and Native Americans, respectively, from East Eurasians. East Eurasians are the closest to the Sanganji Jomon.

Supplementary Figure S12. PCA of Sanganji Jomon and East Eurasians based on transversion sitesGenetic relationship among Sanganji Jomon and 1000 Genome Project populations from East Eurasia were compared based on 16,720 transversion sites with PB. The distributions of each populations are basically same with Figure 1b, and no effect of post-mortem changes observed.

Supplementary Figure S13. PCA of Sanganji Jomon and East Eurasians based on two independent datasetsGenetic relationship among Sanganji Jomon and 1000 Genome Project populations from East Eurasia were compared based on (a) 12,837 SNP sites from A1 (GAIIx) and (b) 33,531 SNP sites from B (index12) with PB. The distributions of each population are basically same with Figure 1b, and no effect of merging the two datasets was observed.

Supplementary Figure S14. PCA of Sanganji Jomon and East EurasiansGenetic relationship among Sanganji Jomon and HGDP populations from East Eurasia were compared based on 6,864 SNP sites with PB. PC1 and PC3 also described uniqueness of Sanganji Jomon within East Eurasians as Figure 2a.

Supplementary Figure S15. PCA of Sanganji Jomon, three Japanese populations, and other East EurasiansGenetic relationship among Sanganji Jomon, Ainu, Mainland Japanese, Ryukyuan, and 1000 Genome Project populations from East Eurasia were compared based on 3,645 SNP sites with PB. PC1 and PC2 show genetic similarity of Sanganji Jomon and Ainu people, and PC3 divides Ainu from Sanganji Jomon.

Supplementary Figure S16. Maximum likelihood tree for Sanganji Jomon, 12 populations, and 5 individuals using all variant sitesA comparison of Sanganji Jomon, 1000 Genomes Project worldwide populations, Papuan, Karitiana, Mal’ta MA1, Ust’-Ishim, and Denisovan based on 43,310 SNP sites with PB. Denisovan was used as the outgroup. (a)~(j) the tree without/with assuming gene flow from zero to nine.

Supplementary Figure S17. Maximum likelihood tree for Sanganji Jomon, 12 populations, and 5 individuals using transversion sitesA comparison of Sanganji Jomon, 1000 Genomes Project worldwide populations, Papuan, Karitiana, Mal’ta MA1, Ust’-Ishim, and Denisovan based on 15,455 transversion sites with PB. Denisovan was used as the outgroup. (a)~(j) the tree without/with assuming gene flow from zero to nine.

Supplementary Figure S18. Maximum likelihood tree for Sanganji Jomon and 24 populations using all variant sitesA comparison of Sanganji Jomon and HGDP worldwide populations based on 7,529 SNP sites with PB. San was used as the outgroup.

Supplementary Figure S19. TreeMix tree without Sanganji Jomon.A comparison of 1000 Genomes worldwide populations, Papuan, and Denisovan based on 702,660 SNP sites. Denisovan was used as outgroup and three migration events were estimated. The tree was drawn by using MEGA622. Red colored values are bootstrap probabilities (%) for their adjacent internal branch. Arrows were manually added to this tree, and colors of migration weight (ratio of gene flow) follow TreeMix outputs. Values inside arrows are the ratio of gene flow. Bootstrap probabilities (%) of the gene flow from JPT to the root of CHB and CHS, CEU to Papuan, and Papuan to Denisovan, estimated out of 1,000 bootstrap replicate TreeMix outputs, are 89%, 86%, and 98 %, respectively.

Supplementary Figure S20. Neighbor-joining tree of Sanganji Jomon and HGDP worldwide populations based on 7,529 SNP sites with PB

Supplementary Figure S21. Phylogenetic Network of Sanganji Jomon and HGDP worldwide populations based on 7,529 SNP sites with PB

Supplementary Figure S22. D-statistic tests of Sanganji Jomon and 1000 Genome Project worldwide populations, Mal’ta MA1, and Ust’-Ishim based on 15,549 transversion sites with PB

Supplementary Figure S23. D-statistic tests of Sanganji Jomon, Chimpanzee, 1000 Genome Project populations from East Eurasia, and Ust’-Ishim based on 14,978 transversion sites with PBStrong genetic affinity between JPT and other East Eurasians were detected. The position of Sanganji Jomon in TreeMix analyses would not be affected by sequence errors common to ancient DNA. Each bars indicate standard errors.

Supplementary Figure S24. D-statistic tests of Sanganji Jomon, Chimpanzee, 1000 Genome Project worldwide populations, Karitiana, Mal’ta MA1, and Ust’-Ishim based on 14,978 transversion sites with PBConsidering the tree of (YRI, (non-East Eurasians, (Jomon, CHB))), no evidence of gene flow between Jomon and non-East Eurasians were observed after the divergence. Each bars indicate standard errors.

Supplementary Figure S25. D-statistic tests of Sanganji Jomon and HGDP worldwide populations based on 7,529 SNP sites with PBConsidering the tree of (San, (X, (Jomon, Han))), and no evidence of gene flow between Jomon and population X was observed after the divergence. Each bars indicate standard errors.

Supplementary Figure S26. D-statistic tests of Sanganji Jomon and HGDP

populations based on 7,529 SNP sites with PB

Considering the tree of (San, (Y, (Jomon, X))), and population X is Native Americans or

East Eurasians, and population Y is Sahulians. Each bars indicate standard errors.

Supplementary Figure S27. D-statistic tests of Sanganji Jomon, 1000 Genome

Project worldwide populations, Papuan, Melanesian, Karitiana, Mal’ta MA1, and

Ust’-Ishim with PB

(a), (b) The tree based on 21,286 SNP sites and 7,490 transversion sites. The affinity

between Melanesian and East Eurasians compared to Sanganji Jomon was inferred, but

the affinity with Papuan was not supported when using transversion sites. Each bars

indicate standard errors.

Supplementary Figure S28. D-statistic tests of Sanganji Jomon, Chimpanzee, 1000

Genome Project worldwide populations, Papuan, Karitiana, Mal’ta MA1, and

Ust’-Ishim with PB

(a), (b) The tree based on 42,128 SNP sites and 14,968 transversion sites. The affinity

between Papuan and East Eurasians compared to Sanganji Jomon was still not inferred

as supplementary figure S29 even though more SNP sites and different outgroups were

used. Each bars indicate standard errors.

Supplementary Figure S29. D-statistic tests of Sanganji Jomon, Chimpanzee, 1000

Genome Project worldwide populations, and Karitiana based on 14,978

transversion sites with PB

(a), (b) Outgroup of the tree were Karitiana and Sanganji Jomon, respectively, and both

figures show some skew from zero, but not significant Z-score. Each bars indicate

standard errors.

Supplementary Figure S30. D-statistic tests of Sanganji Jomon and HGDP based

on 7,529 SNP sites with PB

Population X is European or Sahulian or East Eurasians, and population Y is Native

American. Each bars indicate standard errors.

Supplementary Figure S31. D-statistic tests of Sanganji Jomon, 15 humans, and

archaic humans based on 224,646 transversion sites with PB

The tree of (Chimp, (Y, (San, X))), and population X is Non-Africans, and population Y

is archaic humans. Any pair of individuals did not represent significant Z-score, but

non-Africans including Jomon sift to positive value compared to Africans. Each bar

indicates standard errors.

===== Perl Script for PB (picking bases) =====

If you have any question on this perl script, please contact Kirill Kryukov ([email protected]).

#!/usr/bin/env perl

#

# sam-merge-duplicates-picking-bases.pl

#

# Version 0.1.4 (August 3, 2015)

#

# Copyright (c) 2015 Kirill Kryukov

#

# This software is provided 'as-is', without any express or implied

# warranty. In no event will the authors be held liable for any damages

# arising from the use of this software.

#

# Permission is granted to anyone to use this software for any purpose,

# including commercial applications, and to alter it and redistribute it

# freely, subject to the following restrictions:

#

# 1. The origin of this software must not be misrepresented; you must not

# claim that you wrote the original software. If you use this software

# in a product, an acknowledgment in the product documentation would be

# appreciated but is not required.

# 2. Altered source versions must be plainly marked as such, and must not be

# misrepresented as being the original software.

# 3. This notice may not be removed or altered from any source distribution.

#

#

# Usage:

# sam-merge-duplicates-picking-bases.pl --refdir REFDIR [Options] <input.sam >output.sam

# Where:

# REFDIR is directory with reference FASTA files.

# Options:

# --ignore-strand - Merge reads regardless of strand (by default merges only same strand reads).

# --debug - Add alignment into output (output is no longer in SAM format).

# --use-n - Use N for ambiguous positions in merged read (those having multiple alleles).

#

# Reads and writes SAM format (when no "--debug" option is used).

#

# Reference directory must contain reference as one file per chromosome, named as:

# "chr1.fa", "chr2.fa", etc.

#

# For each group of reads that look like duplicates, merges them into single read.

#

# Two reads are considered duplicates if they:

# - Are mapped to the same chromosome and at the same starting position

# - Are mapped to reference region of same length

# - Have identically long soft-clipped parts (for both beginning and end).

# - Are mapped to same strand (unless --ignore-strand is specified).

#

# It's OK if they have different insertions/deletions.

#

# Merging means:

# - At sites where all reads in a cluster have same substitution, insertion, or deletion,

# it is preserved in the merged read.

# - At sites where only some reads have substitution, insertion, or deletion, reference

# sequence is used in the merged read.

# - At sites where there are different insertions, the majority among the shortest ones wins.

# - Quality at each merged site is computed as the maximum of all input qualities at this site.

#

# Limitations:

# - This script expects sorged input SAM file.

# - This script discards all reads mapped to reference other than 1..22,X,Y,MT (hardcoded).

# - This script only works with CIGAR containing M,I,D,S - hard clipping is not handled, as well as N.

# - Only reads with flag equal to 0 and 16 are processed.

# - This script only writes optional tags NM and MD for the merged reads.

# - Soft-clipped sequence and quality is simply copied from one of the reads

# (instead of checking which soft-clipped sequence is major, or finding maximum quality).

# - RNEXT, PNEXT and TLEN of the merged read are copied from one of the input reads without any

verification.

# - QNAME, FLAG and MAPQ of the merged reads are copied from one of the input reads verbatim.

#

# This script is tested on limited data and may not work on other data.

#

# Let me know if you have suggestions. In case of any issues with this script, please send me

# example input and the expected output.

#

# If you use --debug option, the output will contain complete alignment and other information for

# each cluster of duplicate reads.

#

use strict;

use File::Basename qw(basename);

use File::Glob qw(:bsd_glob);

use File::Slurp;

use Getopt::Long;

$| = 1;

my ($start_time,$debug,$ignore_strand,$use_n,$ref_dir) = (time,0,0,0);

GetOptions(

"debug" => \$debug,

"ref-dir=s" => \$ref_dir,

"ignore-strand" => \$ignore_strand,

"use-n" => \$use_n

);

if (!defined $ref_dir) { die "Reference directory is not specified, please use --ref-dir option\n"; }

if (!-e $ref_dir) { die "Can't find reference directory\n"; }

my @ref_chr_to_use = (1..22,'X','Y','MT');

my %use_chr = map { $_ => 1 } @ref_chr_to_use;

for (@ref_chr_to_use) { if (!-e "$ref_dir/chr$_.fa") { die "Can't find reference file \"$ref_dir/chr$_.fa\"\

n"; } }

my ($r_name,$r_seq,$r_len,$cluster_start) = ('','',0,-1);

my ($n_reads_total,$n_single_reads,$n_duplicate_clusters,$n_duplicate_reads,$n_reads_saved,

$n_skipped_by_flag,$n_skipped_by_chromosome) = (0,0,0,0);

my @cluster = ();

#my %flag_count = ();

#my %cigar_char_count = ();

binmode STDIN;

binmode STDOUT;

while (<STDIN>)

{

s/[\x0D\x0A]+$//;

if (/^@/) { print "$_\n"; next; }

my @fields = split /\t/, $_;

my $n_fields = scalar(@fields);

my ($qname,$flag,$rname,$pos,$mapq,$cigar,$rnext,$pnext,$tlen,$seq,$qual) = @fields;

$n_reads_total++;

#$flag_count{$flag}++;

if ($pos != $cluster_start or $rname ne $r_name)

{

if ($rname eq $r_name and $pos < $cluster_start) { die "SAM file is not sorted! Following read is out

of order:\n$_\n"; }

process_cluster();

$cluster_start = $pos;

if ($rname ne $r_name)

{

if (exists $use_chr{$rname}) { load_reference_chromosome($rname); }

$r_name = $rname;

}

}

if ($cigar eq '*' or $seq eq '*') { next; }

if (!exists $use_chr{$rname}) { $n_skipped_by_chromosome++; next; }

if ($flag ne '0' and $flag ne '16') { $n_skipped_by_flag++; next; }

#while ($cigar =~ /[0-9]+([MIDNSHPX=])/g) { $cigar_char_count{$1}++; }

#if ($cigar =~ /[0-9]+S/) { print STDERR ' ', $qname; }

my $cs = scalar(@cluster);

@{$cluster[$cs]} = @fields;

}

process_cluster();

print STDERR "\n";

print STDERR "Input has $n_reads_total reads:\n";

if ($n_skipped_by_chromosome > 0) { print STDERR " discarded $n_skipped_by_chromosome reads

mapped to reference sequence other than chr. 1-22,X,Y,MT\n"; }

if ($n_skipped_by_flag > 0) { print STDERR " discarded $n_skipped_by_flag reads with flag different

from 0 and 16\n"; }

print STDERR " merged $n_duplicate_reads duplicate reads into $n_duplicate_clusters merged reads\n";

print STDERR " kept the other $n_single_reads reads unchanged\n";

#print STDERR "Flags:\n";

#foreach my $flag (sort { $a <=> $b } keys %flag_count) { print STDERR ' ', $flag, ': ',

$flag_count{$flag}, "\n"; }

#print STDERR "CIGAR chars:\n";

#foreach my $c (sort { $a cmp $b } keys %cigar_char_count) { print STDERR ' ', $c, ': ',

$cigar_char_count{$c}, "\n"; }

my $secs = time - $start_time;

print STDERR "$secs second", (($secs==1)?'':'s'), " elapsed\n";

sub process_cluster

{

if (!scalar @cluster) { return; }

if (scalar(@cluster) == 1) { print join("\t",@{$cluster[0]}), "\n"; $n_single_reads++; @cluster = ();

return; }

my (@soft_clip_start,@soft_clip_end,@ins_len,@del_len,@match_len,@slen,@end_pos);

foreach my $i (keys @cluster)

{

if ($cluster[$i]->[5] =~ /^([0-9]+)S/) { $soft_clip_start[$i] = $1; } else { $soft_clip_start[$i] = 0; }

if ($cluster[$i]->[5] =~ /([0-9]+)S$/) { $soft_clip_end[$i] = $1; } else { $soft_clip_end[$i] = 0; }

$ins_len[$i] = 0;

while ($cluster[$i]->[5] =~ /([0-9]+)I/g) { $ins_len[$i] += $1; }

$del_len[$i] = 0;

while ($cluster[$i]->[5] =~ /([0-9]+)D/g) { $del_len[$i] += $1; }

$match_len[$i] = 0;

while ($cluster[$i]->[5] =~ /([0-9]+)[M=X]/g) { $match_len[$i] += $1; }

$slen[$i] = $match_len[$i] + $del_len[$i];

$end_pos[$i] = $cluster[$i]->[3] + $slen[$i] - 1;

}

my %subclusters;

foreach my $i (keys @cluster)

{

my $key = "$soft_clip_start[$i]-$end_pos[$i]-$soft_clip_end[$i]";

if (!$ignore_strand) { $key .= '-' . (($cluster[$i]->[1] & 16) ? 'minus' : 'plus'); }

push @{$subclusters{$key}->{'indexes'}}, $i;

$subclusters{$key}->{'start_pos'} = $cluster[$i]->[3];

$subclusters{$key}->{'end_pos'} = $end_pos[$i];

$subclusters{$key}->{'slen'} = $slen[$i];

$subclusters{$key}->{'soft_clip_start'} = $soft_clip_start[$i];

$subclusters{$key}->{'soft_clip_end'} = $soft_clip_end[$i];

}

#if (scalar(keys %subclusters) != 1) { print "\n----- multiple clusters starting at the same position! -----\

n"; }

foreach my $k (sort { $a cmp $b } keys %subclusters)

{

my $n = scalar(@{$subclusters{$k}->{'indexes'}});

if ($n == 1) { print join("\t",@{$cluster[$subclusters{$k}->{'indexes'}->[0]]}), "\n";

$n_single_reads++; next; }

$n_duplicate_clusters++;

$n_duplicate_reads += $n;

my $slen = $subclusters{$k}->{'slen'};

my $start_pos = $subclusters{$k}->{'start_pos'};

my $end_pos = $subclusters{$k}->{'end_pos'};

my $soft_clip_start = $subclusters{$k}->{'soft_clip_start'};

my $soft_clip_end = $subclusters{$k}->{'soft_clip_end'};

my $rs = substr($r_seq,$start_pos-1,$slen);

my %cigar_char_count = ();

for (my $ii=0; $ii<$n; $ii++)

{

my $i = $subclusters{$k}->{'indexes'}->[$ii];

while ($cluster[$i]->[5] =~ /[0-9]+([MIDNSHPX=])/g) { $cigar_char_count{$1}++; }

}

if ($debug)

{

print "\n----- cluster of $n reads, mapped length: $slen bp (on reference), key: $k -----\n";

if (exists $cigar_char_count{'S'}) { print "----- cluster is soft-clipped -----\n"; }

if (exists $cigar_char_count{'D'}) { print "----- cluster has deletions -----\n"; }

if (exists $cigar_char_count{'I'}) { print "----- cluster has insertions -----\n"; }

print "----- ref-seq: $rs -----\n";

}

my (@aln,@alnq,@ins,@insq,@mininslen,@maxinslen,@ast,@out,@outq,@outins,@outinsq);

# Aligning nucleotides according to CIGAR string.


{

my $i = $subclusters{$k}->{'indexes'}->[$ii];

my $ap = 0;

my $qp = $soft_clip_start;

while ($cluster[$i]->[5] =~ /([0-9]+)([MIDNPX=])/g)

{

my ($len,$op) = ($1,$2);

if ($op eq 'D') {

for (my $p=0; $p<$len; $p++)

{

$aln[$ii]->[$ap+$p] = '-';

$alnq[$ii]->[$ap+$p] = '';

}

$ap += $len;

}

elsif ($op eq 'I')

{

$ins[$ii]->[$ap] = substr($cluster[$i]->[9],$qp,$len);

$insq[$ii]->[$ap] = substr($cluster[$i]->[10],$qp,$len);

$qp += $len;

}

elsif ($op eq 'M' or $op eq '=' or $op eq 'X')

{

for (my $p=0; $p<$len; $p++)

{

$aln[$ii]->[$ap+$p] = substr($cluster[$i]->[9],$qp+$p,1);

$alnq[$ii]->[$ap+$p] = substr($cluster[$i]->[10],$qp+$p,1);

}

$ap += $len;

$qp += $len;

}

else { die "Unsupported CIGAR component: $op\n"; }

}

}

for (my $p=0; $p<$slen; $p++) { $aln[$n]->[$p] = substr($rs,$p,1); }

# Merging aligned nucleotides

for (my $p=0; $p<$slen; $p++)

{

my %cn = ();

for (my $ii=0; $ii<$n; $ii++) { $cn{$aln[$ii]->[$p]}++; }

my $maxcn = 0;

foreach my $c (keys %cn) { if ($cn{$c} > $maxcn) { $maxcn = $cn{$c}; } }

if ($maxcn == $n)

{

$out[$p] = $aln[0]->[$p];

$ast[$p] = ($aln[0]->[$p] eq '-') ? 'D' : ($aln[0]->[$p] eq $aln[$n]->[$p]) ? '=' : 'X';

}

else

{

$out[$p] = $use_n ? 'N' : $aln[$n]->[$p];

$ast[$p] = '~';

}

}

# Merging insertions

my $has_disagreeing_insertions = 0;

my $has_interesting_insertions = 0;


{

my ($min_ins_len,$max_ins_len) = (1000000,0);


{

my $l = defined($ins[$ii]->[$p]) ? length($ins[$ii]->[$p]) : 0;

if ($l < $min_ins_len) { $min_ins_len = $l; }

if ($l > $max_ins_len) { $max_ins_len = $l; }

}

$mininslen[$p] = $min_ins_len;

$maxinslen[$p] = $max_ins_len;

if ($min_ins_len < 1) { next; }

my %ins_num = ();

my %ins_len_num = ();


{


if ($l == $min_ins_len) { $ins_num{$ins[$ii]->[$p]}++; }

$ins_len_num{$l}++;

}

if (scalar(keys %ins_num) > 1) { $has_disagreeing_insertions = 1; }

if (scalar(keys %ins_len_num) > 1) { $has_interesting_insertions = 1; }

my $major_ins = (sort { $ins_num{$b} <=> $ins_num{$a} } keys %ins_num)[0];

$outins[$p] = $major_ins;

}

if ($debug)

{

#if ($has_disagreeing_insertions) { print "----- cluster has disagreeing insertions -----\n"; }

#if ($has_interesting_insertions) { print "----- cluster has insertions of varying lengths -----\n"; }

print "----- input reads: -----\n";

foreach my $i (@{$subclusters{$k}->{'indexes'}}) { print join("\t",@{$cluster[$i]}), "\n"; }

print "----- alignment: -----\n";

print 'ref ';


{

if ($maxinslen[$p] > 0) { print (('-') x ($maxinslen[$p])); }

print $aln[$n]->[$p];

}

print "\n";


{

my $aname = sprintf('%3d',$ii+1);

print $aname, ' ';


{

if ($maxinslen[$p] > 0)

{


if (defined($ins[$ii]->[$p])) { print $ins[$ii]->[$p]; }

if ($l < $maxinslen[$p]) { print (('-') x ($maxinslen[$p]-$l)); }

}

print $aln[$ii]->[$p];

}

print "\n";

}

print ' ';


{

if ($mininslen[$p] > 0) { print (('I') x ($mininslen[$p])); }

if ($maxinslen[$p] > $mininslen[$p]) { print (('-') x ($maxinslen[$p]-$mininslen[$p])); }

print $ast[$p];

}

print "\n";

print 'out ';


{

if ($maxinslen[$p] > 0)

{

my $l = defined($outins[$p]) ? length($outins[$p]) : 0;

if (defined($outins[$p])) { print $outins[$p]; }

if ($l < $maxinslen[$p]) { print (('-') x ($maxinslen[$p]-$l)); }

}

print $out[$p];

}

print "\n";

}

# Constructing output CIGAR

my $out_cigar_chars = '';

if ($soft_clip_start > 0) { $out_cigar_chars .= ('S') x $soft_clip_start; }


{

if (defined($outins[$p])) { $out_cigar_chars .= ('I') x length($outins[$p]); }

$out_cigar_chars .= ($out[$p] eq '-') ? 'D' : 'M';

}

if ($soft_clip_end > 0) { $out_cigar_chars .= ('S') x $soft_clip_end; }

my $out_cigar = '';

while ($out_cigar_chars =~ /([MIDS])\1*/g) { $out_cigar .= length($&) . $1; }

# Constructing output sequence.

my $out_seq = '';

if ($soft_clip_start > 0) { $out_seq .= substr($cluster[$subclusters{$k}->{'indexes'}->[0]]-

>[9],0,$soft_clip_start); }


{

if (defined($outins[$p])) { $out_seq .= $outins[$p]; }

if ($out[$p] ne '-') { $out_seq .= $out[$p]; }

}

if ($soft_clip_end > 0) { $out_seq .= substr($cluster[$subclusters{$k}->{'indexes'}->[0]]->[9],-

$soft_clip_end,$soft_clip_end); }

# Constructing output quality.

my $out_qual = '';

if ($soft_clip_start > 0) { $out_qual .= substr($cluster[$subclusters{$k}->{'indexes'}->[0]]-

>[10],0,$soft_clip_start); }


{

if (defined($outins[$p]))

{

for (my $pp=0; $pp<length($outins[$p]); $pp++)

{

my $q = 33;


{

if (!defined($ins[$ii]->[$p])) { next; }

if ($ins[$ii]->[$p] ne $outins[$p]) { next; }

my $qa = ord(substr($insq[$ii]->[$p],$pp,1));

if ($qa > $q) { $q = $qa; }

}

$out_qual .= chr($q);

}

}

if ($out[$p] ne '-')

{

my $q = 33;


{

if ($aln[$ii]->[$p] ne $out[$p]) { next; }

my $qa = ord($alnq[$ii]->[$p]);

if ($qa > $q) { $q = $qa; }

}

$out_qual .= chr($q);

}

}

if ($soft_clip_end > 0) { $out_qual .= substr($cluster[$subclusters{$k}->{'indexes'}->[0]]->[10],-

$soft_clip_end,$soft_clip_end); }

# Computing NM tag (edit distance from reference)

my $NM = 0;


{

if (defined($outins[$p])) { $NM += length($outins[$p]); }

if ($out[$p] ne $aln[$n]->[$p]) { $NM++; }

}

# Constructing MD tag (reference bases that differ from read)

my $MD = '';


{

if ($out[$p] eq '-') { $MD .= '-' . $aln[$n]->[$p]; }

else { $MD .= ($out[$p] eq $aln[$n]->[$p]) ? '=' : ('.' . $aln[$n]->[$p]); }

}

while ($MD =~ /(\=+)/) { $MD = $` . length($1) . $'; }

$MD =~ s/([a-zA-Z])(\.)/${1}0$2/g;

while ($MD =~ /(\-[a-zA-Z]+)\-([a-zA-Z]+)/) { $MD = $` . $1 . $2 . $'; }

$MD =~ s/\-/^/g;

$MD =~ s/\.//g;

if ($MD !~ /^\d/) { $MD = '0' . $MD; }

if ($MD !~ /\d$/) { $MD .= '0'; }

# Printing the output read.

if ($debug) { print "----- merged read: -----\n"; }

my $i0 = $subclusters{$k}->{'indexes'}->[0];

print $cluster[$i0]->[0], "\t", $cluster[$i0]->[1], "\t", $cluster[$i0]->[2], "\t", $cluster[$i0]->[3], "\t",

$cluster[$i0]->[4];

print "\t", $out_cigar, "\t", $cluster[$i0]->[6], "\t", $cluster[$i0]->[7], "\t", $cluster[$i0]->[8];

print "\t$out_seq\t$out_qual\tNM:i:$NM\tMD:Z:$MD\n";

if ($debug) { print "\n"; }

}

@cluster = ();

}

sub load_reference_chromosome

{

my ($name) = @_;

my $file = "$ref_dir/chr$name.fa";

if (!-e $file) { die "Can't find reference file \"$file\"\n"; }

open(my $R,'<',$file) or die "Can't open \"$file\"\n";

binmode $R;

print STDERR "Loading chr$name ..";

$r_seq = '';

<$R>;

while (<$R>) { s/[\x0D\x0A]+$//; $r_seq .= $_; }

close $R;

$r_len = length($r_seq);

print STDERR " OK - $r_len bp\n";

}

===== Five Perl Scripts for obtaining bootstrap values from TreeMix outputs =====

We provide following five perl scripts for obtaining bootstrap values from TreeMix outputs.

(A) 1Newick_to_Split_name_change.pl

(B) 2Newick_to_Split_one_original.pl

(C) 3Newick_to_Split_many.pl

(D) 4Split_Count.p

(E) 5Split_to_Newick+bootstrap.pl

We transform Newick format to Split Matrix format for simplifying counting interior branch (= split) for

obtaining bootstap probabilities. For example, splits matrix and population names corresponding to

TreeMix tree shown in Figure 4 are as follows.

--------------------------------------

Population ID:

000000000111111111 Bootstrap

123456789012345678 Frequency

--------------------------------------

Split 01: 000000000000000011 910

Split 02: 000000000000011000 123

Split 03: 000000000000011100 380

Split 04: 000000000000011111 982

Split 05: 000010000000011111 956

Split 06: 011101111111100000 998

Split 07: 011100111111100000 973

Split 08: 011000111111100000 952

Split 09: 000000000100100000 888

Split 10: 000000001010000000 701

Split 11: 000000001110100000 701

Split 12: 000000001111100000 975

Split 13: 001000001111100000 996

Split 14: 010000110000000000 1000

Split 15: 000000110000000000 526

======================================

List of Populations

ID: name

1 Saganji Jomon

2 Denisovan

3 Malta_MA1

4 Ust_Ishim

5 Karitiana

6 Papuan

7 YRI

8 LWK

9 CEU

10 IBS

11 GBR

12 FIN

13 TSI

14 JPT

15 CHB

16 CHS

17 CDX

18 KHV

See Chapter 3 of “Introduction to Evolutionary Genomics” (Naruya Saitou, 2014, Springer) for

explanation of splits matrix. If you have any question on this perl script, please contact Naruya Saitou

([email protected]).

(A) 1Newick_to_Split_name_change.pl

print "Please type population file ==> ";

$pop_file = <STDIN>;

chomp($pop_file);

open (FILE2,$pop_file) or die "$!";

@poplist=<FILE2>;

close(FILE2);

print @poplist;

@popname=();

@popno=();

foreach $poplist(@poplist) {

($no,$popname) = split(/¥s+/,$poplist);

push(@popno,$no);

push(@popname,$popname);

$tree_topology =~ s/$popname:/$no:/;

}

print "Please type name of new folder which copy treeout files ==> ";

$folder = <STDIN>;

chomp($folder);

opendir (DIR, $folder) or die "$dirname: $!";

@dirs = readdir(DIR);

foreach $dir(@dirs) {

if ( $dir =~ /^¥./) {

next;

}

$filename = "$folder¥/$dir";

open (FILE2,$filename) or die "$!";

@split_row=<FILE2>;

close(FILE);

#print join("¥n",@split_row);<STDIN>;

for ($i=0;$i<@poplist;$i++) {

$split_row[0] =~ s/$popname[$i]/$popno[$i]/;

}

#print join("¥n",$split_row[0]);<STDIN>;

open (NEWFILE, "> $filename") or die "$!";

print NEWFILE $split_row[0];

close (NEWFILE);

}

closedir(DIR);

end_proc:

($sec,$min,$hour) = localtime();

print "Computation is normaly ended. $hour:$min¥nThe new treeout file is created in '$folder'.¥n";

(B) 2Newick_to_Split_one_original.pl

($sec,$min,$hour,$mday,$mon,$year) = localtime();

$year += 1900;

$mon += 1;

print "start Newick_to_Split_one.pl $year/$mon/$mday $hour:$min¥n";

start:

print 'This perl program is written by Mizuguchi Masako.',"¥n";

print 'Algorithm is provided by Saitou Naruya.',"¥n";

print 'Version Date: Nov 21, A.S. 0015 (2015 A.D.)',"¥n";

print 'This script transforms Newick format file produced by TreeMix to split list format.',"¥n";

print "Please type Newick format file ==> ";

$in_file = <STDIN>;

chomp($in_file);

if ($in_file =~ /^(.*)¥.treeout$/) {

$out_file = $1 . "_original.split";

$out_file2 = $1 . "_original.popout";

$out_file3 = $1 . "_original.split+tree";

}

open (FILE,$in_file) or die "$!";

@line=<FILE>;

close(FILE);

$tree_topology = $line[0];

$tree_topology =~ /^(.*);/;

$tree_topology = $1;

open (OUTFILE2,"> $out_file2") or die "$!";

print OUTFILE2 $tree_topology, "¥n¥n";

print "Please type population file ==> ";

$pop_file = <STDIN>;

chomp($pop_file);

open (FILE2,$pop_file) or die "$!";

@poplist=<FILE2>;

close(FILE2);

print OUTFILE2 @poplist;

@popname=();

@popno=();

foreach $poplist(@poplist) {

($no,$popname) = split(/¥s+/,$poplist);

push(@popno,$no);

push(@popname,$popname);

$tree_topology =~ s/$popname:/$no:/;

}

print OUTFILE2 "¥n",$tree_topology,"¥n¥n",;

$OTUs_no = $no;

while ($tree_topology =~ /(¥:-?¥d+¥.?¥d*e?¥-?¥d*)[¥,¥)]/g) {

$tree_topology =~ s/$1//;

}

#print OUTFILE2 "¥n",$tree_topology,"¥n¥n",;

LOOP:

@row_list = ();

for ($i=0;$i<$OTUs_no;$i++) {

@one_row = ();

for ($j=0;$j<$OTUs_no;$j++) {

if ($j == $i ) {

push(@one_row,1);

} else {

push(@one_row,0);

}

}

push(@row_list,join('',@one_row));

}

$new_no = $OTUs_no;

@new_row = ();

$new_row = 0;

while ($tree_topology =~ /(¥(¥d+¥,¥d+¥))/g) {

$new_no++;

$tree_topology =~ s/$2/$new_no/;

#print $&;<STDIN>;

$& =~ /(¥d+)¥,(¥d+)/;

@row1=split(//,$row_list[$1-1]);


@one_row=();

$total = 0;


$total = $total + $row1[$j]+$row2[$j];

push(@one_row,$row1[$j]+$row2[$j]);

}

push(@popno,"($popno[$1-1],$popno[$2-1])");

push(@popname,"($popname[$1-1],$popname[$2-1])");

#print join("¥n",@popno);<STDIN>;

#print join("¥n",@popname);<STDIN>;

if ($total > $OTUs_no-2 ) {

last;

}

push(@new_row,,join('',@one_row));


$new_row++;

if ($new_row>=$OTUs_no-3) {

last;

}

}

for ($i=0;$i<@new_row;$i++) {

@one_row = ();

@one_row = split(//,$new_row[$i]);

if ($one_row[0] == 0) {

next;

}

for ($j=0;$j<@one_row;$j++) {

if ($one_row[$j] == 0) {

$one_row[$j] = 1;

} else {

$one_row[$j] = 0;

}

}

$new_row[$i] = join('',@one_row);

}

open (OUTFILE,"> $out_file") or die "$!";

print OUTFILE join("¥n",@new_row),"¥n";

print OUTFILE2 join("¥n",@new_row),"¥n";

open (OUTFILE3,"> $out_file3") or die "$!";


print OUTFILE3 "$new_row[$i] $popname[$i+$OTUs_no]¥n";

#print OUTFILE3 "$new_row[$i] $popno[$i+$OTUs_no]¥n";

}

close(OUTFILE);

close(OUTFILE2);

close(OUTFILE3);

end_proc:


print "Computation is normaly ended. $hour:$min¥nThe file name created split list of original tree is

'$out_file' and the check list is '$out_file2'. ¥nThe number of populations compared is $OTUs_no¥n";

(C) 3Newick_to_Split_many.pl


$year += 1900;

$mon += 1;

print "start Newick_to_Split.pl $year/$mon/$mday $hour:$min\n";

start:

print 'This perl program is written by Mizuguchi Masako.',"\n";

print 'Algorithm is provided by Saitou Naruya.',"\n";

print 'Version Date: Nov 21, A.S. 0015 (2015 A.D.)',"\n";

print 'This script transforms Newick format file produced by TreeMix to split list format.',"\n";

print "Please type common part of Newick format file ==> ";

$file_before = <STDIN>;

chomp($file_before);

print "Please type number of bootstraping pseudosamples ==> ";

$bootstrap_no = <STDIN>;

chomp($bootstrap_no);

print "Please type number of populations compared ==> ";

$OTUs_no = <STDIN>;

chomp($OTUs_no);

chomp($file);

for ($no=1;$no<=$bootstrap_no;$no++) {

if ($no < 10) {

$in_file = $file_before . "000" . $no . ".treeout";

$out_file = $file_before . "000" . $no . ".split";

} elsif ($no < 100) {



} elsif ($no < 1000) {



} else {

$in_file = $file_before . $no . ".treeout";

$out_file = $file_before . $no . ".split";

}

open (FILE,$in_file) or die "$!";

@line=<FILE>;

close(FILE);

$tree_topology = $line[0];

$tree_topology =~ /^(.*);/;

$tree_topology = $1;

open (OUTFILE,"> $out_file") or die "$!";

LOOP:

@row_list = ();

for ($i=0;$i<$OTUs_no;$i++) {

@one_row = ();


if ($j == $i ) {

push(@one_row,1);

} else {

push(@one_row,0);

}

}


}

while ($tree_topology =~ /(\:-?\d+\.?\d*e?\-?\d*)[\,\)]/g) {

$tree_topology =~ s/$1//;

}

$new_no = $OTUs_no;

@new_row = ();

$new_row = 0;

while ($tree_topology =~ /($\d+\,\d+$)/g) {

$new_no++;

$tree_topology =~ s/$2/$new_no/;

$& =~ /(\d+)\,(\d+)/;



@one_row=();

$total = 0;


$total = $total + $row1[$j]+$row2[$j];

push(@one_row,$row1[$j]+$row2[$j]);

}

if ($total > $OTUs_no-2 ) {

last;

}

push(@new_row,,join('',@one_row));


$new_row++;

if ($new_row>=$OTUs_no-3) {

last;

}

}


@one_row = ();

@one_row = split(//,$new_row[$i]);

if ($one_row[0] == 0) {

next;

}

for ($j=0;$j<@one_row;$j++) {

if ($one_row[$j] == 0) {

$one_row[$j] = 1;

} else {

$one_row[$j] = 0;

}

}

$new_row[$i] = join('',@one_row);

}

print OUTFILE join("\n",@new_row),"\n";

close(OUTFILE);

}

end_proc:


print "Computation is normaly ended. $hour:$min\n";

(D) 4Split_Count.pl

#use strict;


$year += 1900;

$mon += 1;

print "start Split_Count.pl $year/$mon/$mday $hour:$min\n";

start:

print 'This perl program is written by Mizuguchi Masako.',"\n";

print 'Algorithm is provided by Saitou Naruya.',"\n";

print 'Version Date: Nov 21, A.S. 0015 (2015 A.D.)',"\n";

print 'This script counts numbers splits to obtain bootstrap probabilities.',"\n";

print "Please type file name for split list of original tree produced by TreeMix ==> ";

$file = <STDIN>;

chomp($file);

print "Please type name of folder (or directory) which contains split list files ==> ";

$folder = <STDIN>;

chomp($folder);

open (FILE,$file) or die "$!";

@row=<FILE>;

close(FILE);

opendir (DIR, $folder) or die "$dirname: $!";

@dirs = readdir(DIR);

@count = ();

for($i=0;$i<@row;$i++) {

$count[$i] = 0;

}

foreach $dir(@dirs) {

if ( $dir =~ /^\./) {

next;

}

if ($dir !~ /split/) {

next;

}

$filename = "$folder\/$dir";

open (FILE2,$filename) or die "$!";

@split_row=<FILE2>;

close(FILE);

for ($i=0;$i<@row;$i++) {

for ($j=0;$j<@split_row;$j++) {

if ($row[$i] eq $split_row[$j]) {

$count[$i]++;

}

}

}

}

closedir(DIR);

$outfile = $folder . "_count.txt";

open (OUTFILE,"> $outfile") or die "$!";

for ($i=0;$i<@row;$i++) {

chomp($row[$i]);

print OUTFILE "$row[$i] $count[$i] \n";

}

close(OUTFILE);

end_proc:


print "Computation is normaly ended. $hour:$min\nThe output file is '$outfile'.\n";

(E) 5Split_to_Newick+bootstrap.pl

#use strict;


$year += 1900;

$mon += 1;

print "start Split_Count.pl $year/$mon/$mday $hour:$min¥n";

start:

print 'This perl program is written by Mizuguchi Masako.',"¥n";

print 'Algorithm is provided by Saitou Naruya.',"¥n";

print 'Version Date: Jan 19, A.S. 0016 (2016 A.D.)',"¥n";

print "Please type Newick format file ==> ";

$file1 = <STDIN>;

chomp($file1);

print "Please type file name of xxxxxx_count.txt ==> ";

$file2 = <STDIN>;

chomp($file2);

if ($file1 =~ /^(.*)¥.treeout$/) {

$file3 = $1 . "_original.split+tree";

$outfile = $1. ".tree";

}

open (FILE1,$file1) or die "$!";

@row1=<FILE1>;

close(FILE1);

$treeout = $row1[0];


@row2=<FILE2>;

close(FILE2);


@row3=<FILE3>;

close(FILE3);

LOOP:

$item = pop(@row3);

if (defined($item)) {

($distance1,$pair) = split(/¥s+/,$item);

@pair = split(/,/,$pair);

$item2 = $pair[-1];

@char = split(/¥)/,$item2);

$name = $char[0];

@matches = $item2 =~ m/¥)/g;

$n = scalar(@matches);

($distance2,$cnt) = split(/¥s+/,pop(@row2));

$treeout =~ /^(.*)$name(.*)$/;

$front_char = $1;

$item3 = $2;

@item3 = split(/¥)/,$item3);

@item4 = split(/,/,$item3[$n]);

$item4[0] = $item4[0] . "¥[" . $cnt/10 . "¥]";

$item3[$n] = join(',',@item4);

$item3 = join("¥)",@item3);

$treeout = $front_char . $name . $item3;

goto LOOP;

}

open (OUTFILE,"> $outfile") or die "$!";

print OUTFILE $treeout;

close(OUTFILE);

end_proc:


print "Computation is normaly ended. $hour:$min¥n";

print "The tree is '$outfile'. ¥n";

media. · web viewmitochondrial genome because the depth is much higher than that of nuclear...

Documents