How to create phylogenetic trees from raw polyguanine genotyping data April 2016

1. We have provided raw data for every PCR replicate in the folder “Filtered Raw Data After Impurity Exclusion”. PCR replicates are identified by the suffix -1/2/3. For example, the three PCR replicates for sample 12-A are labeled 12-A-1, 12-A-2 and 12-A-3. In this data set, low quality PCRs have already been removed as described in the supplementary methods of Naxerova et al. (a PCR was considered low quality if the mean height of all fragments was below 10% of the mean intensity across all replicates for any given marker and patient; the number 0.1 in the file name describes this cutoff). Therefore, some samples may only have one or two PCR replicates. Impure samples have also been removed as described in the supplementary methods. (Please note that for patient C31, there is no 0.1 extension in the file names because we reused previously collected genotyping data, as described in the supplementary methods).

2. The sample names in the raw data files are derived from the order of paraffin blocks from which the samples originated. To convert these original sample names to the labels that are used in the phylogenetic trees and the figures in Naxerova et al., please use the SampleNames.txt file, which provides a complete mapping.

3. The first step in our analysis pipeline is to choose a representative replicate from the original three PCR replicates. To achieve this, we first calculate pairwise Jensen-Shannon distances among all replicates for a marker. These complete pairwise distance are contained in the “Data matrices” folder and are called [Marker][Patient]0.1_distancematrixdf=JSMr=0.11.tsv. As described in the supplementary methods, we developed the following procedure to select the most representative replicate of a marker in each sample: (i) if all three replicates were classified as identical (JSD was below d=0.11), we selected the pair of replicates that minimized the distance measure and then selected the replicate with the higher intensity value; (ii) if two out of three replicates were classified as identical, we selected the replicate with the higher intensity from the identical replicates; (iii) if only two replicates were available and they were classified as identical we proceeded as in case (ii); (iv) if only one replicate was available, we proceeded with this replicate; and (v) if all available replicates were classified as different (JSDs above 0.11), we excluded this marker across all samples of a subject.

4. The files called [Marker][Patient]0.1_representativedistancematrixdf=JSMr=0.11.tsv contain the distances between the representative replicates we identified in the previous step. If a sample does not have a representative replicate (see above), the marker is excluded, the values in the corresponding file are 0, and the file is omitted from further use.

5. To obtain the final distance matrix used for phylogenetic reconstruction (called [Patient]normalized_final_distance_matrix.txt), distances from [Marker][Patient]0.1representativedistancematrixdf=JSMr=0.11.tsv files are normalized by the number of used markers in this subject.

6. To create neighbor-joining or UPGMA trees, install the R package ape (https://cran.r- ) and use the nj() or upgma() functions.

Data Matrices

Filtered Raw Data After Impurity

Sample Names