For DNA sequencing data, users should provide variant call format (VCF) by jointly calling each tumor/normal pair.

If there are K tumor/normal pairs, there should be K VCF files, all saved in the same directory with only these K files in it.

User needs to create two directories: BAF/ and LRR/ with writing and reading privileges, save txt files for each of the 22 chromosomes in these directories with names chr N.txt, N=1,2,...22.

The columns for BAF and LRR files are: SNP id (optional), chromosome, position start, position end (optional), tumor_1, normal_1, tumor_2, normal_2, ...

It is currently not part of CHAT and details for preparing LRR and BAF data can be found in [Extract LRR and BAF signals from next generation sequencing data] Genome-wide SNP arrays usually contains more than 500K markers.

When sample size is large (if 100 tumor samples are analyzes, there will be 200 in total, since each tumor is paired with a normal sample), in R, memory consumption to load the complete data is large.

To note, in BAF files, the identifier used in real file for tumor_i and normal_i should be exactly the same as those used in LRR files.

Allele-specific SNP array data contains two types of signals for each SNP: intensity of A allele and intensity of B allele. In CHAT, the input is commonly used log R ratio (LRR) and B-allele frequency (BAF), transformed from the original intensities: LRR = log2(intensity of A B)-1 BAF = intensity of B / intensity of (A B) The pipeline is optimized using SNP array data of both Illumina and Affymetrix platforms.