gnomix_frq.py
: a conversion that aggregates information from Gnomix output files (.msp
) and vcf files (.vcf
) and creates frequency files (.freq
) for population groups masked by specific ancestry (mask_group).
Be careful to modify variable "data_dir" to your own data directory, which includes a set of chromosome data directories, each with vcf and msp files, in gnomix_frq.py
.
Please also provide a sample information file, e.g.,hawaiiPopInfo.csv
, that gives sample's unique identifier created by the family ID and sample ID, "famid_id", and the sample's population group, "population", at the same path of gnomix_frq.py
.
usage: gnomix_frq.py [-h] [-m MASK_GROUP]
optional arguments: -h, --help show this help message and exit -m MASK_GROUP, --mask_group MASK_GROUP mask subpopulation group for alleles |
An example of the execution code:
python3 gnomix_frq.py -m Polynesian
main.py
: a main execution file that computes genetic statistics.
Currently, ancestry-specific versions of "F2", "F3", "F4", "F_st", "Pi", "Psi", and "Heterozygosity" are supported.
usage: main.py [-h] -f FILE [FILE ...] -t DATA_DIR [-b BLOCKSIZE] [-n NUM_REPLICATES]
[-r GROUP [GROUP ...]] [-D DAF] [-d DOWNSAMPLE_SIZE] [--rm_DA_files RM_DA_FILES]
{F2,F3,F4,Pi,Psi,F_ST,H}
positional arguments: optional arguments: |
An example of the execution code:
python3 main.py Psi -f pop_list.txt -t data_American -b 50 -n 200 -r Samoa Tonga -D 0.05 -d 2 -m True --rm_DA_files True
(Psi only) The program will first generate an aggregated population file if there are multiple reference population groups and the file name will contain the first two letters of each reference group capitalized. So for the example above, you can find a file SamTon.freq
in the frequency data file directory. Then it will generate a file specifying the derived allele positions, which is named in the following format: psi_<aggr_name>_frq_<DAF>.txt
.
The outputs of the given statistics are in the directory: path/to/dir/<pop_gen_stat>_output
. The statistics matrix file, <pop_gen_stat>_mtx.csv
, contains a matrix of the sample mean of the given statistics, where rows and columns represent pop_list1.txt
and pop_list2.txt
, respectively. The statistics output file, <pop_gen_stat>_stat.txt
, contains a list of rows of outputs:
popA-popB, <pop_gen_stat>_mean, <pop_gen_stat>_SE, num_of_SNPs_used
NOTE: <pop_gen_stat>_mtx.csv
is overwritten at each run, while <pop_gen_stat>_stat.txt
uses appending format for each run.
Cite: Ioannidis, A. G., Blanco-Portillo, J., Sandoval, K., Hagelberg, E., Barberena-Jonas, C., Hill, A. V., ... & Moreno-Estrada, A. (2021). Paths and timings of the peopling of Polynesia inferred from genomic networks. Nature, 597(7877), 522-526.
Footnotes
-
All pair example (complete graph):
[A1, A2, A3] => A1A2, A1A3, A2A3
, cross-match pair example (bipartite graph):[A1, A2], [B1, B2] => A1B1, A1B2, A2B1, A2B2
. Heterozygosity only supports all pair example (1 population list file), whereas F4 only supports cross-match pair example (3 population list files, i.e., one reference population is specified and fixed). The rest statistics support both format. ↩ -
In the file directory, there should be frequency files named as
<pop>.freq
, for example,Samoa.freq
, which has the first line denoting the column names of the csv data file (comma-separated format) - CHR, SNP, A1, A2, MAF, NCHROBS, MAF_MSK, NCHROBS_MSK: "CHROM_IDX" - chromosome ID "SNP" - SNP physical position "A1" - alternative allele "A2" - reference allele "MAF" - alternative allele frequency "NCHROBS" - total allele observations "MAF_MSK" - alternative allele frequency with ancestry-masked SNP data "NCHROBS_MSK" - total allele observations with ancestry-masked SNP data ↩ -
When
DAF=0.5
, the Psi computation will have no polarization. ↩ -
Beware that a valid downsampling size should be less than the minimum number of observations for the population groups passed into the statistics. ↩