rareSurvival
Rare variants have been proposed as contributing to the “missing heritability” of complex human traits. Statistical methodology has been developed to investigate the association of complex traits with multiple rare variants within pre-defined “units” from sequence and array-based studies of the exome or genome. However, software for modelling time to event outcomes for rare variant associations has been underdeveloped in comparison with binary and quantitative traits. We introduce a new command line application, rareSurvival, used for the analysis of rare variants with time to event outcomes. The program is compatible with high-performance computing (HPC) clusters for batch processing. rareSurvival implements statistical methodology, which is a combination of widely used survival and gene-based analysis techniques such as the Cox proportional hazards model and the burden test. We introduce a novel piece of software that will be at the forefront of efforts to discover rare variants associated with a variety of complex diseases with survival endpoints.
Download
Download rareSurvival v1.0
Installation instructions
1) Click on the download link.
2) Once downloaded, save the folder to your Linux, Mac OSX or Windows machine.
3) For Linux or Mac OSX operating system users, you must download Mono to run the software. The command for direct installation of Mono is:
sudo apt-get install mono-complete
For more information about installing Mono on Linux or OSX visit the Mono website:
http://www.mono-project.com/download/
Make sure Mono has downloaded all .NET frameworks to a directory on the machine.
For Windows operating system users, run raresurvival.exe through the windows command prompt.
Input Files
The user is required to specify the 3 data files that will be read into the program. This must be a genotype file (.gen or .impute) which contains the SNP probabilities (imputed or non-imputed) and a sample file (.sample) which contains all the covariate, survival time and censoring indicator information for each individual. The third file to specify is the gene list file which can be in one of two different formats. The gene list file is a text file which can contain; (i) the gene name, chromosome, base pair start position and base pair end position, which should be given the file extension .pos or (ii) the gene name, chromosome and the rsid of each variant in a row list separated by a single space, which should be given the extension name .list.
GENOTYPE file:
SNP1 rs1 500 A T 1 0 0 1 0 0 1 0 0
SNP2 rs2 210 A T 0 1 0 1 0 0 1 0 0
SNP3 rs3 5000 C T 1 0 0 0 0 1 0 1 0
SNP4 rs4 4637 A T 1 0 0 1 0 0 1 0 0
SNP5 rs5 8000 C T 0 0 1 1 0 0 1 0 0
(Genotype file can be zipped[*.zip] or gzipped[*.gz]. Please use gzip/gunzip for compression)
Supports Variant call format (VCF) files v4.0, v4.1 & v4.2. Genotypes can be in the form of hard genotype calls (GT), dosages (DS) and/or probabilities (GP).
##fileformat=VCFv4.1
##filedate=2016.12.14
##source=Minimac3, PLINK, VCFtools
##contig=<ID=1>
##FORMAT=<ID=GT,Number=1,Type=String,Description="Genotype">
##FORMAT=<ID=DS,Number=1,Type=Float,Description="Estimated Alternate Allele Dosage : [P(0/1)+2*P(1/1)]">
##FORMAT=<ID=GP,Number=3,Type=Float,Description="Estimated Posterior Probabilities for Genotypes 0/0, 0/1 and 1/1">
##INFO=<ID=AF,Number=1,Type=Float,Description="Estimated Alternate Allele Frequency">
##INFO=<ID=MAF,Number=1,Type=Float,Description="Estimated Minor Allele Frequency">
##INFO=<ID=R2,Number=1,Type=Float,Description="Estimated Imputation Accuracy">
#CHROM POS ID REF ALT QUAL FILTER INFO FORMAT
1 13480 rs1323 A T . . AF=0.07;MAF=0.07;R2=0.00682 GT:DS:GP 0|0:0.000:1.000,0.000,0.000 0|0:0.000:1.000,0.000,0.000
SAMPLE file:
Sample_id Subject_id Missing Gender SurvivalTime CensoringIndicator
0 0 0 D T C -------> this line can be deleted or the program will delete this line automatically.
1 1 0 1 40 1
2 2 0 1 200 1
3 3 0 1 140 0
Sample file can be any text file extension e.g. .txt, .sample, etc..
GENE LIST file:
.pos extension example:
GeneA 1 11873 14409
GeneB 1 11873 14409
GeneC 1 11873 14409
GeneD 1 14362 16765
.list extension example:
GeneA rs16931821 rs11614957 rs12425974
GeneB rs1106984 rs881194 rs7973597 rs7974784
GeneC rs11061652 rs16925540
GeneD rs1920045
Running rareSurvival
rareSurvival can be run on High Performance Parralel Computing Clusters.
Command line:
mono raresurvival.exe -gf= -sf= -mf= -threads= -t= -c= -cov= -i= -chr= -maf= -info= -lstart= -lstop= -method= -rm= -p= -o=
Four methods are currently offered for analysis; (i) a Burden test in a Cox proportional hazards model, (ii) a Burden test in a parametric Weibull regression model, (iii) a Madsen-Browning weighted Burden test in a Cox proportional hazards model, and (iv) a Madsen-Browning weighted Burden test in a parametric Weibull regression model.
Command | Description |
---|---|
-gf= |
This specifies the genotype file. |
-sf= |
This specifies the sample file. |
-mf= |
This specifies the the gene list file. (.pos or .list). |
-threads= | Specifies the number of threads. On a multi-core system, multiple threads can execute tasks in parallel, with each core executing a different thread or multiple threads. e.g. You have a single computing node with 8 cores. You want to execute the program on the command line analysing 10000 SNPs. If you specify -threads=5, the software will automatically assign resources analysing the 10000 SNPs in 5 equal batches across one or more computing cores. Note: Creating more threads than is required can slow down the analysis. Specify "-threads=1" for a sequential process like previous versions. |
-t= |
This specifies the observation time. |
-c= |
This specifies the censoring indicator. |
-cov= |
Specification of list of covariates. Each one separated by a comma (,). e.g. –cov=Treatment,Age,Gender |
-lstart= | This specifies the line in the gene list file at which the start position of analysis will occur. |
-lstop= | This specifies the line in the gene list file at which the end position of analysis will occur. |
-maf= | Specifies the MAF threshold for inclusion of variants in the analysis. |
-info= | Specifies the info-score threshold for inclusion of variants in the analysis. |
-chr= |
User specified chromosome number. Some genotype files do not contain the chromosome number, so this is specified and written to the output file. |
-p= |
Enter “onlygene” if you want only the gene analysis output to be in the output file. |
-m= | Specify choice of of regression model. This is either “cox” for the Cox proportional hazards model or “weibull” for the Weibull regression model. |
-rm= | Specifies the choice of rare variant analysis method. This is "burden" for the BT with unit weighting or "mbweight" for the Madsen-Browning weighted BT. |
-o= | This specifies the name of the file for output to be saved in (.txt). |
-h or –help | Help command. Prints the descriptions in this table to terminal. |
For an example of a shell script (.sh) to distribute the analyses between 10 computer cores within a Linux cluster, using a sun grid engine batch system click the link below:
To submit the shell script use either of the two commands depending on the cluster manager:
qsub -t 1:10 raresurvival.sh
sbatch --array=1-10 raresurvival.sh
This should submit the analysis and spread it over 10 cores. This will produce 10 output files which you should concatenate using the “cat” command in Linux, e.g.
cat file1 file2 file3 > jointfile.txt
Joining the files will also add multiple header lines within the file, so remember to delete the duplicate header text before or after concatenation.
Output file
InputName/Gene - Variable name (can be the gene name or covariate).
Chr - User specified chromosome number.
BPmid - Mean base pair position of the gene.
RareCount - Total number of rare variants in gene.
TotalSample - Total number of variants (common and rare) in gene.
MAFSum - Sum of rare variant MAFs in gene.
AvMAF - Mean of rare variant MAFs in gene.
InfoSum - Sum of rare variant info-scores in gene.
AvMAF - Mean of rare variant info-scores in gene.
CoefValue - Coefficient estimate.
HR - Hazard Ratio of accumulation of minor alleles.
AF - Acceleration factor of accumulation of minor alleles (Weibull only).
SE - Standard error of coefficient value.
LowerCI - Lower 95% confidence interval for HR (Cox model only)
UpperCI - Upper 95% confidence interval for HR (Cox model only)
Waldpv - p-value calculated using a Wald test for each input parameter.
LRTpv - Likelihood ratio test p-value for for each parameter (Cox model only).
ModLRTpv - Model likelihood ratio p-value.
Shape - The shape parameter of the survival distribution (Weibull only).
The IMPUTE info measure, reflects the information within imputed genotypes relative to the information if only the allele frequency were known. The info measure takes the value 1 if all genotypes are completely certain and a value of 0 if the genotype probabilities for each sample are completely uncertain. For a more detailed explanation and a full derivation of the info measure visit the SNPTEST website below:
https://mathgen.stats.ox.ac.uk/genetics_software/snptest/snptest.html
Note: When analysing VCF files, the output file produced will have the output under the column headers “Pos” and “rsid” reversed.
Contact
Hamzah Syed
Block F, Waterhouse Building
Department of Biostatistics
University of Liverpool
Email: hsyed@liverpool.ac.uk
Andrew P Morris
Email: A.P.Morris@liverpool.ac.uk
Andrea L Jorgensen
Email: aljorgen@liverpool.ac.uk
FAQ
Please submit your question via email to hsyed@liverpool.ac.uk and we will respond as soon as possible.
All questions and responses will be published on this page below.