SurvivalGWAS_SV (v1.3.2)
Methodology for the analysis of genome-wide association studies (GWAS) has focused primarily on binary (case-control) phenotypes and quantitative traits. Software implementing these methods, such as PLINK and SNPTEST, can efficiently handle the scale and complexity of genetic data from GWAS, allowing for imputed genotypes at millions of single nucleotide polymorphisms (SNPs).
However, the key outcome of interest in many genetic studies, particularly those in pharmacogenetics, is often “time to event” data (such as death, disease remission or occurrence of an adverse drug reaction), which cannot adequately be modelled by existing software. To address this challenge, we have developed the SurvivalGWAS_SV software implemented using C# and run on Linux operating systems. SurvivalGWAS_SV is able to handle large scale genome-wide imputed data, modelling time to event outcomes under an additive dosage model in the number of minor alleles at a SNP. Analysis can be performed using either Cox proportional hazards or Weibull regression models, the latter capable of analysing accelerated failure time and non-proportional data. The software also allows for multiple covariates and incorporation of SNP-covariate interaction effects.
Download
Download SurvivalGWAS_SV v1.3.2 (For previous versions please email hsyed@liverpool.ac.uk)
Installation instructions
1) Click on the download link.
2) Once downloaded, save the folder to your Linux, Mac OSX or Windows machine.
3) For Linux or Mac OSX operating system users, you must download Mono to run the software. The command for direct installation of Mono is:
sudo apt-get install mono-complete
For more information about installing Mono on Linux or OSX visit the Mono website:
http://www.mono-project.com/download/
Make sure Mono has downloaded all .NET frameworks to a directory on the machine.
For Windows operating system users, run survivalgwas-sv.exe through the windows command prompt.
Link to publication
https://bmcbioinformatics.biomedcentral.com/articles/10.1186/s12859-017-1683-z
Input Files
The user is required to specify the two data files that will be read into the program. This must be a genotype file (.gen or .impute) which contains the SNP probabilities (imputed or non-imputed) and a sample file (.sample) which contains all the covariate, survival time and censoring indicator information for each individual.
GENOTYPE file:
SNP1 rs1 500 A T 1 0 0 1 0 0 1 0 0
SNP2 rs2 210 A T 0 1 0 1 0 0 1 0 0
SNP3 rs3 5000 C T 1 0 0 0 0 1 0 1 0
SNP4 rs4 4637 A T 1 0 0 1 0 0 1 0 0
SNP5 rs5 8000 C T 0 0 1 1 0 0 1 0 0
(Genotype file can be zipped[*.zip] or gzipped[*.gz]. Please use gzip/gunzip for compression)
Supports Variant call format (VCF) files v4.0, v4.1 & v4.2. Genotypes can be in the form of hard genotype calls (GT), dosages (DS) and/or probabilities (GP).
##fileformat=VCFv4.1
##filedate=2016.12.14
##source=Minimac3, PLINK, VCFtools
##contig=<ID=1>
##FORMAT=<ID=GT,Number=1,Type=String,Description="Genotype">
##FORMAT=<ID=DS,Number=1,Type=Float,Description="Estimated Alternate Allele Dosage : [P(0/1)+2*P(1/1)]">
##FORMAT=<ID=GP,Number=3,Type=Float,Description="Estimated Posterior Probabilities for Genotypes 0/0, 0/1 and 1/1">
##INFO=<ID=AF,Number=1,Type=Float,Description="Estimated Alternate Allele Frequency">
##INFO=<ID=MAF,Number=1,Type=Float,Description="Estimated Minor Allele Frequency">
##INFO=<ID=R2,Number=1,Type=Float,Description="Estimated Imputation Accuracy">
#CHROM POS ID REF ALT QUAL FILTER INFO FORMAT
1 13480 rs1323 A T . . AF=0.07;MAF=0.07;R2=0.00682 GT:DS:GP 0|0:0.000:1.000,0.000,0.000 0|0:0.000:1.000,0.000,0.000
SAMPLE file:
Sample_id Subject_id Missing Gender SurvivalTime CensoringIndicator
0 0 0 D T C -------> this line can be deleted or the program will delete this line automatically.
1 1 0 1 40 1
2 2 0 1 200 1
3 3 0 1 140 0
Sample file can be any text file extension e.g. .txt, .sample, etc..
Running SurvivalGWAS_SV
SurvivalGWAS_SV can be run on High Performance Parralel Computing Clusters.
Command line:
mono survivalgwas-sv.exe -gf= -sf= -threads= -t= -c= -cov= -i= -chr= -lstart= -lstop= -method= -p= -o=
Command | Description |
---|---|
-gf= or -gen_file= | This specifies the genotype file. |
-sf= or -sample_file= | This specifies the sample file. |
-threads= | Specifies the number of threads. On a multi-core system, multiple threads can execute tasks in parallel, with each core executing a different thread or multiple threads. e.g. You have a single computing node with 8 cores. You want to execute the program on the command line analysing 10000 SNPs. If you specify -threads=5, the software will automatically assign resources analysing the 10000 SNPs in 5 equal batches across one or more computing cores. Note: Creating more threads than is required can slow down the analysis. Specify "-threads=1" for a sequential process like previous versions. |
-t= or -time= | This specifies the observation time. |
-c= or –censor= | This specifies the censoring indicator. |
-cov= or -covariates= | Specification of list of covariates. Each one separated by a comma (,). e.g. –cov=Treatment,Age,Gender |
-i= or -int= | This specifies the interaction between the SNP and one covariate. Separate using a comma (,).e.g. –i=SNP,Treatment |
-lstart= or -linestart= | This specifies the line in the genotype file at which the start position of analysis will occur. |
-lstop= or -linestop= | This specifies the line in the genotype file at which the end position of analysis will occur. |
-sp= | Specifies the start position (in base pairs) on the chromosome. <optional> Still need to specify the number of lines in the file using -lstart & -lstop commands. |
-ep= | This specifies the stop position (in base pairs) on the chromosome. <optional> |
-chr= or -chromosome= | User specified chromosome number. Some genotype files do not contain the chromosome number, so this is specified and written to the output file. |
-p= or –print= | Enter “onlysnp” if you want only the SNP analysis output to be in the output file and “onlyint” if you want only the interaction analysis output to be in the output file. |
-m= or -method= | Specify choice of method for analysis. This is either “cox” for the Cox proportional hazards model or “weibull” for the Weibull regression model. |
-o= or -output= | This specifies the name of the file for output to be saved in (.txt). |
-h or –help | Help command. Prints the descriptions in this table to terminal. |
For an example of a shell script (.sh) to distribute the analyses between 10 computer cores within a Linux cluster, using a sun grid engine batch system click the link below:
To submit the shell script use the command:
qsub -t 1:10 batchexample.sh
This should submit the analysis and spread it over 10 cores. This will produce 10 output files which you should concatenate using the “cat” command in Linux, e.g.
cat file1 file2 file3 > jointfile.txt
Joining the files will also add multiple header lines within the file, so remember to delete the duplicate header text before or after concatenation.
Output file
InputName - Variable name (can be the SNP ID, covariate or interaction name).
rsid - The unique SNP identifier for each SNP analysed.
Chr - User specified chromosome number.
Pos - Base pair position of the SNP.
EA - Effect allele.
NonEA - Non effect allele.
CoefValue - Effect size estimate.
HR - Hazard Ratio.
AF - Acceleration factor (Weibull only).
SE - Standard error of effect.
LowerCI - Lower 95% confidence interval (Cox model only)
UpperCI - Upper 95% confidence interval (Cox model only)
Waldpv - p-value calculated using a Wald test for each input parameter.
LRTpv - Likelihood ratio test p-value for for each parameter (Cox model only).
ModLRTpv - Model likelihood ratio p-value.
Jointassoc - LRT p-value comparing a model with SNP, covariate and interaction terms with null model including only covariate terms (only if interaction is defined).
zscorestat - Test statistic for p-value calculation (Weibull only).
p-value - Calculated using z statistic (Weibull only).
EAF - Effect/population allele frequency.
MAF - Minor allele frequency
Shape - The shape parameter of the survival distribution (Weibull only).
Infoscore
The IMPUTE info measure, reflects the information within imputed genotypes relative to the information if only the allele frequency were known. The info measure takes the value 1 if all genotypes are completely certain and a value of 0 if the genotype probabilities for each sample are completely uncertain. For a more detailed explanation and a full derivation of the info measure visit the SNPTEST website below:
https://mathgen.stats.ox.ac.uk/genetics_software/snptest/snptest.html
Note: When analysing VCF files, the output file produced will have the output under the column headers “Pos” and “rsid” reversed.
New features & Bug fixes
Version 1.3.2 changes (09/03/2018):
- Sample data included in download folder.
- Improved efficiency by eliminating model LRT calculation.
- Joint asociation test for interaction effects added. Details in output table above.
Version 1.3.1 changes (25/07/2017):
- Multithreading command "-threads=" did not distribute the number of lines in the genotype file accurately between threads if "-lstart=" was not equal to 0. This has been fixed in the updated version.
Version 1.3 changes (16/06/2017):
- Increased efficiency through multithreading. Command "-threads=" specifies the number of threads. For example, specifying 5 threads could potentially run the analysis 5x faster than previous versions that use a sequential process, but this is dependent on the users computer/cluster. All threads will write to the same output file.
- VCF file and Sample file can have samples/variables separated by a space or tab.
- VCFtools and PLINK VCF v4.1 & 4.2 supported. Hard genotype calls in VCF file represented as "GT" can be separated with "/" and/or "|". For example; 0/0, 0/1, 1/1 or 0|0, 0|1, 1|1.
- If the infoscore = -1 in the output file; this means unable to determine imputation quality as imputation is not known to have been done.
Version 1.2.2 changes (14/04/2017):
- Reads in VCF (Variant Call Format) files.
- Command for calling software has been made lowercase (survivalgwas-sv.exe).
- Output file headings have changed and have new additions. See updated output table above.
- Analysis using chromosome position criteria included with use of the commands for number of lines.
- Hazard ratio and acceleration factor included in Weibull regression analysis output.
- Likelihood ratio test p-value for each parameter added to Cox proportional hazards analysis output.
Version 1.2.1 changes:
- Analysis using chromosome position was removed. Analysis using this specification was substantially slower than analysis run by specifying the number of lines in genotype file.
- Minor allele frequency ("MAF") calculation was displayed incorrectly in output file. This has been fixed.
- Weibull regression model currently can only fit up to 10 additional covariates, not including SNP.
- .gz file extension read in fixed.
Contact
Hamzah Syed
Block F, Waterhouse Building
Department of Biostatistics
University of Liverpool
Email: hsyed@liverpool.ac.uk
Andrew P Morris
Email: A.P.Morris@liverpool.ac.uk
Andrea L Jorgensen
Email: aljorgen@liverpool.ac.uk
FAQ
Please submit your question via email to hsyed@liverpool.ac.uk and we will respond as soon as possible.
All questions and responses will be published on this page below.
Q: I am receiving an error: "Unhandled Exception: System.IO.FileNotFoundException: Could not load file or assembly 'System.Data, Version=4.0.0.0, Culture=neutral, PublicKeyToken....".
A: If you installed Mono on Linux using the command: sudo apt-get install mono-devel
You then need to install the complete package version using the command: sudo apt-get install mono-complete
The mono-complete install should correct most cases of "assembly not found" errors.
Q: I am receiving the following error "Unhandled Exception: System.ArgumentOutOfRangeException: Index was out of range. Must be non-negative and less than the size of the collection."
A: This error means that the number of samples in your genotype file does not match the number of rows in your sample file. In previous versions prior to v1.3, this error would occur because the sample file was tab separated or VCF file was space separated. The new version can read in both tab and space separated files. Sometimes a blank row at the end of the sample file can cause this, therefore it should be deleted.
Q: If the data for each chromosome are saved in different VCF files, but within the same folder, can the software run all chromosomes and output the results into one file using a single command line?
A: Currently, the software only takes one input file for each chromosome at a time. Analysing multiple files would need to be done using multiple executions of commands or using a shell script.
Q: How should missing values be represented in the sample file?
A: Missing values in the sample file should be coded as "NA". For example:
id1 id2 cov time status
231 a22 0 NA NA
221 b33 1 44 1
Q: When specifying the commands "-lstart=" and "-lstop=" for a VCF file, do i need to specify "-lstart=" at the line the samples start?
A: Not necessarily, “-lstart=” means the line you want to start the analysis from, it can be 0 or any number. The software will automatically skip over the lines with text at the beginning of a VCF file, if you input a 0 for “-lstart=”.
Q: If i want to analyse all SNPs in the file do i still need to specify "-lstart=" and "-lstop="?
A: Yes. The reason for “-lstart=” and “-lstop=” is so the program knows at which line to separate the input genotype file, allowing small batches to be analysed using a shell script or through multithreading.
Q: I am trying to submit a batch shell script, however i am getting the error "expr:syntax error"?
A: To submit shell scripts successfully use either commands:
qsub batchexample.sh
sh batchexample.sh
Q: VCF file created using PLINK cannot be analysed. Please advise?
A: PLINK VCF files have both "/" and "|" separated samples. Versions of SurvivalGWAS_SV prior to v1.3, can only read in samples separated by "|". However, the new version can read in both separations.
Q: When using the "-lstop=" command, if I don't want to lose a single SNP from the analysis, should I set it to a large number which is greater than the max number of SNPs in the input file? Will it affect the assignment of resources (e.g. cores & memory) amongst the jobs?
A: Yes you can set “-lstop=” to a number greater than the number of SNPs in your file as the software knows when there are no more lines to read in the file. It is always better to specify more lines than there are and especially a number divisible by the number of threads or cores you specify, because it will distribute the input file evenly. Memory will never be an issue with the software as it efficiently garbage collects tables and unused variables, freeing up memory.
Q: Can SurvivalGWAS_SV be used for the analysis of left-truncated survival data?
A: Left truncated data is currently not supported by the software. However, we are constantly updating the software and suggestions from users are appreciated to help us improve the software.
Q: I have zipped my file using bzip, but the software is not reading the entire file. What should i do?
A: Currently, the software does not support bgzip/bzip files (Blocked GNU Zip Format), please use gzip/gunzip to compress or decompress your VCF files.