Oncolonnator

Daniel Chen

May 17, 2020 Data Engineering

Code

Introduction

Oncolonnator is a variant annotation tool which annotates vcf files using the Broad Institute ExAC database to return variant information for all snps in a given file. This will take a given input vcf file and output a csv file containing basic vcf information and metadata from ExAC to be able to learn more about the variations in your dataset. Combined with a given VCF variant annotation file, you will be able to combine it with the ExAC database and understand how the variants contribute towards long term health in a given gene.

How to run

Basic Parameters

-h: This is the help page and gives a description of these scripts
--input: This is the parameter to pass which path is VCF file to parse
--output: This is the parameter to pass the name and path of the csv file to output

Input

This is the typical vcf format used in genomic studies. More information about that this looks like can be found here:

Variant Call Format

Output

This will output a csv file with the following columns:

CHROMOSOME: The chromosome that this SNP is located on. This can be numeric or string.
POSITION: This is where in the genome that this SNP is located. This is in base pairs(bp).
REF_ALLELE: This is the reference allele.
ALT_ALLELE: This is the alternate allele.
TOTAL_DEPTH: This is the total number of reads supporting the SNP seen.
ALT_DEPTH: This is the total number of reads supporting the alternative allele.
ALT_PERCENTAGE: This is the percentage of alternate allele depth divided by the total depth.
ALLELE_FREQUENCY: This is the allele frequency given by ExAC based on their own calculations.
WORST_CONSEQUENCE: This is the worst consequence given for a specific alternate allele at a given position. This is preferentially give deleterious consequences followed by nonsynonomous and synonymous consequences.
GENES: This is the list of genes that this SNP is located on.
TRANSCRIPTS: This is the list of potential transcripts that this SNP may be a part of.

Pipenv / Python

Pipenv builds an virtual environment based in the Pipfile and Pipfile.lock in the directory. You can run the script using pipenv or manually using your own virtual environment or python distribution. A example for running it via pipenv is shown below:

pipenv run python oncolonnator.py --input <INPUT_VCF> --output <OUTPUT_CSV>

INPUT_VCF: Path to the VCF file to parse
OUTPUT_CSV: Path and file name for the output csv

Pytest Test Suite

pipenv run pytest

This will run through the unit tests to ensure that the functions behave as expected.

Docker

Basic Run

docker run --rm -v <INPUT_DIRECTORY>:input -v <OUTPUT_DIRECTORY>:output dchen71/oncolonnator:latest --input /input/<EXAMPLE_VCF_FILE> --output /output/<EXAMPLE_CSV_OUTPUT>

INPUT_DIRECTORY: Absolute directory of where the data you want to parse resides. This will map it to the /input folder in the docker container.
OUTPUT_DIRECTORY: Absolute directory of where you want to output the data. This will map it to the /output folder in the docker container.
EXAMPLE_VCF_FILE: The name of the VCF file you want to parse.
EXAMPLE_CSV_OUTPUT: The name of the CSV file you want to output.

References

Variant Call Format
ExAC Rest API
Docker
Pipenv
PyVCF

Data Engineering