Introduction

This tutorial will show you how to execute liftover on VCF files using Picard. This is useful when you have a VCF file that is in one genome build and you want to convert it to another genome build. For example, you may have a VCF file in hg19 (GRCh37, the older build) and you want to convert it to hg38 (GRCh38, which is the newer build); this is what the tutorial will cover, specifically on Chromosome 1 only. Once we get to the end, I will show how would extend this to all chromosomes.

Note there are other file genotype file formats such as PLINK or UCSC BED file, but this tutorial will focus on VCF files. If you have UCSC BED file, you can just use the webtool LiftOver to liftover your file.

Download all the things

First, we need to download the necessary files. We will need the following:

Picard, the software we will use to liftover our VCF file.
- this would give you a .jar file, which is a compressed Java executable file which will have the LiftoverVcf command we will use.
Chain file, hg19ToHg38.over.chain.gz, prerequisite file for liftover.
Fasta Sequence file, Reference Sequence Files, this is our reference sequence file (another prerequisite).
- for this, there is multiple files per chromosomes, we will need all of them. You can download them all by running the following command on your terminal:

#!/bin/bash
# example only for chromosome 1
wget http://hgdownload.soe.ucsc.edu/goldenPath/hg38/chromosomes/chr1_GL383518v1_alt.fa.gz
wget http://hgdownload.soe.ucsc.edu/goldenPath/hg38/chromosomes/chr1_GL383519v1_alt.fa.gz
wget http://hgdownload.soe.ucsc.edu/goldenPath/hg38/chromosomes/chr1_GL383520v2_alt.fa.gz
wget http://hgdownload.soe.ucsc.edu/goldenPath/hg38/chromosomes/chr1_KI270706v1_random.fa.gz
wget http://hgdownload.soe.ucsc.edu/goldenPath/hg38/chromosomes/chr1_KI270707v1_random.fa.gz
wget http://hgdownload.soe.ucsc.edu/goldenPath/hg38/chromosomes/chr1_KI270708v1_random.fa.gz
wget http://hgdownload.soe.ucsc.edu/goldenPath/hg38/chromosomes/chr1_KI270709v1_random.fa.gz
wget http://hgdownload.soe.ucsc.edu/goldenPath/hg38/chromosomes/chr1_KI270710v1_random.fa.gz
wget http://hgdownload.soe.ucsc.edu/goldenPath/hg38/chromosomes/chr1_KI270711v1_random.fa.gz
wget http://hgdownload.soe.ucsc.edu/goldenPath/hg38/chromosomes/chr1_KI270712v1_random.fa.gz
wget http://hgdownload.soe.ucsc.edu/goldenPath/hg38/chromosomes/chr1_KI270713v1_random.fa.gz
wget http://hgdownload.soe.ucsc.edu/goldenPath/hg38/chromosomes/chr1_KI270714v1_random.fa.gz
wget http://hgdownload.soe.ucsc.edu/goldenPath/hg38/chromosomes/chr1_KI270759v1_alt.fa.gz
wget http://hgdownload.soe.ucsc.edu/goldenPath/hg38/chromosomes/chr1_KI270760v1_alt.fa.gz
wget http://hgdownload.soe.ucsc.edu/goldenPath/hg38/chromosomes/chr1_KI270761v1_alt.fa.gz
wget http://hgdownload.soe.ucsc.edu/goldenPath/hg38/chromosomes/chr1_KI270762v1_alt.fa.gz
wget http://hgdownload.soe.ucsc.edu/goldenPath/hg38/chromosomes/chr1_KI270763v1_alt.fa.gz
wget http://hgdownload.soe.ucsc.edu/goldenPath/hg38/chromosomes/chr1_KI270764v1_alt.fa.gz
wget http://hgdownload.soe.ucsc.edu/goldenPath/hg38/chromosomes/chr1_KI270765v1_alt.fa.gz
wget http://hgdownload.soe.ucsc.edu/goldenPath/hg38/chromosomes/chr1_KI270766v1_alt.fa.gz
wget http://hgdownload.soe.ucsc.edu/goldenPath/hg38/chromosomes/chr1_KI270892v1_alt.fa.gz

Note: Both the chain and sequence files are from https://hgdownload.soe.ucsc.edu/downloads.html, there’s many sections in it. For this tutorial, we are focusing on the ‘Human Section’. The chain file, is from Feb .2009 GRCh37/hg19 - Liftover files. Within the link, there will be multiple ones for different build being lifted to, but here we are lifting to GRCh38/hg38, hence the link I have placed above. The sequence files are from Dec. 2013 GRCh38/hg38- Sequence data by chromosomes, but you can see that each chromosome has multiple files, so we need to download all of them as per the bash code above.

If you’d like to liftover to other builds, or liftover other chromosomes, you can go to the links above and download the respective files.

Prepare the files

Now that we have all the files, we need to prepare them. Here are the different sections we need to do:

Input VCF File

We need to make sure our VCF file’s chromosome ID has a prefix of chr. We can check this using bcftools by running the following command:

bcftools query -f %CHROM input.vcf.gz | head

If it does not have a chr prefix, we can add it by running the following command:

bcftools annotate --rename-chrs chr_prefix.txt input.vcf.gz --output-type z --output input.chr.vcf.gz

where chr_prefix.txt is a text file with the following content:

1 chr1

Where the first column has the current chromosome ID, and the second column has the chromosome ID with the chr prefix. You can add more lines to the file if you have more chromosomes to annotate.

Reference Sequence file

We need to concatenate all the sequence files into one file. We can do this by running the following command:

#!/bin/bash
cat chr1.fa \
    chr1_GL383518v1_alt.fa \
    chr1_GL383519v1_alt.fa \
    chr1_GL383520v2_alt.fa \
    chr1_KI270706v1_random.fa \
    chr1_KI270707v1_random.fa \
    chr1_KI270708v1_random.fa \
    chr1_KI270709v1_random.fa \
    chr1_KI270710v1_random.fa \
    chr1_KI270711v1_random.fa \
    chr1_KI270712v1_random.fa \
    chr1_KI270713v1_random.fa \
    chr1_KI270714v1_random.fa \
    chr1_KI270759v1_alt.fa \
    chr1_KI270760v1_alt.fa \
    chr1_KI270761v1_alt.fa \
    chr1_KI270762v1_alt.fa \
    chr1_KI270763v1_alt.fa \
    chr1_KI270764v1_alt.fa \
    chr1_KI270765v1_alt.fa \
    chr1_KI270766v1_alt.fa \
    chr1_KI270892v1_alt.fa \
    > chr1_reference.fa

Following that we need to generate .dict file for the reference sequence file. We can do this by running the following command:

java -jar picard.jar CreateSequenceDictionary \
    R=chr1_reference.fa \
    O=chr1_reference.dict

Ready to Liftover

Now that we have all the files ready, we can run the liftover command. We can do this by running the following command:

java -jar picard.jar LiftoverVcf \
    WARN_ON_MISSING_CONTIG=true \
    I=input.chr.vcf.gz \
    O=lifted_over_chr1.vcf \
    CHAIN=hg19ToHg38.over.chain.gz \
    REJECT=rejected_variants.vcf \
    R=chr1_reference.fa

WARN_ON_MISSING_CONTIG=true is vital because we are only running liftover on chromosome 1, and we do not want the execution to breakdown because of error for the other chromosomes (which is part of the chain file). If you are running liftover on all chromosomes, you can remove this flag (or keep it for sanity check). As it runs, you can read the warning output, as long as the encountered warnings are for the other chromosomes, you are fine.

MEMORY ISSUE: it is likely you’ll have memory issue given that genotype files these days are massive. You can increase the memory by adding the following flag: -Xmx8g where 8g is 8 gigabytes. You can increase this number if you have more memory available. You can add this flag like so java -Xmx8g -jar picard.jar LiftoverVcf ...

Extending to all chromosomes

Now that you have seen how to do it for chromosome 1, you can do this for other chromosome by downloading all the other sequence file from Reference Sequence Files and concatanate them as above, and make the .dict file as above then run the liftover command as above.

Conclusion

Hopefully this tutorial has helped you to liftover your genotype file. If you have any questions, feel free to contact me.

Yap Chuan Fu

Genome Liftover with Picard

Tutorial on how to execute liftover on VCF files - 3 min read

Share