Research Associate at The University of Manchester. I enjoy solving problems using computers and mathematics.
 
Published Dec 04, 2023
This tutorial will show you how to execute liftover on VCF files using Picard. This is useful when you have a VCF file that is in one genome build and you want to convert it to another genome build. For example, you may have a VCF file in hg19 (GRCh37, the older build) and you want to convert it to hg38 (GRCh38, which is the newer build); this is what the tutorial will cover, specifically on Chromosome 1 only. Once we get to the end, I will show how would extend this to all chromosomes.
Note there are other file genotype file formats such as PLINK or UCSC BED file, but this tutorial will focus on VCF files. If you have UCSC BED file, you can just use the webtool LiftOver to liftover your file.
First, we need to download the necessary files. We will need the following:
LiftoverVcf
command we will use.#!/bin/bash
# example only for chromosome 1
wget http://hgdownload.soe.ucsc.edu/goldenPath/hg38/chromosomes/chr1_GL383518v1_alt.fa.gz
wget http://hgdownload.soe.ucsc.edu/goldenPath/hg38/chromosomes/chr1_GL383519v1_alt.fa.gz
wget http://hgdownload.soe.ucsc.edu/goldenPath/hg38/chromosomes/chr1_GL383520v2_alt.fa.gz
wget http://hgdownload.soe.ucsc.edu/goldenPath/hg38/chromosomes/chr1_KI270706v1_random.fa.gz
wget http://hgdownload.soe.ucsc.edu/goldenPath/hg38/chromosomes/chr1_KI270707v1_random.fa.gz
wget http://hgdownload.soe.ucsc.edu/goldenPath/hg38/chromosomes/chr1_KI270708v1_random.fa.gz
wget http://hgdownload.soe.ucsc.edu/goldenPath/hg38/chromosomes/chr1_KI270709v1_random.fa.gz
wget http://hgdownload.soe.ucsc.edu/goldenPath/hg38/chromosomes/chr1_KI270710v1_random.fa.gz
wget http://hgdownload.soe.ucsc.edu/goldenPath/hg38/chromosomes/chr1_KI270711v1_random.fa.gz
wget http://hgdownload.soe.ucsc.edu/goldenPath/hg38/chromosomes/chr1_KI270712v1_random.fa.gz
wget http://hgdownload.soe.ucsc.edu/goldenPath/hg38/chromosomes/chr1_KI270713v1_random.fa.gz
wget http://hgdownload.soe.ucsc.edu/goldenPath/hg38/chromosomes/chr1_KI270714v1_random.fa.gz
wget http://hgdownload.soe.ucsc.edu/goldenPath/hg38/chromosomes/chr1_KI270759v1_alt.fa.gz
wget http://hgdownload.soe.ucsc.edu/goldenPath/hg38/chromosomes/chr1_KI270760v1_alt.fa.gz
wget http://hgdownload.soe.ucsc.edu/goldenPath/hg38/chromosomes/chr1_KI270761v1_alt.fa.gz
wget http://hgdownload.soe.ucsc.edu/goldenPath/hg38/chromosomes/chr1_KI270762v1_alt.fa.gz
wget http://hgdownload.soe.ucsc.edu/goldenPath/hg38/chromosomes/chr1_KI270763v1_alt.fa.gz
wget http://hgdownload.soe.ucsc.edu/goldenPath/hg38/chromosomes/chr1_KI270764v1_alt.fa.gz
wget http://hgdownload.soe.ucsc.edu/goldenPath/hg38/chromosomes/chr1_KI270765v1_alt.fa.gz
wget http://hgdownload.soe.ucsc.edu/goldenPath/hg38/chromosomes/chr1_KI270766v1_alt.fa.gz
wget http://hgdownload.soe.ucsc.edu/goldenPath/hg38/chromosomes/chr1_KI270892v1_alt.fa.gz
Note: Both the chain and sequence files are from https://hgdownload.soe.ucsc.edu/downloads.html, there’s many sections in it. For this tutorial, we are focusing on the ‘Human Section’. The chain file, is from
Feb .2009 GRCh37/hg19 - Liftover files
. Within the link, there will be multiple ones for different build being lifted to, but here we are lifting to GRCh38/hg38, hence the link I have placed above. The sequence files are fromDec. 2013 GRCh38/hg38- Sequence data by chromosomes
, but you can see that each chromosome has multiple files, so we need to download all of them as per the bash code above.If you’d like to liftover to other builds, or liftover other chromosomes, you can go to the links above and download the respective files.
Now that we have all the files, we need to prepare them. Here are the different sections we need to do:
We need to make sure our VCF file’s chromosome ID has a prefix of chr
. We can check this using bcftools by running the following command:
bcftools query -f %CHROM input.vcf.gz | head
If it does not have a chr prefix, we can add it by running the following command:
bcftools annotate --rename-chrs chr_prefix.txt input.vcf.gz --output-type z --output input.chr.vcf.gz
where chr_prefix.txt
is a text file with the following content:
1 chr1
Where the first column has the current chromosome ID, and the second column has the chromosome ID with the chr
prefix. You can add more lines to the file if you have more chromosomes to annotate.
We need to concatenate all the sequence files into one file. We can do this by running the following command:
#!/bin/bash
cat chr1.fa \
chr1_GL383518v1_alt.fa \
chr1_GL383519v1_alt.fa \
chr1_GL383520v2_alt.fa \
chr1_KI270706v1_random.fa \
chr1_KI270707v1_random.fa \
chr1_KI270708v1_random.fa \
chr1_KI270709v1_random.fa \
chr1_KI270710v1_random.fa \
chr1_KI270711v1_random.fa \
chr1_KI270712v1_random.fa \
chr1_KI270713v1_random.fa \
chr1_KI270714v1_random.fa \
chr1_KI270759v1_alt.fa \
chr1_KI270760v1_alt.fa \
chr1_KI270761v1_alt.fa \
chr1_KI270762v1_alt.fa \
chr1_KI270763v1_alt.fa \
chr1_KI270764v1_alt.fa \
chr1_KI270765v1_alt.fa \
chr1_KI270766v1_alt.fa \
chr1_KI270892v1_alt.fa \
> chr1_reference.fa
Following that we need to generate .dict
file for the reference sequence file. We can do this by running the following command:
java -jar picard.jar CreateSequenceDictionary \
R=chr1_reference.fa \
O=chr1_reference.dict
Now that we have all the files ready, we can run the liftover command. We can do this by running the following command:
java -jar picard.jar LiftoverVcf \
WARN_ON_MISSING_CONTIG=true \
I=input.chr.vcf.gz \
O=lifted_over_chr1.vcf \
CHAIN=hg19ToHg38.over.chain.gz \
REJECT=rejected_variants.vcf \
R=chr1_reference.fa
WARN_ON_MISSING_CONTIG=true
is vital because we are only running liftover on chromosome 1, and we do not want the execution to breakdown because of error for the other chromosomes (which is part of the chain file). If you are running liftover on all chromosomes, you can remove this flag (or keep it for sanity check). As it runs, you can read the warning output, as long as the encountered warnings are for the other chromosomes, you are fine.
MEMORY ISSUE: it is likely you’ll have memory issue given that genotype files these days are massive. You can increase the memory by adding the following flag:
-Xmx8g
where 8g is 8 gigabytes. You can increase this number if you have more memory available. You can add this flag like sojava -Xmx8g -jar picard.jar LiftoverVcf ...
Now that you have seen how to do it for chromosome 1, you can do this for other chromosome by downloading all the other sequence file from Reference Sequence Files and concatanate them as above, and make the .dict
file as above then run the liftover command as above.
Hopefully this tutorial has helped you to liftover your genotype file. If you have any questions, feel free to contact me.