Mitochondrial Madness Pt. 2

This week’s post is a continuation of my previous post on assembling mitochondrial genomes from existing data. I referenced doing this last week, but I’m experiencing an episode of tunnel vision locked on k-mer analyses, which I plan to cover in my next blog post…it’s consuming and confusing.

Me, reading my 10th paper on k-mers and still not understanding how to interpret them…

Also, because I pretend I can multitask, I’ll have to release this over the week in three updates (per program basis) because I’ve spent my blogging time starring at my k-mer results for a different project.

In my last post, I needed to improve the mitogenome sequence that I partially captured from mapping reads from 2 RNA-sequencing libraries and a DNA-sequencing library. This more or less gave me a “backbone” for the sequence I am trying to obtain. This also gave me an idea of sequence coverage for specific regions of the mitogenome (reads that mapped from all three sequencing libraries virtually provided coverage at the same regions several times [at high depth]).

I decided to explore de novo assembly methods for mitochondrial genomes. It’s always nice to have someone else pipeline, so you don’t feel like you’re just hacking things together. Unfortunately, that means lots of deadends and trial and error that I typically don’t have time for…but someones gotta do it.

Reasons for pulling my hair out strand by strand when dealing with published programs, pipelines, and workflows:

  1. POOR DOCUMENTATION – if you write software for others, and you expect people to use your software, PLEASE leave well-detailed, debugged, and tested code with documentation on what the code is suppose to do. Instruct, with grave, detail on how a user is supposed to implement and run your scripts, programs, and workflows. This makes it less likely that tons of time is spent on explaining things in github issue forums, and at least people with some coding skills could probably make things work for them without author intervention.
  2. POOR MAINTANENCE – technology is getting older moments after being created, so there are always new ways to improve your code, dependencies, and program algorithms. Many programs are publish, and released to the community and the authors NEVER go back to check for issues presented by users or update their programs. Of course I understand that people’s lives move past projects, but I’ve seen some programs that would be super beneficial to the non-model community if it were maintained!

I mapped contigs from 3 different de novo assembly methods to my “backbone” sequence, in hopes of filling in sequence ambiguities in overlapping regions in Geneious using the Geneious mapping algorithm. I also independently used de novo assembled contigs on a “by program” basis to map to the same reference genome I mapped reads from my species of interest to create the “backbone”.

GetOrganelle

After failing get results from that iterative script i spoke about last time, norgal, and one other promising all-in-one mitogenome solution for reproducible results, i stumbled upon GetOrganelle(github). This program is written in python, so no compiling anything or need for requesting special installation permission from our HPC administration- Just a conda install. I was also worried that this would be a waste of time after reading the publication(link) titled “GetOrganelle: a simple and fast pipeline for de novo assembly of a complete circular chloroplast genome using genome skimming data.” With a bit of reading, I later found out that they have accomodations for vertebrate animal mitochondria genomes! Their methods are based off of the characteristics of circularized sequences, so it is theoretically useful across chloroplast, bacterial, and mitochondrial DNA sequences alike!

Code

The pipeline depends on SPAdes (an assembler program for small sequences), Bowtie2 (an alignment program), BLAST+ (also a local aligner, sequence similarity search algorithm), and if you want to look at a cool assembly graph, you’ll need to download Bandage(visualization tool).

I have taken a liking (can’t you tell I’m southern) to Anaconda/Miniconda and their environments. Basically, programs that have conda recipes (created in, converted/translated to python) can be loaded within specific environments along with all of the recipies for dependencies.

Installation/Setup

```
#setting up a new conda environment for assembling mitogenomes and installing GetOrganelle from the Bioconda channel
conda create --name MitoGenome -c bioconda getorganelle
#activate MitoGenome environment 
conda activate MitoGenome
```

This pipeline has already set up for it’s dependencies to be loaded and/or installed when GetOrganelle is, which is super thoughtful! If they hadn’t, you would have needed to run something like this…

conda create --name MitoGenome getorganelle bowtie2 blast spades

I may have needed to answer with a simple y to finish installation. You can always check to see if your favorite computational biology tools have exisiting recipes here. Here’s a quickstart cheat sheet for conda commands as well.

Pipeline Progression

Here are some snippets from the log file produced while running GetOrganelle. I’ll use # as above to indicate comments about what the code is doing, generally.

`````````
GetOrganelle v1.6.4
Python 3.7.3 (default, Mar 27 2019, 22:11:17) [GCC 7.3.0]
PYTHON LIBS: numpy 1.18.1; sympy 1.5.1; scipy 1.4.1; psutil 5.7.0
DEPENDENCIES: Bowtie2 2.3.5.1; SPAdes 3.13.1; Blast 2.9.0
WORKING DIR: /gpfs01/scratch/adc0032/GetOrganelle

##Commands that i provided to the program, pointing to sequence reads, requesting their mitochondrial data to seed/guide my assembly, and compressing my output files named with the -o flag

/home/adc0032/anaconda3/bin/get_organelle_from_reads.py  \
-1 {read1.fq.gz} -2 {read2.fq.gz} -o {Output_prefix} -F animal_mt --zip-files --keep-temp -t 15 --verbose --target-genome-size 17000
##The program downsamples the data (skimming), maps with bowtie to their animal_mt index to make contigs and assembles those contigs with spades 
2020-04-08 20:50:04,159 - INFO: Pre-reading fastq …
2020-04-08 20:50:04,160 - INFO: Estimating reads to use … (to use all reads, set '--reduce-reads-for-coverage inf')
2020-04-08 20:50:05,084 - INFO: Tasting 100000+100000 reads …

2020-04-08 20:50:06,013 - INFO: Seed bowtie2 index existed!
2020-04-08 20:50:06,013 - INFO: Mapping reads to seed bowtie2 index …
2020-04-08 20:50:06,013 - INFO: bowtie2 --large-index --seed 12345 --mm -p 15 /gpfs01/home/adc0032/anaconda3/lib/python3.7/site-packages/GetOrganelleLib/SeedDatabase/animal_mt.index \
2020-04-08 20:50:12,387 - INFO: Mapping finished.
2020-04-08 20:50:12,387 - INFO: Tasting 500000+500000 reads …
======= SPAdes pipeline started. Log can be found here: /gpfs01/scratch/adc0032/GetOrganelle/filtered_spades/spades.log
===== Read error correction started.....
Results

I ran this program with DNA and blood RNA sequence reads independently. GetOrganelle attempts to give you one, circularized sequence (1 contig), but I got 3 contigs from RNA reads and 6 from DNA reads. I downloaded the contigs to explore in Geneious, and I downloaded the assembly graphs to look at in Bandage.

Figure: GetOrganelle contigs assembled from RNA sequence reads mapped to the glass lizard mitochondrial genome.
Figure was output from Geneious.

It looks like those contigs pretty much assembled the mitochondrial genome, with the exception of the control region, which is highly repetitive (and therefore quite difficult to sequence and assemble) and variable. The DNA sequence data yielded more contigs, so a less contiguous assembly. It really only ….

Figure: GetOrganelle contigs assembled from DNA sequence reads mapped to the glass lizard mitochondrial genome.
Figure was output from Geneious.

The spades couldn’t resolve the connection between the 3 (or 6) contigs it assembled, as seen in this Bandage visualization of the assembly graph below.

Figure: Bandage Visualization of SPAdes assembly graph produced.
Figure was output from Bandage.

NOVOPlasty

NOVOPlasty is advertised as a de novo assembler and variant caller (heteroplasmy) for small circular genomes. The data paper for this program was published in 2016, so only three years older than GetOrganelle, and they released an updated publication focusing on the variant caller additions to the program in 2019. In the context of usability, I found this program to be the easiest to install and get running. It is a perl script, so as long as you have perl installed, you can clone the NOVOPlasty directory from Github and get started with mitogenome assembly!

Code

NOVOPlasty assembles organelle genomes through a seed-extend algorithm. This just means it uses a small chunk of sequence you are looking for, or something very closely related to what you’re looking for and tries to build your mitogenome by using reads from your dataset to extend the seed sequence. It’s important to have a good (complete) seed if you are using a closely related species. Alternatively, if you have partial sequences from your species of interests (i.e. a complete ND6 gene ) you can use those as seeds. Again, NOVOPlasty was, by far, the easy to mitogenome assembly program to clone and implement.

Installation/Setup

Find out if you have perl installed on your system:

-> which perl

~/anaconda3/bin/perl

If you have perl installed, you can clone the repository using git clone:

-> git clone https://github.com/ndierckx/NOVOPlasty.git

From here, you’re ready to run the script, you just need to set up a config file (which they have an example of on their github):

Project:
Project name = MyProject
Type = mito
Genome Range = 15000-19000
K-mer = 39
Max memory =
Extended log = 1
Save assembled reads = yes
Seed Input = /path/to/reference.fa or /path/to/seed.fa
Extend seed directly = no
Reference sequence =
Variance detection = no
Heteroplasmy =
HP exclude list =
Chloroplast sequence =
Dataset 1:
Read Length = 150
Insert size = 350
Platform = illumina
Single/Paired = PE
Combined reads =
Forward reads = /path/to/forward.fq
Reverse reads = /path/to/reverse.fq

Pipeline Progression

In my initial run of the program it returned an error message, something along the lines of a memory allocation issue. I was using raw genomic reads with an expected coverage of 50X, so I was definitely drowning the program with my data.

I downsampled my reads using a bash one-liner:

zcat read.fq.gz | head -300000000 > downsample_read.fq

This line reads the compressed read.fq file with zcat and pipes the information to the head command, which returns the amount of lines indicated with the - flag (here, return the first 300M reads) and direct those reads to a new file.

It is important that you use a multiple of 4, due to the structure of a fastq file. FASTQ files include 4 lines of information for every read. Here is some more information of FASTQ files.

The program outputs nice log files, and you can request the extended log in your config file by setting Extended Log=1 (1=True in many coding situations).

From a log file:

-------------------------Output data metrics-----------------------------
...
Total contigs : 51
Largest contig : 3139 bp
Smallest contig : 272 bp
Average insert size : 300 bp
-------------------------Input data metrics-----------------------------
Total reads : 150000000
Aligned reads : 5660
Assembled reads : 5378

Results

I was the least impressed with these results. I ran downsampled genomic reads, liver reads, and blood reads. I used a seed in the form of a single known gene from my focal species, and a seed in the form of the mitogenome from my reference species.

The blood and liver data, using either seed resulted in a single contig of 966bp and 945bp long, respectively. If you’ll notice from the config.txt example, I’m hoping to assemble 15,000-17,000 bps of sequence (*super exaggerated eye-roll).

Figure: NOVOPlasty contig assembled from liver reads mapped to the glass lizard mitogenome. Underwhelming coverage of the 12S ribosomal RNA gene.

The downsampled genomic data resulted in 51 contigs of various length. It’s not ideal that it is such a fragmented assembly, but at least this can be mapped to my backbone to verify what I’ve already produced, and/or fill in any missing data covered by these 51 contigs.

Figure: NOVOPlasty contigs assembled from downsampled genomic reads mapped to the glass lizard mitogenome. Of the 51 contigs created by NOVOPlasty, ~10 were duplicates of an existing contig.

NOVOPlasty has some instructions to use circos to build plots for visualization of variants in organelle genomes. There are some hacking file rearrangements necessary to make it work, but the details are on github at their wiki. This was outside of the scope of my interest in using the program, so I didn’t call variants (or make a circos plot here).

These data really only provided more coverage for regions of my mitogenome I’d already assembled, additionally, one of my original goals was to find an all-in-one solution to mitogenome assembly… which we can see didn’t happen for my dataset.

I did test NOVOPlasty on some seastar data for a friend (and colleague) and it was able to assemble and circularize a decent mitogenome sequence.

MitoZ

MitoZ is a python-based de novo mitogenome assembly and annotation tool. I reviewed the data paper for this program in a computational biology colloquium I took last Fall, and I was really excited to try it when I first began exploring tools for mitogenome assembly and annotation.

Code

MitoZ is fairly easy to install, and they have several methods of installation, with instructions on setting up taxonomy information for their report of closely related species. I used a conda installation for this program.

Installation/Setup

The installation was very clearly detailed on the github page, and I can’t think of anything special I needed to have that wasn’t loaded with activation of the mitoZ environment.

-> conda create  -n mitozEnv libgd=2.2.4 python=3.6.0 biopython=1.69 ete3=3.0.0b35 perl-list-moreutils perl-params-validate perl-clone circos=0.69 perl-bioperl blast=2.2.31  hmmer=3.1b2  bwa=0.7.12 samtools=1.3.1 infernal=1.1.1 tbl2asn openjdk
-> conda update -c bioconda tbl2asn ete3

The authors included very detailed troubleshooting information for getting started. I was really please with the level of set-up information provided.

Pipeline Progression

MitoZ has a few run modes that include different levels of data manipulation. There is an option to trim and filter your input dataset before assembly

> python3 MitoZ.py all

Furthermore, it runs a fast mode and provides an option to run a k-mer mode to improve the assembly from fast mode. The memory usage and run time increase greatly in k-mer mode.

The one section I had trouble with was transitioning from fast mode to k-mer mode to improve my assembly. The filenames aren’t quite the same as they instruct you to use, but find the files with the most similar name, and they will get the job done.

Additionally, you need to provide a file that has the scaffolds found in fast mode and the protein-coding gene that were found on each of them listed in a row. You can pretty much build this from your summary output file.

> cat quick_mode_fa_genes.txt

scaffold7320 ATP8 ND5 ND6

scaffold2113 COX1 ATP6

...

Results

NEXT UP:

Bringing it all together (now)… (links to the Beatles :))
https://giphy.com/explore/all-together-now

I will include my results from this project in an upcoming genome publication! Good genomic resources can help expand the questions answered as well as the hypotheses generated that moves science forward.

I am about half-way through assembling mitochondrial genomes for several reptile transcriptomic datasets published in Waits (2019). You can follow my repository for updates on this data.

Commentary on methodology, post content, content accuracy, or future blog content is always welcomed!
Figures and images were created in BioRender or exported from the indicated program. Web image links are included in the captions.
Feature Image “Cute Mitochondria,” https://www.pinterest.com/pin/322148179570782350/
Data used was generated from my research lab and/or downloaded from a public database (NCBI).
Unpublished data will be partially concealed until it is published.

Published by Amanda Clark

PhD candidate at Auburn University in the functional genomics lab of Dr. Tonia Schwartz

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google photo

You are commenting using your Google account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s

Create your website at WordPress.com
Get started
%d bloggers like this: