Mitochondrial Madness Pt. 2

This week’s post is a continuation of my previous post on assembling mitochondrial genomes from existing data. I referenced doing this last week, but I’m experiencing an episode of tunnel vision locked on k-mer analyses, which I plan to cover in my next blog post…it’s consuming and confusing.

Me, reading my 10th paper on k-mers and still not understanding how to interpret them…

Also, because I pretend I can multitask, I’ll have to release this over the week in three updates (per program basis) because I’ve spent my blogging time starring at my k-mer results for a different project.

In my last post, I needed to improve the mitogenome sequence that I partially captured from mapping reads from 2 RNA-sequencing libraries and a DNA-sequencing library. This more or less gave me a “backbone” for the sequence I am trying to obtain. This also gave me an idea of sequence coverage for specific regions of the mitogenome (reads that mapped from all three sequencing libraries virtually provided coverage at the same regions several times [at high depth]).

I decided to explore de novo assembly methods for mitochondrial genomes. It’s always nice to have someone else pipeline, so you don’t feel like you’re just hacking things together. Unfortunately, that means lots of deadends and trial and error that I typically don’t have time for…but someones gotta do it.

Reasons for pulling my hair out strand by strand when dealing with published programs, pipelines, and workflows:

  1. POOR DOCUMENTATION – if you write software for others, and you expect people to use your software, PLEASE leave well-detailed, debugged, and tested code with documentation on what the code is suppose to do. Instruct, with grave, detail on how a user is supposed to implement and run your scripts, programs, and workflows. This makes it less likely that tons of time is spent on explaining things in github issue forums, and at least people with some coding skills could probably make things work for them without author intervention.
  2. POOR MAINTANENCE – technology is getting older moments after being created, so there are always new ways to improve your code, dependencies, and program algorithms. Many programs are publish, and released to the community and the authors NEVER go back to check for issues presented by users or update their programs. Of course I understand that people’s lives move past projects, but I’ve seen some programs that would be super beneficial to the non-model community if it were maintained!

I mapped contigs from 3 different de novo assembly methods to my “backbone” sequence, in hopes of filling in sequence ambiguities in overlapping regions in Geneious using the Geneious mapping algorithm. I also independently used de novo assembled contigs on a “by program” basis to map to the same reference genome I mapped reads from my species of interest to create the “backbone”.

GetOrganelle

After failing get results from that iterative script i spoke about last time, norgal, and one other promising all-in-one mitogenome solution for reproducible results, i stumbled upon GetOrganelle(github). This program is written in python, so no compiling anything or need for requesting special installation permission from our HPC administration- Just a conda install. I was also worried that this would be a waste of time after reading the publication(link) titled “GetOrganelle: a simple and fast pipeline for de novo assembly of a complete circular chloroplast genome using genome skimming data.” With a bit of reading, I later found out that they have accomodations for vertebrate animal mitochondria genomes! Their methods are based off of the characteristics of circularized sequences, so it is theoretically useful across chloroplast, bacterial, and mitochondrial DNA sequences alike!

Code

The pipeline depends on SPAdes (an assembler program for small sequences), Bowtie2 (an alignment program), BLAST+ (also a local aligner, sequence similarity search algorithm), and if you want to look at a cool assembly graph, you’ll need to download Bandage(visualization tool).

I have taken a liking (can’t you tell I’m southern) to Anaconda/Miniconda and their environments. Basically, programs that have conda recipes (created in, converted/translated to python) can be loaded within specific environments along with all of the recipies for dependencies.

Installation/Setup

```
#setting up a new conda environment for assembling mitogenomes and installing GetOrganelle from the Bioconda channel
conda create --name MitoGenome -c bioconda getorganelle
#activate MitoGenome environment 
conda activate MitoGenome
```

This pipeline has already set up for it’s dependencies to be loaded and/or installed when GetOrganelle is, which is super thoughtful! If they hadn’t, you would have needed to run something like this…

conda create --name MitoGenome getorganelle bowtie2 blast spades

I may have needed to answer with a simple y to finish installation. You can always check to see if your favorite computational biology tools have exisiting recipes here. Here’s a quickstart cheat sheet for conda commands as well.

Pipeline Progression

Here are some snippets from the log file produced while running GetOrganelle. I’ll use # as above to indicate comments about what the code is doing, generally.

`````````
GetOrganelle v1.6.4
Python 3.7.3 (default, Mar 27 2019, 22:11:17) [GCC 7.3.0]
PYTHON LIBS: numpy 1.18.1; sympy 1.5.1; scipy 1.4.1; psutil 5.7.0
DEPENDENCIES: Bowtie2 2.3.5.1; SPAdes 3.13.1; Blast 2.9.0
WORKING DIR: /gpfs01/scratch/adc0032/GetOrganelle

##Commands that i provided to the program, pointing to sequence reads, requesting their mitochondrial data to seed/guide my assembly, and compressing my output files named with the -o flag

/home/adc0032/anaconda3/bin/get_organelle_from_reads.py  \
-1 {read1.fq.gz} -2 {read2.fq.gz} -o {Output_prefix} -F animal_mt --zip-files --keep-temp -t 15 --verbose --target-genome-size 17000
##The program downsamples the data (skimming), maps with bowtie to their animal_mt index to make contigs and assembles those contigs with spades 
2020-04-08 20:50:04,159 - INFO: Pre-reading fastq …
2020-04-08 20:50:04,160 - INFO: Estimating reads to use … (to use all reads, set '--reduce-reads-for-coverage inf')
2020-04-08 20:50:05,084 - INFO: Tasting 100000+100000 reads …

2020-04-08 20:50:06,013 - INFO: Seed bowtie2 index existed!
2020-04-08 20:50:06,013 - INFO: Mapping reads to seed bowtie2 index …
2020-04-08 20:50:06,013 - INFO: bowtie2 --large-index --seed 12345 --mm -p 15 /gpfs01/home/adc0032/anaconda3/lib/python3.7/site-packages/GetOrganelleLib/SeedDatabase/animal_mt.index \
2020-04-08 20:50:12,387 - INFO: Mapping finished.
2020-04-08 20:50:12,387 - INFO: Tasting 500000+500000 reads …
======= SPAdes pipeline started. Log can be found here: /gpfs01/scratch/adc0032/GetOrganelle/filtered_spades/spades.log
===== Read error correction started.....
Results

I ran this program with DNA and blood RNA sequence reads independently. GetOrganelle attempts to give you one, circularized sequence (1 contig), but I got 3 contigs from RNA reads and 6 from DNA reads. I downloaded the contigs to explore in Geneious, and I downloaded the assembly graphs to look at in Bandage.

Figure: GetOrganelle contigs assembled from RNA sequence reads mapped to the glass lizard mitochondrial genome.
Figure was output from Geneious.

It looks like those contigs pretty much assembled the mitochondrial genome, with the exception of the control region, which is highly repetitive (and therefore quite difficult to sequence and assemble) and variable. The DNA sequence data yielded more contigs, so a less contiguous assembly. It really only ….

Figure: GetOrganelle contigs assembled from DNA sequence reads mapped to the glass lizard mitochondrial genome.
Figure was output from Geneious.

The spades couldn’t resolve the connection between the 3 (or 6) contigs it assembled, as seen in this Bandage visualization of the assembly graph below.

Figure: Bandage Visualization of SPAdes assembly graph produced.
Figure was output from Bandage.

NOVOPlasty

NOVOPlasty is advertised as a de novo assembler and variant caller (heteroplasmy) for small circular genomes. The data paper for this program was published in 2016, so only three years older than GetOrganelle, and they released an updated publication focusing on the variant caller additions to the program in 2019. In the context of usability, I found this program to be the easiest to install and get running. It is a perl script, so as long as you have perl installed, you can clone the NOVOPlasty directory from Github and get started with mitogenome assembly!

Code

NOVOPlasty assembles organelle genomes through a seed-extend algorithm. This just means it uses a small chunk of sequence you are looking for, or something very closely related to what you’re looking for and tries to build your mitogenome by using reads from your dataset to extend the seed sequence. It’s important to have a good (complete) seed if you are using a closely related species. Alternatively, if you have partial sequences from your species of interests (i.e. a complete ND6 gene ) you can use those as seeds. Again, NOVOPlasty was, by far, the easy to mitogenome assembly program to clone and implement.

Installation/Setup

Find out if you have perl installed on your system:

-> which perl

~/anaconda3/bin/perl

If you have perl installed, you can clone the repository using git clone:

-> git clone https://github.com/ndierckx/NOVOPlasty.git

From here, you’re ready to run the script, you just need to set up a config file (which they have an example of on their github):

Project:
Project name = MyProject
Type = mito
Genome Range = 15000-19000
K-mer = 39
Max memory =
Extended log = 1
Save assembled reads = yes
Seed Input = /path/to/reference.fa or /path/to/seed.fa
Extend seed directly = no
Reference sequence =
Variance detection = no
Heteroplasmy =
HP exclude list =
Chloroplast sequence =
Dataset 1:
Read Length = 150
Insert size = 350
Platform = illumina
Single/Paired = PE
Combined reads =
Forward reads = /path/to/forward.fq
Reverse reads = /path/to/reverse.fq

Pipeline Progression

In my initial run of the program it returned an error message, something along the lines of a memory allocation issue. I was using raw genomic reads with an expected coverage of 50X, so I was definitely drowning the program with my data.

I downsampled my reads using a bash one-liner:

zcat read.fq.gz | head -300000000 > downsample_read.fq

This line reads the compressed read.fq file with zcat and pipes the information to the head command, which returns the amount of lines indicated with the - flag (here, return the first 300M reads) and direct those reads to a new file.

It is important that you use a multiple of 4, due to the structure of a fastq file. FASTQ files include 4 lines of information for every read. Here is some more information of FASTQ files.

The program outputs nice log files, and you can request the extended log in your config file by setting Extended Log=1 (1=True in many coding situations).

From a log file:

-------------------------Output data metrics-----------------------------
...
Total contigs : 51
Largest contig : 3139 bp
Smallest contig : 272 bp
Average insert size : 300 bp
-------------------------Input data metrics-----------------------------
Total reads : 150000000
Aligned reads : 5660
Assembled reads : 5378

Results

I was the least impressed with these results. I ran downsampled genomic reads, liver reads, and blood reads. I used a seed in the form of a single known gene from my focal species, and a seed in the form of the mitogenome from my reference species.

The blood and liver data, using either seed resulted in a single contig of 966bp and 945bp long, respectively. If you’ll notice from the config.txt example, I’m hoping to assemble 15,000-17,000 bps of sequence (*super exaggerated eye-roll).

Figure: NOVOPlasty contig assembled from liver reads mapped to the glass lizard mitogenome. Underwhelming coverage of the 12S ribosomal RNA gene.

The downsampled genomic data resulted in 51 contigs of various length. It’s not ideal that it is such a fragmented assembly, but at least this can be mapped to my backbone to verify what I’ve already produced, and/or fill in any missing data covered by these 51 contigs.

Figure: NOVOPlasty contigs assembled from downsampled genomic reads mapped to the glass lizard mitogenome. Of the 51 contigs created by NOVOPlasty, ~10 were duplicates of an existing contig.

NOVOPlasty has some instructions to use circos to build plots for visualization of variants in organelle genomes. There are some hacking file rearrangements necessary to make it work, but the details are on github at their wiki. This was outside of the scope of my interest in using the program, so I didn’t call variants (or make a circos plot here).

These data really only provided more coverage for regions of my mitogenome I’d already assembled, additionally, one of my original goals was to find an all-in-one solution to mitogenome assembly… which we can see didn’t happen for my dataset.

I did test NOVOPlasty on some seastar data for a friend (and colleague) and it was able to assemble and circularize a decent mitogenome sequence.

MitoZ

MitoZ is a python-based de novo mitogenome assembly and annotation tool. I reviewed the data paper for this program in a computational biology colloquium I took last Fall, and I was really excited to try it when I first began exploring tools for mitogenome assembly and annotation.

Code

MitoZ is fairly easy to install, and they have several methods of installation, with instructions on setting up taxonomy information for their report of closely related species. I used a conda installation for this program.

Installation/Setup

The installation was very clearly detailed on the github page, and I can’t think of anything special I needed to have that wasn’t loaded with activation of the mitoZ environment.

-> conda create  -n mitozEnv libgd=2.2.4 python=3.6.0 biopython=1.69 ete3=3.0.0b35 perl-list-moreutils perl-params-validate perl-clone circos=0.69 perl-bioperl blast=2.2.31  hmmer=3.1b2  bwa=0.7.12 samtools=1.3.1 infernal=1.1.1 tbl2asn openjdk
-> conda update -c bioconda tbl2asn ete3

The authors included very detailed troubleshooting information for getting started. I was really please with the level of set-up information provided.

Pipeline Progression

MitoZ has a few run modes that include different levels of data manipulation. There is an option to trim and filter your input dataset before assembly

> python3 MitoZ.py all

Furthermore, it runs a fast mode and provides an option to run a k-mer mode to improve the assembly from fast mode. The memory usage and run time increase greatly in k-mer mode.

The one section I had trouble with was transitioning from fast mode to k-mer mode to improve my assembly. The filenames aren’t quite the same as they instruct you to use, but find the files with the most similar name, and they will get the job done.

Additionally, you need to provide a file that has the scaffolds found in fast mode and the protein-coding gene that were found on each of them listed in a row. You can pretty much build this from your summary output file.

> cat quick_mode_fa_genes.txt

scaffold7320 ATP8 ND5 ND6

scaffold2113 COX1 ATP6

...

Results

NEXT UP:

Bringing it all together (now)… (links to the Beatles :))
https://giphy.com/explore/all-together-now

I will include my results from this project in an upcoming genome publication! Good genomic resources can help expand the questions answered as well as the hypotheses generated that moves science forward.

I am about half-way through assembling mitochondrial genomes for several reptile transcriptomic datasets published in Waits (2019). You can follow my repository for updates on this data.

Commentary on methodology, post content, content accuracy, or future blog content is always welcomed!
Figures and images were created in BioRender or exported from the indicated program. Web image links are included in the captions.
Feature Image “Cute Mitochondria,” https://www.pinterest.com/pin/322148179570782350/
Data used was generated from my research lab and/or downloaded from a public database (NCBI).
Unpublished data will be partially concealed until it is published.

Mitochondrial Madness

When I started graduate school, I ran head first into genomics (wiki link). I didn’t even know what a “command line” was, and I didn’t know any programming languages…not even basic bash.

https://memeguy.com/photo/7694/the-command-line

Now after several computational biology courses, endless scrolls of genomics/bioinformatics forums, and free online coding courses — I know more than I did yesterday, but I still don’t know enough. The fields of “omics” and “informatics” is drastically changing every day, with technology fading into deprecation in a span of months!

So why am I telling you this? Doesn’t matter when you start in bioinformatics to study genomics, you’ll always have something new to learn, replacing the knowledge and tools you just learned yesterday.

Learn it anyway! Even if you don’t become a computer programmer, it helps you learn to use other people programs and gives you the flexibility to tweak them for your goals. After the science, I’ll leave you with some thoughts on troubleshooting other people’s code to accomplish your research goals.

made with: https://imgflip.com/memegenerator

Today’s post was inspired by my recent investigation of assembling mitochondrial genomes and it may turn into a series. Here’s a bit of background to orient you!

A genome is an assembly of the genetic material (DNA) from an organism, organized to reflect their original chromosomal arrangement. I like to use a book analogy (below) to understand genomics terminology. This figure takes you through sequencing and assembly theory.

Simplified sequencing and assembly theory, using a book related analogy.
Created in BioRender.

Virtually all cells have DNA in their nuclei called the nuclear genome. Mitochondria (and plant chloroplast), which are organelles in the cell responsible for generating ATP to power cellular work, also have a separate and smaller, circularized genomes that are maternally inherited. Below is a short comparison of mitochondrial and nuclear DNA. To learn more about mitochondrial origin and function from a lab focused on mitochondria, check out the Ettema Lab.

Brief comparison of DNA from mitochondria versus nuclei.
The details for basepair and gene numbers are specific to animals. These numbers can be very different for plants and fungi.
*This refers only to somatic (non-sex) cells. Sperm and ova (gametic cells) have only a single of the nuclear genome.
Created in BioRender.

Methods in genomics is like the wild west sometimes. We have tons of programs that are based on different algorithms and platforms, each with their own flaw, to carry out the same goal. Further more, different methods are written for different levels of skill for community use, but often the quality of the results vary —especially with “plug ‘n’ chug” graphical user interfaces (GUI) that simplify input and parameterization.

Mitogenome Miniproject

Goal

From broad research, it seems that the generally accepted methodology for assembling a mitochondrial genome (mitogenome) consists of using sequence similarity searches with existing mitogenomes or mitochondrial gene sequences as bait to capture contigs from the genome assembly that are from mitochondrial reads.These contigs are then assembled to create a contiguous, mitogenome with all 37 genes. There are also quite a few programs that attempt to assemble the mitochondrial genome from the raw read data. I’m attempting to reconstruct the full mitogenome for a species of interest using the sequencing data available to me.

The First Hurdle

What if your sequence similarity search yields nada (zilch, zero)? My first round of sequence similarity search with ncbi-blast+/2.6.0 left me with a few hits that weren’t long enough to be mitochondrial contigs, and could very well be nuclear mitochondrial DNA segments (NUMTs, mitochondrial sequences incorporated into the nuclear genome) –which would be really embarrassing to publish as my mitochondrial sequences…sigh. The search algorithm and setting were probably more appropriate for species that are more closely related.

Was my bait species not as closely related as I thought? I checked out the divergence time estimation for both the reference and my species of interest on TimeTree.org.

Pairwise Divergence Time Estimated for the species of interest and a reference (Dopasia gracilis).
Figure exported from timetree.org; edited in BioRender.

45 million years ago (MYA) is quite a long time, but the most commonly used reference for reptile genomics (green anole, Anolis carolinensis) has a divergence time of 165 MYA from my species of interest! Based on my knowledge of complete, annotated mitogenomes available on National Center for Biotechnology Information (NCBI) at the time, I decided to stick with the reference I had.

Did sequencing library preparation lead to reduced mitochondrial representation? Maybe! So, I expanded from my genomic reads, in case library preparation through 10x Genomics doesn’t lead to an absurd amount of mitochondrial sequences (which is typically the case). I included two RNA-seq libraries (raw read sets from DNA extracted from blood & liver) for my species of interest that were readily available on the NCBI SRA database.

If at First You Don’t Succeed…

I expanded my methodology to mapping raw, trimmed reads to a reference mitogenome and two de novo (no reference sequence) assembly methods. All of the methods used were performed separately with all three libraries.

Good Ole’ BWA

I started with the Burrow’s Wheeler Alignment (BWA) mem algorithm, which is commonly used to map sequencing reads to a closely related reference sequence (typically closer than my species and reference are).

bwa mem -t 8 -M ref_mito myreads_R1.fq.gz myreads_R2.fq.gz|samtools sort -@ 8 -o myreadstoref.bam

The binary alignment map (BAM/.bam) files created from mapping the three sequence libraries were downloaded and visualized in Geneious/11.1.4.

I exported the consensus sequence (sequence with corresponding basepairs where reads mapped and N’s where no reads mapped) from all three mapping files. The consensus sequences from the blood and liver RNA-seq data were aligned (MAFFT in Geneious) first, and their consensus was aligned with the consensus sequence from raw DNA-seq reads. Lastly, the consensus of the three libraries was aligned to the reference genome for annotation transfer.

Alignment of three sequencing libraries mapped to the same reference mitogenome (NC_030369). The identity bar indicates sequence similarity, where green is 100% identical basepairs between the sequences and red is completely different basepairs. All red regions (and other low identity [>50%] regions) are N’s in the consensus.
Image exported from Geneious.

My initial thought, “I’m missing so much data, Yikes!” After chatting with my advisor and doing some reading, I came to the conclusion that iterating through the assembly processes will help me capture more mitochondrial reads and fill in my sequencing gaps. This gave me a bit more hope! Although there are SO MANY N’s (red regions on the identity track of the alignment), I did now have a “backbone” mitogenome sequence.

Sequence alignment zoomed in on the ATP8 gene sequence. Regions without read coverage are filled with N’s (shown in red/burgundy).
Image exported from Geneious.

How do I fill in some of these N’s (missing data)?

I stumbled across Forni, 2019 which gave me some new direction and some scripts to test out! Their methods combine mapping and de novo assembly in an iterative manner to reconstruct mitogenomes from RNA-seq data. I attempted to use the scripts found on their github, and initially encountered an obscure error in their code preventing it from running. After a bit of investigation, turns out there were some parentheses missing on a print statement, which was only obvious if you can code in python. This is why I think scientist interested in analyzing their own data should learn code and the command line. BUT even once I fixed that issue, the script ran for WEEKS, passing the iteration limit set using the -N 10 flag and stalling at its 32nd iteration after producing over a terabyte of temporary files (eye roll).

Needless to say, I gave up on that – but I did start looking into other programs promising de novo (or almost de novo) reconstruction of mitogenomes: MitoZ (github), GetOrganelle (github), and NOVOPlasty (github). Virtually all of the programs I’ve found are available only on a command line interface (CLI), requiring basic command line operating skills.

I hope you’ve learned a bit about mitochondrial DNA, sequencing genetic material, and the importance of learning a little bit of coding vernacular to implement programs. Next week, I’ll continue this post, going into detail on how I installed and implemented these three programs in hopes of filling in missing sequence data for mitogenome reconstruction.

Commentary on methodology, post content, content accuracy, or future blog content is always welcomed!
Figures and images were created in BioRender or exported from the indicated program. Web image links are included in the captions.
Feature Image “Cute Mitochondria,” https://www.pinterest.com/pin/322148179570782350/
Data used was generated from my research lab and/or downloaded from a public database (NCBI).
Unpublished data will be partially concealed until it is published.

The First One is Always the Hardest.

The first blog post is what I’m referring to in my title. I’m typically not good with blogging or keeping up with the digital age — but I will admit that it is advantageous in many ways.

For my first blog post, I thought I would focus on writing and/or talking about science to anyone. Most scientists invest many years into studying particular fields of interest, changing our thought patterns for endless inquiry and validation, and learning a heap of vocabulary (and, sigh, acronyms). With technology constantly pushing us to new heights daily, it behooves one to also invest time integrating data management, computer literacy, coding literacy, and statistical data analysis into our scientific skillsets. This helps with efficiency of data analysis and interpretation, which can become a hang up because we often produce more data than we can analyze or we don’t have the right tools optimized and developed yet. I’m not saying we should all be computer programmers, but learning a programming language can take you a long way with troubleshooting other people’s code.

https://www.buzzfeed.com/kevinsmith/grad-life?utm_term=.frlRAvxdE&sub=4058009_7194417

The point is, we learn a lot of things to be successful scientists, but we often forget that communicating science to everyone is the most important part.

I wrote a short article for a colleague about my path to science and my research interests in my first year of graduate school. This article was for Youngzine, an online news and education site for children, and I recently revisited the article to find comments from the viewers (albeit 3 years later) that inspired this inaugural blog and reminded me of my personal commitment to learning about how life around us works, and making the knowledge generated by sound and transparent research accessible and understandable. Our communities depend on it — not just the scientific ones, but the school, environmental, and health communities too.

How did we get here?
Credit: Tim Dunne; https://asunow.asu.edu/20191210-new-asu-courses-offer-tools-engaging-others-science

Although I plan for this to be a casual blog, I hope to also translate what I am learning about throughout my graduate career as a means of strengthening my ability to communicate science to anyone (even, the kiddos!).

Create your website at WordPress.com
Get started