When I started graduate school, I ran head first into genomics (wiki link). I didn’t even know what a “command line” was, and I didn’t know any programming languages…not even basic bash.
Now after several computational biology courses, endless scrolls of genomics/bioinformatics forums, and free online coding courses — I know more than I did yesterday, but I still don’t know enough. The fields of “omics” and “informatics” is drastically changing every day, with technology fading into deprecation in a span of months!
So why am I telling you this? Doesn’t matter when you start in bioinformatics to study genomics, you’ll always have something new to learn, replacing the knowledge and tools you just learned yesterday.
Learn it anyway! Even if you don’t become a computer programmer, it helps you learn to use other people programs and gives you the flexibility to tweak them for your goals. After the science, I’ll leave you with some thoughts on troubleshooting other people’s code to accomplish your research goals.
Today’s post was inspired by my recent investigation of assembling mitochondrial genomes and it may turn into a series. Here’s a bit of background to orient you!
A genome is an assembly of the genetic material (DNA) from an organism, organized to reflect their original chromosomal arrangement. I like to use a book analogy (below) to understand genomics terminology. This figure takes you through sequencing and assembly theory.
Virtually all cells have DNA in their nuclei called the nuclear genome. Mitochondria (and plant chloroplast), which are organelles in the cell responsible for generating ATP to power cellular work, also have a separate and smaller, circularized genomes that are maternally inherited. Below is a short comparison of mitochondrial and nuclear DNA. To learn more about mitochondrial origin and function from a lab focused on mitochondria, check out the Ettema Lab.
Methods in genomics is like the wild west sometimes. We have tons of programs that are based on different algorithms and platforms, each with their own flaw, to carry out the same goal. Further more, different methods are written for different levels of skill for community use, but often the quality of the results vary —especially with “plug ‘n’ chug” graphical user interfaces (GUI) that simplify input and parameterization.
From broad research, it seems that the generally accepted methodology for assembling a mitochondrial genome (mitogenome) consists of using sequence similarity searches with existing mitogenomes or mitochondrial gene sequences as bait to capture contigs from the genome assembly that are from mitochondrial reads.These contigs are then assembled to create a contiguous, mitogenome with all 37 genes. There are also quite a few programs that attempt to assemble the mitochondrial genome from the raw read data. I’m attempting to reconstruct the full mitogenome for a species of interest using the sequencing data available to me.
The First Hurdle
What if your sequence similarity search yields nada (zilch, zero)? My first round of sequence similarity search with
ncbi-blast+/2.6.0 left me with a few hits that weren’t long enough to be mitochondrial contigs, and could very well be nuclear mitochondrial DNA segments (NUMTs, mitochondrial sequences incorporated into the nuclear genome) –which would be really embarrassing to publish as my mitochondrial sequences…sigh. The search algorithm and setting were probably more appropriate for species that are more closely related.
Was my bait species not as closely related as I thought? I checked out the divergence time estimation for both the reference and my species of interest on TimeTree.org.
45 million years ago (MYA) is quite a long time, but the most commonly used reference for reptile genomics (green anole, Anolis carolinensis) has a divergence time of 165 MYA from my species of interest! Based on my knowledge of complete, annotated mitogenomes available on National Center for Biotechnology Information (NCBI) at the time, I decided to stick with the reference I had.
Did sequencing library preparation lead to reduced mitochondrial representation? Maybe! So, I expanded from my genomic reads, in case library preparation through 10x Genomics doesn’t lead to an absurd amount of mitochondrial sequences (which is typically the case). I included two RNA-seq libraries (raw read sets from DNA extracted from blood & liver) for my species of interest that were readily available on the NCBI SRA database.
If at First You Don’t Succeed…
I expanded my methodology to mapping raw, trimmed reads to a reference mitogenome and two de novo (no reference sequence) assembly methods. All of the methods used were performed separately with all three libraries.
Good Ole’ BWA
I started with the Burrow’s Wheeler Alignment (BWA) mem algorithm, which is commonly used to map sequencing reads to a closely related reference sequence (typically closer than my species and reference are).
bwa mem -t 8 -M ref_mito myreads_R1.fq.gz myreads_R2.fq.gz|samtools sort -@ 8 -o myreadstoref.bam
The binary alignment map (BAM/.bam) files created from mapping the three sequence libraries were downloaded and visualized in Geneious/11.1.4.
I exported the consensus sequence (sequence with corresponding basepairs where reads mapped and N’s where no reads mapped) from all three mapping files. The consensus sequences from the blood and liver RNA-seq data were aligned (MAFFT in Geneious) first, and their consensus was aligned with the consensus sequence from raw DNA-seq reads. Lastly, the consensus of the three libraries was aligned to the reference genome for annotation transfer.
My initial thought, “I’m missing so much data, Yikes!” After chatting with my advisor and doing some reading, I came to the conclusion that iterating through the assembly processes will help me capture more mitochondrial reads and fill in my sequencing gaps. This gave me a bit more hope! Although there are SO MANY N’s (red regions on the identity track of the alignment), I did now have a “backbone” mitogenome sequence.
How do I fill in some of these N’s (missing data)?
I stumbled across Forni, 2019 which gave me some new direction and some scripts to test out! Their methods combine mapping and de novo assembly in an iterative manner to reconstruct mitogenomes from RNA-seq data. I attempted to use the scripts found on their github, and initially encountered an obscure error in their code preventing it from running. After a bit of investigation, turns out there were some parentheses missing on a print statement, which was only obvious if you can code in python. This is why I think scientist interested in analyzing their own data should learn code and the command line. BUT even once I fixed that issue, the script ran for WEEKS, passing the iteration limit set using the
-N 10 flag and stalling at its 32nd iteration after producing over a terabyte of temporary files (eye roll).
Needless to say, I gave up on that – but I did start looking into other programs promising de novo (or almost de novo) reconstruction of mitogenomes: MitoZ (github), GetOrganelle (github), and NOVOPlasty (github). Virtually all of the programs I’ve found are available only on a command line interface (CLI), requiring basic command line operating skills.
I hope you’ve learned a bit about mitochondrial DNA, sequencing genetic material, and the importance of learning a little bit of coding vernacular to implement programs. Next week, I’ll continue this post, going into detail on how I installed and implemented these three programs in hopes of filling in missing sequence data for mitogenome reconstruction.
Commentary on methodology, post content, content accuracy, or future blog content is always welcomed!
Figures and images were created in BioRender or exported from the indicated program. Web image links are included in the captions.
Feature Image “Cute Mitochondria,” https://www.pinterest.com/pin/322148179570782350/
Data used was generated from my research lab and/or downloaded from a public database (NCBI).
Unpublished data will be partially concealed until it is published.