--- title: "CRBHits Basic Vignette" author: "Kristian K Ullrich" date: "`r Sys.Date()`" output: rmarkdown::html_vignette vignette: > %\VignetteIndexEntry{CRBHits Basic Vignette} %\VignetteEngine{knitr::rmarkdown} %\VignetteEncoding{UTF-8} --- ```{r setup, include = FALSE} knitr::opts_chunk$set( collapse = TRUE, comment = "#>" ) ``` # CRBHit Basic Vignette - includes CRBHit pair calculation - includes CRBHit pair filtering - includes Longest Isoform selection - includes Codon alignments - includes Ka/Ks calculations ## Table of Contents 1. [Installation](#install) 2. [Conditional Reciprocal Best Hits - Algorithm](#crbhalgorithm) 1. [1. step: sequence similarity search (last)](#crbhstep1) 1. [Input: Coding Sequences (CDS)](#cdsinput) 2. [Get Longest Isoform from NCBI or ENSEMBL Input (optional)](#isoform) 3. [Sequence Similarity Search](#sequencesearch) 2. [2. step: filter hits + RBH extraction](#crbhstep2) 1. [Filter blast-like output prior fitting](#filters) 2. [Custom Filter](#customfilters) 3. [Extract RBHs](#rbhextraction) 3. [3. step: fitting RBHs + get CRBHs](#crbhstep3) 1. [CRBHit pairs - Fitting Parameter](#fittingparam) 3. [Ka/Ks Calculations](#KaKs) 1. [Codon Alignments - cds2codonaln() Function](#cds2codon) 2. [Calculate Ka/Ks values comparing two sequences - cds2kaks() Function](#cds2kaks) 3. [Calculate Ka/Ks values for each CRBHit pair - rbh2kaks() Function](#rbh2kaks) 4. [References](#references) [CRBHits](https://gitlab.gwdg.de/mpievolbio-it/crbhits) is a reimplementation of the Conditional Reciprocal Best Hit (CRBH) algorithm [crb-blast](https://github.com/cboursnell/crb-blast) in R. The Reciprocal Best Hit (RBH) approach is commonly used in bioinformatics to show that two sequences evolved from a common ancestral gene. In other words, RBH tries to find orthologous protein sequences within and between species. These orthologous sequences can be further analysed to evaluate protein family evolution, infer phylogenetic trees and to annotate protein function. The CRBH algorithm builds upon the traditional RBH approach to find additional orthologous sequences between two sets of sequences. CRBH uses the sequence search results to fit an expect-value (e-value) cutoff given each RBH to subsequently add sequence pairs to the list of *bona-fide* orthologs given their alignment length. The sequence similarity results can be filtered by common criteria like e-value, query-coverage and more. If chromosome annotations are available for the protein-coding genes (protein order), this information can be directly used to assign tandem duplicates before conducting further processing steps. [CRBHits](https://gitlab.gwdg.de/mpievolbio-it/crbhits) features downstream anaylsis like batch calculating synonymous (Ks) and nonsynonymous substitutions (Ka) per orthologous sequence pair. The codon sequence alignment step consist of three subtasks, namely coding nucleotide to protein sequence translation, pairwise protein sequence alignment calculation and converting the protein sequence alignment back into a codon based alignment. __Note:__ In this Vignette the `lastpath` is defined as `vignette.paths[1]` and the `dagchainerpath` as `vignette.paths[2]` to be able to build the Vignette. However, once you have compiled last-1639 and DAGchainer with the functions `make_last()` and `make_dagchainer()` you won't need to specify the paths anymore. Please remove them if you would like to repeat the examples. ```{r} ## load vignette specific libraries library(CRBHits) suppressPackageStartupMessages(library(Biostrings)) suppressPackageStartupMessages(library(dplyr)) suppressPackageStartupMessages(library(ggplot2)) suppressPackageStartupMessages(library(gridExtra)) suppressPackageStartupMessages(library(curl)) ## compile LAST and DAGchainer for the vignette vignette.paths <- CRBHits::make_vignette() ``` ## 1. [Installation]{#install} See the R package page for a detailed description of the install process and its dependencies [https://mpievolbio-it.pages.gwdg.de/crbhits/](https://mpievolbio-it.pages.gwdg.de/crbhits/). ``` ## see a detailed description of installation prerequisites at ## https://mpievolbio-it.pages.gwdg.de/crbhits/ library(devtools) ## from gitlab install_gitlab("mpievolbio-it/crbhits", host = "https://gitlab.gwdg.de", build_vignettes = TRUE, dependencies = FALSE) ## from github install_github("kullrich/CRBHits", build_vignettes = TRUE, dependencies = FALSE) ``` [CRBHits](https://gitlab.gwdg.de/mpievolbio-it/crbhits) needs [LAST](https://gitlab.com/mcfrith/last) and [DAGchainer](http://dagchainer.sourceforge.net/) to be installed before one can efficiently use it. All prerequisites (LAST, DAGchainer) are forked within [CRBHits](https://gitlab.gwdg.de/mpievolbio-it/crbhits) and can be compiled for Linux/Unix/macOS with the functions `make_last()` and `make_dagchainer()`. For the windows platform the user needs to use e.g. [https://www.cygwin.com/](https://www.cygwin.com/) to be able to compile the prerequisites. ``` library(CRBHits) ## compile last-1639 make_last() ## compile DAGchainer make_dagchainer() ``` ## 2. [Conditional Reciprocal Best Hits - Algorithm]{#crbhalgorithm} ![Figure: Overview of the `cds2rbh()` function](cds2rbhoverview.png){width=400px} The CRBH algorithm was introduced by [Aubry S, Kelly S et al. (2014)](http://www.plosgenetics.org/article/info%3Adoi%2F10.1371%2Fjournal.pgen.1004365) and ported to python [shmlast](https://pypi.org/project/shmlast/) ([Scott C. 2017](https://joss.theoj.org/papers/10.21105/joss.00142)) which benefits from the blast-like sequence search software [LAST](http://last.cbrc.jp/) ([Kiełbasa SM et al. 2011](https://genome.cshlp.org/content/21/3/487.long)). The CRBH algorithm implementation consists of three steps which are explained in more detail in the following subsections. 1. step: sequence similarity search (last)(#crbhstep1) 2. step: filter hits and extract Reciprocal Best Hits (RBHs)(#crbhstep2) 3. step: fitting RBHs and get Conditional Reciprocal Best Hits (CRBHs)(#crbhstep3) All of these steps are combined in either the `cds2rbh()` function (R vector input) or the `cdsfile2rbh()` function (file input). To activate/deactivate specific options use the parameters `@param` given and explained for each function (see `?cds2rbh()` and `?cdsfile2rbh()`). ![Figure: Individual steps of the `cds2rbh()` function](cds2rbh.png){width=800px} ### 2.1. [1. step: sequence similarity search (last)]{#crbhstep1} All further described steps are included in the `cds2rbh()` and `cdsfile2rbh()` function. The subsections will just explain in more detail what is happening with the input data. #### 2.1.1. [Input: Coding Sequences (CDS)]{#cdsinput} [CRBHits](https://gitlab.gwdg.de/mpievolbio-it/crbhits) only takes coding nucleotide sequences as the query and target inputs. __Why?__ Because [CRBHits](https://gitlab.gwdg.de/mpievolbio-it/crbhits) is implemented to use the same input afterwards to calculate synonymous and nonsynonymous substitutions which must match the protein sequences used to find CRBHit pairs. One can either use a `DNAStringSet` vector from the `Biostrings` R package with the `cds2rbh()` function or use **raw**,**gzipped** or **URL** coding sequences **fasta** files as inputs with the `cdsfile2rbh()` function. [CRBHits](https://gitlab.gwdg.de/mpievolbio-it/crbhits) uses either the function `cdsfile2aafile()` to translate a coding sequence **fasta** file into its corresponding protein sequence **fasta** file or `cds2aa()` to translate a `DNAStringSet` into a `AAStringSet`. Thereby it removes all sequences that are not a multiple of three which can not be parsed correctly. __Note:__ The user can easily check if all coding sequences would be a multiple of three or rely on the input files generated by other sources. ```{r} ## example how to check coding sequences if all are a mutiple of three ## load CDS file cdsfile <- system.file("fasta", "ath.cds.fasta.gz", package="CRBHits") cds <- Biostrings::readDNAStringSet(cdsfile) ## the following statement should return TRUE, ## if all sequences are a mutiple of three all(Biostrings::width(cds) %% 3==0) ``` #### 2.1.2. [Get Longest Isoform from NCBI or ENSEMBL Input (optional)]{#isoform} It is also possible to use an URL to directly access coding sequences from [NCBI](https://www.ncbi.nlm.nih.gov/genome) or [ENSEMBL](ftp://ftp.ensembl.org/pub/current_fasta) and reduce this input to the longest annotated isoform. ```` ## example how to access CDS from URL and get longest isoform ## get coding sequences for Homo sapiens from NCBI HOMSAP.cds.NCBI.url <- paste0( "https://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/000/001/405/", "GCF_000001405.39_GRCh38.p13/", "GCF_000001405.39_GRCh38.p13_cds_from_genomic.fna.gz") HOMSAP.cds.NCBI <- Biostrings::readDNAStringSet(HOMSAP.cds.NCBI.url) ## reduce to the longest isoform HOMSAP.cds.NCBI.longest <- isoform2longest(HOMSAP.cds.NCBI, "NCBI") ## get coding sequences for Homo sapiens from ENSEMBL HOMSAP.cds.ENSEMBL.url <- paste0( "ftp://ftp.ensembl.org/pub/release-101/fasta/", "homo_sapiens/cds/Homo_sapiens.GRCh38.cds.all.fa.gz") HOMSAP.cds.ENSEMBL.file <- tempfile() download.file(HOMSAP.cds.ENSEMBL.url, HOMSAP.cds.ENSEMBL.file, quiet=FALSE) HOMSAP.cds.ENSEMBL <- Biostrings::readDNAStringSet(HOMSAP.cds.ENSEMBL.file) ## reduce to the longest isoform HOMSAP.cds.ENSEMBL.longest <- isoform2longest(HOMSAP.cds.ENSEMBL, "ENSEMBL") ## get help #?isoform2longest ```` __Note:__ It is also possible to obtain and select the longest/primary isoform from CDS Input sources other than NCBI or Ensembl. This can be achieved by parsing a GTF/GFF3 file. This special case is handled in the [CRBHits KaKs Vignette](https://mpievolbio-it.pages.gwdg.de/crbhits/articles/V02KaKsVignette.html) under point 1.2.2, 1.2.3 and 1.4.2. #### 2.1.3. [Sequence Similarity Search]{#sequencesearch} The blast-like software [LAST](http://last.cbrc.jp/) is used to compare the translated coding sequences against each other and output a blast-like output table including the query and target length. __Note:__ Classical RBH can be performed by disabling the `crbh` `@param`of the `cds2rbh()` or `cdsfile2rbh()` function. ```{r} ## example how to get CRBHit pairs from two CDS using classical RBH ## define CDS file1 cdsfile1 <- system.file("fasta", "ath.cds.fasta.gz", package="CRBHits") ## define CDS file2 cdsfile2 <- system.file("fasta", "aly.cds.fasta.gz", package="CRBHits") ## get CDS1 cds1 <- Biostrings::readDNAStringSet(cdsfile1) ## get CDS2 cds2 <- Biostrings::readDNAStringSet(cdsfile2) ## example how to perform classical RBH ath_aly_crbh <- cds2rbh( cds1=cds1, cds2=cds2, crbh=FALSE, lastpath=vignette.paths[1]) ## show summary summary(ath_aly_crbh) ## get help #?cds2rbh ``` Like [shmlast](https://pypi.org/project/shmlast/), [CRBHits](https://gitlab.gwdg.de/mpievolbio-it/crbhits) plots the fitted model of the CRBH e-value based algorithm. The CRBH algorithm can be performed and visualized as follows: ```{r} ## example how to get CRBHit pairs using one thread ## and plot CRBHit algorithm fitting curve ## example how to perform CRBH ath_aly_crbh <- cds2rbh( cds1=cds1, cds2=cds2, plotCurve=TRUE, lastpath=vignette.paths[1]) ## show summary summary(ath_aly_crbh) ## get help #?cds2rbh ``` Both, `cdsfile2rbh()` and `cds2rbh()` function, return a list of three `data.frame`'s which contain the CRBH pairs (`$crbh.pairs`) retained, the query > target blast-like output matrix (`$crbh1`) and the target > query blast-like output matrix (`$crbh2`). For each matrix the `$rbh_class` indicates if the hit is a Reciprocal Best Hit (`rbh`) or if it is a conditional secondary hit retained because of RBH fitting (CRBH algorithm) (`sec`). ```{r} ## example showing cds2rbh() results ## show dimension and first parts of retained hit pairs dim(ath_aly_crbh$crbh.pairs) head(ath_aly_crbh$crbh.pairs) ## show first retained hit pairs for the query > target matrix head(ath_aly_crbh$crbh1) ## get the number of CRBHit classified as rbh and sec hit pairs table(ath_aly_crbh$crbh1$rbh_class) table(ath_aly_crbh$crbh2$rbh_class) ``` If the `@param` `crbh` was set to `TRUE`, the CRBH algorithm was applied and the fitting function will also be returned. The fitting function can than be used for manual plotting as follows: ```{r} ## example how to use the fitting function for manual plotting ## plot fitting function curve(ath_aly_crbh$rbh1_rbh2_fit(x), from=1, to=1000, xlab="alnlength", ylab="-log10(evalue)", main="CRBH fitting") ``` One can also specify if a hit pair which is only found in one direction should be retained, which will be classified as (`single`). ```{r} ## example how to retain single direction secondary homologs ## get CRBHit pairs with keepSingleDirection = TRUE ath_aly_crbh <- cds2rbh( cds1=cds1, cds2=cds2, plotCurve=TRUE, keepSingleDirection=TRUE, lastpath=vignette.paths[1]) ## get the number of CRBHit classified as rbh and sec hit pairs table(ath_aly_crbh$crbh1$rbh_class) table(ath_aly_crbh$crbh2$rbh_class) dim(ath_aly_crbh) ## get help #?cds2rbh ``` ### 2.2. [2. step: filter hits + RBH extraction]{#crbhstep2} All further described steps are included in the `cds2rbh()` and `cdsfile2rbh()` function. The subsections will just explain in more detail what is happening with the input data. Until now the hit pairs obtained via the sequence similarity searches were only filtered by default settings (e-value < 0.001). The following subsections will explain the default filters in more detail and show how one can create and apply custom filter on the hit pairs prior RBH extraction and fitting to obtain CRBHit pairs. #### 2.2.1. [Filter blast-like output prior fitting]{#filters} The following filters are already defined and can be used out of the box: - evalue (`@param` `evalue = 0.001`) - query coverage (`@param` `qcov = 0.0`) - target coverage (`@param` `tcov = 0.0`) - protein identity (`@param` `pident = 0`) - alignment length (`@param` `alnlen = 0`) - rost1999 (`@param` `rost1999 = FALSE`) Taking the length of the obtained pairwise protein alignment one can calculate for each hit pair the query coverage as $\frac{alignment length}{query length}$. The data will be filtered with a query coverage of 50\% using the `@param` `qcov` as follows: `cds2rbh(..., qcov = 0.5)` ```{r} ## example how to filter prior crbh for query coverage of 50% ## get CRBHit pairs with direct query coverage filter ath_aly_crbh <- cds2rbh( cds1=cds1, cds2=cds2, plotCurve=TRUE, qcov=0.5, lastpath=vignette.paths[1]) dim(ath_aly_crbh$crbh.pairs) ## get help #?cds2rbh ``` In addition, the users can filter for the **twilight zone of protein sequence alignments** according to [Rost B. (1999)](https://academic.oup.com/peds/article/12/2/85/1550637). This **twilight zone** indicates proteins, which might share a common ancestor (homologue genes) but have diverged from each other through time so that similarity is no longer detectable. Here, a certain alignment length and protein identity is at least necessary to indicate *bona-fide* homologues. The following equation and plot will show how the relationship between alignment length and protein identity is defined for the `rost1999` filter. The implemented `rost1999` filter uses `equation2` of [Rost B. (1999)](https://academic.oup.com/peds/article/12/2/85/1550637) $$f(x_{\text{hit pair}}) = \cases { 100 \text{ , for } L_{\text{hit pair}} < 11 \\ 480 * L^{-0.32 * (1 + e^{\frac{-L}{1000}}) } \text{ , for } L_{\text{hit pair}} \leq 450 \\ 19.5 \text{ , for } L_{\text{hit pair}} > 450}$$ , where $x_{\text{hit pair}}$ is the expected protein identity given the alignemnet length $L_{\text{hit pair}}$. If the actual $pident_{\text{hit pair}} >= f(x_{\text{hit pair}})$ the hit pair is retained. ```{r} ## detailed explanation for the Rost (1999) twilight-zone filter ## define eq2 from Rost (1999) get_pident_by_length <- function(x){ eq2 <- function(L){ if(L<=11){return(100)} if(L<=450){return(480*(L^(-0.32*(1+(exp(-L/1000))))))} if(L>450){return(19.5)} } return(unlist(lapply(x, eq2))) } ## plot expected pident by alignment length using eq2 from Rost (1999) curve(get_pident_by_length, 11, 500, pch=20, xlab="alignment length", ylab="pident", main="expected protein identity (eq2; Rost B. 1999)") ``` This **twilight zone** filter can be directly used by the `@param` `rost1999` set to `TRUE`. ```{r} ## example how to filter prior crbh for eq2 from Rost (1999) ## get CRBHit pairs with direct twilight-zone filter ath_aly_crbh <- cds2rbh( cds1=cds1, cds2=cds2, plotCurve=TRUE, rost1999=TRUE, lastpath=vignette.paths[1]) dim(ath_aly_crbh$crbh.pairs) ## get help #?cds2rbh ## get help #?filter.alnlen ## get help #?filter.eval ## get help #?filter.pident ## get help #?filter.qcov ## get help #?filter.rost1999 ## get help #?filter.tcov ``` #### 2.2.2. [Custom Filter]{#filters} The user can also define its own filter function and apply it to the blast-like output prior fitting and calculating CRBHs as follows: ```{r} ## example for a custom filters ## define custom filter for e.g. bit score (column 12) myfilter1 <- function(rbh, value=500.0){ return(dplyr::filter(rbh, bit_score>=value)) } ## define custom filter for e.g. corrected query_coverage myfilter2 <- function(rbh, value=0.5){ return(dplyr::filter(rbh, ((alignment_length-mismatches-gap_opens) / query_length)>=value)) } ## get CRBHit pairs with custom filter list ath_aly_crbh <- cds2rbh( cds1=cds1, cds2=cds2, plotCurve=TRUE, filter=list(myfilter1, myfilter2), lastpath=vignette.paths[1]) dim(ath_aly_crbh$crbh.pairs) ``` #### 2.2.3. [Extract RBHs]{#rbhextraction} As already explained in the Sequence Similarity Search section (#sequencesearch), both, `cds2rbh()` and `cdsfile2rbh()` function, return a list of three `data.frame`'s which contain the CRBHit pairs (`$crbh.pairs`) retained, the query > target blast-like output matrix (`$crbh1`) and the target > query blast-like output matrix (`$crbh2`). For each matrix the `$rbh_class` indicates if the hit is a Reciprocal Best Hit (`rbh`) or if it is a conditional secondary hit retained because of RBH fitting (CRBH algorithm) (`sec`). To extract just the RBHs one can subset based on the `rbh_class` columns as follows: ```{r} ## example to extract CRBHit pairs classified as rbh ## reduce to rbh_class rbh data("ath_aly_crbh", package = "CRBHits") head(dplyr::filter(ath_aly_crbh$crbh1, rbh_class=="rbh")) head(dplyr::filter(ath_aly_crbh$crbh2, rbh_class=="rbh")) ``` ### 2.3. [3. step: fitting RBHs + get CRBHs]{#crbhstep3} All further described steps are included in the `cds2rbh()` and `cdsfile2rbh()` function. The subsections will just explain in more detail what is happening with the input data. #### 2.3.1. [CRBHit pairs - Fitting Parameter]{#fittingparam} The original implementation of the CRBH algorithm [crb-blast.rb line 438-518](https://github.com/cboursnell/crb-blast/blob/master/lib/crb-blast/crb-blast.rb) uses certain parameters for fitting the e-value against the alignment length. In brief, for each retained reciprocal hit pair (RBH; after filtering) the e-value and the corresponding alignment length of a hit pair and its neighborhood is evaluated. The neighborhood is defined by a weighting parameter of the alignment length (default is set to `@param` `fit.varweight = 0.1`) and a minimum neighborhood alignment length (default is set to `@param` `fit.min = 5`). For each unique alignment length $L$ a value $s_{L}$ is defined as the neighborhood alignment length as $$s_{L} = \cases { \text{fit.min} \text{ , for } L * \text{fit.varweight} < \text{fit.min} \\ L * \text{fit.varweight} \text{ , for } L * \text{fit.varweight} \geq \text{fit.min}}$$ E.g. for an alignment length of 100 and the default settings, $s_{100} = 100 * 0.1 = 10$. All RBHs, for which $L-s_{L} \leq L_{RBH_i} \leq L + s_{L}$, in other words, for which their alignment lengths fall between 90 and 110 are than considered for fitting a cutoff at that alignment length using the mean of the $-log10(evalue)$. $$\text{cutoff}_{L} = \frac{1}{n} \sum_{i = 1}^{n}{ \cases{324 \text{ , for } evalue_{RBH_i} = 0 \\ -log10(evalue_{\text{RBH}_i}) \text{ , for } evalue_{RBH_i} \ne 0}}$$ The user can now alter between `mean` and `median` using the `@param` `fit.type = "mean"` or `fit.type = "median"` to get alternative fitting curves. ```{r} ## example how to get crbh from two coding fasta files using median fitting ## get CRBHit pairs with median fitting ath_aly_crbh <- cds2rbh( cds1=cds1, cds2=cds2, plotCurve=TRUE, fit.type="median", lastpath=vignette.paths[1]) ## get help #?cds2rbh ``` As described earlier, [CRBHits](https://gitlab.gwdg.de/mpievolbio-it/crbhits) features downstream anaylsis like batch calculating synonymous (Ks) and nonsynonymous substitutions (Ka) per orthologous sequence pair. The following sections will give more details about these downstream functionalities. ## 3. [Ka/Ks Calculations]{#KaKs} ![Figure: Overview of the `rbh2kaks()` function steps](rbh2kaksoverview.png){width=800px} The resulting CRBHs (see #crbhalgorithm) can be further processed to e.g. filtered for Tandem Duplicate (see [CRBHits KaKs Vignette](https://mpievolbio-it.pages.gwdg.de/crbhits/articles/V02KaKsVignette.html)) and/or can be used to calculate Ka/Ks values for each CRBH pair (see #rbh2kaks and [CRBHits KaKs Vignette](https://mpievolbio-it.pages.gwdg.de/crbhits/articles/V02KaKsVignette.html)). All further described steps are included in the `rbh2kaks()` function. The subsections will just explain in more detail what is happening with the input data. To activate/deactivate specific options use the parameters `@param` given and explained for each function (see `?rbh2kaks()`). __Note:__ To be able to perform *Tandem Duplicate Assignment* with the `rbh2kaks()` function one needs to specify *gene positions* for both Input CDS (can be extracted automatically from *NCBI* or *ENSEMBL* Input (see `cds2genepos()` function)) and supply *selfblast* results for both Input CDS. ![Figure: Individual steps of the `rbh2kaks()` function](rbh2kaks.png){width=600px} ### 3.1. [Codon Alignments - cds2codonaln() Function]{#cds2codon} To be able to calculate synonymous (Ks) and non-synonymous (Ka) substitutions, one needs to use a codon alignment. The `MSA2dist::cds2codonaln()` function takes two single nucleotide sequences as input to obtain such a codon alignment. The function will convert the nucleotide sequences into amino acid sequences, align them with the help of the `pairwiseAlignment()` function from the `Biostrings` R package and convert this alignment back into a codon alignment. ```{r} ## example to get a codon alignment ## define two CDS cds1 <- Biostrings::DNAString("ATGCAACATTGC") cds2 <- Biostrings::DNAString("ATGCATTGC") ## get codon alignment MSA2dist::cds2codonaln( cds1=cds1, cds2=cds2) ## get help #?MSA2dist::cds2codonaln ``` The user can alter some `@param` to change the alignment parameters, like `type`, `substitutionMatrix`, `gapOpening` and `gapExtension` costs as defined by the `pairwiseAlignment()` function from the `Biostrings` R package. ```{r} ## example to alter the substitionMatrix and use the BLOSUM45 cost matrix ## for the codon alignment ## get codon alignemnt with BLOSUM45 cost matrix MSA2dist::cds2codonaln( cds1=cds1, cds2=cds2, substitutionMatrix="BLOSUM45") ``` The user can also remove gaps in the codon alignment. ```{r} ## example to remove codon gaps ## get codon alignment with gaps removed MSA2dist::cds2codonaln( cds1=cds1, cds2=cds2, remove.gaps=TRUE) ``` ### 3.2. [MSA2dist::dnastring2kaks() Function]{#cds2kaks} The Ka/Ks (dN/dS) values can be obtained either via the codon model of [Li WH. (1999)](https://www.ncbi.nlm.nih.gov/pubmed/8433381) implemented in the R package [seqinr](https://cran.r-project.org/web/packages/seqinr/index.html) or the model [Yang Z and Nielson R. (2000)](https://www.ncbi.nlm.nih.gov/pubmed/10666704) implemented in [KaKs_Calculator2.0](https://sourceforge.net/projects/kakscalculator2/files/KaKs_Calculator2.0.tar.gz/download) (see [Quick Installation](#quickinstall) how to compile the prerequisites). The `MSA2dist::dnastring2kaks()` function takes single nucleotide sequences as input to obtain a codon alignment and calculate Ka/Ks. ```{r} ## calculate Ka/Ks on two CDS ## load example sequence data data("ath", package="CRBHits") data("aly", package="CRBHits") ## select a sequence pair according to a best hit pair (done for you) cds1 <- ath[1] cds2 <- aly[282] ## calculate Ka/Ks values on two CDS using Li model MSA2dist::dnastring2kaks( c(cds1, cds2), model="Li", isMSA=FALSE) ## get help #?MSA2dist::dnastring2kaks ``` All parameters which can be used by the `MSA2dist::cds2codonaln()` function can be overgiven to e.g. use an alternative `substitutionMatrix`. ```{r} ## example to use an alternative substitutionMatrix for the codon alignment ## and obtain Ka/Ks ## calculate Ka/Ks values on two CDS using Li model and BLOSUM45 cost matrix MSA2dist::dnastring2kaks( c(cds1, cds2), model="Li", isMSA=FALSE, substitutionMatrix="BLOSUM45") ``` ### 3.3. [rbh2kaks() Function]{#rbh2kaks} The resulting CRBHit pairs (see [cds2rbh()](#cds2rbhfunction)) can be used with the `rbh2kaks()` function to obtain pairwise codon alignments, which are further used to calculate synonymous (Ks) and nonsynonymous (Ka) substitutions using parallelization. __Note:__ It is important, that the names of the `rbh` columns must match the names of the corresponding `cds1` and `cds2` `DNAStringSet` vectors. However, since one should directly use the same input `DNAStringSet` vector or input `url` to calculate the reciprocal best hit pair matrix with the `cds2rbh()` or the `cdsfile2rbh()` function, this should not be a problem. ```{r} ## example how to get CRBHit pairs from two coding fasta files cdsfile1 <- system.file("fasta", "ath.cds.fasta.gz", package="CRBHits") cdsfile2 <- system.file("fasta", "aly.cds.fasta.gz", package="CRBHits") ath <- Biostrings::readDNAStringSet(cdsfile1) aly <- Biostrings::readDNAStringSet(cdsfile2) ## the following function calculates CRBHit pairs ## using one thread and plots the fitted curve ath_aly_crbh <- cds2rbh( cds1=ath, cds2=aly, lastpath=vignette.paths[1]) ## calculate Ka/Ks using the CRBHit pairs ath_aly_crbh$crbh.pairs <- head(ath_aly_crbh$crbh.pairs) ath_aly_crbh.kaks <- rbh2kaks( rbhpairs=ath_aly_crbh, cds1=ath, cds2=aly, model="Li") head(ath_aly_crbh.kaks) ## get help #?rbh2kaks ``` The Ka/Ks calculations can be parallelized. ```{r} ## calculate Ka/Ks using the CRBHit pairs and multiple threads ath_aly_crbh.kaks <- rbh2kaks( rbhpairs=ath_aly_crbh, cds1=ath, cds2=aly, model="Li", threads=2) head(ath_aly_crbh.kaks) ## get help #?rbh2kaks ``` Like for the `MSA2dist::dnastring2kaks()` function, all parameters which can be used by the `MSA2dist::cds2codonaln()` function can be overgiven to e.g. use an alternative `substitutionMatrix`. ```{r} ## calculate Ka/Ks using the CRBHit pairs and multiple threads ath_aly_crbh.kaks <- rbh2kaks( rbhpairs=ath_aly_crbh, cds1=ath, cds2=aly, model="Li", threads=2, substitutionMatrix="BLOSUM45") head(ath_aly_crbh.kaks) ## get help #?rbh2kaks ``` ## 4. [References]{#references} Aubry S., Kelly S., Kümpers B. M., Smith-Unna R. D., and Hibberd J. M. (2014). **Deep evolutionary comparison of gene expression identifies parallel recruitment of trans-factors in two independent origins of C4 photosynthesis.** *PLoS Genetics*, **10(6)**. [https://doi.org/10.1371/journal.pgen.1004365](https://doi.org/10.1371/journal.pgen.1004365) Charif D., and Lobry J. R. (2007). **SeqinR 1.0-2: a contributed package to the R project for statistical computing devoted to biological sequences retrieval and analysis.** *In Structural approaches to sequence evolution* (pp. 207-232). Springer, Berlin, Heidelberg. [https://link.springer.com/chapter/10.1007/978-3-540-35306-5_10](https://link.springer.com/chapter/10.1007/978-3-540-35306-5_10) Duong T., and Wand M. (2015). **feature: Local Inferential Feature Significance for Multivariate Kernel Density Estimation.** *R package version 1.2.13*. [https://cran.r-project.org/web/packages/feature/](https://cran.r-project.org/web/packages/feature/) Haas B. J., Delcher A. L., Wortman J. R., and Salzberg S. L. (2004). **DAGchainer: a tool for mining segmental genome duplications and synteny.** *Bioinformatics*, **20(18)**, 3643-3646. [https://doi.org/10.1093/bioinformatics/bth397](https://doi.org/10.1093/bioinformatics/bth397) Haug-Baltzell A., Stephens S. A., Davey S., Scheidegger C. E., Lyons E. (2017). **SynMap2 and SynMap3D: web-based wholge-genome synteny browsers.** *Bioinformatics*, **33(14)**, 2197-2198. [https://academic.oup.com/bioinformatics/article/33/14/2197/3072872](https://academic.oup.com/bioinformatics/article/33/14/2197/3072872) Kiełbasa S. M., Wan R., Sato K., Horton P., and Frith M. C. (2011). **Adaptive seeds tame genomic sequence comparison.** *Genome Research*, **21(3)**, 487-493. [https://doi.org/10.1101/gr.113985.110](https://doi.org/10.1101/gr.113985.110) Kimura M. (1977). **Preponderance of synonymous changes as evidence for the neutral theory of molecular evolution.** *Nature*, **267**, 275-276. Li W. H. (1993). **Unbiased estimation of the rates of synonymous and nonsynonymous substitution.** *Journal of Molecular Evolution*, **36(1)**, 96-99. [https://doi.org/10.1007/bf02407308](https://doi.org/10.1007/bf02407308) Microsoft, and Weston S. (2020). **foreach: Provides Foreach Looping Construct.** *R package version*, 1.5.1. [foreach](https://cran.r-project.org/web/packages/foreach/index.html) Mugal C. F., Wolf J. B. W., Kaj I. (2014). **Why Time Matters: Codon Evolution and the Temproal Dynamics of dN/dS.** *Molecular Biology and Evolution*, **31(1)**, 212-231. Ooms J. (2019). **curl: A Modern and Flexible Web Client for R.** *R package version*, 4.3. [curl](https://cran.r-project.org/web/packages/curl/index.html) Pagès H., Aboyoun P., Gentleman R., and DebRoy S. (2017). **Biostrings: Efficient manipulation of biological strings.** *R package version*, 2.56.0. [Biostrings](http://bioconductor.org/packages/release/bioc/html/Biostrings.html) Revolution Analytics, and Weston S. (2020). **doMC: Foreach Parallel Adaptor for 'parallel'.** *R package version*, 1.3.7. [doMC](https://cran.r-project.org/web/packages/doMC/index.html) Rost B. (1999). **Twilight zone of protein sequence alignments.** *Protein Engineering*, **12(2)**, 85-94. [https://doi.org/10.1093/protein/12.2.85](https://doi.org/10.1093/protein/12.2.85) Scott C. (2017). **shmlast: an improved implementation of conditional reciprocal best hits with LAST and Python.** *Journal of Open Source Software*, **2(9)**, 142. [https://joss.theoj.org/papers/10.21105/joss.00142](https://joss.theoj.org/papers/10.21105/joss.00142) Scrucca L., Fop M., Murphy T. B., and Raftery A. E. (2016) **mclust 5: clustering, classification and density estimation using Gaussian finite mixture models.** *The R Journal*, 8(1), 289-317. [https://www.ncbi.nlm.nih.gov/pmc/articles/PMC5096736/](https://www.ncbi.nlm.nih.gov/pmc/articles/PMC5096736/) Wickham H. (2011). **testthat: Get Started with Testing.** *The R Journal*, 3(1), 5. [testthat](https://cran.r-project.org/web/packages/testthat/index.html) Wickham H. (2019). **stringr: Simple, Consistent Wrappers for Common String Operations.** *R package version*, 1.4.0. [stringr](https://cran.r-project.org/web/packages/stringr/index.html) Wickham H. (2020). **tidyr: Tidy Messy Data.** *R package version*, 1.1.2. [tidyr](https://cran.r-project.org/web/packages/tidyr/index.html) Wickham H., Hester J., and Chang W. (2020). **devtools: Tools to make Developing R Packages Easier.** *R package version*, (2.3.2). [devtools](https://cran.r-project.org/web/packages/devtools/index.html) Wickham H., François R., Henry L., and Müller K. (2020). **dplyr: A Grammar of Data Manipulation.** *R package version*, 1.0.2. [dplyr](https://cran.r-project.org/web/packages/dplyr/index.html) Yang Z., and Nielsen R. (2000). **Estimating synonymous and nonsynonymous substitution rates under realistic evolutionary models.** *Molecular Biology and Evolution*, **17(1)**, 32-43. [https://doi.org/10.1093/oxfordjournals.molbev.a026236](https://doi.org/10.1093/oxfordjournals.molbev.a026236)