---
title: "CRBHits Basic Vignette"
author: "Kristian K Ullrich"
date: "`r Sys.Date()`"
output: rmarkdown::html_vignette
vignette: >
  %\VignetteIndexEntry{CRBHits Basic Vignette}
  %\VignetteEngine{knitr::rmarkdown}
  %\VignetteEncoding{UTF-8}
---

```{r setup, include = FALSE}
knitr::opts_chunk$set(
  collapse = TRUE,
  comment = "#>"
)
```

# CRBHit Basic Vignette

- includes CRBHit pair calculation
- includes CRBHit pair filtering
- includes Longest Isoform selection
- includes Codon alignments
- includes Ka/Ks calculations

## Table of Contents

1. [Installation](#install)
2. [Conditional Reciprocal Best Hits - Algorithm](#crbhalgorithm)
    1. [1. step: sequence similarity search (last)](#crbhstep1)
        1. [Input: Coding Sequences (CDS)](#cdsinput)
        2. [Get Longest Isoform from NCBI or ENSEMBL Input (optional)](#isoform)
        3. [Sequence Similarity Search](#sequencesearch)
    2. [2. step: filter hits + RBH extraction](#crbhstep2)
        1. [Filter blast-like output prior fitting](#filters)
        2. [Custom Filter](#customfilters)
        3. [Extract RBHs](#rbhextraction)
    3. [3. step: fitting RBHs + get CRBHs](#crbhstep3)
        1. [CRBHit pairs - Fitting Parameter](#fittingparam)
3. [Ka/Ks Calculations](#KaKs)
    1. [Codon Alignments - cds2codonaln() Function](#cds2codon)
    2. [Calculate Ka/Ks values comparing two sequences - cds2kaks() Function](#cds2kaks)
    3. [Calculate Ka/Ks values for each CRBHit pair - rbh2kaks() Function](#rbh2kaks)
4. [References](#references)

[CRBHits](https://gitlab.gwdg.de/mpievolbio-it/crbhits) is a reimplementation of
the Conditional Reciprocal Best Hit (CRBH) algorithm
[crb-blast](https://github.com/cboursnell/crb-blast) in R.

The Reciprocal Best Hit (RBH) approach is commonly used in bioinformatics to
show that two sequences evolved from a common ancestral gene. In other words,
RBH tries to find orthologous protein sequences within and between species.
These orthologous sequences can be further analysed to evaluate protein family
evolution, infer phylogenetic trees and to annotate protein function.

The CRBH algorithm builds upon the traditional RBH approach to find additional
orthologous sequences between two sets of sequences.

CRBH uses the sequence search results to fit an expect-value (e-value) cutoff
given each RBH to subsequently add sequence pairs to the list of *bona-fide*
orthologs given their alignment length. The sequence similarity results can be
filtered by common criteria like e-value, query-coverage and more. If
chromosome annotations are available for the protein-coding genes
(protein order), this information can be directly used to assign tandem
duplicates before conducting further processing steps.

[CRBHits](https://gitlab.gwdg.de/mpievolbio-it/crbhits) features downstream
anaylsis like batch calculating synonymous (Ks) and nonsynonymous substitutions
(Ka) per orthologous sequence pair. The codon sequence alignment step consist
of three subtasks, namely coding nucleotide to protein sequence translation,
pairwise protein sequence alignment calculation and converting the protein
sequence alignment back into a codon based alignment.

__Note:__

In this Vignette the `lastpath` is defined as `vignette.paths[1]` and the `dagchainerpath` as
`vignette.paths[2]` to be able to build the Vignette.

However, once you have compiled last-1639 and DAGchainer
with the functions `make_last()` and
`make_dagchainer()` you won't need to specify the paths anymore. Please remove
them if you would like to repeat the examples.

```{r}
## load vignette specific libraries
library(CRBHits)
suppressPackageStartupMessages(library(Biostrings))
suppressPackageStartupMessages(library(dplyr))
suppressPackageStartupMessages(library(ggplot2))
suppressPackageStartupMessages(library(gridExtra))
suppressPackageStartupMessages(library(curl))
## compile LAST and DAGchainer for the vignette
vignette.paths <- CRBHits::make_vignette()
```

## 1. [Installation]{#install}

See the R package page for a detailed description of the install process and
its dependencies
[https://mpievolbio-it.pages.gwdg.de/crbhits/](https://mpievolbio-it.pages.gwdg.de/crbhits/).

```
## see a detailed description of installation prerequisites at
## https://mpievolbio-it.pages.gwdg.de/crbhits/
library(devtools)
## from gitlab
install_gitlab("mpievolbio-it/crbhits", host = "https://gitlab.gwdg.de",
               build_vignettes = TRUE, dependencies = FALSE)
## from github
install_github("kullrich/CRBHits", build_vignettes = TRUE, dependencies = FALSE)
```

[CRBHits](https://gitlab.gwdg.de/mpievolbio-it/crbhits) needs
[LAST](https://gitlab.com/mcfrith/last) and [DAGchainer](http://dagchainer.sourceforge.net/) to be installed before one can efficiently use it.

All prerequisites (LAST, DAGchainer) are forked within
[CRBHits](https://gitlab.gwdg.de/mpievolbio-it/crbhits) and can be compiled for
Linux/Unix/macOS with the functions `make_last()` and  `make_dagchainer()`.

For the windows platform the user needs to use e.g.
[https://www.cygwin.com/](https://www.cygwin.com/) to be able to compile the
prerequisites.

```
library(CRBHits)
## compile last-1639
make_last()
## compile DAGchainer
make_dagchainer()
```

## 2. [Conditional Reciprocal Best Hits - Algorithm]{#crbhalgorithm}

![Figure: Overview of the `cds2rbh()` function](cds2rbhoverview.png){width=400px}

The CRBH algorithm was introduced by
[Aubry S, Kelly S et al. (2014)](http://www.plosgenetics.org/article/info%3Adoi%2F10.1371%2Fjournal.pgen.1004365)
and ported to python [shmlast](https://pypi.org/project/shmlast/)
([Scott C. 2017](https://joss.theoj.org/papers/10.21105/joss.00142)) which
benefits from the blast-like sequence search software
[LAST](http://last.cbrc.jp/)
([Kiełbasa SM et al. 2011](https://genome.cshlp.org/content/21/3/487.long)).

The CRBH algorithm implementation consists of three steps which are explained
in more detail in the following subsections.

1. step: sequence similarity search (last)(#crbhstep1)
2. step: filter hits and extract Reciprocal Best Hits (RBHs)(#crbhstep2)
3. step: fitting RBHs and get Conditional Reciprocal Best Hits (CRBHs)(#crbhstep3)

All of these steps are combined in either the `cds2rbh()` function
(R vector input) or the `cdsfile2rbh()` function (file input).

To activate/deactivate specific options use the parameters `@param` given and
explained for each function (see `?cds2rbh()` and `?cdsfile2rbh()`). 

![Figure: Individual steps of the `cds2rbh()` function](cds2rbh.png){width=800px}

### 2.1. [1. step: sequence similarity search (last)]{#crbhstep1}

All further described steps are included in the `cds2rbh()` and `cdsfile2rbh()`
function. The subsections will just explain in more detail what is happening
with the input data.

#### 2.1.1. [Input: Coding Sequences (CDS)]{#cdsinput}

[CRBHits](https://gitlab.gwdg.de/mpievolbio-it/crbhits) only takes coding
nucleotide sequences as the query and target inputs.

__Why?__ Because [CRBHits](https://gitlab.gwdg.de/mpievolbio-it/crbhits) is
implemented to use the same input afterwards to calculate synonymous and
nonsynonymous substitutions which must match the protein sequences used to find
CRBHit pairs.

One can either use a `DNAStringSet` vector from the `Biostrings` R package with
the `cds2rbh()` function or use **raw**,**gzipped** or **URL** coding sequences
**fasta** files as inputs with the `cdsfile2rbh()` function.

[CRBHits](https://gitlab.gwdg.de/mpievolbio-it/crbhits) uses either the
function `cdsfile2aafile()` to translate a coding sequence **fasta** file into
its corresponding protein sequence **fasta** file or `cds2aa()` to translate a
`DNAStringSet` into a `AAStringSet`. Thereby it removes all sequences that are
not a multiple of three which can not be parsed correctly.

__Note:__ The user can easily check if all coding sequences would be a multiple
of three or rely on the input files generated by other sources.

```{r}
## example how to check coding sequences if all are a mutiple of three

## load CDS file
cdsfile <- system.file("fasta", "ath.cds.fasta.gz", package="CRBHits")
cds <- Biostrings::readDNAStringSet(cdsfile)
## the following statement should return TRUE,
## if all sequences are a mutiple of three
all(Biostrings::width(cds) %% 3==0)
```

#### 2.1.2. [Get Longest Isoform from NCBI or ENSEMBL Input (optional)]{#isoform}

It is also possible to use an URL to directly access coding sequences from
[NCBI](https://www.ncbi.nlm.nih.gov/genome) or
[ENSEMBL](ftp://ftp.ensembl.org/pub/current_fasta) and reduce this input to the
longest annotated isoform.

````
## example how to access CDS from URL and get longest isoform

## get coding sequences for Homo sapiens from NCBI
HOMSAP.cds.NCBI.url <- paste0(
  "https://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/000/001/405/",
  "GCF_000001405.39_GRCh38.p13/",
  "GCF_000001405.39_GRCh38.p13_cds_from_genomic.fna.gz")
HOMSAP.cds.NCBI <- Biostrings::readDNAStringSet(HOMSAP.cds.NCBI.url)
## reduce to the longest isoform
HOMSAP.cds.NCBI.longest <- isoform2longest(HOMSAP.cds.NCBI, "NCBI")
## get coding sequences for Homo sapiens from ENSEMBL
HOMSAP.cds.ENSEMBL.url <- paste0(
  "ftp://ftp.ensembl.org/pub/release-101/fasta/",
  "homo_sapiens/cds/Homo_sapiens.GRCh38.cds.all.fa.gz")
HOMSAP.cds.ENSEMBL.file <- tempfile()
download.file(HOMSAP.cds.ENSEMBL.url, HOMSAP.cds.ENSEMBL.file, quiet=FALSE)
HOMSAP.cds.ENSEMBL <- Biostrings::readDNAStringSet(HOMSAP.cds.ENSEMBL.file)
## reduce to the longest isoform
HOMSAP.cds.ENSEMBL.longest <- isoform2longest(HOMSAP.cds.ENSEMBL, "ENSEMBL")

## get help
#?isoform2longest
````

__Note:__ It is also possible to obtain and select the longest/primary isoform
from CDS Input sources other than NCBI or Ensembl. This can be achieved by
parsing a GTF/GFF3 file. This special case is handled in the
[CRBHits KaKs Vignette](https://mpievolbio-it.pages.gwdg.de/crbhits/articles/V02KaKsVignette.html)
under point 1.2.2, 1.2.3 and 1.4.2.

#### 2.1.3. [Sequence Similarity Search]{#sequencesearch}

The blast-like software [LAST](http://last.cbrc.jp/) is used to compare the
translated coding sequences against each other and output a blast-like output
table including the query and target length.

__Note:__ Classical RBH can be performed by disabling the `crbh` `@param`of the
`cds2rbh()` or `cdsfile2rbh()` function.

```{r}
## example how to get CRBHit pairs from two CDS using classical RBH

## define CDS file1
cdsfile1 <- system.file("fasta", "ath.cds.fasta.gz", package="CRBHits")
## define CDS file2
cdsfile2 <- system.file("fasta", "aly.cds.fasta.gz", package="CRBHits")
## get CDS1
cds1 <- Biostrings::readDNAStringSet(cdsfile1)
## get CDS2
cds2 <- Biostrings::readDNAStringSet(cdsfile2)
## example how to perform classical RBH
ath_aly_crbh <- cds2rbh(
    cds1=cds1,
    cds2=cds2,
    crbh=FALSE,
    lastpath=vignette.paths[1])
## show summary
summary(ath_aly_crbh)

## get help
#?cds2rbh
```

Like [shmlast](https://pypi.org/project/shmlast/),
[CRBHits](https://gitlab.gwdg.de/mpievolbio-it/crbhits) plots the fitted model
of the CRBH e-value based algorithm. The CRBH algorithm can be performed and
visualized as follows:

```{r}
## example how to get CRBHit pairs using one thread
## and plot CRBHit algorithm fitting curve

## example how to perform CRBH
ath_aly_crbh <- cds2rbh(
    cds1=cds1,
    cds2=cds2,
    plotCurve=TRUE,
    lastpath=vignette.paths[1])
## show summary
summary(ath_aly_crbh)

## get help
#?cds2rbh
```

Both, `cdsfile2rbh()` and `cds2rbh()` function, return a list of three
`data.frame`'s which contain the CRBH pairs (`$crbh.pairs`) retained, the
query > target blast-like output matrix (`$crbh1`) and the target > query
blast-like output matrix (`$crbh2`). For each matrix the `$rbh_class` indicates
if the hit is a Reciprocal Best Hit (`rbh`) or if it is a conditional secondary
hit retained because of RBH fitting (CRBH algorithm) (`sec`). 

```{r}
## example showing cds2rbh() results

## show dimension and first parts of retained hit pairs
dim(ath_aly_crbh$crbh.pairs)
head(ath_aly_crbh$crbh.pairs)

## show first retained hit pairs for the query > target matrix
head(ath_aly_crbh$crbh1)

## get the number of CRBHit classified as rbh and sec hit pairs
table(ath_aly_crbh$crbh1$rbh_class)
table(ath_aly_crbh$crbh2$rbh_class)
```

If the `@param` `crbh` was set to `TRUE`, the CRBH algorithm was applied and
the fitting function will also be returned. The fitting function can than be
used for manual plotting as follows:

```{r}
## example how to use the fitting function for manual plotting

## plot fitting function
curve(ath_aly_crbh$rbh1_rbh2_fit(x),
    from=1,
    to=1000,
    xlab="alnlength",
    ylab="-log10(evalue)",
    main="CRBH fitting")
```

One can also specify if a hit pair which is only found in one direction should
be retained, which will be classified as (`single`).

```{r}
## example how to retain single direction secondary homologs

## get CRBHit pairs with keepSingleDirection = TRUE
ath_aly_crbh <- cds2rbh(
    cds1=cds1,
    cds2=cds2,
    plotCurve=TRUE,
    keepSingleDirection=TRUE,
    lastpath=vignette.paths[1])
## get the number of CRBHit classified as rbh and sec hit pairs
table(ath_aly_crbh$crbh1$rbh_class)
table(ath_aly_crbh$crbh2$rbh_class)
dim(ath_aly_crbh)

## get help
#?cds2rbh
```

### 2.2. [2. step: filter hits + RBH extraction]{#crbhstep2}

All further described steps are included in the `cds2rbh()` and `cdsfile2rbh()`
function. The subsections will just explain in more detail what is happening
with the input data.

Until now the hit pairs obtained via the sequence similarity searches were only
filtered by default settings (e-value < 0.001). The following subsections will
explain the default filters in more detail and show how one can create and
apply custom filter on the hit pairs prior RBH extraction and fitting to obtain
CRBHit pairs.

#### 2.2.1. [Filter blast-like output prior fitting]{#filters}

The following filters are already defined and can be used out of the box:

- evalue (`@param` `evalue = 0.001`)
- query coverage (`@param` `qcov = 0.0`)
- target coverage (`@param` `tcov = 0.0`)
- protein identity (`@param` `pident = 0`)
- alignment length (`@param` `alnlen = 0`)
- rost1999 (`@param` `rost1999 = FALSE`)

Taking the length of the obtained pairwise protein alignment one can calculate
for each hit pair the query coverage as $\frac{alignment length}{query length}$.

The data will be filtered with a query coverage of 50\% using the
`@param` `qcov` as follows: `cds2rbh(..., qcov = 0.5)`

```{r}
## example how to filter prior crbh for query coverage of 50%

## get CRBHit pairs with direct query coverage filter
ath_aly_crbh <- cds2rbh(
    cds1=cds1,
    cds2=cds2,
    plotCurve=TRUE,
    qcov=0.5,
    lastpath=vignette.paths[1])
dim(ath_aly_crbh$crbh.pairs)

## get help
#?cds2rbh
```

In addition, the users can filter for the
**twilight zone of protein sequence alignments** according to
[Rost B. (1999)](https://academic.oup.com/peds/article/12/2/85/1550637).
This **twilight zone** indicates proteins, which might share a common ancestor
(homologue genes) but have diverged from each other through time so that
similarity is no longer detectable. Here, a certain alignment length and
protein identity is at least necessary to indicate *bona-fide* homologues.

The following equation and plot will show how the relationship between
alignment length and protein identity is defined for the `rost1999` filter.

The implemented `rost1999` filter uses `equation2` of
[Rost B. (1999)](https://academic.oup.com/peds/article/12/2/85/1550637)

$$f(x_{\text{hit pair}}) = \cases {
100 \text{ , for } L_{\text{hit pair}} < 11 \\
480 * L^{-0.32 * (1 + e^{\frac{-L}{1000}})
} \text{ , for } L_{\text{hit pair}} \leq 450 \\
19.5 \text{ , for } L_{\text{hit pair}} > 450}$$
, where $x_{\text{hit pair}}$ is the expected protein identity given the
alignemnet length $L_{\text{hit pair}}$. If the actual
$pident_{\text{hit pair}} >= f(x_{\text{hit pair}})$ the hit pair is retained.

```{r}
## detailed explanation for the Rost (1999) twilight-zone filter

## define eq2 from Rost (1999)
get_pident_by_length <- function(x){
    eq2 <- function(L){
        if(L<=11){return(100)}
        if(L<=450){return(480*(L^(-0.32*(1+(exp(-L/1000))))))}
        if(L>450){return(19.5)}
    }
    return(unlist(lapply(x, eq2)))
}

## plot expected pident by alignment length using eq2 from Rost (1999)
curve(get_pident_by_length,
    11,
    500,
    pch=20,
    xlab="alignment length",
    ylab="pident",
    main="expected protein identity (eq2; Rost B. 1999)")
```

This **twilight zone** filter can be directly used by the `@param` `rost1999`
set to `TRUE`.

```{r}
## example how to filter prior crbh for eq2 from Rost (1999)

## get CRBHit pairs with direct twilight-zone filter
ath_aly_crbh <- cds2rbh(
    cds1=cds1,
    cds2=cds2,
    plotCurve=TRUE,
    rost1999=TRUE,
    lastpath=vignette.paths[1])
dim(ath_aly_crbh$crbh.pairs)

## get help
#?cds2rbh
## get help
#?filter.alnlen
## get help
#?filter.eval
## get help
#?filter.pident
## get help
#?filter.qcov
## get help
#?filter.rost1999
## get help
#?filter.tcov
```

#### 2.2.2. [Custom Filter]{#filters}

The user can also define its own filter function and apply it to the blast-like
output prior fitting and calculating CRBHs as follows:

```{r}
## example for a custom filters

## define custom filter for e.g. bit score (column 12)
myfilter1 <- function(rbh, value=500.0){
    return(dplyr::filter(rbh, bit_score>=value))
}
## define custom filter for e.g. corrected query_coverage
myfilter2 <- function(rbh, value=0.5){
    return(dplyr::filter(rbh,
        ((alignment_length-mismatches-gap_opens) / query_length)>=value))
}
## get CRBHit pairs with custom filter list
ath_aly_crbh <- cds2rbh(
    cds1=cds1,
    cds2=cds2,
    plotCurve=TRUE,
    filter=list(myfilter1, myfilter2),
    lastpath=vignette.paths[1])
dim(ath_aly_crbh$crbh.pairs)
```

#### 2.2.3. [Extract RBHs]{#rbhextraction}

As already explained in the Sequence Similarity Search section
(#sequencesearch), both, `cds2rbh()` and `cdsfile2rbh()` function, return a
list of three `data.frame`'s which contain the CRBHit pairs (`$crbh.pairs`)
retained, the query > target blast-like output matrix (`$crbh1`) and the
target > query blast-like output matrix (`$crbh2`). For each matrix the
`$rbh_class` indicates if the hit is a Reciprocal Best Hit (`rbh`) or if it is
a conditional secondary hit retained because of RBH fitting (CRBH algorithm)
(`sec`). 

To extract just the RBHs one can subset based on the `rbh_class` columns
as follows:
```{r}
## example to extract CRBHit pairs classified as rbh

## reduce to rbh_class rbh
data("ath_aly_crbh", package = "CRBHits")
head(dplyr::filter(ath_aly_crbh$crbh1, rbh_class=="rbh"))
head(dplyr::filter(ath_aly_crbh$crbh2, rbh_class=="rbh"))
```

### 2.3. [3. step: fitting RBHs + get CRBHs]{#crbhstep3}

All further described steps are included in the `cds2rbh()` and
`cdsfile2rbh()` function. The subsections will just explain in more detail what
is happening with the input data.

#### 2.3.1. [CRBHit pairs - Fitting Parameter]{#fittingparam}

The original implementation of the CRBH algorithm [crb-blast.rb line 438-518](https://github.com/cboursnell/crb-blast/blob/master/lib/crb-blast/crb-blast.rb)
uses certain parameters for fitting the e-value against the alignment length.

In brief, for each retained reciprocal hit pair (RBH; after filtering) the
e-value and the corresponding alignment length of a hit pair and its
neighborhood is evaluated. The neighborhood is defined by a weighting parameter
of the alignment length (default is set to `@param` `fit.varweight = 0.1`) and
a minimum neighborhood alignment length
(default is set to `@param` `fit.min = 5`).

For each unique alignment length $L$ a value $s_{L}$ is defined as the
neighborhood alignment length as

$$s_{L} = \cases {
\text{fit.min} \text{ , for } L * \text{fit.varweight} < \text{fit.min} \\
L * \text{fit.varweight} \text{ , for } L * \text{fit.varweight} \geq
\text{fit.min}}$$
E.g. for an alignment length of 100 and the default settings,
$s_{100} = 100 * 0.1 = 10$.

All RBHs, for which $L-s_{L} \leq L_{RBH_i} \leq L + s_{L}$, in other words,
for which their alignment lengths fall between 90 and 110 are than considered
for fitting a cutoff at that alignment length using the mean of the
$-log10(evalue)$.

$$\text{cutoff}_{L} = \frac{1}{n} \sum_{i = 1}^{n}{
\cases{324 \text{ , for }  evalue_{RBH_i} = 0 \\
-log10(evalue_{\text{RBH}_i}) \text{ , for } evalue_{RBH_i} \ne 0}}$$

The user can now alter between `mean` and `median` using the
`@param` `fit.type = "mean"` or `fit.type = "median"` to get alternative
fitting curves.

```{r}
## example how to get crbh from two coding fasta files using median fitting

## get CRBHit pairs with median fitting
ath_aly_crbh <- cds2rbh(
    cds1=cds1,
    cds2=cds2,
    plotCurve=TRUE,
    fit.type="median",
    lastpath=vignette.paths[1])

## get help
#?cds2rbh
```

As described earlier, [CRBHits](https://gitlab.gwdg.de/mpievolbio-it/crbhits)
features downstream anaylsis like batch calculating synonymous (Ks) and
nonsynonymous substitutions (Ka) per orthologous sequence pair. The following
sections will give more details about these downstream functionalities.

## 3. [Ka/Ks Calculations]{#KaKs}

![Figure: Overview of the `rbh2kaks()` function steps](rbh2kaksoverview.png){width=800px}

The resulting CRBHs (see #crbhalgorithm) can be further processed to e.g.
filtered for Tandem Duplicate (see [CRBHits KaKs Vignette](https://mpievolbio-it.pages.gwdg.de/crbhits/articles/V02KaKsVignette.html))
and/or can be used to calculate Ka/Ks values for each CRBH pair
(see #rbh2kaks and [CRBHits KaKs Vignette](https://mpievolbio-it.pages.gwdg.de/crbhits/articles/V02KaKsVignette.html)).

All further described steps are included in the `rbh2kaks()` function. The
subsections will just explain in more detail what is happening with the input
data.

To activate/deactivate specific options use the parameters `@param` given and
explained for each function (see `?rbh2kaks()`).

__Note:__ To be able to perform *Tandem Duplicate Assignment* with the
`rbh2kaks()` function one needs to specify *gene positions* for both Input CDS
(can be extracted automatically from *NCBI* or *ENSEMBL* Input
(see `cds2genepos()` function)) and supply *selfblast* results for both
Input CDS.

![Figure: Individual steps of the `rbh2kaks()` function](rbh2kaks.png){width=600px}

### 3.1. [Codon Alignments - cds2codonaln() Function]{#cds2codon}

To be able to calculate synonymous (Ks) and non-synonymous (Ka) substitutions,
one needs to use a codon alignment.

The `MSA2dist::cds2codonaln()` function takes two single nucleotide
sequences as input to obtain such a codon alignment. The function
will convert the nucleotide sequences into amino acid sequences,
align them with the help of the
`pairwiseAlignment()` function from the `Biostrings` R package and
convert this alignment back into a codon alignment.

```{r}
## example to get a codon alignment

## define two CDS
cds1 <- Biostrings::DNAString("ATGCAACATTGC")
cds2 <- Biostrings::DNAString("ATGCATTGC")
## get codon alignment
MSA2dist::cds2codonaln(
    cds1=cds1,
    cds2=cds2)

## get help
#?MSA2dist::cds2codonaln
```

The user can alter some `@param` to change the alignment parameters, like
`type`, `substitutionMatrix`, `gapOpening` and `gapExtension` costs as defined
by the `pairwiseAlignment()` function from the `Biostrings` R package.

```{r}
## example to alter the substitionMatrix and use the BLOSUM45 cost matrix
## for the codon alignment

## get codon alignemnt with BLOSUM45 cost matrix
MSA2dist::cds2codonaln(
    cds1=cds1,
    cds2=cds2,
    substitutionMatrix="BLOSUM45")
```

The user can also remove gaps in the codon alignment.

```{r}
## example to remove codon gaps

## get codon alignment with gaps removed
MSA2dist::cds2codonaln(
    cds1=cds1,
    cds2=cds2,
    remove.gaps=TRUE)
```

### 3.2. [MSA2dist::dnastring2kaks() Function]{#cds2kaks}

The Ka/Ks (dN/dS) values can be obtained either via the codon model of
[Li WH. (1999)](https://www.ncbi.nlm.nih.gov/pubmed/8433381) implemented in the
R package [seqinr](https://cran.r-project.org/web/packages/seqinr/index.html)
or the model
[Yang Z and Nielson R. (2000)](https://www.ncbi.nlm.nih.gov/pubmed/10666704)
implemented in
[KaKs_Calculator2.0](https://sourceforge.net/projects/kakscalculator2/files/KaKs_Calculator2.0.tar.gz/download)
(see [Quick Installation](#quickinstall) how to compile the prerequisites).

The `MSA2dist::dnastring2kaks()` function takes single nucleotide
sequences as input to obtain a codon alignment and calculate Ka/Ks.

```{r}
## calculate Ka/Ks on two CDS

## load example sequence data
data("ath", package="CRBHits")
data("aly", package="CRBHits")
## select a sequence pair according to a best hit pair (done for you)
cds1 <- ath[1]
cds2 <- aly[282]
## calculate Ka/Ks values on two CDS using Li model
MSA2dist::dnastring2kaks(
    c(cds1,
    cds2),
    model="Li",
    isMSA=FALSE)

## get help
#?MSA2dist::dnastring2kaks
```

All parameters which can be used by the `MSA2dist::cds2codonaln()`
function can be overgiven to e.g. use an alternative
`substitutionMatrix`.

```{r}
## example to use an alternative substitutionMatrix for the codon alignment
## and obtain Ka/Ks

## calculate Ka/Ks values on two CDS using Li model and BLOSUM45 cost matrix
MSA2dist::dnastring2kaks(
    c(cds1,
    cds2),
    model="Li",
    isMSA=FALSE,
    substitutionMatrix="BLOSUM45")
```

### 3.3. [rbh2kaks() Function]{#rbh2kaks}

The resulting CRBHit pairs (see [cds2rbh()](#cds2rbhfunction)) can
be used with the `rbh2kaks()` function to obtain pairwise codon
alignments, which are further used to calculate synonymous (Ks) and
nonsynonymous (Ka) substitutions using parallelization.

__Note:__

It is important, that the names of the `rbh` columns must match the
names of the corresponding `cds1` and `cds2` `DNAStringSet` vectors.

However, since one should directly use the same input `DNAStringSet`
vector or input `url` to calculate the reciprocal best hit pair
matrix with the `cds2rbh()` or the `cdsfile2rbh()` function, this
should not be a problem.

```{r}
## example how to get CRBHit pairs from two coding fasta files
cdsfile1 <- system.file("fasta", "ath.cds.fasta.gz", package="CRBHits")
cdsfile2 <- system.file("fasta", "aly.cds.fasta.gz", package="CRBHits")

ath <- Biostrings::readDNAStringSet(cdsfile1)
aly <- Biostrings::readDNAStringSet(cdsfile2)

## the following function calculates CRBHit pairs
## using one thread and plots the fitted curve
ath_aly_crbh <- cds2rbh(
    cds1=ath,
    cds2=aly,
    lastpath=vignette.paths[1])

## calculate Ka/Ks using the CRBHit pairs
ath_aly_crbh$crbh.pairs <- head(ath_aly_crbh$crbh.pairs)
ath_aly_crbh.kaks <- rbh2kaks(
    rbhpairs=ath_aly_crbh,
    cds1=ath,
    cds2=aly,
    model="Li")
head(ath_aly_crbh.kaks)

## get help
#?rbh2kaks
```

The Ka/Ks calculations can be parallelized.

```{r}
## calculate Ka/Ks using the CRBHit pairs and multiple threads
ath_aly_crbh.kaks <- rbh2kaks(
    rbhpairs=ath_aly_crbh,
    cds1=ath,
    cds2=aly,
    model="Li",
    threads=2)
head(ath_aly_crbh.kaks)

## get help
#?rbh2kaks
```

Like for the `MSA2dist::dnastring2kaks()` function, all parameters
which can be used by the `MSA2dist::cds2codonaln()` function can be
overgiven to e.g. use an alternative `substitutionMatrix`.

```{r}
## calculate Ka/Ks using the CRBHit pairs and multiple threads
ath_aly_crbh.kaks <- rbh2kaks(
    rbhpairs=ath_aly_crbh,
    cds1=ath,
    cds2=aly,
    model="Li",
    threads=2,
    substitutionMatrix="BLOSUM45")
head(ath_aly_crbh.kaks)

## get help
#?rbh2kaks
```

## 4. [References]{#references}

Aubry S., Kelly S., Kümpers B. M., Smith-Unna R. D., and Hibberd J. M. (2014). **Deep evolutionary comparison of gene expression identifies parallel recruitment of trans-factors in two independent origins of C4 photosynthesis.** *PLoS Genetics*, **10(6)**. [https://doi.org/10.1371/journal.pgen.1004365](https://doi.org/10.1371/journal.pgen.1004365)

Charif D., and Lobry J. R. (2007). **SeqinR 1.0-2: a contributed package to the R project for statistical computing devoted to biological sequences retrieval and analysis.** *In Structural approaches to sequence evolution* (pp. 207-232). Springer, Berlin, Heidelberg. [https://link.springer.com/chapter/10.1007/978-3-540-35306-5_10](https://link.springer.com/chapter/10.1007/978-3-540-35306-5_10)

Duong T., and Wand M. (2015). **feature: Local Inferential Feature Significance for Multivariate Kernel Density Estimation.** *R package version 1.2.13*. [https://cran.r-project.org/web/packages/feature/](https://cran.r-project.org/web/packages/feature/)

Haas B. J., Delcher A. L., Wortman J. R., and Salzberg S. L. (2004). **DAGchainer: a tool for mining segmental genome duplications and synteny.** *Bioinformatics*, **20(18)**, 3643-3646. [https://doi.org/10.1093/bioinformatics/bth397](https://doi.org/10.1093/bioinformatics/bth397)

Haug-Baltzell A., Stephens S. A., Davey S., Scheidegger C. E., Lyons E. (2017). **SynMap2 and SynMap3D: web-based wholge-genome synteny browsers.** *Bioinformatics*, **33(14)**, 2197-2198. [https://academic.oup.com/bioinformatics/article/33/14/2197/3072872](https://academic.oup.com/bioinformatics/article/33/14/2197/3072872)

Kiełbasa S. M., Wan R., Sato K., Horton P., and Frith M. C. (2011). **Adaptive seeds tame genomic sequence comparison.** *Genome Research*, **21(3)**, 487-493. [https://doi.org/10.1101/gr.113985.110](https://doi.org/10.1101/gr.113985.110)

Kimura M. (1977). **Preponderance of synonymous changes as evidence for the neutral theory of molecular evolution.** *Nature*, **267**, 275-276.

Li W. H. (1993). **Unbiased estimation of the rates of synonymous and nonsynonymous substitution.** *Journal of Molecular Evolution*, **36(1)**, 96-99. [https://doi.org/10.1007/bf02407308](https://doi.org/10.1007/bf02407308)

Microsoft, and Weston S. (2020). **foreach: Provides Foreach Looping Construct.** *R package version*, 1.5.1. [foreach](https://cran.r-project.org/web/packages/foreach/index.html)

Mugal C. F., Wolf J. B. W., Kaj I. (2014). **Why Time Matters: Codon Evolution and the Temproal Dynamics of dN/dS.** *Molecular Biology and Evolution*, **31(1)**, 212-231.

Ooms J. (2019). **curl: A Modern and Flexible Web Client for R.** *R package version*, 4.3. [curl](https://cran.r-project.org/web/packages/curl/index.html)

Pagès H., Aboyoun P., Gentleman R., and DebRoy S. (2017). **Biostrings: Efficient manipulation of biological strings.** *R package version*, 2.56.0. [Biostrings](http://bioconductor.org/packages/release/bioc/html/Biostrings.html)

Revolution Analytics, and Weston S. (2020). **doMC: Foreach Parallel Adaptor for 'parallel'.** *R package version*, 1.3.7. [doMC](https://cran.r-project.org/web/packages/doMC/index.html)

Rost B. (1999). **Twilight zone of protein sequence alignments.** *Protein Engineering*, **12(2)**, 85-94. [https://doi.org/10.1093/protein/12.2.85](https://doi.org/10.1093/protein/12.2.85)

Scott C. (2017). **shmlast: an improved implementation of conditional reciprocal best hits with LAST and Python.** *Journal of Open Source Software*, **2(9)**, 142. [https://joss.theoj.org/papers/10.21105/joss.00142](https://joss.theoj.org/papers/10.21105/joss.00142)

Scrucca L., Fop M., Murphy T. B., and Raftery A. E. (2016) **mclust 5: clustering, classification and density estimation using Gaussian finite mixture models.** *The R Journal*, 8(1), 289-317. [https://www.ncbi.nlm.nih.gov/pmc/articles/PMC5096736/](https://www.ncbi.nlm.nih.gov/pmc/articles/PMC5096736/)

Wickham H. (2011). **testthat: Get Started with Testing.** *The R Journal*, 3(1), 5. [testthat](https://cran.r-project.org/web/packages/testthat/index.html)

Wickham H. (2019). **stringr: Simple, Consistent Wrappers for Common String Operations.** *R package version*, 1.4.0. [stringr](https://cran.r-project.org/web/packages/stringr/index.html)

Wickham H. (2020). **tidyr: Tidy Messy Data.** *R package version*, 1.1.2. [tidyr](https://cran.r-project.org/web/packages/tidyr/index.html)

Wickham H., Hester J., and Chang W. (2020). **devtools: Tools to make Developing R Packages Easier.** *R package version*, (2.3.2). [devtools](https://cran.r-project.org/web/packages/devtools/index.html)

Wickham H., François R., Henry L., and Müller K. (2020). **dplyr: A Grammar of Data Manipulation.** *R package version*, 1.0.2. [dplyr](https://cran.r-project.org/web/packages/dplyr/index.html)

Yang Z., and Nielsen R. (2000). **Estimating synonymous and nonsynonymous substitution rates under realistic evolutionary models.** *Molecular Biology and Evolution*, **17(1)**, 32-43. [https://doi.org/10.1093/oxfordjournals.molbev.a026236](https://doi.org/10.1093/oxfordjournals.molbev.a026236)