--- title: "KaKs Vignette" author: "Kristian K Ullrich" date: "`r Sys.Date()`" output: rmarkdown::html_vignette vignette: > %\VignetteIndexEntry{KaKs Vignette} %\VignetteEngine{knitr::rmarkdown} %\VignetteEncoding{UTF-8} --- ```{r setup, include = FALSE} knitr::opts_chunk$set( collapse = TRUE, comment = "#>" ) ``` # Ka/Ks Vignette - includes *Arabidopsis thalina* and *Arabidopsis lyrata* CRBHit pair calculation - includes *Homo sapiens* and *Pan troglodytes* CRBHit pair calculation - inlcudes Longest Isoform selection - includes Gene/Isoform chromosomal position extraction - includes Tandem Duplicate Assignment - includes Synteny Assignment - includes Ka/Ks colored Dot-Plot ## Table of Contents 1. [Conditional Reciprocal Best Hit pairs (CRBH pairs)](#crbhpairs) 1. [Download Coding Sequences (CDS) from NCBI and ENSEMBL](#ensembl) 2. [Get Longest Isoform from NCBI or ENSEMBL Input](#isoform) 1. [isoform2longest()](#isoform2longest) 2. [gtf2longest()](#gtf2longest) 3. [gff2longest()](#gff2longest) 3. [Calculate/Filter CRBHit pairs](#crbhfilter) 4. [Extract Gene/Isoform chromosomal position](#extractposition) 1. [Get gene position from NCBI or ENSEMBL Input](#ncbi2genepos) 2. [Use GTF/GFF3 file to obtain gene position](#gff2genepos) 5. [Assign Tandem Duplicates](#tandemduplicates) 6. [Synteny with DAGchainer](#synteny) 2. [Ka/Ks Calculation](#KaKs) 3. [Ka/Ks Filtering and Visualisation](#KaKsFiltering) 4. [*Homo sapiens* vs. *Pan troglodytes* example](#homosapiens) 5. [References](#references) This vignette is supposed to explain in more detail Ka/Ks calculation and it's downstream filtering and visualization. The Ka/Ks ratio was originally developed for the analysis of genetic sequences from divergent species ([Kimura (1977)](https://www.nature.com/articles/267275a0), [Yang and Nielson (2000)]((https://doi.org/10.1093/oxfordjournals.molbev.a026236))). In short, the Ka/Ks ratio quantifies the mode and strength of selection by comparing synonymous substitution rates (Ks) (assumed to be evolutionary neutral) with nonsynonymous rates (Ka), which are exposed to selection as they change the amino acid composition of a protein ([Mugal et al (2014)](https://academic.oup.com/mbe/article/31/1/212/1049986)). [CRBHits](https://gitlab.gwdg.de/mpievolbio-it/crbhits) is a reimplementation of the Conditional Reciprocal Best Hit (CRBH) algorithm [crb-blast](https://github.com/cboursnell/crb-blast) in R. The CRBHit pairs can be directly used to calculate Ka/Ks ratios and to filter for tandem duplicates or syntenic groups. See the R package page for a detailed description of the install process and its dependencies [https://mpievolbio-it.pages.gwdg.de/crbhits/](https://mpievolbio-it.pages.gwdg.de/crbhits/) or have a look at the [CRBHits Basic Vignette](https://mpievolbio-it.pages.gwdg.de/crbhits/articles/V01CRBHitsBasicVignette.html). __Note:__ In this Vignette the `lastpath` is defined as `vignette.paths[1]` and the `dagchainerpath` as `vignette.paths[2]` to be able to build the Vignette. However, once you have compiled last-1639, KaKs_Calculator2.0 and DAGchainer with the functions `make_last()` and `make_dagchainer()` you won't need to specify the paths anymore. Please remove them if you would like to repeat the examples. ``` ## load vignette specific libraries library(CRBHits) suppressPackageStartupMessages(library(Biostrings)) suppressPackageStartupMessages(library(dplyr)) suppressPackageStartupMessages(library(ggplot2)) suppressPackageStartupMessages(library(gridExtra)) suppressPackageStartupMessages(library(curl)) ## compile LAST, KaKs_Calculator2.0 and DAGchainer for the vignette vignette.paths <- make_vignette() ``` ## 1. [Conditional reciprocal best hit pairs]{#crbhpairs} The CRBH algorithm builds upon the classical reciprocal best hit (RBH) approach to find orthologous sequences between two sets of sequences by defining an expect-value cutoff per alignment length. Further, this cutoff is used to retain secondary hits as additional *bona-fide* homologues. See the R package page and the [CRBHits Basic Vignette](https://mpievolbio-it.pages.gwdg.de/crbhits/articles/V01CRBHitsBasicVignette.html) for a more detailed description of the CRBH algorithm. ### 1.1. [Download Coding sequences (CDS) from NCBI and ENSEMBL]{#ensembl} To calculate CRBHit pairs between two species, one can directly use an URL to access the coding sequences and calculate the CRBH pairs matrix. Two examples are given in this vignette. The first example compares the coding sequences from *Arabidopsis thaliana* and *Arabidopsis lyrata*. The input sequences are from the FTP server from [NCBI Genomes](ftp://ftp.ncbi.nlm.nih.gov/genomes/all/). The second example compares the CDS from *Homo sapiens* and *Pan troglodytes* (see [*Homo sapiens* vs. *Pan troglodytes* example](#homosapiens)). As an alternative the [EnsemblPlants release-52](ftp://ftp.ensemblgenomes.org/pub/plants/release-52/fasta/) can used. ``` ## set URLs for Arabidopis thaliana and Arabidopsis lyrata from NCBI Genomes ## set NCBI URL NCBI <- "https://ftp.ncbi.nlm.nih.gov/genomes/all/" ## set Arabidopsis thaliana CDS URL ARATHA.cds.url <- paste0(NCBI, "GCF/000/001/735/GCF_000001735.4_TAIR10.1/", "GCF_000001735.4_TAIR10.1_cds_from_genomic.fna.gz") ## set Arabidopsis lyrata CDS URL ARALYR.cds.url <- paste0(NCBI, "GCF/000/004/255/GCF_000004255.2_v.1.0/", "GCF_000004255.2_v.1.0_cds_from_genomic.fna.gz") ## get Arabidopsis thaliana CDS ARATHA.cds <- Biostrings::readDNAStringSet(ARATHA.cds.url) ## get Arabidopsis lyrata CDS ARALYR.cds <- Biostrings::readDNAStringSet(ARALYR.cds.url) ``` ### 1.2. [Get Longest Isoform from NCBI or ENSEMBL Input]{#isoform} Prior CRBHit pairs calculation, the longest isoforms can be selected, if the data was downloaded from NCBI or Ensembl. __Note:__ This is an important step. If e.g. isoforms are used, the Ka/Ks calculations might show biased values inspecting mean or median Ka/Ks values grouped per query sequence. #### 1.2.1 [isoform2longest()]{#isoform2longest} ``` ## get longest isoform from sequence IDs ARATHA.cds.longest <- isoform2longest(ARATHA.cds, "NCBI") ARALYR.cds.longest <- isoform2longest(ARALYR.cds, "NCBI") ## get help #?isoform2longest ``` #### 1.2.2 [gtf2longest()]{#gtf2longest} __Note:__ Sometimes the sequence IDs do not have the necessary information to extract the longest isoforms. If this might be the case, please use the dedicated functions `gtf2longest()` or `gff2longest()` to get the gene positions and the longest isoforms. ``` ## set NCBI URL NCBI <- "https://ftp.ncbi.nlm.nih.gov/genomes/all/" ARATHA.NCBI.cds.url <- paste0(NCBI, "GCF/000/001/735/GCF_000001735.4_TAIR10.1/", "GCF_000001735.4_TAIR10.1_cds_from_genomic.fna.gz") ARALYR.NCBI.cds.url <- paste0(NCBI, "GCF/000/004/255/GCF_000004255.2_v.1.0/", "GCF_000004255.2_v.1.0_cds_from_genomic.fna.gz") ARATHA.NCBI.cds <- Biostrings::readDNAStringSet(ARATHA.NCBI.cds.url) ARALYR.NCBI.cds <- Biostrings::readDNAStringSet(ARALYR.NCBI.cds.url) ## get longest isoform from GTF file - NCBI ## set Arabidopsis thaliana GTF URL ARATHA.NCBI.gtf.url <- paste0(NCBI, "GCF/000/001/735/GCF_000001735.4_TAIR10.1/", "GCF_000001735.4_TAIR10.1_genomic.gtf.gz") ARALYR.NCBI.gtf.url <- paste0(NCBI, "GCF/000/004/255/GCF_000004255.2_v.1.0/", "GCF_000004255.2_v.1.0_genomic.gtf.gz") ## get longest isoform from GTF file ARATHA.NCBI.gtf.longest <- gtf2longest( gtffile=ARATHA.NCBI.gtf.url, cds=ARATHA.NCBI.cds, removeNonCoding=TRUE, source="NCBI") ARALYR.NCBI.gtf.longest <- gtf2longest( gtffile=ARALYR.NCBI.gtf.url, cds=ARALYR.NCBI.cds, removeNonCoding=TRUE, source="NCBI") ARATHA.NCBI.gtf.cds.longest <- ARATHA.NCBI.gtf.longest$cds ARALYR.NCBI.gtf.cds.longest <- ARALYR.NCBI.gtf.longest$cds ARATHA.NCBI.gtf.cds.genepos <- ARATHA.NCBI.gtf.longest$genepos ARALYR.NCBI.gtf.cds.genepos <- ARALYR.NCBI.gtf.longest$genepos ## get help #?gtf2longest ``` ``` ## set ENSEMBL URL ENSEMBL <- "http://ftp.ensemblgenomes.org/pub/plants/release-52/" ARATHA.ENSEMBL.cds.url <- paste0(ENSEMBL, "fasta/arabidopsis_thaliana/cds/", "Arabidopsis_thaliana.TAIR10.cds.all.fa.gz") ARALYR.ENSEMBL.cds.url <- paste0(ENSEMBL, "fasta/arabidopsis_lyrata/cds/", "Arabidopsis_lyrata.v.1.0.cds.all.fa.gz") ARATHA.ENSEMBL.cds <- Biostrings::readDNAStringSet(ARATHA.ENSEMBL.cds.url) ARALYR.ENSEMBL.cds <- Biostrings::readDNAStringSet(ARALYR.ENSEMBL.cds.url) ## get longest isoform from GTF file - ENSEMBL ## set Arabidopsis thaliana GTF URL ARATHA.ENSEMBL.gtf.url <- paste0(ENSEMBL, "gtf/arabidopsis_thaliana/", "Arabidopsis_thaliana.TAIR10.52.gtf.gz") ARALYR.ENSEMBL.gtf.url <- paste0(ENSEMBL, "gtf/arabidopsis_lyrata/", "Arabidopsis_lyrata.v.1.0.52.gtf.gz") ## get longest isoform from GTF file ARATHA.ENSEMBL.gtf.longest <- gtf2longest( gtffile=ARATHA.ENSEMBL.gtf.url, cds=ARATHA.ENSEMBL.cds, removeNonCoding=TRUE, source="ENSEMBL") ARALYR.ENSEMBL.gtf.longest <- gtf2longest( gtffile=ARALYR.ENSEMBL.gtf.url, cds=ARALYR.ENSEMBL.cds, removeNonCoding=TRUE, source="ENSEMBL") ARATHA.ENSEMBL.gtf.cds.longest <- ARATHA.ENSEMBL.gtf.longest$cds ARALYR.ENSEMBL.gtf.cds.longest <- ARALYR.ENSEMBL.gtf.longest$cds ARATHA.ENSEMBL.gtf.cds.genepos <- ARATHA.ENSEMBL.gtf.longest$genepos ARALYR.ENSEMBL.gtf.cds.genepos <- ARALYR.ENSEMBL.gtf.longest$genepos ## get help #?gtf2longest ``` #### 1.2.3 [gff2longest()]{#gff2longest} ``` ## set NCBI URL NCBI <- "https://ftp.ncbi.nlm.nih.gov/genomes/all/" ## get longest isoform from GFF3 file - NCBI ## set Arabidopsis thaliana GFF3 URL ARATHA.NCBI.gff3.url <- paste0(NCBI, "GCF/000/001/735/GCF_000001735.4_TAIR10.1/", "GCF_000001735.4_TAIR10.1_genomic.gff.gz") ARALYR.NCBI.gff3.url <- paste0(NCBI, "GCF/000/004/255/GCF_000004255.2_v.1.0/", "GCF_000004255.2_v.1.0_genomic.gff.gz") ## get longest isoform from GFF3 file ARATHA.NCBI.gff3.longest <- gff2longest( gff3file=ARATHA.NCBI.gff3.url, cds=ARATHA.NCBI.cds, removeNonCoding=TRUE, source="NCBI") ARALYR.NCBI.gff3.longest <- gff2longest( gff3file=ARALYR.NCBI.gff3.url, cds=ARALYR.NCBI.cds, removeNonCoding=TRUE, source="NCBI") ARATHA.NCBI.gff3.cds.longest <- ARATHA.NCBI.gff3.longest$cds ARALYR.NCBI.gff3.cds.longest <- ARALYR.NCBI.gff3.longest$cds ARATHA.NCBI.gff3.cds.genepos <- ARATHA.NCBI.gff3.longest$genepos ARALYR.NCBI.gff3.cds.genepos <- ARALYR.NCBI.gff3.longest$genepos ## get help #?gff2longest ``` ``` ## set ENSEMBL URL ENSEMBL <- "http://ftp.ensemblgenomes.org/pub/plants/release-52/" ## get longest isoform from GFF3 file - ENSEMBL ## set Arabidopsis thaliana GFF3 URL ARATHA.ENSEMBL.gff3.url <- paste0(ENSEMBL, "gff3/arabidopsis_thaliana/", "Arabidopsis_thaliana.TAIR10.52.gff3.gz") ARALYR.ENSEMBL.gff3.url <- paste0(ENSEMBL, "gff3/arabidopsis_lyrata/", "Arabidopsis_lyrata.v.1.0.52.gff3.gz") ARATHA.ENSEMBL.gff3.file <- tempfile() download.file(ARATHA.ENSEMBL.gff3.url, ARATHA.ENSEMBL.gff3.file, quiet=FALSE) ARALYR.ENSEMBL.gff3.file <- tempfile() download.file(ARALYR.ENSEMBL.gff3.url, ARALYR.ENSEMBL.gff3.file, quiet=FALSE) ## get longest isoform from GFF3 file ARATHA.ENSEMBL.gff3.longest <- gff2longest( gff3file=ARATHA.ENSEMBL.gff3.file, cds=ARATHA.ENSEMBL.cds, removeNonCoding=TRUE, source="ENSEMBL") ARALYR.ENSEMBL.gff3.longest <- gff2longest( gff3file=ARALYR.ENSEMBL.gff3.file, cds=ARALYR.ENSEMBL.cds, removeNonCoding=TRUE, source="ENSEMBL") ARATHA.ENSEMBL.gff3.cds.longest <- ARATHA.ENSEMBL.gff3.longest$cds ARALYR.ENSEMBL.gff3.cds.longest <- ARALYR.ENSEMBL.gff3.longest$cds ARATHA.ENSEMBL.gff3.cds.genepos <- ARATHA.ENSEMBL.gff3.longest$genepos ARALYR.ENSEMBL.gff3.cds.genepos <- ARALYR.ENSEMBL.gff3.longest$genepos ## get help #?gff2longest ``` ### 1.3. [Calculate/Filter CRBHit pairs]{#crbhfilter} The blast-like software [LAST](http://last.cbrc.jp/) is used to compare the translated CDS against each other and output a blast-like output table including the query and target length. The hit pairs are filtered prior fitting for a query coverage of 50\% and the **twilight zone of protein sequence alignments** ([Rost B. (1999)](https://academic.oup.com/peds/article/12/2/85/1550637)). The CRBhit pairs can be calculated directly from the URLs with the function `cdsfile2rbh()` using multiple threads including longest isoform selection. For a detailed description of the `cdsfile2rbh()` function and the `cds2rbh()` function with its individual filtering steps prior CRBH algorithm fitting see the [CRBHits Basic Vignette](https://mpievolbio-it.pages.gwdg.de/crbhits/articles/V01CRBHitsBasicVignette.html). ``` ## calculate CRBHit pairs for A. thaliana and A. lyrata using 2 threads ## input from CDS obtained from NCBI ## longest isoform selection ## query coverage >= 50% ## rost199 filter ARATHA_ARALYR_crbh <- cds2rbh( cds1=ARATHA.cds, cds2=ARALYR.cds, qcov=0.5, rost1999=TRUE, longest.isoform=TRUE, isoform.source="NCBI", threads=2, plotCurve=TRUE, lastpath=vignette.paths[1]) attributes(ARATHA_ARALYR_crbh)$selfblast ## get help #?cds2rbh ``` ![Figure: CRBHit pairs between A. thaliana and A. lyrata](V02-CRBHit-pairs-ARATHA-ARALYR.png) ### 1.4. [Extract Gene/Isoform chromosomal position]{#extractposition} To be able to filter for *Tandem Duplicates* and to plot a dotplot, one needs to have annotated gene positions per chromosome/contig. With the `cds2genepos()` function it is possible to directly access this information if the data was obtained from NCBI or Ensembl. Chromosomal gene positions might overlap and if one has not filtered to use the longest or primary isoform there will be positions overlap for CDS. To overcome this issue it is recommended to first reduce the CDS Input to the longest isoform (see [Get Longest Isoform from NCBI or ENSEMBL Input](#isoform)). __Note:__ It is also possible to obtain and define chromosomal gene positions from other sources, like parsing a GTF/GFF file and supply a manual curated genepos matrix. In the same way this process might be used to select the longest/primary isoform from CDS Input sources other than NCBI or Ensembl. This special case is handled here: #### 1.4.1. [Get gene position from NCBI or ENSEMBL Input]{#ncbi2genepos} If the CDS Input was obtained from NCBI or ENSEMBL, the gene position can be directly extracted from the `DNAStringSet` as follows. __Note:__ Sometimes the sequence IDs do not have the necessary information to extract the longest isoforms. If this might be the case, please use the dedicated functions `gtf2longest()` or `gff2longest()` to get the gene positions and the longest isoforms. ``` ## get gene position idx from NCBI CDS ## extract gene position from CDS directly from sequence IDs ARATHA.cds.genepos <- cds2genepos( cds=ARATHA.cds, source="NCBI") ## extract gene position from longest isoform CDS ARATHA.cds.longest.genepos <- cds2genepos( cds=ARATHA.cds.longest, source="NCBI") ## show first entries head(ARATHA.cds.genepos) head(ARATHA.cds.longest.genepos) ## get number of gene isoforms with same index table(table(ARATHA.cds.genepos$gene.idx)) table(table(ARATHA.cds.longest.genepos$gene.idx)) ##get help #?cds2genepos ``` #### 1.4.2. [Use GTF/GFF3 file to obtain gene position]{#gff2genepos} As noted above, it is also possible to obtain gene position and corresponding the longest isoforms per gene annotated in a GTF/GFF3 file. see [GTF/GFF3 format](http://www.ensembl.org/info/website/upload/gff3.html) __Note:__ In some cases the IDs provided in the CDS and the GTF/GFF3 file differ, which make it more complicated to directly link gene position and CDS. Here, as an example, the GTF file for the species *Arabidopsis thaliana* is obtained from the FTP server from [Ensembl Plants](https://plants.ensembl.org/index.html). __Note:__ The GTF and GFF3 files can be very large and it might take some time until this data is loaded into memory. see `gtf2longest()` in section 1.2.2 and `gff2longest()` in section 1.2.3 __Note:__ These manual steps (see below) are now deprecated, please use the dedicated functions `gtf2longest()` or `gff2longest()`. ``` ## deprecated, please use gtf2longest() or gff2longest() function ## get gene position idx from GTF/GFF3 ## set NCBI URL ensemblPlants <- "ftp://ftp.ensemblgenomes.org/pub/plants/release-48/" ## set Arabidopsis thaliana GFF3 URL ARATHA.GFF.url <- paste0(ensemblPlants, "gff3/arabidopsis_thaliana/", "Arabidopsis_thaliana.TAIR10.48.gff3.gz") ## downlaod and gunzip file ARATHA.GFF.file <- tempfile() ARATHA.GFF.file.gz <- paste0(ARATHA.GFF.file, ".gz") download.file(ARATHA.GFF.url, ARATHA.GFF.file.gz, quiet=FALSE) system2(command="gunzip", args=c("-f", ARATHA.GFF.file.gz)) ## read ARATHA.GFF.file ARATHA.gff <- read.table(ARATHA.GFF.file, sep="\t", quote="", header=FALSE) colnames(ARATHA.gff) <- c("seqname", "source", "feature", "start", "end", "score", "strand", "frame", "attribute") ## extract gene ARATHA.gff.gene <- ARATHA.gff %>% dplyr::filter(feature=="gene") ## get gene ID ARATHA.gff.gene.id <- gsub("ID\\=", "", stringr::word(ARATHA.gff.gene$attribute, 1, sep=";")) ## extract mRNA ARATHA.gff.mRNA <- ARATHA.gff %>% dplyr::filter(feature=="mRNA") ## get mRNA ID ARATHA.gff.mRNA.id <- gsub("ID\\=", "", stringr::word(ARATHA.gff.mRNA$attribute, 1, sep=";")) ## get mRNA based gene ID ARATHA.gff.mRNA.parent.id <- gsub("Parent\\=", "", stringr::word(ARATHA.gff.mRNA$attribute, 2, sep=";")) ## add mRNA ID and mRNA based gene ID ARATHA.gff.mRNA <- ARATHA.gff.mRNA %>% dplyr::mutate(gene.id=ARATHA.gff.mRNA.parent.id, mRNA.id=ARATHA.gff.mRNA.id) ## extract CDS ARATHA.gff.CDS <- ARATHA.gff %>% dplyr::filter(feature=="CDS") ## get CDS based mRNA ID ARATHA.gff.CDS.parent.id <- gsub("Parent\\=", "", stringr::word(ARATHA.gff.CDS$attribute, 2, sep=";")) ## add CDS based mRNA ID ARATHA.gff.CDS <- ARATHA.gff.CDS %>% dplyr::mutate(mRNA.id=ARATHA.gff.CDS.parent.id) ## get width per mRNA isoform ARATHA.gff.mRNA.len <- ARATHA.gff.CDS %>% dplyr::group_by(mRNA.id) %>% dplyr::summarise(mRNA.id=unique(mRNA.id), len=sum(end-start+1)) ## add mRNA isoform width ARATHA.gff.mRNA <- ARATHA.gff.mRNA %>% dplyr::mutate(mRNA.len= ARATHA.gff.mRNA.len$len[ match(ARATHA.gff.mRNA$mRNA.id, ARATHA.gff.mRNA.len$mRNA.id)]) ## retain only longest mRNA isoform ARATHA.gff.mRNA.longest <- ARATHA.gff.mRNA %>% dplyr::arrange(seqname, desc(mRNA.len), mRNA.id, gene.id, start) %>% dplyr::distinct(gene.id, .keep_all=TRUE) %>% dplyr::mutate(gene.mid=start + (end-start)/2) ## add gene idx ARATHA.gff.mRNA.longest <- ARATHA.gff.mRNA.longest %>% dplyr::mutate(gene.idx=seq(from=1, to=dim(ARATHA.gff.mRNA.longest)[1])) ## create gene position matrix to be used for downstream analysis ARATHA.gff.genepos <- ARATHA.gff.mRNA.longest %>% dplyr::select( gene.seq.id=mRNA.id, gene.chr=seqname, gene.start=start, gene.end=end, gene.mid=gene.mid, gene.strand=strand, gene.idx=gene.idx) ## set attribute attr(ARATHA.gff.genepos, "CRBHits.class") <- "genepos" ## show first entries of generated gene position head(ARATHA.gff.genepos) ``` ### 1.5 [Assign Tandem Duplicates]{#tandemduplicates} Tandem duplicated genes (gene family members that are tightly clustered on a chromosome) are common in plant genomes and can show different Ka/Ks distribution than non tandem duplicates or other segmental duplicated genes ([Rizzon et al., 2006](https://journals.plos.org/ploscompbiol/article?id=10.1371/journal.pcbi.0020115)). In order to account for this, it is possible to classify these tandem duplicates with the `tandemdups()` function. __Note:__ To be able to classify tandem duplicated genes within one genome, one needs to calculate selfblast based CRBHit pairs and provide gene position information for the same input data. The following example shows how one can: 1. get selfblast CRBHit pairs 2. get gene positions 3. assign tandem duplicates ``` ## example to assign tandem duplicates given selfblast CRBHit pairs and gene position ## get selfblast CRBHit pairs for A. thaliana ARATHA_selfblast_crbh <- cds2rbh( cds1=ARATHA.cds, cds2=ARATHA.cds, qcov=0.5, rost1999=TRUE, longest.isoform=TRUE, isoform.source="NCBI", threads=2, plotCurve=TRUE, lastpath=vignette.paths[1]) attributes(ARATHA_selfblast_crbh)$selfblast ``` ![Figure: selfblast CRBHit pairs between of A. thaliana](V02-CRBHit-pairs-selfblast-ARATHA.png) ``` ## get selfblast CRBHit pairs for A. lyrata ARALYR_selfblast_crbh <- cds2rbh( cds1=ARALYR.cds, cds2=ARALYR.cds, qcov=0.5, rost1999=TRUE, longest.isoform=TRUE, isoform.source="NCBI", threads=2, plotCurve=TRUE, lastpath=vignette.paths[1]) attributes(ARALYR_selfblast_crbh)$selfblast ``` ![Figure: selfblast CRBHit pairs between of A. lyrata](V02-CRBHit-pairs-selfblast-ARALYR.png) ``` ## get gene position for A. thaliana longest isoforms ARATHA.cds.longest.genepos <- cds2genepos( cds=ARATHA.cds.longest, source="NCBI") ## get gene position for A. lyrata longest isoforms ARALYR.cds.longest.genepos <- cds2genepos( cds=ARALYR.cds.longest, source="NCBI") ## assign tandem duplicates for A. thaliana ARATHA.cds.longest.tandemdups <- tandemdups( rbhpairs=ARATHA_selfblast_crbh, genepos=ARATHA.cds.longest.genepos, dupdist=5) ## assign tandem duplicates for A. lyrata ARALYR.cds.longest.tandemdups <- tandemdups( rbhpairs=ARALYR_selfblast_crbh, genepos=ARALYR.cds.longest.genepos, dupdist=5) ## get help #?tandemdups ``` The resulting tandem duplicated genes detected can be plotted given their chromosomal gene position as follows: ``` ## example how to plot tandem duplicated gene groups ## get tandem group size tandem_group_size <- ARATHA.cds.longest.tandemdups %>% dplyr::group_by(tandem_group) %>% dplyr::group_size() table(tandem_group_size) ## use dplyr::mutate to assign group size ARATHA.cds.longest.tandemdups <- ARATHA.cds.longest.tandemdups %>% dplyr::mutate(tandem_group_size=unlist(apply(cbind(tandem_group_size, tandem_group_size), 1, function(x) rep(x[1], x[2])))) ## use dplyr::group_by to plot group by chromosome and colored by group size ARATHA.cds.longest.tandemdups %>% dplyr::group_by(gene.chr) %>% ggplot2::ggplot(aes(x=gene.mid, y=gene.mid)) + ggplot2::geom_point(shape=20, aes(col=as.factor(tandem_group_size))) + ggplot2::facet_wrap(~gene.chr) + ggplot2::scale_colour_manual(values= CRBHitsColors(length(table(tandem_group_size)))) ``` ![Figure: Tandem duplicates A. thaliana](V02-tandemdups-ARATHA.png) ### 1.6 [Synteny with DAGchainer]{#synteny} Chains of collinear gene pairs can represent segmentally duplicated regions and genes within a single genome or syntenic regions between related genomes ([Haas et al., 2004](https://doi.org/10.1093/bioinformatics/bth397)). There exists external tools which can be used to infer syntenic regions, like e.g. [SynMap](https://genomevolution.org/CoGe/SynMap.pl) as published by [Haug-Baltzell A. et al., 2017 ](https://academic.oup.com/bioinformatics/article/33/14/2197/3072872). Within CRBHits the tool [DAGchainer](http://dagchainer.sourceforge.net/) published by [Haas et al., 2004](https://doi.org/10.1093/bioinformatics/bth397) can be directly applied on the CRBHit pairs. The following example shows how one can run [DAGchainer](http://dagchainer.sourceforge.net/) on CRBHit pairs and gene positions to get syntenic information about the CRBHit pairs. ``` ## example how to run DAGchainer on CRBHit pairs and gene positions ## DAGchainer using gene base pair (gene bp start end) ## Note: change parameter to fit bp option ARATHA_ARALYR_crbh.dagchainer.bp <- rbh2dagchainer( rbhpairs=ARATHA_ARALYR_crbh, gene.position.cds1=ARATHA.cds.longest.genepos, gene.position.cds2=ARALYR.cds.longest.genepos, type="bp", gap_length=10000, max_dist_allowed=200000, dagchainerpath=vignette.paths[2]) ## DAGchainer using gene index (gene order) ARATHA_ARALYR_crbh.dagchainer.idx <- rbh2dagchainer( rbhpairs=ARATHA_ARALYR_crbh, gene.position.cds1=ARATHA.cds.longest.genepos, gene.position.cds2=ARALYR.cds.longest.genepos, type="idx", gap_length=1, max_dist_allowed=20, dagchainerpath=vignette.paths[2]) ## get help #?rbh2dagchainer ``` A plotting function `plot_dagchainer()` is integrated and can be directly applied on the DAGchainer results. ``` ## example how to plot pairwise chromosomal syntenic groups (DAGchainer results) ## plot DAGchainer results for each chromosome combination dim(ARATHA_ARALYR_crbh.dagchainer.bp) plot_dagchainer(ARATHA_ARALYR_crbh.dagchainer.bp) ``` ![Figure: Dagchainer results of A. thaliana and A. lyrata](V02-dagchainer-default-ARATHA-ARALYR.png) ``` ## plot DAGchainer results selected chromosomes g <- plot_dagchainer( dag=ARATHA_ARALYR_crbh.dagchainer.bp, select.chr=c( "AA1:NC_003070.9", "AA1:NC_003071.7", "AA1:NC_003074.8", "AA1:NC_003075.7", "AA1:NC_003076.8", "AA2:NW_003302551.1", "AA2:NW_003302554.1", "AA2:NW_003302553.1", "AA2:NW_003302552.1", "AA2:NW_003302551.1", "AA2:NW_003302550.1", "AA2:NW_003302549.1", "AA2:NW_003302548.1")) g ``` ![Figure: Dagchainer results selected chromosomes](V02-dagchainer-default-ARATHA-ARALYR-selected-chr.png) ``` ## change title size g + ggplot2::theme(title=element_text(size=16)) ``` ![Figure: Dagchainer results altered title size](V02-dagchainer-default-ARATHA-ARALYR-selected-chr-title-size.png) ``` ## change axis title size (gene1.mid; gene2.mid) g + ggplot2::theme(axis.title.x=element_text(size=16), axis.title.y=element_text(size=16)) ``` ![Figure: Dagchainer results altered axis title size](V02-dagchainer-default-ARATHA-ARALYR-selected-chr-axis-size.png) ``` ## change grid title size g + theme(strip.text.x=element_text(size=16), strip.text.y=element_text(size=16)) ``` ![Figure: Dagchainer results altered grid title size](V02-dagchainer-default-ARATHA-ARALYR-selected-chr-grid-title-size.png) ``` ## change grid axis size and angle g + theme(axis.text.x=element_text(size=12, angle=90)) ``` ![Figure: Dagchainer results altered grid angle](V02-dagchainer-default-ARATHA-ARALYR-selected-chr-grid-axis-size.png) ``` ## get help #?plot_dagchainer ``` The same analysis can be done just using the selfblast hits of one species. ``` ## example how to run DAGchainer on CRBHit pairs and gene positions ## DAGchainer using gene base pair (gene bp start end) ## ignore tandem duplicates ## Note: change parameter to fit bp option ARATHA_selfblast_crbh.dagchainer.ignore_tandem.bp <- rbh2dagchainer( rbhpairs=ARATHA_selfblast_crbh, gene.position.cds1=ARATHA.cds.longest.genepos, gene.position.cds2=ARATHA.cds.longest.genepos, type="bp", gap_length=10000, max_dist_allowed=200000, ignore_tandem=TRUE, only_tandem=FALSE, dagchainerpath=vignette.paths[2]) ## DAGchainer using gene base pair (gene bp start end) ## use tandem duplicates ## Note: change parameter to fit bp option ARATHA_selfblast_crbh.dagchainer.tandem.bp <- rbh2dagchainer( rbhpairs=ARATHA_selfblast_crbh, gene.position.cds1=ARATHA.cds.longest.genepos, gene.position.cds2=ARATHA.cds.longest.genepos, type="bp", gap_length=10000, max_dist_allowed=200000, ignore_tandem=FALSE, only_tandem=FALSE, dagchainerpath=vignette.paths[2]) ## DAGchainer using gene base pair (gene bp start end) ## use only tandem duplicates ## Note: change parameter to fit bp option ARATHA_selfblast_crbh.dagchainer.only_tandem.bp <- rbh2dagchainer( rbhpairs=ARATHA_selfblast_crbh, gene.position.cds1=ARATHA.cds.longest.genepos, gene.position.cds2=ARATHA.cds.longest.genepos, type="bp", gap_length=10000, max_dist_allowed=200000, ignore_tandem=FALSE, only_tandem=TRUE, dagchainerpath=vignette.paths[2]) ## get help #?rbh2dagchainer ``` ``` ## example how to plot pairwise chromosomal syntenic groups for selfblast (DAGchainer results) ## plot DAGchainer results for each chromosome combination dim(ARATHA_selfblast_crbh.dagchainer.ignore_tandem.bp) plot_dagchainer(ARATHA_selfblast_crbh.dagchainer.ignore_tandem.bp, DotPlotTitle="ignore tandem duplicates") dim(ARATHA_selfblast_crbh.dagchainer.tandem.bp) plot_dagchainer(ARATHA_selfblast_crbh.dagchainer.tandem.bp, DotPlotTitle="use tandem duplicates") dim(ARATHA_selfblast_crbh.dagchainer.only_tandem.bp) plot_dagchainer(ARATHA_selfblast_crbh.dagchainer.only_tandem.bp, DotPlotTitle="use only tandem duplicates") ``` ![Figure: Dagchainer results selfblast ARATHA ignore tandem](V02-dagchainer-selfblast-ARATHA-ignore-tandem.png) ![Figure: Dagchainer results selfblast ARATHA use tandem](V02-dagchainer-selfblast-ARATHA-tandem.png) ![Figure: Dagchainer results selfblast ARATHA use only tandem](V02-dagchainer-selfblast-ARATHA-only-tandem.png) It is even possible to do the same analysis and use the inter and intra species hits. ``` ## example how to run DAGchainer on CRBHit pairs ## and gene positions using inter and intra species hits ## DAGchainer using gene base pair (gene bp start end) ## ignore tandem duplicates ## Note: change parameter to fit bp option ARATHA_ARALYR_crbh.dagchainer.ignore_tandem.bp <- rbh2dagchainer( rbhpairs=ARATHA_ARALYR_crbh, selfblast1=ARATHA_selfblast_crbh, selfblast2=ARALYR_selfblast_crbh, gene.position.cds1=ARATHA.cds.longest.genepos, gene.position.cds2=ARALYR.cds.longest.genepos, type="bp", gap_length=10000, max_dist_allowed=200000, ignore_tandem=TRUE, only_tandem=FALSE, dagchainerpath=vignette.paths[2]) ## DAGchainer using gene base pair (gene bp start end) ## use tandem duplicates ## Note: change parameter to fit bp option ARATHA_ARALYR_crbh.dagchainer.tandem.bp <- rbh2dagchainer( rbhpairs=ARATHA_ARALYR_crbh, selfblast1=ARATHA_selfblast_crbh, selfblast2=ARALYR_selfblast_crbh, gene.position.cds1=ARATHA.cds.longest.genepos, gene.position.cds2=ARALYR.cds.longest.genepos, type="bp", gap_length=10000, max_dist_allowed=200000, ignore_tandem=FALSE, only_tandem=FALSE, dagchainerpath=vignette.paths[2]) ## DAGchainer using gene base pair (gene bp start end) ## use only tandem duplicates ## Note: change parameter to fit bp option ARATHA_ARALYR_crbh.dagchainer.only_tandem.bp <- rbh2dagchainer( rbhpairs=ARATHA_ARALYR_crbh, selfblast1=ARATHA_selfblast_crbh, selfblast2=ARALYR_selfblast_crbh, gene.position.cds1=ARATHA.cds.longest.genepos, gene.position.cds2=ARALYR.cds.longest.genepos, type="bp", gap_length=10000, max_dist_allowed=200000, ignore_tandem=FALSE, only_tandem=TRUE, dagchainerpath=vignette.paths[2]) ## get help #?rbh2dagchainer ``` ``` ## example how to plot pairwise chromosomal syntenic groups ## for inter and intra species hits (DAGchainer results) ## plot DAGchainer results for each chromosome combination dim(ARATHA_ARALYR_crbh.dagchainer.ignore_tandem.bp) plot_dagchainer(ARATHA_ARALYR_crbh.dagchainer.ignore_tandem.bp, DotPlotTitle="inter and intra species hits - ignore tandem duplicates") dim(ARATHA_ARALYR_crbh.dagchainer.tandem.bp) plot_dagchainer(ARATHA_ARALYR_crbh.dagchainer.tandem.bp, DotPlotTitle="inter and intra species hits - use tandem duplicates") dim(ARATHA_ARALYR_crbh.dagchainer.only_tandem.bp) plot_dagchainer(ARATHA_ARALYR_crbh.dagchainer.only_tandem.bp, DotPlotTitle="inter and intra species hits - use only tandem duplicates") ``` ![Figure: Dagchainer results of A. thaliana and A. lyrata using inter and intra species hits ignore tandem](V02-dagchainer-inter-plus-intra-ARATHA-ARALYR-ignore-tandem.png) ![Figure: Dagchainer results of A. thaliana and A. lyrata using inter and intra species hits use tandem](V02-dagchainer-inter-plus-intra-ARATHA-ARALYR-tandem.png) ![Figure: Dagchainer results of A. thaliana and A. lyrata using inter and intra species hits only tandem](V02-dagchainer-inter-plus-intra-ARATHA-ARALYR-only-tandem.png) ## 2. [Ka/Ks Calculation]{#KaKs} The second part of this vignette describes how to calculate synonymous (Ks) and nonsynonymous substitution rates (Ka) given a set of CRBHit pairs (see [Calculate/Filter CRBHit pairs](#crbhfilter) section how to obtain CRBHit pairs). The `rbh2kaks()` function is used to obtain pairwise codon alignments and directly calculate synonymous (Ks) and nonsynonymous (Ka) substitutions in a parallel fashion. One can choose either to use the model by [Li (1993)](https://doi.org/10.1007/bf02407308) or the model by [Yang and Nielson (2000)](https://doi.org/10.1093/oxfordjournals.molbev.a026236). __Note:__ It is important, that the names of the `rbh` columns must match the names of the corresponding `cds1` and `cds2` `DNAStringSet` vectors. However, since one should directly use the same input `DNAStringSet` vector or input `url` to calculate the CRBHit pairs with the `cds2rbh()` or the `cdsfile2rbh()` function, this should not be a problem (see the [Get Longest Isoform from NCBI or ENSEMBL Input](#isoform) section how to obtain the CDS for the next example). To calculate Ka/Ks values based on the model by [Li (1993)](https://doi.org/10.1007/bf02407308), do the following: __Note:__ The following example can take some time and is not calculated by the vignette building process. ``` ## example how to calculate Ka/Ks values with model "Li" ## calculate Ka/Ks using the CRBHit pairs and multiple threads ARATHA_ARALYR_crbh.kaks.Li <- rbh2kaks( rbhpairs=ARATHA_ARALYR_crbh, cds1=ARATHA.cds.longest, cds2=ARALYR.cds.longest, model="Li", threads=4) ``` To calculate Ka/Ks values based on the model by [Yang and Nielson (2000)](https://doi.org/10.1093/oxfordjournals.molbev.a026236), do the following: __Note:__ The following example can take some time and is not calculated by the vignette building process. ``` ## example how to calculate Ka/Ks values with model "YN" ## calculate Ka/Ks using the CRBHit pairs and multiple threads ARATHA_ARALYR_crbh.kaks.YN <- rbh2kaks( rbhpairs=ARATHA_ARALYR_crbh, cds1=ARATHA.cds.longest, cds2=ARALYR.cds.longest, model="YN", threads=4) ``` ## 3. [Ka/Ks Filtering and Visualisation]{#KaKsFiltering} In some cases the Ka/Ks calculation fails, since e.g. there are no synonymous (Ks) or nonsynonymous substitutions (Ka) between cds1 and cds2 or one wants to get rid of high Ks values due to substitution saturation. One can easily remove these cases before doing further analysis. A plotting function `plot_kaks()` is integrated and can be applied on the Ka/Ks results. ``` ## load example Ka/Ks values (see above commands above how to obtain them) data("ath_aly_ncbi_kaks", package="CRBHits") ## plot Ka/Ks results as histogram colored by Ka/Ks values g <- plot_kaks(kaks=ath_aly_ncbi_kaks) ``` ![Figure: Ka/Ks results as histogram colored by Ka/ks](V02-kaks-hist.png) ``` ## plot Ka/Ks results as histogram filter for ka.min, ka.max, ks.min, ks.max g.min_max <- plot_kaks( kaks=ath_aly_ncbi_kaks, ka.min=0, ka.max=1, ks.min=0, ks.max=1) ``` ![Figure: Ka/Ks results filtered for ka.min, ka.max, ks.max](V02-kaks-hist-min-max.png) ``` ## select subset of chromosomes - needs gene position information head(ARATHA.cds.longest.genepos) head(ARALYR.cds.longest.genepos) g.subset <- plot_kaks( kaks=ath_aly_ncbi_kaks, gene.position.cds1=ARATHA.cds.longest.genepos, gene.position.cds2=ARALYR.cds.longest.genepos, select.chr=c( "NC_003070.9", "NC_003071.7", "NW_003302551.1", "NW_003302554.1")) ``` ![Figure: Ka/Ks results filtered for chromosomes](V02-kaks-hist-chr.png) ``` ## plot Ka/Ks results and split by chromosome - needs gene position information g.split <- plot_kaks( kaks=ath_aly_ncbi_kaks, gene.position.cds1=ARATHA.cds.longest.genepos, gene.position.cds2=ARALYR.cds.longest.genepos, select.chr=c( "NC_003070.9", "NC_003071.7", "NW_003302551.1", "NW_003302554.1"), splitByChr=TRUE) ``` ![Figure: Ka/Ks results split by chromosomes](V02-kaks-hist-chr-split.png) __Note:__ Default **data.frame** handling with the **dplyr** package is possible on the original Ka/Ks results as well as on the resulting plot objects. Apply `dplyr::filter()` and `dplyr::mutate()` function: ``` ## examples how to filter and mutate Ka/Ks results ## filter for Ks values < 2 on original Ka/Ks results head(ath_aly_ncbi_kaks) dim(ath_aly_ncbi_kaks) dim(ath_aly_ncbi_kaks %>% dplyr::filter(ks<2)) ## filter for Ks values < 2 on plot object head(g.split$g.kaks$data) dim(g.split$g.kaks$data) dim(g.split$g.kaks$data %>% dplyr::filter(ks<2)) ## filter for Ks values < 1 on plot object and plot g.split$g.kaks$data %>% dplyr::filter(ks<1) %>% ggplot2::ggplot() + ggplot2::geom_histogram(binwidth=0.1, aes(x=ks)) ``` ![Figure: Ks results as histogram](V02-ks-hist.png) ## 4. [*Homo sapiens* vs. *Pan troglodytes* example]{#homosapiens} The second example compares the CDS from *Homo sapiens* and *Pan troglodytes*. The input sequences are from the FTP server from [Ensembl](https://ensembl.org/index.html). The [Ensembl release-101](ftp://ftp.ensembl.org/pub/release-101/fasta/) is used. ``` ## example comparing Homo sapiens and Pan troglodytes ## set URLs for Homo sapiens and Pan troglodytes from Ensembl ## set Ensembl URL ensembl <- "ftp://ftp.ensembl.org/pub/release-105/fasta/" ## set Homo sapiens CDS URL HOMSAP.cds.url <- paste0(ensembl, "homo_sapiens/cds/Homo_sapiens.GRCh38.cds.all.fa.gz") ## set Pan troglodytes CDS URL PANTRO.cds.url <- paste0(ensembl, "pan_troglodytes/cds/Pan_troglodytes.Pan_tro_3.0.cds.all.fa.gz") ## get Homo sapiens CDS HOMSAP.cds.file <- tempfile() download.file(HOMSAP.cds.url, HOMSAP.cds.file, quiet = FALSE) HOMSAP.cds <- Biostrings::readDNAStringSet(HOMSAP.cds.file) ## get Pan troglodytes CDS PANTRO.cds.file <- tempfile() download.file(PANTRO.cds.url, PANTRO.cds.file, quiet = FALSE) PANTRO.cds <- Biostrings::readDNAStringSet(PANTRO.cds.file) ## get longest isoform HOMSAP.cds.longest <- isoform2longest(HOMSAP.cds, "ENSEMBL") PANTRO.cds.longest <- isoform2longest(PANTRO.cds, "ENSEMBL") ## calculate CRBHit pairs for H. sapiens and P. troglodytes using 2 threads ## input from CDS obtained from Ensembl ## longest isoform selection ## query coverage >= 50% ## rost199 filter HOMSAP_PANTRO_crbh <- cds2rbh( cds1=HOMSAP.cds, cds2=PANTRO.cds, qcov=0.5, rost1999=TRUE, longest.isoform=TRUE, isoform.source="ENSEMBL", threads=2, plotCurve=TRUE, lastpath=vignette.paths[1]) ``` ![Figure: CRBHit pairs of H. sapiens and P. troglodytes](V02-CRBHit-pairs-HOMSAP-PANTRO.png) ``` ## get gene position for H. sapiens longest isoforms HOMSAP.cds.longest.genepos <- cds2genepos( cds=HOMSAP.cds.longest, source="ENSEMBL") ## get gene position for P. troglodytes longest isoforms PANTRO.cds.longest.genepos <- cds2genepos( cds=PANTRO.cds.longest, source="ENSEMBL") ## DAGchainer using gene base pair (gene bp start end) ## Note: change parameter to fit bp option HOMSAP_PANTRO_crbh.dagchainer.bp <- rbh2dagchainer( rbhpairs=HOMSAP_PANTRO_crbh, gene.position.cds1=HOMSAP.cds.longest.genepos, gene.position.cds2=PANTRO.cds.longest.genepos, type="bp", gap_length=10000, max_dist_allowed=200000, dagchainerpath=vignette.paths[2]) ## plot DAGchainer results for selected chromosome combinations plot_dagchainer( dag=HOMSAP_PANTRO_crbh.dagchainer.bp, select.chr=c( "AA1:1","AA1:2","AA1:3","AA1:4","AA1:5","AA1:14", "AA2:1","AA2:3","AA2:4","AA2:5","AA2:14")) ``` ![Figure: Dagchainer results of H. sapiens and P. troglodytes](V02-dagchainer-default-HOMSAP-PANTRO-chr.png) ``` ## get selfblast CRBHit pairs for H. sapiens HOMSAP_selfblast_crbh <- cds2rbh( cds1=HOMSAP.cds, cds2=HOMSAP.cds, qcov=0.5, rost1999=TRUE, longest.isoform=TRUE, isoform.source="ENSEMBL", threads=2, plotCurve=FALSE, lastpath=vignette.paths[1]) PANTRO_selfblast_crbh <- cds2rbh( cds1=PANTRO.cds, cds2=PANTRO.cds, qcov=0.5, rost1999=TRUE, longest.isoform=TRUE, isoform.source="ENSEMBL", threads=2, plotCurve=FALSE, lastpath=vignette.paths[1]) ``` ``` ## example how to run DAGchainer on CRBHit pairs ## and gene positions using inter and intra species hits ## DAGchainer using gene base pair (gene bp start end) ## use tandem duplicates ## Note: change parameter to fit bp option HOMSAP_PANTRO_crbh.dagchainer.tandem.bp <- rbh2dagchainer( rbhpairs=HOMSAP_PANTRO_crbh, selfblast1=HOMSAP_selfblast_crbh, selfblast2=PANTRO_selfblast_crbh, gene.position.cds1=HOMSAP.cds.longest.genepos, gene.position.cds2=PANTRO.cds.longest.genepos, type="bp", gap_length=10000, max_dist_allowed=200000, ignore_tandem=FALSE, only_tandem=FALSE, dagchainerpath=vignette.paths[2]) ## plot DAGchainer results for selected chromosome combinations plot_dagchainer( dag=HOMSAP_PANTRO_crbh.dagchainer.tandem.bp, select.chr=c( "AA1:1","AA1:2","AA1:3","AA1:4","AA1:5","AA1:14", "AA2:1","AA2:3","AA2:4","AA2:5","AA2:14")) ``` ![Figure: Dagchainer results of H. sapiens and P. troglodytes](V02-dagchainer-inter-plus-intra-HOMSAP-PANTRO-chr.png) __Note:__ The following example can take some time and is not calculated by the vignette building process. ``` ## example how to calculate Ka/Ks values with model "Li" ## calculate Ka/Ks using the CRBHit pairs and multiple threads HOMSAP_PANTRO_crbh.kaks.Li <- rbh2kaks( rbhpairs=HOMSAP_PANTRO_crbh, cds1=HOMSAP.cds.longest, cds2=PANTRO.cds.longest, model="Li", threads=4) ``` ``` ## load example Ka/Ks values (see above commands above how to obtain them) data("hom_pan_ensembl_kaks", package="CRBHits") ## plot Ka/Ks results as histogram colored by Ka/Ks values g <- plot_kaks(hom_pan_ensembl_kaks) ``` ![Figure: Ka/Ks results of H. sapiens and P. troglodytes](V02-kaks-hist-HOMSAP-PANTRO.png) ## 5. [References]{#references} Aubry S., Kelly S., Kümpers B. M., Smith-Unna R. D., and Hibberd J. M. (2014). **Deep evolutionary comparison of gene expression identifies parallel recruitment of trans-factors in two independent origins of C4 photosynthesis.** *PLoS Genetics*, **10(6)**. [https://doi.org/10.1371/journal.pgen.1004365](https://doi.org/10.1371/journal.pgen.1004365) Charif D., and Lobry J. R. (2007). **SeqinR 1.0-2: a contributed package to the R project for statistical computing devoted to biological sequences retrieval and analysis.** *In Structural approaches to sequence evolution* (pp. 207-232). Springer, Berlin, Heidelberg. [https://link.springer.com/chapter/10.1007/978-3-540-35306-5_10](https://link.springer.com/chapter/10.1007/978-3-540-35306-5_10) Duong T., and Wand M. (2015). **feature: Local Inferential Feature Significance for Multivariate Kernel Density Estimation.** *R package version 1.2.13*. [https://cran.r-project.org/web/packages/feature/](https://cran.r-project.org/web/packages/feature/) Haas B. J., Delcher A. L., Wortman J. R., and Salzberg S. L. (2004). **DAGchainer: a tool for mining segmental genome duplications and synteny.** *Bioinformatics*, **20(18)**, 3643-3646. [https://doi.org/10.1093/bioinformatics/bth397](https://doi.org/10.1093/bioinformatics/bth397) Haug-Baltzell A., Stephens S. A., Davey S., Scheidegger C. E., Lyons E. (2017). **SynMap2 and SynMap3D: web-based wholge-genome synteny browsers.** *Bioinformatics*, **33(14)**, 2197-2198. [https://academic.oup.com/bioinformatics/article/33/14/2197/3072872](https://academic.oup.com/bioinformatics/article/33/14/2197/3072872) Kiełbasa S. M., Wan R., Sato K., Horton P., and Frith M. C. (2011). **Adaptive seeds tame genomic sequence comparison.** *Genome Research*, **21(3)**, 487-493. [https://doi.org/10.1101/gr.113985.110](https://doi.org/10.1101/gr.113985.110) Kimura M. (1977). **Preponderance of synonymous changes as evidence for the neutral theory of molecular evolution.** *Nature*, **267**, 275-276. Li W. H. (1993). **Unbiased estimation of the rates of synonymous and nonsynonymous substitution.** *Journal of Molecular Evolution*, **36(1)**, 96-99. [https://doi.org/10.1007/bf02407308](https://doi.org/10.1007/bf02407308) Microsoft, and Weston S. (2020). **foreach: Provides Foreach Looping Construct.** *R package version*, 1.5.1. [foreach](https://cran.r-project.org/web/packages/foreach/index.html) Mugal C. F., Wolf J. B. W., Kaj I. (2014). **Why Time Matters: Codon Evolution and the Temproal Dynamics of dN/dS.** *Molecular Biology and Evolution*, **31(1)**, 212-231. Ooms J. (2019). **curl: A Modern and Flexible Web Client for R.** *R package version*, 4.3. [curl](https://cran.r-project.org/web/packages/curl/index.html) Pagès H., Aboyoun P., Gentleman R., and DebRoy S. (2017). **Biostrings: Efficient manipulation of biological strings.** *R package version*, 2.56.0. [Biostrings](http://bioconductor.org/packages/release/bioc/html/Biostrings.html) Revolution Analytics, and Weston S. (2020). **doMC: Foreach Parallel Adaptor for 'parallel'.** *R package version*, 1.3.7. [doMC](https://cran.r-project.org/web/packages/doMC/index.html) Rost B. (1999). **Twilight zone of protein sequence alignments.** *Protein Engineering*, **12(2)**, 85-94. [https://doi.org/10.1093/protein/12.2.85](https://doi.org/10.1093/protein/12.2.85) Scott C. (2017). **shmlast: an improved implementation of conditional reciprocal best hits with LAST and Python.** *Journal of Open Source Software*, **2(9)**, 142. [https://joss.theoj.org/papers/10.21105/joss.00142](https://joss.theoj.org/papers/10.21105/joss.00142) Scrucca L., Fop M., Murphy T. B., and Raftery A. E. (2016) **mclust 5: clustering, classification and density estimation using Gaussian finite mixture models.** *The R Journal*, 8(1), 289-317. [https://www.ncbi.nlm.nih.gov/pmc/articles/PMC5096736/](https://www.ncbi.nlm.nih.gov/pmc/articles/PMC5096736/) Wickham H. (2011). **testthat: Get Started with Testing.** *The R Journal*, 3(1), 5. [testthat](https://cran.r-project.org/web/packages/testthat/index.html) Wickham H. (2019). **stringr: Simple, Consistent Wrappers for Common String Operations.** *R package version*, 1.4.0. [stringr](https://cran.r-project.org/web/packages/stringr/index.html) Wickham H. (2020). **tidyr: Tidy Messy Data.** *R package version*, 1.1.2. [tidyr](https://cran.r-project.org/web/packages/tidyr/index.html) Wickham H., Hester J., and Chang W. (2020). **devtools: Tools to make Developing R Packages Easier.** *R package version*, (2.3.2). [devtools](https://cran.r-project.org/web/packages/devtools/index.html) Wickham H., François R., Henry L., and Müller K. (2020). **dplyr: A Grammar of Data Manipulation.** *R package version*, 1.0.2. [dplyr](https://cran.r-project.org/web/packages/dplyr/index.html) Yang Z., and Nielsen R. (2000). **Estimating synonymous and nonsynonymous substitution rates under realistic evolutionary models.** *Molecular Biology and Evolution*, **17(1)**, 32-43. [https://doi.org/10.1093/oxfordjournals.molbev.a026236](https://doi.org/10.1093/oxfordjournals.molbev.a026236)