| Title: | Conditional reciprocal best hits (CRBHits) in R |
|---|---|
| Description: | CRBHits calculates conditional reciprocal best hit (CRBHit) pairs and subsequently can classify tandem duplicates, assign syntney groups and calculate Ka/Ks values. |
| Authors: | Kristian K Ullrich [aut, cre] (ORCID: <https://orcid.org/0000-0003-4308-9626>) |
| Maintainer: | Kristian K Ullrich <[email protected]> |
| License: | MIT + file LICENSE |
| Version: | 0.0.10 |
| Built: | 2026-05-11 09:31:46 UTC |
| Source: | https://github.com/kullrich/crbhits |
This function calculates (conditional-)reciprocal best hit
(CRBHit) pairs from two AA AAStringSet's.
Conditional-reciprocal best hit pairs were introduced by
Aubry S, Kelly S et al. (2014).
Sequence searches are performed with last
Kiełbasa, SM et al. (2011) [default]
or with mmseqs2
Steinegger, M and Soeding, J (2017)
or with diamond
Buchfink, B et al. (2021).
If one specifies aa1 and aa2 as the same input a selfblast is conducted.
aa2rbh( aa1, aa2, dbfile1 = NULL, dbfile2 = NULL, searchtool = "last", lastpath = file.path(find.package("CRBHits"), "extdata", "last-1639", "bin"), lastD = 1e+06, lastm = 10, mmseqs2path = NULL, mmseqs2sensitivity = 5.7, mmseqs2maxseqs = 300, diamondpath = NULL, diamondsensitivity = "--sensitive", diamondmaxtargetseqs = 0, lambda3path = NULL, lambda3sensitivity = "sensitive", lambda3nummatches = 25, outpath = "/tmp", crbh = TRUE, keepSingleDirection = FALSE, eval = 0.001, qcov = 0, tcov = 0, pident = 0, alnlen = 0, rost1999 = FALSE, filter = NULL, plotCurve = FALSE, fit.type = "mean", fit.varweight = 0.1, fit.min = 5, threads = 1, remove = TRUE, remove.db = TRUE )aa2rbh( aa1, aa2, dbfile1 = NULL, dbfile2 = NULL, searchtool = "last", lastpath = file.path(find.package("CRBHits"), "extdata", "last-1639", "bin"), lastD = 1e+06, lastm = 10, mmseqs2path = NULL, mmseqs2sensitivity = 5.7, mmseqs2maxseqs = 300, diamondpath = NULL, diamondsensitivity = "--sensitive", diamondmaxtargetseqs = 0, lambda3path = NULL, lambda3sensitivity = "sensitive", lambda3nummatches = 25, outpath = "/tmp", crbh = TRUE, keepSingleDirection = FALSE, eval = 0.001, qcov = 0, tcov = 0, pident = 0, alnlen = 0, rost1999 = FALSE, filter = NULL, plotCurve = FALSE, fit.type = "mean", fit.varweight = 0.1, fit.min = 5, threads = 1, remove = TRUE, remove.db = TRUE )
aa1 |
aa1 sequences as |
aa2 |
aa2 sequences as |
dbfile1 |
aa1 db file [optional] |
dbfile2 |
aa2 db file [optional] |
searchtool |
specify sequence search algorithm last, mmseqs2, diamond or lambda3 [default: last] |
lastpath |
specify the PATH to the last binaries [default: /extdata/last-1639/bin/] |
lastD |
last option D: query letters per random alignment [default: 1e6] |
lastm |
last option m: maximum initial matches per query position [default: 10] |
mmseqs2path |
specify the PATH to the mmseqs2 binaries [default: NULL] |
mmseqs2sensitivity |
specify the sensitivity option of mmseqs2 [default: 5.7] |
mmseqs2maxseqs |
mmseqs2 option: Maximum results per query sequence allowed to pass the prefilter [default: 300] |
diamondpath |
specify the PATH to the diamond binaries [default: NULL] |
diamondsensitivity |
specify the sensitivity option of diamond [default: –sensitive] |
diamondmaxtargetseqs |
specify the maximum number of target sequences per query option of diamond [default: 0] |
lambda3path |
specify the PATH to the lambda3 binaries [default: NULL] |
lambda3sensitivity |
specify the sensitivity option of lambda3 [default: sensitive] |
lambda3nummatches |
specify the number of matches per query option of lambda3 [default: 25] |
outpath |
specify the output PATH [default: /tmp] |
crbh |
specify if conditional-reciprocal hit pairs should be retained as secondary hits [default: TRUE] |
keepSingleDirection |
specify if single direction secondary hit pairs should be retained [default: FALSE] |
eval |
evalue [default: 1e-3] |
qcov |
query coverage [default: 0.0] |
tcov |
target coverage [default: 0.0] |
pident |
percent identity [default: 0.0] |
alnlen |
alignment length [default: 0.0] |
rost1999 |
specify if hit pairs should be filter by equation 2 of Rost 1999 [default: FALSE] |
filter |
specify additional custom filters as list to be applied on hit pairs [default: NULL] |
plotCurve |
specify if crbh fitting curve should be plotted [default: FALSE] |
fit.type |
specify if mean or median should be used for fitting [default: mean] |
fit.varweight |
factor for fitting function to consider neighborhood [default: 0.1] |
fit.min |
specify minimum neighborhood alignment length [default: 5] |
threads |
number of parallel threads [default: 1] |
remove |
specify if last result files should be removed [default: TRUE] |
remove.db |
specify if last db files should be removed [default: TRUE] |
List of three (crbh=FALSE)
1: $crbh.pairs
2: $crbh1 matrix; query > target
3: $crbh2 matrix; target > query
List of four (crbh=TRUE)
1: $crbh.pairs
2: $crbh1 matrix; query > target
3: $crbh2 matrix; target > query
4: $rbh1_rbh2_fit; evalue fitting function
Kristian K Ullrich
Aubry S, Kelly S et al. (2014) Deep Evolutionary Comparison of Gene Expression Identifies Parallel Recruitment of Trans-Factors in Two Independent Origins of C4 Photosynthesis. PLOS Genetics, 10(6) e1004365.
Kiełbasa, SM et al. (2011) Adaptive seeds tame genomic sequence comparison. Genome research, 21(3), 487-493.
Rost B. (1999). Twilight zone of protein sequence alignments. Protein Engineering, 12(2), 85-94.
## compile last-1639 within CRBHits CRBHits::make_last() ## load example sequence data data("ath", package="CRBHits") data("aly", package="CRBHits") ath.aa <- MSA2dist::cds2aa(ath) aly.aa <- MSA2dist::cds2aa(aly) ## get CRBHit pairs ath_aly_crbh <- aa2rbh( aa1=ath.aa, aa2=aly.aa, plotCurve=TRUE) dim(ath_aly_crbh$crbh.pairs) ## get classical reciprocal best hit (RBHit) pairs ath_aly_rbh <- aa2rbh( aa1=ath.aa, aa2=aly.aa, crbh=FALSE) dim(ath_aly_rbh$crbh.pairs) ## selfblast ath_selfblast_crbh <- aa2rbh( aa1=ath.aa, aa2=ath.aa, plotCurve=TRUE) ## see ?cds2rbh for more examples## compile last-1639 within CRBHits CRBHits::make_last() ## load example sequence data data("ath", package="CRBHits") data("aly", package="CRBHits") ath.aa <- MSA2dist::cds2aa(ath) aly.aa <- MSA2dist::cds2aa(aly) ## get CRBHit pairs ath_aly_crbh <- aa2rbh( aa1=ath.aa, aa2=aly.aa, plotCurve=TRUE) dim(ath_aly_crbh$crbh.pairs) ## get classical reciprocal best hit (RBHit) pairs ath_aly_rbh <- aa2rbh( aa1=ath.aa, aa2=aly.aa, crbh=FALSE) dim(ath_aly_rbh$crbh.pairs) ## selfblast ath_selfblast_crbh <- aa2rbh( aa1=ath.aa, aa2=ath.aa, plotCurve=TRUE) ## see ?cds2rbh for more examples
This function calculates (conditional-)reciprocal best hit (CRBHit) pairs for all possible comparison including self comparison from a directory of AA fasta files. Sequence searches are performed with last Kiełbasa, SM et al. (2011) [default] or with mmseqs2 Steinegger, M and Soeding, J (2017) or with diamond Buchfink, B et al. (2021).
aadir2orthofinder( dir, file_ending = "*", searchtool = "last", lastpath = file.path(find.package("CRBHits"), "extdata", "last-1639", "bin"), lastD = 1e+06, lastm = 10, mmseqs2path = NULL, mmseqs2sensitivity = 5.7, mmseqs2maxseqs = 300, diamondpath = NULL, diamondsensitivity = "--sensitive", diamondmaxtargetseqs = 0, lambda3path = NULL, lambda3sensitivity = "sensitive", lambda3nummatches = 25, outpath = "/tmp", crbh = TRUE, keepSingleDirection = FALSE, eval = 0.001, qcov = 0, tcov = 0, pident = 0, alnlen = 0, rost1999 = FALSE, filter = NULL, fit.type = "mean", fit.varweight = 0.1, fit.min = 5, threads = 1, remove = TRUE )aadir2orthofinder( dir, file_ending = "*", searchtool = "last", lastpath = file.path(find.package("CRBHits"), "extdata", "last-1639", "bin"), lastD = 1e+06, lastm = 10, mmseqs2path = NULL, mmseqs2sensitivity = 5.7, mmseqs2maxseqs = 300, diamondpath = NULL, diamondsensitivity = "--sensitive", diamondmaxtargetseqs = 0, lambda3path = NULL, lambda3sensitivity = "sensitive", lambda3nummatches = 25, outpath = "/tmp", crbh = TRUE, keepSingleDirection = FALSE, eval = 0.001, qcov = 0, tcov = 0, pident = 0, alnlen = 0, rost1999 = FALSE, filter = NULL, fit.type = "mean", fit.varweight = 0.1, fit.min = 5, threads = 1, remove = TRUE )
dir |
directory containing AA fasta files [mandatory] |
file_ending |
define file ending to consider [default: *] |
searchtool |
specify sequence search algorithm last, mmseqs2, diamond or lambda3 [default: last] |
lastpath |
specify the PATH to the last binaries [default: /extdata/last-1639/bin/] |
lastD |
last option D: query letters per random alignment [default: 1e6] |
lastm |
last option m: maximum initial matches per query position [default: 10] |
mmseqs2path |
specify the PATH to the mmseqs2 binaries [default: NULL] |
mmseqs2sensitivity |
specify the sensitivity option of mmseqs2 [default: 5.7] |
mmseqs2maxseqs |
mmseqs2 option: Maximum results per query sequence allowed to pass the prefilter [default: 300] |
diamondpath |
specify the PATH to the diamond binaries [default: NULL] |
diamondsensitivity |
specify the sensitivity option of diamond [default: –sensitive] |
diamondmaxtargetseqs |
specify the maximum number of target sequences per query option of diamond [default: 0] |
lambda3path |
specify the PATH to the lambda3 binaries [default: NULL] |
lambda3sensitivity |
specify the sensitivity option of lambda3 [default: sensitive] |
lambda3nummatches |
specify the number of matches per query option of lambda3 [default: 25] |
outpath |
specify the output PATH [default: /tmp] |
crbh |
specify if conditional-reciprocal hit pairs should be retained as secondary hits [default: TRUE] |
keepSingleDirection |
specify if single direction secondary hit pairs should be retained [default: FALSE] |
eval |
evalue [default: 1e-3] |
qcov |
query coverage [default: 0.0] |
tcov |
target coverage [default: 0.0] |
pident |
percent identity [default: 0.0] |
alnlen |
alignment length [default: 0.0] |
rost1999 |
specify if hit pairs should be filter by equation 2 of Rost 1999 [default: FALSE] |
filter |
specify additional custom filters as list to be applied on hit pairs [default: NULL] |
fit.type |
specify if mean or median should be used for fitting [default: mean] |
fit.varweight |
factor for fitting function to consider neighborhood [default: 0.1] |
fit.min |
specify minimum neighborhood alignment length [default: 5] |
threads |
number of parallel threads [default: 1] |
remove |
specify if last result files should be removed [default: TRUE] |
List of three (crbh=FALSE)
Kristian K Ullrich
Aubry S, Kelly S et al. (2014) Deep Evolutionary Comparison of Gene Expression Identifies Parallel Recruitment of Trans-Factors in Two Independent Origins of C4 Photosynthesis. PLOS Genetics, 10(6) e1004365.
Kiełbasa, SM et al. (2011) Adaptive seeds tame genomic sequence comparison. Genome research, 21(3), 487-493.
Rost B. (1999). Twilight zone of protein sequence alignments. Protein Engineering, 12(2), 85-94.
## compile last-1639 within CRBHits CRBHits::make_last()## compile last-1639 within CRBHits CRBHits::make_last()
This function calculates (conditional-)reciprocal best hit (CRBHit) pairs from two AA fasta files. Conditional-reciprocal best hit pairs were introduced by Aubry S, Kelly S et al. (2014). Sequence searches are performed with last Kiełbasa, SM et al. (2011) [default] or with mmseqs2 Steinegger, M and Soeding, J (2017) or with diamond Buchfink, B et al. (2021). If one specifies aafile1 and aafile2 as the same input a selfblast is conducted.
aafile2rbh( aafile1, aafile2, dbfile1 = NULL, dbfile2 = NULL, searchtool = "last", lastpath = file.path(find.package("CRBHits"), "extdata", "last-1639", "bin"), lastD = 1e+06, lastm = 10, mmseqs2path = NULL, mmseqs2sensitivity = 5.7, mmseqs2maxseqs = 300, diamondpath = NULL, diamondsensitivity = "--sensitive", diamondmaxtargetseqs = 0, lambda3path = NULL, lambda3sensitivity = "sensitive", lambda3nummatches = 25, outpath = "/tmp", crbh = TRUE, keepSingleDirection = FALSE, eval = 0.001, qcov = 0, tcov = 0, pident = 0, alnlen = 0, rost1999 = FALSE, filter = NULL, plotCurve = FALSE, fit.type = "mean", fit.varweight = 0.1, fit.min = 5, threads = 1, aafile2tmp = TRUE, remove = TRUE, remove.db = TRUE )aafile2rbh( aafile1, aafile2, dbfile1 = NULL, dbfile2 = NULL, searchtool = "last", lastpath = file.path(find.package("CRBHits"), "extdata", "last-1639", "bin"), lastD = 1e+06, lastm = 10, mmseqs2path = NULL, mmseqs2sensitivity = 5.7, mmseqs2maxseqs = 300, diamondpath = NULL, diamondsensitivity = "--sensitive", diamondmaxtargetseqs = 0, lambda3path = NULL, lambda3sensitivity = "sensitive", lambda3nummatches = 25, outpath = "/tmp", crbh = TRUE, keepSingleDirection = FALSE, eval = 0.001, qcov = 0, tcov = 0, pident = 0, alnlen = 0, rost1999 = FALSE, filter = NULL, plotCurve = FALSE, fit.type = "mean", fit.varweight = 0.1, fit.min = 5, threads = 1, aafile2tmp = TRUE, remove = TRUE, remove.db = TRUE )
aafile1 |
aa1 fasta file [mandatory] |
aafile2 |
aa2 fasta file [mandatory] |
dbfile1 |
aa1 db file [optional] |
dbfile2 |
aa2 db file [optional] |
searchtool |
specify sequence search algorithm last, mmseqs2, diamond or lambda3 [default: last] |
lastpath |
specify the PATH to the last binaries [default: /extdata/last-1639/bin/] |
lastD |
last option D: query letters per random alignment [default: 1e6] |
lastm |
last option m: maximum initial matches per query position [default: 10] |
mmseqs2path |
specify the PATH to the mmseqs2 binaries [default: NULL] |
mmseqs2sensitivity |
specify the sensitivity option of mmseqs2 [default: 5.7] |
mmseqs2maxseqs |
mmseqs2 option: Maximum results per query sequence allowed to pass the prefilter [default: 300] |
diamondpath |
specify the PATH to the diamond binaries [default: NULL] |
diamondsensitivity |
specify the sensitivity option of diamond [default: –sensitive] |
diamondmaxtargetseqs |
specify the maximum number of target sequences per query option of diamond [default: 0] |
lambda3path |
specify the PATH to the lambda3 binaries [default: NULL] |
lambda3sensitivity |
specify the sensitivity option of lambda3 [default: sensitive] |
lambda3nummatches |
specify the number of matches per query option of lambda3 [default: 25] |
outpath |
specify the output PATH [default: /tmp] |
crbh |
specify if conditional-reciprocal hit pairs should be retained as secondary hits [default: TRUE] |
keepSingleDirection |
specify if single direction secondary hit pairs should be retained [default: FALSE] |
eval |
evalue [default: 1e-3] |
qcov |
query coverage [default: 0.0] |
tcov |
target coverage [default: 0.0] |
pident |
percent identity [default: 0.0] |
alnlen |
alignment length [default: 0.0] |
rost1999 |
specify if hit pairs should be filter by equation 2 of Rost 1999 [default: FALSE] |
filter |
specify additional custom filters as list to be applied on hit pairs [default: NULL] |
plotCurve |
specify if crbh fitting curve should be plotted [default: FALSE] |
fit.type |
specify if mean or median should be used for fitting [default: mean] |
fit.varweight |
factor for fitting function to consider neighborhood [default: 0.1] |
fit.min |
specify minimum neighborhood alignment length [default: 5] |
threads |
number of parallel threads [default: 1] |
aafile2tmp |
specify if aa input files sequenceIDs should be reduced to the first word and be written to a temporary aa file [default: TRUE] |
remove |
specify if last result files should be removed [default: TRUE] |
remove.db |
specify if last db files should be removed [default: TRUE] |
List of three (crbh=FALSE)
1: $crbh.pairs
2: $crbh1 matrix; query > target
3: $crbh2 matrix; target > query
List of four (crbh=TRUE)
1: $crbh.pairs
2: $crbh1 matrix; query > target
3: $crbh2 matrix; target > query
4: $rbh1_rbh2_fit; evalue fitting function
Kristian K Ullrich
Aubry S, Kelly S et al. (2014) Deep Evolutionary Comparison of Gene Expression Identifies Parallel Recruitment of Trans-Factors in Two Independent Origins of C4 Photosynthesis. PLOS Genetics, 10(6) e1004365.
Kiełbasa, SM et al. (2011) Adaptive seeds tame genomic sequence comparison. Genome research, 21(3), 487-493.
Rost B. (1999). Twilight zone of protein sequence alignments. Protein Engineering, 12(2), 85-94.
## compile last-1639 within CRBHits CRBHits::make_last() ## load example sequence data athfile <- system.file("fasta", "ath.aa.fasta.gz", package="CRBHits") alyfile <- system.file("fasta", "aly.aa.fasta.gz", package="CRBHits") ## get CRBHit pairs ath_aly_crbh <- aafile2rbh( aafile1=athfile, aafile2=alyfile, plotCurve=TRUE) dim(ath_aly_crbh$crbh.pairs) ## get classical reciprocal best hit (RBHit) pairs ath_aly_rbh <- aafile2rbh( aafile1=athfile, aafile2=alyfile, crbh=FALSE) dim(ath_aly_rbh$crbh.pairs) ## selfblast ath_selfblast_crbh <- aafile2rbh( aafile1=athfile, aafile2=athfile, plotCurve=TRUE) ## see ?cds2rbh for more examples## compile last-1639 within CRBHits CRBHits::make_last() ## load example sequence data athfile <- system.file("fasta", "ath.aa.fasta.gz", package="CRBHits") alyfile <- system.file("fasta", "aly.aa.fasta.gz", package="CRBHits") ## get CRBHit pairs ath_aly_crbh <- aafile2rbh( aafile1=athfile, aafile2=alyfile, plotCurve=TRUE) dim(ath_aly_crbh$crbh.pairs) ## get classical reciprocal best hit (RBHit) pairs ath_aly_rbh <- aafile2rbh( aafile1=athfile, aafile2=alyfile, crbh=FALSE) dim(ath_aly_rbh$crbh.pairs) ## selfblast ath_selfblast_crbh <- aafile2rbh( aafile1=athfile, aafile2=athfile, plotCurve=TRUE) ## see ?cds2rbh for more examples
This function calculates (conditional-)reciprocal best hit (CRBHit) pairs from two AA fasta files. Conditional-reciprocal best hit pairs were introduced by Aubry S, Kelly S et al. (2014). Sequence searches are performed with last Kiełbasa, SM et al. (2011) [default] or with mmseqs2 Steinegger, M and Soeding, J (2017) or with diamond Buchfink, B et al. (2021). If one specifies aafile1 and aafile2 as the same input a selfblast is conducted.
aafile2rbhplus( aafile1, aafile2, dbfile1 = NULL, dbfile2 = NULL, searchtool = "last", lastpath = file.path(find.package("CRBHits"), "extdata", "last-1639", "bin"), lastD = 1e+06, lastm = 10, mafconvertpath = paste0(find.package("CRBHits"), "/extdata/maf-convert-crbhits"), mmseqs2path = NULL, mmseqs2sensitivity = 5.7, mmseqs2maxseqs = 300, diamondpath = NULL, diamondsensitivity = "--sensitive", diamondmaxtargetseqs = 0, lambda3path = NULL, lambda3sensitivity = "sensitive", lambda3nummatches = 25, outpath = "/tmp", crbh = TRUE, keepSingleDirection = FALSE, eval = 0.001, qcov = 0, tcov = 0, pident = 0, alnlen = 0, rost1999 = FALSE, filter = NULL, plotCurve = FALSE, fit.type = "mean", fit.varweight = 0.1, fit.min = 5, threads = 1, aafile2tmp = TRUE, remove = TRUE, remove.db = TRUE )aafile2rbhplus( aafile1, aafile2, dbfile1 = NULL, dbfile2 = NULL, searchtool = "last", lastpath = file.path(find.package("CRBHits"), "extdata", "last-1639", "bin"), lastD = 1e+06, lastm = 10, mafconvertpath = paste0(find.package("CRBHits"), "/extdata/maf-convert-crbhits"), mmseqs2path = NULL, mmseqs2sensitivity = 5.7, mmseqs2maxseqs = 300, diamondpath = NULL, diamondsensitivity = "--sensitive", diamondmaxtargetseqs = 0, lambda3path = NULL, lambda3sensitivity = "sensitive", lambda3nummatches = 25, outpath = "/tmp", crbh = TRUE, keepSingleDirection = FALSE, eval = 0.001, qcov = 0, tcov = 0, pident = 0, alnlen = 0, rost1999 = FALSE, filter = NULL, plotCurve = FALSE, fit.type = "mean", fit.varweight = 0.1, fit.min = 5, threads = 1, aafile2tmp = TRUE, remove = TRUE, remove.db = TRUE )
aafile1 |
aa1 fasta file [mandatory] |
aafile2 |
aa2 fasta file [mandatory] |
dbfile1 |
aa1 db file [optional] |
dbfile2 |
aa2 db file [optional] |
searchtool |
specify sequence search algorithm last, mmseqs2, diamond or lambda3 [default: last] |
lastpath |
specify the PATH to the last binaries [default: /extdata/last-1639/bin/] |
lastD |
last option D: query letters per random alignment [default: 1e6] |
lastm |
last option m: maximum initial matches per query position [default: 10] |
mafconvertpath |
specify the PATH to the maf-convert script [default: /extdata/maf-convert-crbhits] |
mmseqs2path |
specify the PATH to the mmseqs2 binaries [default: NULL] |
mmseqs2sensitivity |
specify the sensitivity option of mmseqs2 [default: 5.7] |
mmseqs2maxseqs |
mmseqs2 option: Maximum results per query sequence allowed to pass the prefilter [default: 300] |
diamondpath |
specify the PATH to the diamond binaries [default: NULL] |
diamondsensitivity |
specify the sensitivity option of diamond [default: –sensitive] |
diamondmaxtargetseqs |
specify the maximum number of target sequences per query option of diamond [default: 0] |
lambda3path |
specify the PATH to the lambda3 binaries [default: NULL] |
lambda3sensitivity |
specify the sensitivity option of lambda3 [default: sensitive] |
lambda3nummatches |
specify the number of matches per query option of lambda3 [default: 25] |
outpath |
specify the output PATH [default: /tmp] |
crbh |
specify if conditional-reciprocal hit pairs should be retained as secondary hits [default: TRUE] |
keepSingleDirection |
specify if single direction secondary hit pairs should be retained [default: FALSE] |
eval |
evalue [default: 1e-3] |
qcov |
query coverage [default: 0.0] |
tcov |
target coverage [default: 0.0] |
pident |
percent identity [default: 0.0] |
alnlen |
alignment length [default: 0.0] |
rost1999 |
specify if hit pairs should be filter by equation 2 of Rost 1999 [default: FALSE] |
filter |
specify additional custom filters as list to be applied on hit pairs [default: NULL] |
plotCurve |
specify if crbh fitting curve should be plotted [default: FALSE] |
fit.type |
specify if mean or median should be used for fitting [default: mean] |
fit.varweight |
factor for fitting function to consider neighborhood [default: 0.1] |
fit.min |
specify minimum neighborhood alignment length [default: 5] |
threads |
number of parallel threads [default: 1] |
aafile2tmp |
specify if aa input files sequenceIDs should be reduced to the first word and be written to a temporary aa file [default: TRUE] |
remove |
specify if last result files should be removed [default: TRUE] |
remove.db |
specify if last db files should be removed [default: TRUE] |
List of three (crbh=FALSE)
1: $crbh.pairs
2: $crbh1 matrix; query > target
3: $crbh2 matrix; target > query
List of four (crbh=TRUE)
1: $crbh.pairs
2: $crbh1 matrix; query > target
3: $crbh2 matrix; target > query
4: $rbh1_rbh2_fit; evalue fitting function
Kristian K Ullrich
Aubry S, Kelly S et al. (2014) Deep Evolutionary Comparison of Gene Expression Identifies Parallel Recruitment of Trans-Factors in Two Independent Origins of C4 Photosynthesis. PLOS Genetics, 10(6) e1004365.
Kiełbasa, SM et al. (2011) Adaptive seeds tame genomic sequence comparison. Genome research, 21(3), 487-493.
Rost B. (1999). Twilight zone of protein sequence alignments. Protein Engineering, 12(2), 85-94.
## compile last-1639 within CRBHits CRBHits::make_last() ## load example sequence data athfile <- system.file("fasta", "ath.aa.fasta.gz", package="CRBHits") alyfile <- system.file("fasta", "aly.aa.fasta.gz", package="CRBHits") ## get CRBHit pairs ath_aly_crbh <- aafile2rbhplus( aafile1=athfile, aafile2=alyfile, plotCurve=TRUE) dim(ath_aly_crbh$crbh.pairs) ## get classical reciprocal best hit (RBHit) pairs ath_aly_rbh <- aafile2rbhplus( aafile1=athfile, aafile2=alyfile, crbh=FALSE) dim(ath_aly_rbh$crbh.pairs) ## selfblast ath_selfblast_crbh <- aafile2rbhplus( aafile1=athfile, aafile2=athfile, plotCurve=TRUE) ## see ?cds2rbh for more examples## compile last-1639 within CRBHits CRBHits::make_last() ## load example sequence data athfile <- system.file("fasta", "ath.aa.fasta.gz", package="CRBHits") alyfile <- system.file("fasta", "aly.aa.fasta.gz", package="CRBHits") ## get CRBHit pairs ath_aly_crbh <- aafile2rbhplus( aafile1=athfile, aafile2=alyfile, plotCurve=TRUE) dim(ath_aly_crbh$crbh.pairs) ## get classical reciprocal best hit (RBHit) pairs ath_aly_rbh <- aafile2rbhplus( aafile1=athfile, aafile2=alyfile, crbh=FALSE) dim(ath_aly_rbh$crbh.pairs) ## selfblast ath_selfblast_crbh <- aafile2rbhplus( aafile1=athfile, aafile2=athfile, plotCurve=TRUE) ## see ?cds2rbh for more examples
Example cds sequences from Arabidopsis lyrata as
DNAStringSet.
data(aly)data(aly)
an object of class DNAStringSet
see XStringSet-class
data("aly", package="CRBHits")data("aly", package="CRBHits")
Example conditional-reciprocal best hit (CRBHit) pair results.
data(ath_aly_crbh)data(ath_aly_crbh)
an object of class list
data("ath_aly_crbh", package="CRBHits")data("ath_aly_crbh", package="CRBHits")
Example of DAGchainer results between Arabidopsis thalina and Arabidopsis lyrata obtained from NCBI.
data(ath_aly_ncbi_dagchainer)data(ath_aly_ncbi_dagchainer)
an object of class list
data("ath_aly_ncbi_dagchainer", package="CRBHits")data("ath_aly_ncbi_dagchainer", package="CRBHits")
Example of Ka/Ks values between Arabidopsis thalina and Arabidopsis lyrata obtained from NCBI.
data(ath_aly_ncbi_kaks)data(ath_aly_ncbi_kaks)
an object of class list
data("ath_aly_ncbi_kaks", package="CRBHits")data("ath_aly_ncbi_kaks", package="CRBHits")
Example cds sequences from Arabidopsis thaliana as
DNAStringSet.
data(ath)data(ath)
an object of class DNAStringSet
see XStringSet-class
data("ath", package="CRBHits")data("ath", package="CRBHits")
This function extracts the gene position from either NCBI or ENSEMBL CDS input.
cds2genepos(cds, source = "NCBI", keep.names = NULL)cds2genepos(cds, source = "NCBI", keep.names = NULL)
cds |
|
source |
source indicating either NCBI or ENSEMBL [default: NCBI] |
keep.names |
vector indicating gene ids to be kept before chromosomal position assignment [default: NULL] |
matrix
1: $gene.seq.id
2: $gene.chr
3: $gene.start
4: $gene.end
5: $gene.mid
6: $gene.strand
7: $gene.idx
Kristian K Ullrich
## load example sequence data data("ath", package="CRBHits") ath.genepos <- cds2genepos( cds=ath, source="ENSEMBL") ## Not run: ## load example sequence data ## set EnsemblPlants URL ensemblPlants <- "ftp://ftp.ensemblgenomes.org/pub/plants/release-48/fasta/" ## set Arabidopsis thaliana CDS URL ARATHA.cds.url <- paste0(ensemblPlants, "arabidopsis_thaliana/cds/Arabidopsis_thaliana.TAIR10.cds.all.fa.gz") ARATHA.cds.file <- tempfile() ## download CDS download.file(ARATHA.cds.url, ARATHA.cds.file, quiet=FALSE) ARATHA.cds <- Biostrings::readDNAStringSet(ARATHA.cds.file) ## get gene position ARATHA.cds.genepos <- cds2genepos(ARATHA.cds, "ENSEMBL") ## End(Not run)## load example sequence data data("ath", package="CRBHits") ath.genepos <- cds2genepos( cds=ath, source="ENSEMBL") ## Not run: ## load example sequence data ## set EnsemblPlants URL ensemblPlants <- "ftp://ftp.ensemblgenomes.org/pub/plants/release-48/fasta/" ## set Arabidopsis thaliana CDS URL ARATHA.cds.url <- paste0(ensemblPlants, "arabidopsis_thaliana/cds/Arabidopsis_thaliana.TAIR10.cds.all.fa.gz") ARATHA.cds.file <- tempfile() ## download CDS download.file(ARATHA.cds.url, ARATHA.cds.file, quiet=FALSE) ARATHA.cds <- Biostrings::readDNAStringSet(ARATHA.cds.file) ## get gene position ARATHA.cds.genepos <- cds2genepos(ARATHA.cds, "ENSEMBL") ## End(Not run)
This function calculates (conditional-)reciprocal best hit
(CRBHit) pairs from two CDS DNAStringSet's.
CRBHit pairs were introduced by Aubry S, Kelly S et al. (2014).
Sequence searches are performed with last
Kiełbasa, SM et al. (2011) [default]
or with mmseqs2
Steinegger, M and Soeding, J (2017)
or with diamond
Buchfink, B et al. (2021).
If one specifies cds1 and cds2 as the same input a selfblast is conducted.
cds2rbh( cds1, cds2, dbfile1 = NULL, dbfile2 = NULL, searchtool = "last", lastpath = file.path(find.package("CRBHits"), "extdata", "last-1639", "bin"), lastD = 1e+06, lastm = 10, mmseqs2path = NULL, mmseqs2sensitivity = 5.7, mmseqs2maxseqs = 300, diamondpath = NULL, diamondsensitivity = "--sensitive", diamondmaxtargetseqs = 0, lambda3path = NULL, lambda3sensitivity = "sensitive", lambda3nummatches = 25, outpath = "/tmp", crbh = TRUE, keepSingleDirection = FALSE, eval = 0.001, qcov = 0, tcov = 0, pident = 0, alnlen = 0, rost1999 = FALSE, filter = NULL, plotCurve = FALSE, fit.type = "mean", fit.varweight = 0.1, fit.min = 5, longest.isoform = FALSE, isoform.source = "NCBI", threads = 1, remove = TRUE, remove.db = TRUE, shorten1 = FALSE, frame1 = 1, framelist1 = NULL, genetic.code1 = NULL, shorten2 = FALSE, frame2 = 1, framelist2 = NULL, genetic.code2 = NULL )cds2rbh( cds1, cds2, dbfile1 = NULL, dbfile2 = NULL, searchtool = "last", lastpath = file.path(find.package("CRBHits"), "extdata", "last-1639", "bin"), lastD = 1e+06, lastm = 10, mmseqs2path = NULL, mmseqs2sensitivity = 5.7, mmseqs2maxseqs = 300, diamondpath = NULL, diamondsensitivity = "--sensitive", diamondmaxtargetseqs = 0, lambda3path = NULL, lambda3sensitivity = "sensitive", lambda3nummatches = 25, outpath = "/tmp", crbh = TRUE, keepSingleDirection = FALSE, eval = 0.001, qcov = 0, tcov = 0, pident = 0, alnlen = 0, rost1999 = FALSE, filter = NULL, plotCurve = FALSE, fit.type = "mean", fit.varweight = 0.1, fit.min = 5, longest.isoform = FALSE, isoform.source = "NCBI", threads = 1, remove = TRUE, remove.db = TRUE, shorten1 = FALSE, frame1 = 1, framelist1 = NULL, genetic.code1 = NULL, shorten2 = FALSE, frame2 = 1, framelist2 = NULL, genetic.code2 = NULL )
cds1 |
cds1 sequences as |
cds2 |
cds2 sequences as |
dbfile1 |
aa1 db file [optional] |
dbfile2 |
aa2 db file [optional] |
searchtool |
specify sequence search algorithm last, mmseqs2, diamond or lambda3 [default: last] |
lastpath |
specify the PATH to the last binaries [default: /extdata/last-1639/bin/] |
lastD |
last option D: query letters per random alignment [default: 1e6] |
lastm |
last option m: maximum initial matches per query position [default: 10] |
mmseqs2path |
specify the PATH to the mmseqs2 binaries [default: NULL] |
mmseqs2sensitivity |
specify the sensitivity option of mmseqs2 [default: 5.7] |
mmseqs2maxseqs |
mmseqs2 option: Maximum results per query sequence allowed to pass the prefilter [default: 300] |
diamondpath |
specify the PATH to the diamond binaries [default: NULL] |
diamondsensitivity |
specify the sensitivity option of diamond [default: –sensitive] |
diamondmaxtargetseqs |
specify the maximum number of target sequences per query option of diamond [default: 0] |
lambda3path |
specify the PATH to the lambda3 binaries [default: NULL] |
lambda3sensitivity |
specify the sensitivity option of lambda3 [default: sensitive] |
lambda3nummatches |
specify the number of matches per query option of lambda3 [default: 25] |
outpath |
specify the output PATH [default: /tmp] |
crbh |
specify if conditional-reciprocal hit pairs should be retained as secondary hits [default: TRUE] |
keepSingleDirection |
specify if single direction secondary hit pairs should be retained [default: FALSE] |
eval |
evalue [default: 1e-3] |
qcov |
query coverage [default: 0.0] |
tcov |
target coverage [default: 0.0] |
pident |
percent identity [default: 0.0] |
alnlen |
alignment length [default: 0.0] |
rost1999 |
specify if hit pairs should be filter by equation 2 of Rost 1999 [default: FALSE] |
filter |
specify additional custom filters as list to be applied on hit pairs [default: NULL] |
plotCurve |
specify if crbh fitting curve should be plotted [default: FALSE] |
fit.type |
specify if mean or median should be used for fitting [default: mean] |
fit.varweight |
factor for fitting function to consider neighborhood [default: 0.1] |
fit.min |
specify minimum neighborhood alignment length [default: 5] |
longest.isoform |
specify if cds sequences should be reduced to the longest isoform (only possible if data was accessed from NCBI or ENSEMBL) [default: FALSE] |
isoform.source |
specify cds sequences source (either NCBI or ENSEMBL) [default: NCBI] |
threads |
number of parallel threads [default: 1] |
remove |
specify if last result files should be removed [default: TRUE] |
remove.db |
specify if last db files should be removed [default: TRUE] |
shorten1 |
shorten all sequences to multiple of three [default: FALSE] |
frame1 |
indicates the first base of the first codon [default: 1] |
framelist1 |
supply vector of frames for each entry [default: NULL] |
genetic.code1 |
The genetic code to use for the translation of codons into Amino Acid letters [default: NULL] |
shorten2 |
shorten all sequences to multiple of three [default: FALSE] |
frame2 |
indicates the first base of the first codon [default: 1] |
framelist2 |
supply vector of frames for each entry [default: NULL] |
genetic.code2 |
The genetic code to use for the translation of codons into Amino Acid letters [default: NULL] |
List of three (crbh=FALSE)
1: $crbh.pairs
2: $crbh1 matrix; query > target
3: $crbh2 matrix; target > query
List of four (crbh=TRUE)
1: $crbh.pairs
2: $crbh1 matrix; query > target
3: $crbh2 matrix; target > query
4: $rbh1_rbh2_fit; evalue fitting function
Kristian K Ullrich
Aubry S, Kelly S et al. (2014) Deep Evolutionary Comparison of Gene Expression Identifies Parallel Recruitment of Trans-Factors in Two Independent Origins of C4 Photosynthesis. PLOS Genetics, 10(6) e1004365.
Kiełbasa, SM et al. (2011) Adaptive seeds tame genomic sequence comparison. Genome research, 21(3), 487-493.
Rost B. (1999). Twilight zone of protein sequence alignments. Protein Engineering, 12(2), 85-94.
## compile last-1639 within CRBHits CRBHits::make_last() ## load example sequence data data("ath", package="CRBHits") data("aly", package="CRBHits") ## get CRBHit pairs ath_aly_crbh <- cds2rbh( cds1=ath, cds2=aly, plotCurve=TRUE) dim(ath_aly_crbh$crbh.pairs) ## get classical reciprocal best hit (RBHit) pairs ath_aly_rbh <- cds2rbh( cds1=ath, cds2=aly, crbh=FALSE) dim(ath_aly_rbh$crbh.pairs) ## filter for evalue 1e-100 ath_aly_crbh.eval100 <- cds2rbh( cds1=ath, cds2=aly, eval=1e-100) dim(ath_aly_crbh.eval100$crbh.pairs) ## filter for query coverage ath_aly_crbh.qcov <- cds2rbh( cds1=ath, cds2=aly, qcov=0.5) dim(ath_aly_crbh.qcov$crbh.pairs) ## custom filter for e.g. bit score (column 12) ## Note: multiple filters can be given in filter param as list myfilter <- function(rbh, value=500.0){ return(rbh[as.numeric(rbh[, 12])>=value, , drop=FALSE]) } ath_aly_crbh.custom <- cds2rbh( cds1=ath, cds2=aly, filter=list(myfilter)) dim(ath_aly_crbh.custom$crbh.pairs) ## selfblast ath_selfblast_crbh <- cds2rbh( cds1=ath, cds2=ath, plotCurve=TRUE)## compile last-1639 within CRBHits CRBHits::make_last() ## load example sequence data data("ath", package="CRBHits") data("aly", package="CRBHits") ## get CRBHit pairs ath_aly_crbh <- cds2rbh( cds1=ath, cds2=aly, plotCurve=TRUE) dim(ath_aly_crbh$crbh.pairs) ## get classical reciprocal best hit (RBHit) pairs ath_aly_rbh <- cds2rbh( cds1=ath, cds2=aly, crbh=FALSE) dim(ath_aly_rbh$crbh.pairs) ## filter for evalue 1e-100 ath_aly_crbh.eval100 <- cds2rbh( cds1=ath, cds2=aly, eval=1e-100) dim(ath_aly_crbh.eval100$crbh.pairs) ## filter for query coverage ath_aly_crbh.qcov <- cds2rbh( cds1=ath, cds2=aly, qcov=0.5) dim(ath_aly_crbh.qcov$crbh.pairs) ## custom filter for e.g. bit score (column 12) ## Note: multiple filters can be given in filter param as list myfilter <- function(rbh, value=500.0){ return(rbh[as.numeric(rbh[, 12])>=value, , drop=FALSE]) } ath_aly_crbh.custom <- cds2rbh( cds1=ath, cds2=aly, filter=list(myfilter)) dim(ath_aly_crbh.custom$crbh.pairs) ## selfblast ath_selfblast_crbh <- cds2rbh( cds1=ath, cds2=ath, plotCurve=TRUE)
This function calculates (conditional-)reciprocal best hit (CRBHit) pairs for all possible comparison including self comparison from a directory of CDS fasta files. Sequence searches are performed with last Kiełbasa, SM et al. (2011) [default] or with mmseqs2 Steinegger, M and Soeding, J (2017) or with diamond Buchfink, B et al. (2021).
cdsdir2orthofinder( dir, file_ending = "*", searchtool = "last", lastpath = file.path(find.package("CRBHits"), "extdata", "last-1639", "bin"), lastD = 1e+06, lastm = 10, mmseqs2path = NULL, mmseqs2sensitivity = 5.7, mmseqs2maxseqs = 300, diamondpath = NULL, diamondsensitivity = "--sensitive", diamondmaxtargetseqs = 0, lambda3path = NULL, lambda3sensitivity = "sensitive", lambda3nummatches = 25, outpath = "/tmp", dbpath = "/tmp", crbh = TRUE, keepSingleDirection = FALSE, eval = 0.001, qcov = 0, tcov = 0, pident = 0, alnlen = 0, rost1999 = FALSE, filter = NULL, fit.type = "mean", fit.varweight = 0.1, fit.min = 5, threads = 1, remove = FALSE, remove.db = FALSE )cdsdir2orthofinder( dir, file_ending = "*", searchtool = "last", lastpath = file.path(find.package("CRBHits"), "extdata", "last-1639", "bin"), lastD = 1e+06, lastm = 10, mmseqs2path = NULL, mmseqs2sensitivity = 5.7, mmseqs2maxseqs = 300, diamondpath = NULL, diamondsensitivity = "--sensitive", diamondmaxtargetseqs = 0, lambda3path = NULL, lambda3sensitivity = "sensitive", lambda3nummatches = 25, outpath = "/tmp", dbpath = "/tmp", crbh = TRUE, keepSingleDirection = FALSE, eval = 0.001, qcov = 0, tcov = 0, pident = 0, alnlen = 0, rost1999 = FALSE, filter = NULL, fit.type = "mean", fit.varweight = 0.1, fit.min = 5, threads = 1, remove = FALSE, remove.db = FALSE )
dir |
directory containing CDS fasta files [mandatory] |
file_ending |
define file ending to consider [default: *] |
searchtool |
specify sequence search algorithm last, mmseqs2, diamond or lambda3 [default: last] |
lastpath |
specify the PATH to the last binaries [default: /extdata/last-1639/bin/] |
lastD |
last option D: query letters per random alignment [default: 1e6] |
lastm |
last option m: maximum initial matches per query position [default: 10] |
mmseqs2path |
specify the PATH to the mmseqs2 binaries [default: NULL] |
mmseqs2sensitivity |
specify the sensitivity option of mmseqs2 [default: 5.7] |
mmseqs2maxseqs |
mmseqs2 option: Maximum results per query sequence allowed to pass the prefilter [default: 300] |
diamondpath |
specify the PATH to the diamond binaries [default: NULL] |
diamondsensitivity |
specify the sensitivity option of diamond [default: –sensitive] |
diamondmaxtargetseqs |
specify the maximum number of target sequences per query option of diamond [default: 0] |
lambda3path |
specify the PATH to the lambda3 binaries [default: NULL] |
lambda3sensitivity |
specify the sensitivity option of lambda3 [default: sensitive] |
lambda3nummatches |
specify the number of matches per query option of lambda3 [default: 25] |
outpath |
specify the output PATH [default: /tmp] |
dbpath |
specify the output PATH [default: /tmp] |
crbh |
specify if conditional-reciprocal hit pairs should be retained as secondary hits [default: TRUE] |
keepSingleDirection |
specify if single direction secondary hit pairs should be retained [default: FALSE] |
eval |
evalue [default: 1e-3] |
qcov |
query coverage [default: 0.0] |
tcov |
target coverage [default: 0.0] |
pident |
percent identity [default: 0.0] |
alnlen |
alignment length [default: 0.0] |
rost1999 |
specify if hit pairs should be filter by equation 2 of Rost 1999 [default: FALSE] |
filter |
specify additional custom filters as list to be applied on hit pairs [default: NULL] |
fit.type |
specify if mean or median should be used for fitting [default: mean] |
fit.varweight |
factor for fitting function to consider neighborhood [default: 0.1] |
fit.min |
specify minimum neighborhood alignment length [default: 5] |
threads |
number of parallel threads [default: 1] |
remove |
specify if last result files should be removed [default: TRUE] |
remove.db |
specify if last db files should be removed [default: TRUE] |
List of three (crbh=FALSE)
Kristian K Ullrich
Aubry S, Kelly S et al. (2014) Deep Evolutionary Comparison of Gene Expression Identifies Parallel Recruitment of Trans-Factors in Two Independent Origins of C4 Photosynthesis. PLOS Genetics, 10(6) e1004365.
Kiełbasa, SM et al. (2011) Adaptive seeds tame genomic sequence comparison. Genome research, 21(3), 487-493.
Rost B. (1999). Twilight zone of protein sequence alignments. Protein Engineering, 12(2), 85-94.
## compile last-1639 within CRBHits CRBHits::make_last()## compile last-1639 within CRBHits CRBHits::make_last()
This function translates a cds fasta file into an aa fasta file.
cdsfile2aafile( infile, outfile, shorten = FALSE, frame = 1, framelist = NULL, genetic.code = NULL )cdsfile2aafile( infile, outfile, shorten = FALSE, frame = 1, framelist = NULL, genetic.code = NULL )
infile |
cds fasta file [mandatory] |
outfile |
aa fasta file [mandatory] |
shorten |
shorten all sequences to multiple of three [default: FALSE] |
frame |
indicates the first base of a the first codon [default: 1] |
framelist |
supply vector of frames for each entry [default: NULL] |
genetic.code |
The genetic code to use for the translation of codons into Amino Acid letters [default: NULL] |
aa fasta file
Kristian K Ullrich
## define file path cdsfile <- system.file("fasta", "ath.cds.fasta.gz", package="CRBHits") ## create empty temp file aafile <- tempfile() ## convert input CDS fasta file into AA fasta file cdsfile2aafile(cdsfile, aafile) aa <- Biostrings::readAAStringSet(aafile) aa## define file path cdsfile <- system.file("fasta", "ath.cds.fasta.gz", package="CRBHits") ## create empty temp file aafile <- tempfile() ## convert input CDS fasta file into AA fasta file cdsfile2aafile(cdsfile, aafile) aa <- Biostrings::readAAStringSet(aafile) aa
This function calculates (conditional-)reciprocal best hit (CRBHit) pairs from two CDS fasta files. CRBHit pairs were introduced by Aubry S, Kelly S et al. (2014). Sequence searches are performed with last Kiełbasa, SM et al. (2011) [default] or with mmseqs2 Steinegger, M and Soeding, J (2017) or with diamond Buchfink, B et al. (2021). If one specifies cdsfile1 and cdsfile2 as the same input a selfblast is conducted.
cdsfile2rbh( cdsfile1, cdsfile2, dbfile1 = NULL, dbfile2 = NULL, searchtool = "last", lastpath = file.path(find.package("CRBHits"), "extdata", "last-1639", "bin"), lastD = 1e+06, lastm = 10, mmseqs2path = NULL, mmseqs2sensitivity = 5.7, mmseqs2maxseqs = 300, diamondpath = NULL, diamondsensitivity = "--sensitive", diamondmaxtargetseqs = 0, lambda3path = NULL, lambda3sensitivity = "sensitive", lambda3nummatches = 25, outpath = "/tmp", crbh = TRUE, keepSingleDirection = FALSE, eval = 0.001, qcov = 0, tcov = 0, pident = 0, alnlen = 0, rost1999 = FALSE, filter = NULL, plotCurve = FALSE, fit.type = "mean", fit.varweight = 0.1, fit.min = 5, longest.isoform = FALSE, isoform.source = "NCBI", threads = 1, remove = TRUE, remove.db = TRUE, shorten1 = FALSE, frame1 = 1, framelist1 = NULL, genetic.code1 = NULL, shorten2 = FALSE, frame2 = 1, framelist2 = NULL, genetic.code2 = NULL )cdsfile2rbh( cdsfile1, cdsfile2, dbfile1 = NULL, dbfile2 = NULL, searchtool = "last", lastpath = file.path(find.package("CRBHits"), "extdata", "last-1639", "bin"), lastD = 1e+06, lastm = 10, mmseqs2path = NULL, mmseqs2sensitivity = 5.7, mmseqs2maxseqs = 300, diamondpath = NULL, diamondsensitivity = "--sensitive", diamondmaxtargetseqs = 0, lambda3path = NULL, lambda3sensitivity = "sensitive", lambda3nummatches = 25, outpath = "/tmp", crbh = TRUE, keepSingleDirection = FALSE, eval = 0.001, qcov = 0, tcov = 0, pident = 0, alnlen = 0, rost1999 = FALSE, filter = NULL, plotCurve = FALSE, fit.type = "mean", fit.varweight = 0.1, fit.min = 5, longest.isoform = FALSE, isoform.source = "NCBI", threads = 1, remove = TRUE, remove.db = TRUE, shorten1 = FALSE, frame1 = 1, framelist1 = NULL, genetic.code1 = NULL, shorten2 = FALSE, frame2 = 1, framelist2 = NULL, genetic.code2 = NULL )
cdsfile1 |
cds1 fasta file [mandatory] |
cdsfile2 |
cds2 fasta file [mandatory] |
dbfile1 |
aa1 db file [optional] |
dbfile2 |
aa2 db file [optional] |
searchtool |
specify sequence search algorithm last, mmseqs2, diamond or lambda3 [default: last] |
lastpath |
specify the PATH to the last binaries [default: /extdata/last-1639/bin/] |
lastD |
last option D: query letters per random alignment [default: 1e6] |
lastm |
last option m: maximum initial matches per query position [default: 10] |
mmseqs2path |
specify the PATH to the mmseqs2 binaries [default: NULL] |
mmseqs2sensitivity |
specify the sensitivity option of mmseqs2 [default: 5.7] |
mmseqs2maxseqs |
mmseqs2 option: Maximum results per query sequence allowed to pass the prefilter [default: 300] |
diamondpath |
specify the PATH to the diamond binaries [default: NULL] |
diamondsensitivity |
specify the sensitivity option of diamond [default: –sensitive] |
diamondmaxtargetseqs |
specify the maximum number of target sequences per query option of diamond [default: 0] |
lambda3path |
specify the PATH to the lambda3 binaries [default: NULL] |
lambda3sensitivity |
specify the sensitivity option of lambda3 [default: sensitive] |
lambda3nummatches |
specify the number of matches per query option of lambda3 [default: 25] |
outpath |
specify the output PATH [default: /tmp] |
crbh |
specify if conditional-reciprocal hit pairs should be retained as secondary hits [default: TRUE] |
keepSingleDirection |
specify if single direction secondary hit pairs should be retained [default: FALSE] |
eval |
evalue [default: 1e-3] |
qcov |
query coverage [default: 0.0] |
tcov |
target coverage [default: 0.0] |
pident |
percent identity [default: 0.0] |
alnlen |
alignment length [default: 0.0] |
rost1999 |
specify if hit pairs should be filter by equation 2 of Rost 1999 [default: FALSE] |
filter |
specify additional custom filters as list to be applied on hit pairs [default: NULL] |
plotCurve |
specify if crbh fitting curve should be plotted [default: FALSE] |
fit.type |
specify if mean or median should be used for fitting [default: mean] |
fit.varweight |
factor for fitting function to consider neighborhood [default: 0.1] |
fit.min |
specify minimum neighborhood alignment length [default: 5] |
longest.isoform |
specify if cds sequences should be removed to the longest isoform (only possible if data was accessed from NCBI or ENSEMBL) [default: FALSE] |
isoform.source |
specify cds sequences source (either NCBI or ENSEMBL) [default: NCBI] |
threads |
number of parallel threads [default: 1] |
remove |
specify if last result files should be removed [default: TRUE] |
remove.db |
specify if last db files should be removed [default: TRUE] |
shorten1 |
shorten all sequences to multiple of three [default: FALSE] |
frame1 |
indicates the first base of the first codon [default: 1] |
framelist1 |
supply vector of frames for each entry [default: NULL] |
genetic.code1 |
The genetic code to use for the translation of codons into Amino Acid letters [default: NULL] |
shorten2 |
shorten all sequences to multiple of three [default: FALSE] |
frame2 |
indicates the first base of the first codon [default: 1] |
framelist2 |
supply vector of frames for each entry [default: NULL] |
genetic.code2 |
The genetic code to use for the translation of codons into Amino Acid letters [default: NULL] |
List of three (crbh=FALSE)
1: $crbh.pairs
2: $crbh1 matrix; query > target
3: $crbh2 matrix; target > query
List of four (crbh=TRUE)
1: $crbh.pairs
2: $crbh1 matrix; query > target
3: $crbh2 matrix; target > query
4: $rbh1_rbh2_fit; evalue fitting function
Kristian K Ullrich
Aubry S, Kelly S et al. (2014) Deep Evolutionary Comparison of Gene Expression Identifies Parallel Recruitment of Trans-Factors in Two Independent Origins of C4 Photosynthesis. PLOS Genetics, 10(6) e1004365.
Kiełbasa, SM et al. (2011) Adaptive seeds tame genomic sequence comparison. Genome research, 21(3), 487-493.
Steinegger, M and Soeding, J. (2017). MMseqs2 enables sensitive protein sequence searching for the analysis of massive data sets. Nature Biotechnology, 35, 1026-1028.
Rost B. (1999). Twilight zone of protein sequence alignments. Protein Engineering, 12(2), 85-94.
## compile last-1639 within CRBHits CRBHits::make_last() ## load example sequence data athfile <- system.file("fasta", "ath.cds.fasta.gz", package="CRBHits") alyfile <- system.file("fasta", "aly.cds.fasta.gz", package="CRBHits") ## get CRBHit pairs ath_aly_crbh <- cdsfile2rbh( cdsfile1=athfile, cdsfile2=alyfile, plotCurve=TRUE) dim(ath_aly_crbh$crbh.pairs) ## get classical reciprocal best hit (RBHit) pairs ath_aly_rbh <- cdsfile2rbh( cdsfile1=athfile, cdsfile2=alyfile, crbh=FALSE) dim(ath_aly_rbh$crbh.pairs) ## selfblast ath_selfblast_crbh <- cdsfile2rbh( cdsfile1=athfile, cdsfile2=athfile, plotCurve=TRUE) ## see ?cds2rbh for more examples## compile last-1639 within CRBHits CRBHits::make_last() ## load example sequence data athfile <- system.file("fasta", "ath.cds.fasta.gz", package="CRBHits") alyfile <- system.file("fasta", "aly.cds.fasta.gz", package="CRBHits") ## get CRBHit pairs ath_aly_crbh <- cdsfile2rbh( cdsfile1=athfile, cdsfile2=alyfile, plotCurve=TRUE) dim(ath_aly_crbh$crbh.pairs) ## get classical reciprocal best hit (RBHit) pairs ath_aly_rbh <- cdsfile2rbh( cdsfile1=athfile, cdsfile2=alyfile, crbh=FALSE) dim(ath_aly_rbh$crbh.pairs) ## selfblast ath_selfblast_crbh <- cdsfile2rbh( cdsfile1=athfile, cdsfile2=athfile, plotCurve=TRUE) ## see ?cds2rbh for more examples
This function checks if external binary exists
check_ext_install(ext_name, ext_dir, binary_name = ext_name)check_ext_install(ext_name, ext_dir, binary_name = ext_name)
ext_name |
external tool name [mandatory] |
ext_dir |
external tool directory [mandatory] |
binary_name |
external binary name (per default ext_name) [mandatory] |
path of prerequisites
Kristian K Ullrich
R color to transparent conversion.
col2transparent(col, alpha.perc = 0)col2transparent(col, alpha.perc = 0)
col |
R color name or R color pal [mandatory] |
alpha.perc |
an alpha-transparency level in percent [default: 0] |
A character vector with elements of 9 characters, "#" followed by the red, blue, green and alpha values in hexadecimal (after rescaling to 0 ... 255).
Kristian K Ullrich
col2transparent("green", 50) ## Demonstrate the colors 1:8 in different palettes using a custom matplot() sinplot <- function(main=NULL) { x <- outer( seq(-pi, pi, length.out = 50), seq(0, pi, length.out = 9), function(x, y) sin(x - y) ) matplot(x, type = "l", lwd = 4, lty = 1, col = 1:9, ylab = "", main=main) } my.palette <- c( "#CBC106", "#27993C", "#1C6838", "#8EBCB5", "#389CA7", "#4D83AB", "#CB7B26", "#BF565D", "#9E163C") palette(my.palette); sinplot("my.palette") ## 25% transparent palette(col2transparent(palette(my.palette), 25)); sinplot("my.palette - transparent") ## 50% transparent palette(col2transparent(palette(my.palette), 50)); sinplot("my.palette - transparent") ## 75% transparent palette(col2transparent(palette(my.palette), 75)); sinplot("my.palette - transparent")col2transparent("green", 50) ## Demonstrate the colors 1:8 in different palettes using a custom matplot() sinplot <- function(main=NULL) { x <- outer( seq(-pi, pi, length.out = 50), seq(0, pi, length.out = 9), function(x, y) sin(x - y) ) matplot(x, type = "l", lwd = 4, lty = 1, col = 1:9, ylab = "", main=main) } my.palette <- c( "#CBC106", "#27993C", "#1C6838", "#8EBCB5", "#389CA7", "#4D83AB", "#CB7B26", "#BF565D", "#9E163C") palette(my.palette); sinplot("my.palette") ## 25% transparent palette(col2transparent(palette(my.palette), 25)); sinplot("my.palette - transparent") ## 50% transparent palette(col2transparent(palette(my.palette), 50)); sinplot("my.palette - transparent") ## 75% transparent palette(col2transparent(palette(my.palette), 75)); sinplot("my.palette - transparent")
CRBHits specific color palette.
CRBHitsColors(n, alpha.perc = 0)CRBHitsColors(n, alpha.perc = 0)
n |
The number of colors (greater than or equal to 1) to be in the palette [mandatory] |
alpha.perc |
an alpha-transparency level in percent [default: 0] |
CRBHits specific color palette
Kristian K Ullrich
plot(1:9, pch=20, col=CRBHitsColors(9), cex=3) ## use alpha plot(1:9, pch=20, col=CRBHitsColors(9, 50), cex=3)plot(1:9, pch=20, col=CRBHitsColors(9), cex=3) ## use alpha plot(1:9, pch=20, col=CRBHitsColors(9, 50), cex=3)
This function filters BLAST-like tabular output according to alignment length.
filter_alnlen(rbh, alnlen = 0, inverse = FALSE)filter_alnlen(rbh, alnlen = 0, inverse = FALSE)
rbh |
BLAST-like tabular matrix [mandatory] |
alnlen |
alignment length [default: 0.0] |
inverse |
specify if filter should keep the removed values [default: FALSE] |
rbh matrix
Kristian K Ullrich
## load crbh data data(ath_aly_crbh) dim(ath_aly_crbh$crbh1) dim(filter_alnlen( rbh=ath_aly_crbh$crbh1, alnlen=75))## load crbh data data(ath_aly_crbh) dim(ath_aly_crbh$crbh1) dim(filter_alnlen( rbh=ath_aly_crbh$crbh1, alnlen=75))
This function filters BLAST-like tabular output according to evalue.
filter_eval(rbh, eval = 0.001, inverse = FALSE)filter_eval(rbh, eval = 0.001, inverse = FALSE)
rbh |
BLAST-like tabular matrix [mandatory] |
eval |
evalue [default: 1e-3] |
inverse |
specify if filter should keep the removed values [default: FALSE] |
rbh matrix
Kristian K Ullrich
## load crbh data data(ath_aly_crbh) dim(ath_aly_crbh$crbh1) dim(filter_eval( rbh=ath_aly_crbh$crbh1, eval=1e-100))## load crbh data data(ath_aly_crbh) dim(ath_aly_crbh$crbh1) dim(filter_eval( rbh=ath_aly_crbh$crbh1, eval=1e-100))
This function filters BLAST-like tabular output according to protein identity.
filter_pident(rbh, pident = 0, inverse = FALSE)filter_pident(rbh, pident = 0, inverse = FALSE)
rbh |
BLAST-like tabular matrix [mandatory] |
pident |
percent identity [default: 0.0] |
inverse |
specify if filter should keep the removed values [default: FALSE] |
rbh matrix
Kristian K Ullrich
## load crbh data data(ath_aly_crbh) dim(ath_aly_crbh$crbh1) dim(filter_pident( rbh=ath_aly_crbh$crbh1, pident=75))## load crbh data data(ath_aly_crbh) dim(ath_aly_crbh$crbh1) dim(filter_pident( rbh=ath_aly_crbh$crbh1, pident=75))
This function filters BLAST-like tabular output according to percentage query coverage.
filter_qcov(rbh, qcov = 0, inverse = FALSE)filter_qcov(rbh, qcov = 0, inverse = FALSE)
rbh |
BLAST-like tabular matrix [mandatory] |
qcov |
query coverage [default: 0.0] |
inverse |
specify if filter should keep the removed values [default: FALSE] |
rbh matrix
Kristian K Ullrich
## load crbh data data(ath_aly_crbh) dim(ath_aly_crbh$crbh1) dim(filter_qcov( rbh=ath_aly_crbh$crbh1, qcov=0.75))## load crbh data data(ath_aly_crbh) dim(ath_aly_crbh$crbh1) dim(filter_qcov( rbh=ath_aly_crbh$crbh1, qcov=0.75))
This function filters BLAST-like tabular output according to equation 2 of Rost 1999.
filter_rost1999(rbh, inverse = FALSE)filter_rost1999(rbh, inverse = FALSE)
rbh |
BLAST-like tabular matrix [mandatory] |
inverse |
specify if filter should keep the removed values [default: FALSE] |
rbh matrix
Kristian K Ullrich
Rost B. (1999). Twilight zone of protein sequence alignments. Protein Engineering, 12(2), 85-94.
## load crbh data data(ath_aly_crbh) dim(ath_aly_crbh$crbh1) dim(filter_rost1999( rbh=ath_aly_crbh$crbh1))## load crbh data data(ath_aly_crbh) dim(ath_aly_crbh$crbh1) dim(filter_rost1999( rbh=ath_aly_crbh$crbh1))
This function filters BLAST-like tabular output according to percentage target coverage.
filter_tcov(rbh, tcov = 0, inverse = FALSE)filter_tcov(rbh, tcov = 0, inverse = FALSE)
rbh |
BLAST-like tabular matrix [mandatory] |
tcov |
target coverage [default: 0.0] |
inverse |
specify if filter should keep the removed values [default: FALSE] |
rbh matrix
Kristian K Ullrich
## load crbh data data(ath_aly_crbh) dim(ath_aly_crbh$crbh1) dim(filter_tcov( rbh=ath_aly_crbh$crbh1, tcov=0.75))## load crbh data data(ath_aly_crbh) dim(ath_aly_crbh$crbh1) dim(filter_tcov( rbh=ath_aly_crbh$crbh1, tcov=0.75))
This function extracts the gene position from GFF3 input and optional extracts the longest isoform.
gff2longest(gff3file, cds = NULL, removeNonCoding = TRUE, source = "NCBI")gff2longest(gff3file, cds = NULL, removeNonCoding = TRUE, source = "NCBI")
gff3file |
|
cds |
|
removeNonCoding |
specify if NonCoding transcripts should be removed |
source |
source indicating either NCBI or ENSEMBL [default: NCBI] |
list
Kristian K Ullrich
## Not run: ## load example sequence data ## set NCBI GFF3 URL NCBI <- "ftp://ftp.ncbi.nlm.nih.gov/genomes/all/" ARATHA.NCBI.gff3.url <- paste0(NCBI, "GCF/000/001/735/GCF_000001735.4_TAIR10.1/", "GCF_000001735.4_TAIR10.1_genomic.gff.gz") ARATHA.NCBI.gff3.file <- tempfile() ## download GTF file download.file(ARATHA.NCBI.gff3.url, ARATHA.NCBI.gff3.file, quiet=FALSE) ## set NCBI CDS URL ARATHA.NCBI.cds.url <- paste0(NCBI, "GCF/000/001/735/GCF_000001735.4_TAIR10.1/", "GCF_000001735.4_TAIR10.1_cds_from_genomic.fna.gz") ARATHA.NCBI.cds.file <- tempfile() ## download CDS file download.file(ARATHA.NCBI.cds.url, ARATHA.NCBI.cds.file, quiet=FALSE) ## load CDS ARATHA.NCBI.cds <- Biostrings::readDNAStringSet(ARATHA.NCBI.cds.file) ## get genepos and longest isoform ARATHA.NCBI.gff3.longest <- gff2longest(gtffile=ARATHA.NCBI.gff3.file, cds=ARATHA.NCBI.cds, source="NCBI") ARATHA.NCBI.gff3.longest$genepos ARATHA.NCBI.gff3.longest$cds ## set ENSEMBL GFF3 URL ensembl <- "http://ftp.ensemblgenomes.org/pub/plants/release-52/" ARATHA.ENSEMBL.gff3.url <- paste0(ensembl, "gff3/arabidopsis_thaliana/Arabidopsis_thaliana.TAIR10.52.gff3.gz") ARATHA.ENSEMBL.gff3.file <- tempfile() ## download GTF file download.file(ARATHA.ENSEMBL.gff3.url, ARATHA.ENSEMBL.gff3.file, quiet=FALSE) ## set ENSEMBL CDS URL ARATHA.ENSEMBL.cds.url <- paste0(ensembl, "fasta/arabidopsis_thaliana/cds/", "Arabidopsis_thaliana.TAIR10.cds.all.fa.gz") ARATHA.ENSEMBL.cds.file <- tempfile() ## download CDS file download.file(ARATHA.ENSEMBL.cds.url, ARATHA.ENSEMBL.cds.file, quiet=FALSE) ARATHA.ENSEMBL.cds <- Biostrings::readDNAStringSet(ARATHA.ENSEMBL.cds.file) ## get genepos and longest isoform ARATHA.ENSEMBL.gff3.longest <- gff2longest(gff3file=ARATHA.ENSEMBL.gff3.file, cds=ARATHA.ENSEMBL.cds) ARATHA.ENSEMBL.gff3.longest$genepos ARATHA.ENSEMBL.gff3.longest$cds ## End(Not run)## Not run: ## load example sequence data ## set NCBI GFF3 URL NCBI <- "ftp://ftp.ncbi.nlm.nih.gov/genomes/all/" ARATHA.NCBI.gff3.url <- paste0(NCBI, "GCF/000/001/735/GCF_000001735.4_TAIR10.1/", "GCF_000001735.4_TAIR10.1_genomic.gff.gz") ARATHA.NCBI.gff3.file <- tempfile() ## download GTF file download.file(ARATHA.NCBI.gff3.url, ARATHA.NCBI.gff3.file, quiet=FALSE) ## set NCBI CDS URL ARATHA.NCBI.cds.url <- paste0(NCBI, "GCF/000/001/735/GCF_000001735.4_TAIR10.1/", "GCF_000001735.4_TAIR10.1_cds_from_genomic.fna.gz") ARATHA.NCBI.cds.file <- tempfile() ## download CDS file download.file(ARATHA.NCBI.cds.url, ARATHA.NCBI.cds.file, quiet=FALSE) ## load CDS ARATHA.NCBI.cds <- Biostrings::readDNAStringSet(ARATHA.NCBI.cds.file) ## get genepos and longest isoform ARATHA.NCBI.gff3.longest <- gff2longest(gtffile=ARATHA.NCBI.gff3.file, cds=ARATHA.NCBI.cds, source="NCBI") ARATHA.NCBI.gff3.longest$genepos ARATHA.NCBI.gff3.longest$cds ## set ENSEMBL GFF3 URL ensembl <- "http://ftp.ensemblgenomes.org/pub/plants/release-52/" ARATHA.ENSEMBL.gff3.url <- paste0(ensembl, "gff3/arabidopsis_thaliana/Arabidopsis_thaliana.TAIR10.52.gff3.gz") ARATHA.ENSEMBL.gff3.file <- tempfile() ## download GTF file download.file(ARATHA.ENSEMBL.gff3.url, ARATHA.ENSEMBL.gff3.file, quiet=FALSE) ## set ENSEMBL CDS URL ARATHA.ENSEMBL.cds.url <- paste0(ensembl, "fasta/arabidopsis_thaliana/cds/", "Arabidopsis_thaliana.TAIR10.cds.all.fa.gz") ARATHA.ENSEMBL.cds.file <- tempfile() ## download CDS file download.file(ARATHA.ENSEMBL.cds.url, ARATHA.ENSEMBL.cds.file, quiet=FALSE) ARATHA.ENSEMBL.cds <- Biostrings::readDNAStringSet(ARATHA.ENSEMBL.cds.file) ## get genepos and longest isoform ARATHA.ENSEMBL.gff3.longest <- gff2longest(gff3file=ARATHA.ENSEMBL.gff3.file, cds=ARATHA.ENSEMBL.cds) ARATHA.ENSEMBL.gff3.longest$genepos ARATHA.ENSEMBL.gff3.longest$cds ## End(Not run)
This function extracts the gene position from GTF input and optional extracts the longest isoform.
gtf2longest(gtffile, cds = NULL, removeNonCoding = TRUE, source = "NCBI")gtf2longest(gtffile, cds = NULL, removeNonCoding = TRUE, source = "NCBI")
gtffile |
|
cds |
|
removeNonCoding |
specify if NonCoding transcripts should be removed |
source |
source indicating either NCBI or ENSEMBL [default: NCBI] |
list
Kristian K Ullrich
## Not run: ## load example sequence data ## set NCBI GTF URL NCBI <- "ftp://ftp.ncbi.nlm.nih.gov/genomes/all/" ARATHA.NCBI.gtf.url <- paste0(NCBI, "GCF/000/001/735/GCF_000001735.4_TAIR10.1/", "GCF_000001735.4_TAIR10.1_genomic.gtf.gz") ARATHA.NCBI.gtf.file <- tempfile() ## download GTF file download.file(ARATHA.NCBI.gtf.url, ARATHA.NCBI.gtf.file, quiet=FALSE) ## set NCBI CDS URL ARATHA.NCBI.cds.url <- paste0(NCBI, "GCF/000/001/735/GCF_000001735.4_TAIR10.1/", "GCF_000001735.4_TAIR10.1_cds_from_genomic.fna.gz") ARATHA.NCBI.cds.file <- tempfile() ## download CDS file download.file(ARATHA.NCBI.cds.url, ARATHA.NCBI.cds.file, quiet=FALSE) ## load CDS ARATHA.NCBI.cds <- Biostrings::readDNAStringSet(ARATHA.NCBI.cds.file) ## get genepos and longest isoform ARATHA.NCBI.gtf.longest <- gtf2longest(gtffile=ARATHA.NCBI.gtf.file, cds=ARATHA.NCBI.cds, source="NCBI") ARATHA.NCBI.gtf.longest$genepos ARATHA.NCBI.gtf.longest$cds ## set ENSEMBL GTF URL ensembl <- "http://ftp.ensemblgenomes.org/pub/plants/release-52/" ARATHA.ENSEMBL.gtf.url <- paste0(ensembl, "gtf/arabidopsis_thaliana/Arabidopsis_thaliana.TAIR10.52.gtf.gz") ARATHA.ENSEMBL.gtf.file <- tempfile() ## download GTF file download.file(ARATHA.ENSEMBL.gtf.url, ARATHA.ENSEMBL.gtf.file, quiet=FALSE) ## set ENSEMBL CDS URL ARATHA.ENSEMBL.cds.url <- paste0(ensembl, "fasta/arabidopsis_thaliana/cds/", "Arabidopsis_thaliana.TAIR10.cds.all.fa.gz") ARATHA.ENSEMBL.cds.file <- tempfile() ## download CDS file download.file(ARATHA.ENSEMBL.cds.url, ARATHA.ENSEMBL.cds.file, quiet=FALSE) ARATHA.ENSEMBL.cds <- Biostrings::readDNAStringSet(ARATHA.ENSEMBL.cds.file) ## get genepos and longest isoform ARATHA.ENSEMBL.gtf.longest <- gtf2longest(gtffile=ARATHA.ENSEMBL.gtf.file, cds=ARATHA.ENSEMBL.cds) ARATHA.ENSEMBL.gtf.longest$genepos ARATHA.ENSEMBL.gtf.longest$cds ## End(Not run)## Not run: ## load example sequence data ## set NCBI GTF URL NCBI <- "ftp://ftp.ncbi.nlm.nih.gov/genomes/all/" ARATHA.NCBI.gtf.url <- paste0(NCBI, "GCF/000/001/735/GCF_000001735.4_TAIR10.1/", "GCF_000001735.4_TAIR10.1_genomic.gtf.gz") ARATHA.NCBI.gtf.file <- tempfile() ## download GTF file download.file(ARATHA.NCBI.gtf.url, ARATHA.NCBI.gtf.file, quiet=FALSE) ## set NCBI CDS URL ARATHA.NCBI.cds.url <- paste0(NCBI, "GCF/000/001/735/GCF_000001735.4_TAIR10.1/", "GCF_000001735.4_TAIR10.1_cds_from_genomic.fna.gz") ARATHA.NCBI.cds.file <- tempfile() ## download CDS file download.file(ARATHA.NCBI.cds.url, ARATHA.NCBI.cds.file, quiet=FALSE) ## load CDS ARATHA.NCBI.cds <- Biostrings::readDNAStringSet(ARATHA.NCBI.cds.file) ## get genepos and longest isoform ARATHA.NCBI.gtf.longest <- gtf2longest(gtffile=ARATHA.NCBI.gtf.file, cds=ARATHA.NCBI.cds, source="NCBI") ARATHA.NCBI.gtf.longest$genepos ARATHA.NCBI.gtf.longest$cds ## set ENSEMBL GTF URL ensembl <- "http://ftp.ensemblgenomes.org/pub/plants/release-52/" ARATHA.ENSEMBL.gtf.url <- paste0(ensembl, "gtf/arabidopsis_thaliana/Arabidopsis_thaliana.TAIR10.52.gtf.gz") ARATHA.ENSEMBL.gtf.file <- tempfile() ## download GTF file download.file(ARATHA.ENSEMBL.gtf.url, ARATHA.ENSEMBL.gtf.file, quiet=FALSE) ## set ENSEMBL CDS URL ARATHA.ENSEMBL.cds.url <- paste0(ensembl, "fasta/arabidopsis_thaliana/cds/", "Arabidopsis_thaliana.TAIR10.cds.all.fa.gz") ARATHA.ENSEMBL.cds.file <- tempfile() ## download CDS file download.file(ARATHA.ENSEMBL.cds.url, ARATHA.ENSEMBL.cds.file, quiet=FALSE) ARATHA.ENSEMBL.cds <- Biostrings::readDNAStringSet(ARATHA.ENSEMBL.cds.file) ## get genepos and longest isoform ARATHA.ENSEMBL.gtf.longest <- gtf2longest(gtffile=ARATHA.ENSEMBL.gtf.file, cds=ARATHA.ENSEMBL.cds) ARATHA.ENSEMBL.gtf.longest$genepos ARATHA.ENSEMBL.gtf.longest$cds ## End(Not run)
Example of Ka/Ks values between Homo sapiens and Pan troglodytes obtained from NCBI.
data(hom_pan_ensembl_kaks)data(hom_pan_ensembl_kaks)
an object of class list
data("hom_pan_ensembl_kaks", package="CRBHits")data("hom_pan_ensembl_kaks", package="CRBHits")
This function extracts the longest isoform from either NCBI or ENSEMBL CDS input.
isoform2longest(cds, source = "NCBI")isoform2longest(cds, source = "NCBI")
cds |
|
source |
source indicating either NCBI, ENSEMBL or WORMBASE [default: NCBI] |
DNAStringSet
Kristian K Ullrich
## Not run: ## load example sequence data ## set NCBI URL NCBI <- "ftp://ftp.ncbi.nlm.nih.gov/genomes/all/" HOMSAP.NCBI.cds.url <- paste0(NCBI, "GCF/000/001/405/GCF_000001405.39_GRCh38.p13/", "GCF_000001405.39_GRCh38.p13_cds_from_genomic.fna.gz") HOMSAP.NCBI.cds.file <- tempfile() ## download CDS file download.file(HOMSAP.NCBI.cds.url, HOMSAP.NCBI.cds.file, quiet=FALSE) HOMSAP.NCBI.cds <- Biostrings::readDNAStringSet(HOMSAP.NCBI.cds.file) ## get longest isoform HOMSAP.NCBI.cds.longest <- isoform2longest(HOMSAP.NCBI.cds, "NCBI") length(HOMSAP.NCBI.cds) length(HOMSAP.NCBI.cds.longest) ## set ENSEMBL URL ensembl <- "ftp://ftp.ensembl.org/pub/release-101/fasta/" ## set Homo sapiens CDS URL HOMSAP.ENSEMBL.cds.url <- paste0(ensembl, "homo_sapiens/cds/Homo_sapiens.GRCh38.cds.all.fa.gz") HOMSAP.ENSEMBL.cds.file <- tempfile() ## download CDS file download.file(HOMSAP.ENSEMBL.cds.url, HOMSAP.ENSEMBL.cds.file, quiet=FALSE) HOMSAP.ENSEMBL.cds <- Biostrings::readDNAStringSet(HOMSAP.ENSEMBL.cds.file) ## get longest isoform HOMSAP.ENSEMBL.cds.longest<-isoform2longest(HOMSAP.ENSEMBL.cds, "ENSEMBL") length(HOMSAP.ENSEMBL.cds) length(HOMSAP.ENSEMBL.cds.longest) ## End(Not run)## Not run: ## load example sequence data ## set NCBI URL NCBI <- "ftp://ftp.ncbi.nlm.nih.gov/genomes/all/" HOMSAP.NCBI.cds.url <- paste0(NCBI, "GCF/000/001/405/GCF_000001405.39_GRCh38.p13/", "GCF_000001405.39_GRCh38.p13_cds_from_genomic.fna.gz") HOMSAP.NCBI.cds.file <- tempfile() ## download CDS file download.file(HOMSAP.NCBI.cds.url, HOMSAP.NCBI.cds.file, quiet=FALSE) HOMSAP.NCBI.cds <- Biostrings::readDNAStringSet(HOMSAP.NCBI.cds.file) ## get longest isoform HOMSAP.NCBI.cds.longest <- isoform2longest(HOMSAP.NCBI.cds, "NCBI") length(HOMSAP.NCBI.cds) length(HOMSAP.NCBI.cds.longest) ## set ENSEMBL URL ensembl <- "ftp://ftp.ensembl.org/pub/release-101/fasta/" ## set Homo sapiens CDS URL HOMSAP.ENSEMBL.cds.url <- paste0(ensembl, "homo_sapiens/cds/Homo_sapiens.GRCh38.cds.all.fa.gz") HOMSAP.ENSEMBL.cds.file <- tempfile() ## download CDS file download.file(HOMSAP.ENSEMBL.cds.url, HOMSAP.ENSEMBL.cds.file, quiet=FALSE) HOMSAP.ENSEMBL.cds <- Biostrings::readDNAStringSet(HOMSAP.ENSEMBL.cds.file) ## get longest isoform HOMSAP.ENSEMBL.cds.longest<-isoform2longest(HOMSAP.ENSEMBL.cds, "ENSEMBL") length(HOMSAP.ENSEMBL.cds) length(HOMSAP.ENSEMBL.cds.longest) ## End(Not run)
This function tries to build the DAGchainer from source code forked within CRBHits
make_dagchainer(build_dir = file.path(find.package("CRBHits"), "extdata"))make_dagchainer(build_dir = file.path(find.package("CRBHits"), "extdata"))
build_dir |
directory to build DAGchainer [mandatory] |
compile DAGchainer and return binary path
Kristian K Ullrich
Haas BJ et al. (2004) DAGchainer: a tool for mining segmental genome duplications and synteny. Bioinformatics 20 (18), 3643-6.
This function tries to build the prerequisite last-1639 from source code forked within CRBHits
make_last(build_dir = file.path(find.package("CRBHits"), "extdata"))make_last(build_dir = file.path(find.package("CRBHits"), "extdata"))
build_dir |
directory to build last [mandatory] |
compile last and return binary path
Kristian K Ullrich
Kiełbasa SM et al. (2011) Adaptive seeds tame genomic sequence comparison. Genome Res. 21 (3), 487-93.
This function tries to build the prerequisite last-1639, and DAGchainer from source code forked within CRBHits
make_vignette()make_vignette()
path of prerequisites
Kristian K Ullrich
Kiełbasa SM et al. (2011) Adaptive seeds tame genomic sequence comparison. Genome Res. 21 (3), 487-93.
Wang D, Zhang Y et al. (2010) KaKs_Calculator 2.0: a toolkit incorporating gamma-series methods and sliding window strategies. Genomics Proteomics Bioinformatics. 8(1), 77-80.
Haas BJ et al. (2004) DAGchainer: a tool for mining segmental genome duplications and synteny. Bioinformatics 20 (18), 3643-6.
This function plots DAGchainer (http://dagchainer.sourceforge.net/) results obtained via 'rbh2dagchainer()' function.
plot_dagchainer( dag, DotPlotTitle = "DAGchainer results", colorBy = "none", kaks = NULL, ka.max = 5, ks.max = 5, ka.min = 0, ks.min = 0, select.chr = NULL )plot_dagchainer( dag, DotPlotTitle = "DAGchainer results", colorBy = "none", kaks = NULL, ka.max = 5, ks.max = 5, ka.min = 0, ks.min = 0, select.chr = NULL )
dag |
specify DAGchainer results as obtained via 'rbh2dagchainer()' [mandatory] |
DotPlotTitle |
specify DotPlot title [default: DAGchainer results] |
colorBy |
specify if dagchainer groups should be colored by "Ka", "Ks", "Ka/Ks" or "none" [default: none] |
kaks |
specify Ka/Ks input obtained via 'rbh2kaks()' [default: NULL] |
ka.max |
specify max Ka to be filtered [default: 5] |
ks.max |
specify max Ks to be filtered [default: 5] |
ka.min |
specify min Ka to be filtered [default: 0] |
ks.min |
specify min Ks to be filtered [default: 0] |
select.chr |
filter results for chromosome names [default: NULL] |
synteny plot
Kristian K Ullrich
## load example sequence data data("ath_aly_ncbi_dagchainer", package="CRBHits") ## plot DAGchainer results - default plot_dagchainer( dag=ath_aly_ncbi_dagchainer) ## chromosome subset plot_dagchainer( dag=ath_aly_ncbi_dagchainer, select.chr=c("NC_000932.1", "NC_034379.1"))## load example sequence data data("ath_aly_ncbi_dagchainer", package="CRBHits") ## plot DAGchainer results - default plot_dagchainer( dag=ath_aly_ncbi_dagchainer) ## chromosome subset plot_dagchainer( dag=ath_aly_ncbi_dagchainer, select.chr=c("NC_000932.1", "NC_034379.1"))
This function plots Ka/Ks results obtained via 'rbh2kaks()' function or 'MSA2dist::dnastring2kaks()' function.
plot_kaks( kaks, dag = NULL, gene.position.cds1 = NULL, gene.position.cds2 = NULL, tandem.dups.cds1 = NULL, tandem.dups.cds2 = NULL, PlotTitle = "Ka/Ks results", PlotType = "h", binw = 0.05, splitByChr = FALSE, colorBy = "none", ka.max = 5, ks.max = 5, ka.min = 0, ks.min = 0, select.chr = NULL, doPlot = TRUE )plot_kaks( kaks, dag = NULL, gene.position.cds1 = NULL, gene.position.cds2 = NULL, tandem.dups.cds1 = NULL, tandem.dups.cds2 = NULL, PlotTitle = "Ka/Ks results", PlotType = "h", binw = 0.05, splitByChr = FALSE, colorBy = "none", ka.max = 5, ks.max = 5, ka.min = 0, ks.min = 0, select.chr = NULL, doPlot = TRUE )
kaks |
specify Ka/Ks input obtained via 'rbh2kaks()' [mandatory] |
dag |
specify DAGchainer results as obtained via 'rbh2dagchainer()' [default: NULL] |
gene.position.cds1 |
specify gene position for cds1 sequences
(see |
gene.position.cds2 |
specify gene position for cds2 sequences
(see |
tandem.dups.cds1 |
specify tandem duplicates for cds1 sequences
(see |
tandem.dups.cds2 |
specify tandem duplicates for cds2 sequences
(see |
PlotTitle |
specify Plot title [default: Ka/Ks results] |
PlotType |
specify Plot type: "h" histogram or "d" dotplot [default: h] |
binw |
specify binwidth (see |
splitByChr |
specify if plot should be split by chromosome [default: FALSE] |
colorBy |
specify if Ka/Ks gene pairs should be colored by "rbh_class", "dagchainer", "tandemdups" or "none" [default: rbh_class] |
ka.max |
specify max Ka to be filtered [default: 5] |
ks.max |
specify max Ks to be filtered [default: 5] |
ka.min |
specify min Ka to be filtered [default: 0] |
ks.min |
specify min Ks to be filtered [default: 0] |
select.chr |
filter results for chromosome names [default: NULL] |
doPlot |
specify plot [default: TRUE] |
Ka/Ks plot
Kristian K Ullrich
## load example sequence data data("ath_aly_ncbi_kaks", package="CRBHits") ## plot Ka/Ks values - default g <- plot_kaks(ath_aly_ncbi_kaks) ## Calculate Ka/Ks values based on MSA data("hiv", package="MSA2dist") hiv_kaks <- MSA2dist::dnastring2kaks(hiv) g <- plot_kaks(hiv_kaks)## load example sequence data data("ath_aly_ncbi_kaks", package="CRBHits") ## plot Ka/Ks values - default g <- plot_kaks(ath_aly_ncbi_kaks) ## Calculate Ka/Ks values based on MSA data("hiv", package="MSA2dist") hiv_kaks <- MSA2dist::dnastring2kaks(hiv) g <- plot_kaks(hiv_kaks)
This function runs DAGchainer (http://dagchainer.sourceforge.net/) given CRBHit pairs and gene positions for both cds1 and cds2. The default options are set to not compare gene positions in base pairs but instead using gene order (gene.idx).
rbh2dagchainer( rbhpairs, selfblast1 = NULL, selfblast2 = NULL, gene.position.cds1 = NULL, gene.position.cds2 = NULL, dagchainerpath = file.path(find.package("CRBHits"), "extdata", "dagchainer"), gap_open_penalty = 0, gap_extension_penalty = -3, gap_length = 10000, max_match_score = 50, max_dist_allowed = 2e+05, max_evalue = 0.001, ignore_tandem = TRUE, only_tandem = FALSE, min_number_aligned_pairs = 5, type = "bp", plotDotPlot = FALSE, DotPlotTitle = "DAGchainer results", colorBy = "none", kaks = NULL, ka.max = 5, ks.max = 5, ka.min = 0, ks.min = 0, select.chr = NULL )rbh2dagchainer( rbhpairs, selfblast1 = NULL, selfblast2 = NULL, gene.position.cds1 = NULL, gene.position.cds2 = NULL, dagchainerpath = file.path(find.package("CRBHits"), "extdata", "dagchainer"), gap_open_penalty = 0, gap_extension_penalty = -3, gap_length = 10000, max_match_score = 50, max_dist_allowed = 2e+05, max_evalue = 0.001, ignore_tandem = TRUE, only_tandem = FALSE, min_number_aligned_pairs = 5, type = "bp", plotDotPlot = FALSE, DotPlotTitle = "DAGchainer results", colorBy = "none", kaks = NULL, ka.max = 5, ks.max = 5, ka.min = 0, ks.min = 0, select.chr = NULL )
rbhpairs |
(conditional-)reciprocal best hit (CRBHit) pair result
(see |
selfblast1 |
(conditional-)reciprocal best hit (CRBHit) pair selfblast
result for cds1 sequences (see |
selfblast2 |
(conditional-)reciprocal best hit (CRBHit) pair selfblast
result for cds2 sequences (see |
gene.position.cds1 |
specify gene position for cds1 sequences
(see |
gene.position.cds2 |
specify gene position for cds2 sequences
(see |
dagchainerpath |
specify the PATH to the DAGchainer directory containing the dagchainer binaries [default: /extdata/dagchainer/] |
gap_open_penalty |
gap open penalty [default: 0] |
gap_extension_penalty |
gap extension penalty [default: -3] |
gap_length |
length of a gap (avgerage distance expected between two syntenic genes); if type is set to "idx" use 1 [default: 10000] |
max_match_score |
Maximum match score [default: 50] |
max_dist_allowed |
maximum distance allowed between two matches; if type is set to "idx" use 20 [default: 200000] |
max_evalue |
Maximum E-value [default: 1e-3] |
ignore_tandem |
ignore tandem duplicates [default = TRUE] |
only_tandem |
only tandem alignments [default = FALSE] |
min_number_aligned_pairs |
Minimum number of Aligned Pairs [default: 5] |
type |
specify if gene order index "idx" or gene base pair position "bp" should be extracted and used with DAGchainer [default: bp] |
plotDotPlot |
specify if dotplot should be plotted [default: FALSE] |
DotPlotTitle |
specify DotPlot title [default: 'DAGchainer results'] |
colorBy |
specify if dagchainer groups should be colored by "Ka", "Ks", "Ka/Ks" or "none" [default: none] |
kaks |
specify Ka/Ks input obtained via 'rbh2kaks()' [default: NULL] |
ka.max |
specify max Ka to be filtered [default: 5] |
ks.max |
specify max Ks to be filtered [default: 5] |
ka.min |
specify min Ka to be filtered [default: 0] |
ks.min |
specify min Ks to be filtered [default: 0] |
select.chr |
filter results for chromosome names [default: NULL] |
DAGchanier results
1: $gene1.chr
2: $gene1.seq.id
3: $gene1.start
4: $gene1.end
5: $gene1.mid
6: $gene1.idx
7: $gene2.chr
8: $gene2.seq.id
9: $gene2.start
10: $gene2.end
11: $gene2.mid
12: $gene2.idx
13: $evalue
14: $score
Kristian K Ullrich
Haas BJ et al. (2004) DAGchainer: a tool for mining segmental genome duplications and synteny. Bioinformatics. 20(18), 3643-3646.
plot_dagchainer,
cds2genepos,
tandemdups,
rbh2kaks
## compile dagchainer CRBHits::make_dagchainer() ## load example sequence data data("ath", package="CRBHits") ## get selfhits CRBHit pairs ath_selfhits_crbh <- cds2rbh( cds1=ath, cds2=ath, plotCurve=TRUE) ## get gene position ath.genepos <- cds2genepos( cds=ath, source="ENSEMBL") ## get DAGchainer results ath_selfblast_crbh.dagchainer <- rbh2dagchainer( rbhpairs=ath_selfhits_crbh, gene.position.cds1=ath.genepos, gene.position.cds2=ath.genepos) head(ath_selfblast_crbh.dagchainer) ## plot dagchainer plot_dagchainer( dag=ath_selfblast_crbh.dagchainer)## compile dagchainer CRBHits::make_dagchainer() ## load example sequence data data("ath", package="CRBHits") ## get selfhits CRBHit pairs ath_selfhits_crbh <- cds2rbh( cds1=ath, cds2=ath, plotCurve=TRUE) ## get gene position ath.genepos <- cds2genepos( cds=ath, source="ENSEMBL") ## get DAGchainer results ath_selfblast_crbh.dagchainer <- rbh2dagchainer( rbhpairs=ath_selfhits_crbh, gene.position.cds1=ath.genepos, gene.position.cds2=ath.genepos) head(ath_selfblast_crbh.dagchainer) ## plot dagchainer plot_dagchainer( dag=ath_selfblast_crbh.dagchainer)
This function calculates Ka/Ks (dN/dS; accoring to
Li (1993) or Yang and Nielson (2000) for each
(conditional-)reciprocal best hit (CRBHit) pair. The names of the rbh
columns must match the names of the corresponding cds1 and cds2
DNAStringSet vectors.
rbh2kaks( rbhpairs, cds1, cds2, model = "Li", plotHistPlot = FALSE, plotDotPlot = FALSE, dag = NULL, gene.position.cds1 = NULL, gene.position.cds2 = NULL, tandem.dups.cds1 = NULL, tandem.dups.cds2 = NULL, colorBy = "none", threads = 1, ... )rbh2kaks( rbhpairs, cds1, cds2, model = "Li", plotHistPlot = FALSE, plotDotPlot = FALSE, dag = NULL, gene.position.cds1 = NULL, gene.position.cds2 = NULL, tandem.dups.cds1 = NULL, tandem.dups.cds2 = NULL, colorBy = "none", threads = 1, ... )
rbhpairs |
(conditional-)reciprocal best hit (CRBHit) pair result
(see |
cds1 |
cds1 sequences as |
cds2 |
cds2 sequences as |
model |
specify codon model either "Li" or "NG86" or one of KaKs_Calculator2 model "NG", "LWL", "LPB", "MLWL", "MLPB", "GY", "YN", "MYN", "MS", "MA", "GNG", "GLWL", "GLPB", "GMLWL", "GMLPB", "GYN", "GMYN" [default: Li] |
plotHistPlot |
specify if histogram should be plotted [default: TRUE] |
plotDotPlot |
specify if dotplot should be plotted (mandatory to define
|
dag |
specify DAGchainer results as obtained via 'rbh2dagchainer()' [default: NULL] |
gene.position.cds1 |
specify gene position for cds1 sequences
(see |
gene.position.cds2 |
specify gene position for cds2 sequences
(see |
tandem.dups.cds1 |
specify tandem duplicates for cds1 sequences
(see |
tandem.dups.cds2 |
specify tandem duplicates for cds2 sequences
(see |
colorBy |
specify if Ka/Ks gene pairs should be colored by "rbh_class", dagchainer", "tandemdups" or "none" [default: none] |
threads |
number of parallel threads [default: 1] |
... |
other codon alignment parameters
(see |
Ka/Ks values
Kristian K Ullrich
Li WH. (1993) Unbiased estimation of the rates of synonymous and nonsynonymous substitution. J. Mol. Evol., 36, 96-99.
Wang D, Zhang Y et al. (2010) KaKs_Calculator 2.0: a toolkit incorporating gamma-series methods and sliding window strategies. Genomics Proteomics Bioinformatics. 8(1), 77-80.
Yang Z and Nielson R. (2000) Estimating synonymous and nonsynonymous substitution rates under realistic evolutionary models. Mol. Biol. Evol., 17(1), 32-43.
dnastring2kaks,
isoform2longest,
cds2genepos,
plot_kaks,
rbh2dagchainer,
tandemdups
## load example sequence data data("ath", package="CRBHits") data("aly", package="CRBHits") ## load example CRBHit pairs data("ath_aly_crbh", package="CRBHits") ## only analyse subset of CRBHit pairs ath_aly_crbh$crbh.pairs <- head(ath_aly_crbh$crbh.pairs) ath_aly_crbh.kaks <- rbh2kaks( rbhpairs=ath_aly_crbh, cds1=ath, cds2=aly, model="Li") head(ath_aly_crbh.kaks) ## plot kaks g.kaks <- plot_kaks(ath_aly_crbh.kaks)## load example sequence data data("ath", package="CRBHits") data("aly", package="CRBHits") ## load example CRBHit pairs data("ath_aly_crbh", package="CRBHits") ## only analyse subset of CRBHit pairs ath_aly_crbh$crbh.pairs <- head(ath_aly_crbh$crbh.pairs) ath_aly_crbh.kaks <- rbh2kaks( rbhpairs=ath_aly_crbh, cds1=ath, cds2=aly, model="Li") head(ath_aly_crbh.kaks) ## plot kaks g.kaks <- plot_kaks(ath_aly_crbh.kaks)
This function returns the first whitespace separated names
value of a XStringSet.
seqid(x)seqid(x)
x |
|
chr
Kristian K Ullrich
## load example sequence data data("ath", package="CRBHits") seqid(ath)## load example sequence data data("ath", package="CRBHits") seqid(ath)
This function assigns tandem duplicates given (conditional-)reciprocal best hit (CRBHit) pairs and their chromosomal gene position. The function is ported into R from Haug-Baltzell et al. (2017) https://github.com/LyonsLab/coge/blob/master/bin/dagchainer/tandems.py
tandemdups(rbhpairs, genepos, dupdist = 5)tandemdups(rbhpairs, genepos, dupdist = 5)
rbhpairs |
(conditional-)reciprocal best hit (CRBHit) pair result
(see |
genepos |
Gene position matrix as obtained from |
dupdist |
Maximum distance between 2 gene positions on the same
chromosome which will call them as a pair of local duplicates. If they are
farther apart than |
matrix
1: $gene.seq.id
2: $gene.chr
3: $gene.start
4: $gene.end
5: $gene.mid
6: $gene.strand
7: $gene.idx
8: $tandem_group
Kristian K Ullrich
Haug-Beltzell A et al. (2017) SynMap2 and SynMap3D: web-based whole-genome synteny browsers. Bioinformatics 33(14), 2197-2198.
## load example sequence data data("ath", package="CRBHits") ## get selfhits CRBHit pairs ath_selfhits_crbh <- cds2rbh( cds1=ath, cds2=ath, plotCurve=TRUE) ## get gene position ath.genepos <- cds2genepos( cds=ath, source="ENSEMBL") ## get tandem duplicate results ath_selfblast_crbh.tandemdups <- tandemdups( rbhpairs=ath_selfhits_crbh, genepos=ath.genepos) head(ath_selfblast_crbh.tandemdups)## load example sequence data data("ath", package="CRBHits") ## get selfhits CRBHit pairs ath_selfhits_crbh <- cds2rbh( cds1=ath, cds2=ath, plotCurve=TRUE) ## get gene position ath.genepos <- cds2genepos( cds=ath, source="ENSEMBL") ## get tandem duplicate results ath_selfblast_crbh.tandemdups <- tandemdups( rbhpairs=ath_selfhits_crbh, genepos=ath.genepos) head(ath_selfblast_crbh.tandemdups)