The aim of ODER is to identify previously unannotated expressed regions (ERs) using RNA-sequencing data. For this purpose, ODER defines and optimises the definition of ERs, then connected these ERs to genes using junction data. In this way, ODER improves gene annotation. Gene annotation is a staple input of many bioinformatic pipelines and a more complete gene annotation can enable more accurate interpretation of disease associated variants.
Returns the optimum definition of the expressed regions by finding the ideal MCC (Mean Coverage Cutoff) and MRG (Max Region Gap). The combination of MCC and MRG that returns the expressed region with the smallest exon delta is the most ideal.
ODER(
bw_paths,
auc_raw,
auc_target,
chrs = "",
genome = "hg38",
mccs,
mrgs,
gtf = NULL,
ucsc_chr,
ignore.strand,
exons_no_overlap = NULL,
biotype = "Non-overlapping",
bw_chr = "chr",
file_type = "non-stranded",
bw_pos = NULL,
bw_neg = NULL,
auc_raw_pos = NULL,
auc_raw_neg = NULL
)
bw_paths | path(s) to bigwig file(s) with the RNA-seq data that you want the #' coverage of. |
---|---|
auc_raw | vector containing AUCs(Area Under Coverage) matching the order of bigwig path(s). |
auc_target | total AUC to normalise all samples to e.g. 40e6 * 100 would be the estimated total auc for sample sequenced to 40 million reads of 100bp in length. |
chrs | chromosomes to obtain mean coverage for, default is "" giving every chromosome. Can take UCSC format(chrs = "chr1") or just the chromosome i.e. chrs = c(1,X) |
genome | the UCSC genome you want to use, the default is hg38. |
mccs | mean coverage cut-offs to apply. |
mrgs | max region gaps to apply. |
gtf | Either a string containg the path to a .gtf file or a pre-imported
gtf using |
ucsc_chr | logical scalar, determining whether to add "chr" prefix to the seqnames of non-overlapping exons and change "chrMT" -> "chrM". Note, if set to TRUE and seqnames already have "chr", it will not add another. |
ignore.strand | logical value for input into
|
exons_no_overlap | Optimum set of exons to help calculate deltas |
biotype | Filters the GTF file passed in to what would be considered the "Gold Standard" exons. The Default is "Non-overlapping" but the options are: "Non-overlapping" (exons that don't intersect each other), "Three Prime" (3' UTR), "Five Prime" (5' UTR), "Internal" (Internal coding), "lncRNA" (Long Non-Coding RNA), "ncRNA" (Non-Coding RNA) and "Pseudogene" |
bw_chr | specifies whether the bigwig files has the chromosomes labelled with a "chr" preceding the chromosome i.e. "chr1" vs "1". Can be either "chr" or "nochr" with "chr" being the default. |
file_type | Describes if the BigWigs are stranded or not. Either "stranded" or non-stranded |
bw_pos | positive strand bigwig file |
bw_neg | negative strand bigwig file |
auc_raw_pos | vector containing AUCs(Area Under Coverage) matching the order of the positive bigwig paths. |
auc_raw_neg | vector containing AUCs(Area Under Coverage) matching the order of the negative bigwig paths. |
list containing optimised ERs, optimal pair of MCC/MRGs and
delta_df
rec_url <- recount::download_study(
project = "SRP012682",
type = "samples",
download = FALSE
)
#> Setting options('download.file.method.GEOquery'='auto')
#> Setting options('GEOquery.inmemory.gpl'=FALSE)
# file_cache is an internal function to download a bigwig file from a link
# if the file has been downloaded recently, it will be retrieved from a cache
bw_path <- file_cache(rec_url[1])
gtf_url <- paste0(
"http://ftp.ensembl.org/pub/release-103/gtf/",
"homo_sapiens/Homo_sapiens.GRCh38.103.chr.gtf.gz"
)
gtf_path <- file_cache(gtf_url)
# As of rtracklayer 1.25.16, BigWig is not supported on Windows.
data(gtex_SRP012682_SRX222703_lung_auc_1, package = "ODER")
if (!xfun::is_windows()) {
opt_ers <- ODER(
bw_paths = bw_path,
auc_raw = gtex_SRP012682_SRX222703_lung_auc_1,
auc_target = 40e6 * 100, chrs = c("chr21", "chr22"),
genome = "hg38", mccs = c(5, 10), mrgs = c(10, 20),
gtf = gtf_path, ucsc_chr = TRUE, ignore.strand = TRUE,
exons_no_overlap = NULL, bw_chr = "chr"
)
opt_ers
}
#> Loading required package: BiocGenerics
#>
#> Attaching package: ‘BiocGenerics’
#> The following objects are masked from ‘package:stats’:
#>
#> IQR, mad, sd, var, xtabs
#> The following objects are masked from ‘package:base’:
#>
#> Filter, Find, Map, Position, Reduce, anyDuplicated, append,
#> as.data.frame, basename, cbind, colnames, dirname, do.call,
#> duplicated, eval, evalq, get, grep, grepl, intersect, is.unsorted,
#> lapply, mapply, match, mget, order, paste, pmax, pmax.int, pmin,
#> pmin.int, rank, rbind, rownames, sapply, setdiff, sort, table,
#> tapply, union, unique, unsplit, which.max, which.min
#> Loading required package: S4Vectors
#> Loading required package: stats4
#>
#> Attaching package: ‘S4Vectors’
#> The following objects are masked from ‘package:base’:
#>
#> I, expand.grid, unname
#> 2021-10-08 16:07:35 - Obtaining mean coverage across 1 samples
#> 2021-10-08 16:07:35 - chr21
#> 2021-10-08 16:07:36 - chr22
#> 2021-10-08 16:07:37 - Generating ERs for chr21
#> 2021-10-08 16:07:40 - Generating ERs for chr22
#> 2021-10-08 16:07:42 - Loading in GTF...
#> 2021-10-08 16:08:25 - Obtaining non-overlapping exons
#> 2021-10-08 16:08:27 - Calculating delta for ERs...
#> 2021-10-08 16:08:28 - Obtaining optimal set of ERs...
#> $opt_ers
#> GRanges object with 15977 ranges and 0 metadata columns:
#> seqnames ranges strand
#> <Rle> <IRanges> <Rle>
#> [1] chr21 5032176-5032217 *
#> [2] chr21 5033408-5033425 *
#> [3] chr21 5034717-5034756 *
#> [4] chr21 5035188-5035189 *
#> [5] chr21 5036577-5036581 *
#> ... ... ... ...
#> [15973] chr22 50798996-50799149 *
#> [15974] chr22 50799209-50799284 *
#> [15975] chr22 50799669-50799688 *
#> [15976] chr22 50799717-50799744 *
#> [15977] chr22 50800460-50800587 *
#> -------
#> seqinfo: 2 sequences from an unspecified genome; no seqlengths
#>
#> $opt_mcc_mrg
#> [1] "mcc_10" "mrg_20"
#>
#> $deltas
#> # A tibble: 4 × 7
#> mcc mrg sum mean median n_eq_0 propor_eq_0
#> <dbl> <dbl> <int> <dbl> <dbl> <int> <dbl>
#> 1 5 10 2187300 964. 150. 322 0.142
#> 2 5 20 1892386 898. 141 335 0.159
#> 3 10 10 1771837 1009. 139 316 0.180
#> 4 10 20 1457726 911. 118 323 0.202
#>