ODER: Optimising the Definition of Expressed Regions

The aim of ODER is to identify previously unannotated expressed regions (ERs) using RNA-sequencing data. For this purpose, ODER defines and optimises the definition of ERs, then connected these ERs to genes using junction data. In this way, ODER improves gene annotation. Gene annotation is a staple input of many bioinformatic pipelines and a more complete gene annotation can enable more accurate interpretation of disease associated variants.

Returns the optimum definition of the expressed regions by finding the ideal MCC (Mean Coverage Cutoff) and MRG (Max Region Gap). The combination of MCC and MRG that returns the expressed region with the smallest exon delta is the most ideal.

ODER(
  bw_paths,
  auc_raw,
  auc_target,
  chrs = "",
  genome = "hg38",
  mccs,
  mrgs,
  gtf = NULL,
  ucsc_chr,
  ignore.strand,
  exons_no_overlap = NULL,
  biotype = "Non-overlapping",
  bw_chr = "chr",
  file_type = "non-stranded",
  bw_pos = NULL,
  bw_neg = NULL,
  auc_raw_pos = NULL,
  auc_raw_neg = NULL
)

Arguments

bw_paths	path(s) to bigwig file(s) with the RNA-seq data that you want the #' coverage of.
auc_raw	vector containing AUCs(Area Under Coverage) matching the order of bigwig path(s).
auc_target	total AUC to normalise all samples to e.g. 40e6 * 100 would be the estimated total auc for sample sequenced to 40 million reads of 100bp in length.
chrs	chromosomes to obtain mean coverage for, default is "" giving every chromosome. Can take UCSC format(chrs = "chr1") or just the chromosome i.e. chrs = c(1,X)
genome	the UCSC genome you want to use, the default is hg38.
mccs	mean coverage cut-offs to apply.
mrgs	max region gaps to apply.
gtf	Either a string containg the path to a .gtf file or a pre-imported gtf using `rtracklayer::import` .
ucsc_chr	logical scalar, determining whether to add "chr" prefix to the seqnames of non-overlapping exons and change "chrMT" -> "chrM". Note, if set to TRUE and seqnames already have "chr", it will not add another.
ignore.strand	logical value for input into `findOverlaps`, default is True.
exons_no_overlap	Optimum set of exons to help calculate deltas
biotype	Filters the GTF file passed in to what would be considered the "Gold Standard" exons. The Default is "Non-overlapping" but the options are: "Non-overlapping" (exons that don't intersect each other), "Three Prime" (3' UTR), "Five Prime" (5' UTR), "Internal" (Internal coding), "lncRNA" (Long Non-Coding RNA), "ncRNA" (Non-Coding RNA) and "Pseudogene"
bw_chr	specifies whether the bigwig files has the chromosomes labelled with a "chr" preceding the chromosome i.e. "chr1" vs "1". Can be either "chr" or "nochr" with "chr" being the default.
file_type	Describes if the BigWigs are stranded or not. Either "stranded" or non-stranded
bw_pos	positive strand bigwig file
bw_neg	negative strand bigwig file
auc_raw_pos	vector containing AUCs(Area Under Coverage) matching the order of the positive bigwig paths.
auc_raw_neg	vector containing AUCs(Area Under Coverage) matching the order of the negative bigwig paths.

Value

list containing optimised ERs, optimal pair of MCC/MRGs and delta_df

Examples

rec_url <- recount::download_study(
    project = "SRP012682",
    type = "samples",
    download = FALSE
)
#> Setting options('download.file.method.GEOquery'='auto')
#> Setting options('GEOquery.inmemory.gpl'=FALSE)

# file_cache is an internal function to download a bigwig file from a link
# if the file has been downloaded recently, it will be retrieved from a cache
bw_path <- file_cache(rec_url[1])
gtf_url <- paste0(
    "http://ftp.ensembl.org/pub/release-103/gtf/",
    "homo_sapiens/Homo_sapiens.GRCh38.103.chr.gtf.gz"
)
gtf_path <- file_cache(gtf_url)

# As of rtracklayer 1.25.16, BigWig is not supported on Windows.
data(gtex_SRP012682_SRX222703_lung_auc_1, package = "ODER")
if (!xfun::is_windows()) {
    opt_ers <- ODER(
        bw_paths = bw_path,
        auc_raw = gtex_SRP012682_SRX222703_lung_auc_1,
        auc_target = 40e6 * 100, chrs = c("chr21", "chr22"),
        genome = "hg38", mccs = c(5, 10), mrgs = c(10, 20),
        gtf = gtf_path, ucsc_chr = TRUE, ignore.strand = TRUE,
        exons_no_overlap = NULL, bw_chr = "chr"
    )

    opt_ers
}
#> Loading required package: BiocGenerics
#> 
#> Attaching package: ‘BiocGenerics’
#> The following objects are masked from ‘package:stats’:
#> 
#>     IQR, mad, sd, var, xtabs
#> The following objects are masked from ‘package:base’:
#> 
#>     Filter, Find, Map, Position, Reduce, anyDuplicated, append,
#>     as.data.frame, basename, cbind, colnames, dirname, do.call,
#>     duplicated, eval, evalq, get, grep, grepl, intersect, is.unsorted,
#>     lapply, mapply, match, mget, order, paste, pmax, pmax.int, pmin,
#>     pmin.int, rank, rbind, rownames, sapply, setdiff, sort, table,
#>     tapply, union, unique, unsplit, which.max, which.min
#> Loading required package: S4Vectors
#> Loading required package: stats4
#> 
#> Attaching package: ‘S4Vectors’
#> The following objects are masked from ‘package:base’:
#> 
#>     I, expand.grid, unname
#> 2021-10-08 16:07:35 - Obtaining mean coverage across 1 samples
#> 2021-10-08 16:07:35 - chr21
#> 2021-10-08 16:07:36 - chr22
#> 2021-10-08 16:07:37 - Generating ERs for chr21
#> 2021-10-08 16:07:40 - Generating ERs for chr22
#> 2021-10-08 16:07:42 - Loading in GTF...
#> 2021-10-08 16:08:25 - Obtaining non-overlapping exons
#> 2021-10-08 16:08:27 - Calculating delta for ERs...
#> 2021-10-08 16:08:28 - Obtaining optimal set of ERs...
#> $opt_ers
#> GRanges object with 15977 ranges and 0 metadata columns:
#>           seqnames            ranges strand
#>              <Rle>         <IRanges>  <Rle>
#>       [1]    chr21   5032176-5032217      *
#>       [2]    chr21   5033408-5033425      *
#>       [3]    chr21   5034717-5034756      *
#>       [4]    chr21   5035188-5035189      *
#>       [5]    chr21   5036577-5036581      *
#>       ...      ...               ...    ...
#>   [15973]    chr22 50798996-50799149      *
#>   [15974]    chr22 50799209-50799284      *
#>   [15975]    chr22 50799669-50799688      *
#>   [15976]    chr22 50799717-50799744      *
#>   [15977]    chr22 50800460-50800587      *
#>   -------
#>   seqinfo: 2 sequences from an unspecified genome; no seqlengths
#> 
#> $opt_mcc_mrg
#> [1] "mcc_10" "mrg_20"
#> 
#> $deltas
#> # A tibble: 4 × 7
#>     mcc   mrg     sum  mean median n_eq_0 propor_eq_0
#>   <dbl> <dbl>   <int> <dbl>  <dbl>  <int>       <dbl>
#> 1     5    10 2187300  964.   150.    322       0.142
#> 2     5    20 1892386  898.   141     335       0.159
#> 3    10    10 1771837 1009.   139     316       0.180
#> 4    10    20 1457726  911.   118     323       0.202
#>