find_related_fusions.RdGiven a set of observed (sequenced) fusions that already carry exon–level
breakpoint annotations, search a reference fusion database and return the
max_results most similar events for every observed fusion.
find_related_fusions(
sequenced_fusions,
db_fusions,
max_results = 5,
verbose = F
)A data.frame (or data.table) produced
by annotate_genomic_coordinates() containing, at minimum, the
columns
chr1, chr2
bp1_gene, bp2_gene
bp1_gene_strand, bp2_gene_strand
bp1_feature, bp2_feature
exon‑proximity columns: bp[12]\_within_exon,
bp[12]\_fiveprime_exon, bp[12]\_threeprime_exon.
A reference fusion catalogue as a
data.frame/data.table with at least the columns
gene1, gene2, gene1_chro, gene2_chro,
gene1_strand, gene2_strand,
gene1_exonnumber, gene2_exonnumber, and fusion_id.
Integer. Maximum number of highest‑scoring database
fusions returned for each sequenced fusion (default 5).
Logical. if TRUE, will provide messaging.
A named list. Each element corresponds to one row in
sequenced_fusions and contains up to max_results sub‑lists,
each with:
db_index – row index in db_fusions
fusion_id – identifier from the reference catalogue
score – composite similarity score
gene1, gene2 – gene symbols in reference fusion
gene1_exon, gene2_exon – reference exon numbers
match_details – short human‑readable description
Similarity is evaluated with a composite score made up of:
+100 if the gene symbols match (either orientation).
+50 if the strand orientations also match.
Up to +30 per partner, scaled as \(30/(|d|+1)\) where \(d\) is the absolute difference between the observed and reference exon numbers. Both five‑prime and three‑prime nearest exons are considered.
+20 for each partner when both the observed and reference fusion breakpoints occur in exon sequence (or both in intron sequence).
Only candidates with a final score \(> 5\) are retained.
Internally, two lookup hash tables are created for the reference database:
one keyed on gene‑pair (gene1--gene2) and one on chromosome‑pair
(chr1_chr2). These drastically reduce the number of candidates that
must be scored for each observed fusion.
The helper calc\_score() function (defined inside
find_related_fusions()) implements the scoring scheme described
above. Scores \(\le 5\) are treated as noise and the corresponding
matches are discarded.
# \donttest{
related <- find_related_fusions(sequenced_fusions, db_fusions, max_results = 3)
#> Error: object 'db_fusions' not found
related[[1]] # top matches for the first observed fusion
#> Error: object 'related' not found
# }