Find Closest Related Fusions in a Reference Catalogue — find_related

Given a set of observed (sequenced) fusions that already carry exon–level breakpoint annotations, search a reference fusion database and return the max_results most similar events for every observed fusion.

find_related_fusions(
  sequenced_fusions,
  db_fusions,
  max_results = 5,
  verbose = F
)

Arguments

sequenced_fusions

A data.frame (or data.table) produced by annotate_genomic_coordinates() containing, at minimum, the columns

chr1, chr2
bp1_gene, bp2_gene
bp1_gene_strand, bp2_gene_strand
bp1_feature, bp2_feature
exon‑proximity columns: bp[12]\_within_exon, bp[12]\_fiveprime_exon, bp[12]\_threeprime_exon.

db_fusions

A reference fusion catalogue as a data.frame/data.table with at least the columns gene1, gene2, gene1_chro, gene2_chro, gene1_strand, gene2_strand, gene1_exonnumber, gene2_exonnumber, and fusion_id.

max_results

Integer. Maximum number of highest‑scoring database fusions returned for each sequenced fusion (default 5).

verbose

Logical. if TRUE, will provide messaging.

Value

A named list. Each element corresponds to one row in sequenced_fusions and contains up to max_results sub‑lists, each with:

db_index – row index in db_fusions
fusion_id – identifier from the reference catalogue
score – composite similarity score
gene1, gene2 – gene symbols in reference fusion
gene1_exon, gene2_exon – reference exon numbers
match_details – short human‑readable description

Details

Similarity is evaluated with a composite score made up of:

Gene‐pair match: +100 if the gene symbols match (either orientation).
Strand concordance: +50 if the strand orientations also match.
Breakpoint exon proximity: Up to +30 per partner, scaled as \(30/(|d|+1)\) where \(d\) is the absolute difference between the observed and reference exon numbers. Both five‑prime and three‑prime nearest exons are considered.
Breakpoint context: +20 for each partner when both the observed and reference fusion breakpoints occur in exon sequence (or both in intron sequence).

Only candidates with a final score \(> 5\) are retained.

Internally, two lookup hash tables are created for the reference database: one keyed on gene‑pair (gene1--gene2) and one on chromosome‑pair (chr1_chr2). These drastically reduce the number of candidates that must be scored for each observed fusion.

The helper calc\_score() function (defined inside find_related_fusions()) implements the scoring scheme described above. Scores \(\le 5\) are treated as noise and the corresponding matches are discarded.

Author

Your Name

Examples

# \donttest{
related <- find_related_fusions(sequenced_fusions, db_fusions, max_results = 3)
#> Error: object 'db_fusions' not found
related[[1]]            # top matches for the first observed fusion
#> Error: object 'related' not found
# }