Given a set of observed (sequenced) fusions that already carry exon–level breakpoint annotations, search a reference fusion database and return the max_results most similar events for every observed fusion.

find_related_fusions(
  sequenced_fusions,
  db_fusions,
  max_results = 5,
  verbose = F
)

Arguments

sequenced_fusions

A data.frame (or data.table) produced by annotate_genomic_coordinates() containing, at minimum, the columns

  • chr1, chr2

  • bp1_gene, bp2_gene

  • bp1_gene_strand, bp2_gene_strand

  • bp1_feature, bp2_feature

  • exon‑proximity columns: bp[12]\_within_exon, bp[12]\_fiveprime_exon, bp[12]\_threeprime_exon.

db_fusions

A reference fusion catalogue as a data.frame/data.table with at least the columns gene1, gene2, gene1_chro, gene2_chro, gene1_strand, gene2_strand, gene1_exonnumber, gene2_exonnumber, and fusion_id.

max_results

Integer. Maximum number of highest‑scoring database fusions returned for each sequenced fusion (default 5).

verbose

Logical. if TRUE, will provide messaging.

Value

A named list. Each element corresponds to one row in sequenced_fusions and contains up to max_results sub‑lists, each with:

  • db_index – row index in db_fusions

  • fusion_id – identifier from the reference catalogue

  • score – composite similarity score

  • gene1, gene2 – gene symbols in reference fusion

  • gene1_exon, gene2_exon – reference exon numbers

  • match_details – short human‑readable description

Details

Similarity is evaluated with a composite score made up of:

Gene‐pair match

+100 if the gene symbols match (either orientation).

Strand concordance

+50 if the strand orientations also match.

Breakpoint exon proximity

Up to +30 per partner, scaled as \(30/(|d|+1)\) where \(d\) is the absolute difference between the observed and reference exon numbers. Both five‑prime and three‑prime nearest exons are considered.

Breakpoint context

+20 for each partner when both the observed and reference fusion breakpoints occur in exon sequence (or both in intron sequence).

Only candidates with a final score \(> 5\) are retained.

Internally, two lookup hash tables are created for the reference database: one keyed on gene‑pair (gene1--gene2) and one on chromosome‑pair (chr1_chr2). These drastically reduce the number of candidates that must be scored for each observed fusion.

The helper calc\_score() function (defined inside find_related_fusions()) implements the scoring scheme described above. Scores \(\le 5\) are treated as noise and the corresponding matches are discarded.

Author

Your Name

Examples

# \donttest{
related <- find_related_fusions(sequenced_fusions, db_fusions, max_results = 3)
#> Error: object 'db_fusions' not found
related[[1]]            # top matches for the first observed fusion
#> Error: object 'related' not found
# }