-
Notifications
You must be signed in to change notification settings - Fork 98
Description
Describe the bug
When a genomic delins variant spans a position that contains a transcript-reference disagreement (a CIGAR I segment — bases present in the genome but absent from the transcript), the internal mapping creates a double-gap that should collapse to a simple substitution, but instead produces an incorrect multi-base delins and a downstream frameshift p-dot.
Steps to reproduce
import hgvs.dataproviders.uta as uta
import hgvs.assemblymapper
hdp = uta.connect()
am = hgvs.assemblymapper.AssemblyMapper(hdp, assembly_name="GRCh38")
hp = hgvs.parser.Parser()
var_g = hp.parse("NC_000011.10:g.119027721_119027726delinsTCACA")
var_c = am.g_to_c(var_g, "NM_001164277.1")
print(var_c) # NM_001467.6:c.527_532delinsTGTGA ← incorrect
var_p = am.c_to_p(var_c)
print(var_p) # frameshift ← incorrectGenomic context:
: 119,027,740 119,027,720
chrom pos : . | . | . | . | .
seq <- : CGGGATAGACCCCGTGACACCACACCAACAGAGGAAGGAGACAGAG
seq -> : GCCCTATCTGGGGCACTGTGGTGTGGTTGTCTCCTTCCTCTGTCTC
tx ref dif: I
region : |----|
tx seq -> : GCCCTATCTGGGGCACTGT-GTGTGGTTGTCTCCTTCCTCTGTCTC
tx pos : | . | . | . | . |
: 510 520 530 540 550
aa seq -> : AlaLeuSerGlyAlaLeuC-ysValValValSerPheLeuCysLeu
aa pos : ||| ... |||
: 170 180
Current behavior
- g_to_c returns NM_001467.6:c.527_532delinsTGTGA
- c_to_p returns a frameshift
Expected behavior
The genomic delins only changes a single base relative to the transcript. The tx-ref
disagreement bases in the flanking I CIGAR segment should be absorbed during mapping,
collapsing the double-gap to:
- g_to_c → NM_001164277.1:c.532G>A
- c_to_p → a missense (not a frameshift)
Root cause
The CIGAR alignment for NM_001164277.1 has an I segment at this position (extra bases
in the genome not present in the transcript). The current mapping logic does not
account for adjacent I segments when computing the transcript-coordinate edit, so the
genome-only bases appear as deleted sequence in the transcript representation,
producing a spurious multi-base delins instead of a SNV.