Skip to content

delins adjacent to tx-ref disagreement incorrectly produces a double-gap instead of collapsing to a SNV #819

@andreasprlic

Description

@andreasprlic

Describe the bug
When a genomic delins variant spans a position that contains a transcript-reference disagreement (a CIGAR I segment — bases present in the genome but absent from the transcript), the internal mapping creates a double-gap that should collapse to a simple substitution, but instead produces an incorrect multi-base delins and a downstream frameshift p-dot.

Steps to reproduce

import hgvs.dataproviders.uta as uta
import hgvs.assemblymapper

hdp = uta.connect()
am = hgvs.assemblymapper.AssemblyMapper(hdp, assembly_name="GRCh38")
hp = hgvs.parser.Parser()

var_g = hp.parse("NC_000011.10:g.119027721_119027726delinsTCACA")
var_c = am.g_to_c(var_g, "NM_001164277.1")
print(var_c)   # NM_001467.6:c.527_532delinsTGTGA  ← incorrect
var_p = am.c_to_p(var_c)
print(var_p)   # frameshift ← incorrect

Genomic context:

         :       119,027,740         119,027,720
chrom pos :  .    |    .    |    .    |    .    |    .    
seq    <- : CGGGATAGACCCCGTGACACCACACCAACAGAGGAAGGAGACAGAG
seq    -> : GCCCTATCTGGGGCACTGTGGTGTGGTTGTCTCCTTCCTCTGTCTC
tx ref dif:                    I                          
region    :                     |----|                    
tx seq -> : GCCCTATCTGGGGCACTGT-GTGTGGTTGTCTCCTTCCTCTGTCTC
tx pos    :   |    .    |    .     |    .    |    .    |  
          :   510       520        530       540       550
aa seq -> : AlaLeuSerGlyAlaLeuC-ysValValValSerPheLeuCysLeu
aa pos    : |||            ...             |||            
          : 170                            180            

Current behavior

  • g_to_c returns NM_001467.6:c.527_532delinsTGTGA
  • c_to_p returns a frameshift

Expected behavior

The genomic delins only changes a single base relative to the transcript. The tx-ref
disagreement bases in the flanking I CIGAR segment should be absorbed during mapping,
collapsing the double-gap to:

  • g_to_c → NM_001164277.1:c.532G>A
  • c_to_p → a missense (not a frameshift)

Root cause

The CIGAR alignment for NM_001164277.1 has an I segment at this position (extra bases
in the genome not present in the transcript). The current mapping logic does not
account for adjacent I segments when computing the transcript-coordinate edit, so the
genome-only bases appear as deleted sequence in the transcript representation,
producing a spurious multi-base delins instead of a SNV.

Metadata

Metadata

Assignees

Labels

bugSomething isn't working

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions