Skip to content

Enhanced document analysis with Markdown parsing and size statistics#1

Open
jlevy wants to merge 17 commits intomainfrom
feature/extend-chopdiff-section-iteration
Open

Enhanced document analysis with Markdown parsing and size statistics#1
jlevy wants to merge 17 commits intomainfrom
feature/extend-chopdiff-section-iteration

Conversation

@jlevy
Copy link
Copy Markdown
Owner

@jlevy jlevy commented Aug 5, 2025

Summary

  • Replace regex-based header parsing with robust Marko-based Markdown parser to correctly handle headers in code blocks
  • Add new insert_size_info.py example for inserting document statistics after section headers
  • Extract and simplify read time calculation into dedicated util/read_time.py module
  • Add SectionDoc and FlexDoc for structured document parsing with hierarchical section navigation

Key Changes

Markdown Parsing Enhancement

  • Fixed bug: Headers in code blocks were incorrectly being parsed as real headers
  • Solution: Integrated marko (via flowmark) for proper Markdown AST parsing in SectionDoc
  • Added comprehensive test coverage with diverse Markdown syntax examples

Core Features

  1. SectionDoc class:

    • Parses Markdown documents into hierarchical section structure
    • Provides section iteration with configurable depth and root inclusion
    • Correctly handles headers using proper Markdown AST parsing
  2. FlexDoc class:

    • Flexible document wrapper supporting multiple formats (SectionDoc, TextDoc)
    • Unified API for document statistics and iteration
    • Seamless integration with existing chopdiff functionality

New Examples & Utilities

  1. insert_size_info.py example:

    • Inserts HTML <div class="size-info"> elements after section headers
    • Provides word count, character count, sentence count, paragraph count, subsection count, and reading time
    • Supports customizable header levels
  2. read_time.py utility module:

    • Centralized reading time calculation and formatting
    • Uses prettyfmt for human-readable time display
    • Configurable minimum time threshold (default 3 minutes)
    • Default reading speed of 225 WPM
  3. analyze_doc.py enhancement:

    • Enhanced with rich formatting for colorized tree and table output
    • Shows document structure and comprehensive statistics

Documentation & Examples

  • Simplified README with concise descriptions and links to examples
  • Updated all examples to use PEP 723 inline script dependencies
  • Added comprehensive Markdown test file for parsing validation
  • Code follows CLAUDE.md Python coding guidelines

Testing

  • Added tests for code block header filtering
  • Created comprehensive test suite for insert_size_info functionality
  • Updated tests for simplified read_time API
  • Full test coverage for SectionDoc and FlexDoc classes
  • All tests and linting pass ✅ (150 tests)

Test Plan

  • Run make lint - all checks pass
  • Run make test - all 150 tests pass
  • Test analyze_doc.py on various Markdown files
  • Test insert_size_info.py on README and sample documents
  • Verify headers in code blocks are correctly ignored
  • Confirm read time calculations match expectations
  • Test SectionDoc with complex nested documents
  • Test FlexDoc with different document types

🤖 Generated with Claude Code

jlevy and others added 17 commits August 5, 2025 10:45
- Add SectionDoc for hierarchical Markdown section parsing
- Add FlexDoc as unified interface for TextDoc, TextNode, and SectionDoc views
- Implement thread-safe lazy loading with synchronized decorator
- Add comprehensive tests for all new components
- Update documentation with examples and API reference

Features:
- Parse Markdown into hierarchical section tree structure
- Navigate sections by level, title, or path
- Cross-reference between token, div, and section views
- Smart section-based document chunking
- Thread-safe concurrent access to all views

🤖 Generated with [Claude Code](https://claude.ai/code)

Co-Authored-By: Claude <noreply@anthropic.com>
- Add explicit type annotations for lists and dicts
- Import override decorator for __repr__ methods
- Fix Callable type hint in test_thread_utils.py
- Ensure all type checking passes in basedpyright

🤖 Generated with [Claude Code](https://claude.ai/code)

Co-Authored-By: Claude <noreply@anthropic.com>
- Remove override decorator usage (requires Python 3.12+)
- Add reportImplicitOverride = false to basedpyright config
- Ensure compatibility with Python 3.11, 3.12, and 3.13

🤖 Generated with [Claude Code](https://claude.ai/code)

Co-Authored-By: Claude <noreply@anthropic.com>
- Use pyright: ignore comments for implicit override warnings
- Maintain Python 3.11 compatibility without override decorator
- Keep all basedpyright checks enabled as per best practices

🤖 Generated with [Claude Code](https://claude.ai/code)

Co-Authored-By: Claude <noreply@anthropic.com>
- Add typing_extensions dependency for @OverRide decorator
- Use @OverRide decorators for all overridden methods
- Simplify docstrings by removing redundant Args/Returns sections
- Keep docstrings concise as per guidelines
- All tests and linting pass

🤖 Generated with [Claude Code](https://claude.ai/code)

Co-Authored-By: Claude <noreply@anthropic.com>
- Add key properties/methods summaries to class docstrings
- Update README.md examples to use correct method names
- Add thread safety example to README
- Fix incorrect method names that don't exist in implementation
- Delete API.md as all content is now properly integrated

🤖 Generated with [Claude Code](https://claude.ai/code)

Co-Authored-By: Claude <noreply@anthropic.com>
- Create analyze_doc.py CLI tool for section-by-section analysis
- Shows hierarchical tree view with statistics per section
- Includes paragraphs, sentences, words, and reading time
- Supports both tree and flat table output formats
- Add example to README.md with sample output
- Works with files or stdin input

🤖 Generated with [Claude Code](https://claude.ai/code)

Co-Authored-By: Claude <noreply@anthropic.com>
- Remove redundant Args/Returns sections where obvious from signatures
- Keep docstrings concise as per Python coding guidelines
- Maintain clear documentation while reducing verbosity

🤖 Generated with [Claude Code](https://claude.ai/code)

Co-Authored-By: Claude <noreply@anthropic.com>
- Simplified README.md by removing large code/output blocks
- Replaced verbose examples with concise descriptions and links
- Enhanced analyze_doc.py to use rich library for colorized output
- Added proper type checking support for optional rich dependency
- Maintained backwards compatibility when rich is not installed

🤖 Generated with [Claude Code](https://claude.ai/code)

Co-Authored-By: Claude <noreply@anthropic.com>
- Added inline script dependencies (PEP 723) to all example scripts
- Removed sys.path.insert workarounds in favor of proper dependencies
- Added rich as a direct dependency for analyze_doc.py
- Removed conditional rich import logic - script now requires rich
- Updated README to remove pip install instructions
- Added pyright ignore comments for external dependencies

🤖 Generated with [Claude Code](https://claude.ai/code)

Co-Authored-By: Claude <noreply@anthropic.com>
- Updated SectionDoc to track code fence positions
- Headers inside code blocks are now properly ignored
- Added tests for code block handling
- Fixed rich styling issue in analyze_doc.py

🤖 Generated with [Claude Code](https://claude.ai/code)

Co-Authored-By: Claude <noreply@anthropic.com>
- Replaced regex-based header detection with marko markdown parser
- Now properly uses marko's DOM traversal to find headers
- Headers inside code blocks are automatically ignored by parser
- Maintains backward compatibility with all existing tests
- No longer needs manual code fence tracking

This ensures proper markdown parsing that respects all markdown rules
including code blocks, inline code, and other edge cases.

🤖 Generated with [Claude Code](https://claude.ai/code)

Co-Authored-By: Claude <noreply@anthropic.com>
- Added test_markdown.md with various markdown features including:
  - Code blocks in multiple languages (bash, yaml, python, shell)
  - Blockquotes and nested blockquotes
  - Numbered and bulleted lists
  - Tables with # symbols in cells
  - Inline code with # symbols
- Auto-formatted all markdown files with flowmark
- Verified proper header parsing ignores all non-header # symbols

🤖 Generated with [Claude Code](https://claude.ai/code)

Co-Authored-By: Claude <noreply@anthropic.com>
- Remove obsolete brief parameter from test_insert_size_info tests
- Update test to match current API after merge with main
@jlevy jlevy changed the title Add SectionDoc and FlexDoc for structured document parsing Enhanced document analysis with Markdown parsing and size statistics Aug 6, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant