Enhanced document analysis with Markdown parsing and size statistics by jlevy · Pull Request #1 · jlevy/chopdiff

jlevy · 2025-08-05T17:46:17Z

Summary

Replace regex-based header parsing with robust Marko-based Markdown parser to correctly handle headers in code blocks
Add new insert_size_info.py example for inserting document statistics after section headers
Extract and simplify read time calculation into dedicated util/read_time.py module
Add SectionDoc and FlexDoc for structured document parsing with hierarchical section navigation

Key Changes

Markdown Parsing Enhancement

Fixed bug: Headers in code blocks were incorrectly being parsed as real headers
Solution: Integrated marko (via flowmark) for proper Markdown AST parsing in SectionDoc
Added comprehensive test coverage with diverse Markdown syntax examples

Core Features

SectionDoc class:
- Parses Markdown documents into hierarchical section structure
- Provides section iteration with configurable depth and root inclusion
- Correctly handles headers using proper Markdown AST parsing
FlexDoc class:
- Flexible document wrapper supporting multiple formats (SectionDoc, TextDoc)
- Unified API for document statistics and iteration
- Seamless integration with existing chopdiff functionality

New Examples & Utilities

insert_size_info.py example:
- Inserts HTML <div class="size-info"> elements after section headers
- Provides word count, character count, sentence count, paragraph count, subsection count, and reading time
- Supports customizable header levels
read_time.py utility module:
- Centralized reading time calculation and formatting
- Uses prettyfmt for human-readable time display
- Configurable minimum time threshold (default 3 minutes)
- Default reading speed of 225 WPM
analyze_doc.py enhancement:
- Enhanced with rich formatting for colorized tree and table output
- Shows document structure and comprehensive statistics

Documentation & Examples

Simplified README with concise descriptions and links to examples
Updated all examples to use PEP 723 inline script dependencies
Added comprehensive Markdown test file for parsing validation
Code follows CLAUDE.md Python coding guidelines

Testing

Added tests for code block header filtering
Created comprehensive test suite for insert_size_info functionality
Updated tests for simplified read_time API
Full test coverage for SectionDoc and FlexDoc classes
All tests and linting pass ✅ (150 tests)

Test Plan

Run make lint - all checks pass
Run make test - all 150 tests pass
Test analyze_doc.py on various Markdown files
Test insert_size_info.py on README and sample documents
Verify headers in code blocks are correctly ignored
Confirm read time calculations match expectations
Test SectionDoc with complex nested documents
Test FlexDoc with different document types

🤖 Generated with Claude Code

- Add SectionDoc for hierarchical Markdown section parsing - Add FlexDoc as unified interface for TextDoc, TextNode, and SectionDoc views - Implement thread-safe lazy loading with synchronized decorator - Add comprehensive tests for all new components - Update documentation with examples and API reference Features: - Parse Markdown into hierarchical section tree structure - Navigate sections by level, title, or path - Cross-reference between token, div, and section views - Smart section-based document chunking - Thread-safe concurrent access to all views 🤖 Generated with [Claude Code](https://claude.ai/code) Co-Authored-By: Claude <noreply@anthropic.com>

- Add explicit type annotations for lists and dicts - Import override decorator for __repr__ methods - Fix Callable type hint in test_thread_utils.py - Ensure all type checking passes in basedpyright 🤖 Generated with [Claude Code](https://claude.ai/code) Co-Authored-By: Claude <noreply@anthropic.com>

- Remove override decorator usage (requires Python 3.12+) - Add reportImplicitOverride = false to basedpyright config - Ensure compatibility with Python 3.11, 3.12, and 3.13 🤖 Generated with [Claude Code](https://claude.ai/code) Co-Authored-By: Claude <noreply@anthropic.com>

- Use pyright: ignore comments for implicit override warnings - Maintain Python 3.11 compatibility without override decorator - Keep all basedpyright checks enabled as per best practices 🤖 Generated with [Claude Code](https://claude.ai/code) Co-Authored-By: Claude <noreply@anthropic.com>

…ff-section-iteration

@OverRide

- Add typing_extensions dependency for @OverRide decorator - Use @OverRide decorators for all overridden methods - Simplify docstrings by removing redundant Args/Returns sections - Keep docstrings concise as per guidelines - All tests and linting pass 🤖 Generated with [Claude Code](https://claude.ai/code) Co-Authored-By: Claude <noreply@anthropic.com>

- Add key properties/methods summaries to class docstrings - Update README.md examples to use correct method names - Add thread safety example to README - Fix incorrect method names that don't exist in implementation - Delete API.md as all content is now properly integrated 🤖 Generated with [Claude Code](https://claude.ai/code) Co-Authored-By: Claude <noreply@anthropic.com>

- Create analyze_doc.py CLI tool for section-by-section analysis - Shows hierarchical tree view with statistics per section - Includes paragraphs, sentences, words, and reading time - Supports both tree and flat table output formats - Add example to README.md with sample output - Works with files or stdin input 🤖 Generated with [Claude Code](https://claude.ai/code) Co-Authored-By: Claude <noreply@anthropic.com>

- Remove redundant Args/Returns sections where obvious from signatures - Keep docstrings concise as per Python coding guidelines - Maintain clear documentation while reducing verbosity 🤖 Generated with [Claude Code](https://claude.ai/code) Co-Authored-By: Claude <noreply@anthropic.com>

- Simplified README.md by removing large code/output blocks - Replaced verbose examples with concise descriptions and links - Enhanced analyze_doc.py to use rich library for colorized output - Added proper type checking support for optional rich dependency - Maintained backwards compatibility when rich is not installed 🤖 Generated with [Claude Code](https://claude.ai/code) Co-Authored-By: Claude <noreply@anthropic.com>

- Added inline script dependencies (PEP 723) to all example scripts - Removed sys.path.insert workarounds in favor of proper dependencies - Added rich as a direct dependency for analyze_doc.py - Removed conditional rich import logic - script now requires rich - Updated README to remove pip install instructions - Added pyright ignore comments for external dependencies 🤖 Generated with [Claude Code](https://claude.ai/code) Co-Authored-By: Claude <noreply@anthropic.com>

- Updated SectionDoc to track code fence positions - Headers inside code blocks are now properly ignored - Added tests for code block handling - Fixed rich styling issue in analyze_doc.py 🤖 Generated with [Claude Code](https://claude.ai/code) Co-Authored-By: Claude <noreply@anthropic.com>

- Replaced regex-based header detection with marko markdown parser - Now properly uses marko's DOM traversal to find headers - Headers inside code blocks are automatically ignored by parser - Maintains backward compatibility with all existing tests - No longer needs manual code fence tracking This ensures proper markdown parsing that respects all markdown rules including code blocks, inline code, and other edge cases. 🤖 Generated with [Claude Code](https://claude.ai/code) Co-Authored-By: Claude <noreply@anthropic.com>

- Added test_markdown.md with various markdown features including: - Code blocks in multiple languages (bash, yaml, python, shell) - Blockquotes and nested blockquotes - Numbered and bulleted lists - Tables with # symbols in cells - Inline code with # symbols - Auto-formatted all markdown files with flowmark - Verified proper header parsing ignores all non-header # symbols 🤖 Generated with [Claude Code](https://claude.ai/code) Co-Authored-By: Claude <noreply@anthropic.com>

- Remove obsolete brief parameter from test_insert_size_info tests - Update test to match current API after merge with main

jlevy and others added 17 commits August 5, 2025 10:45

Merge remote-tracking branch 'origin/main' into feature/extend-chopdi…

2d7995f

…ff-section-iteration

Add read time util. Improved example.

48610ba

Merge branch 'main' into feature/extend-chopdiff-section-iteration

fa00354

Fix test compatibility after merge

2efaa51

- Remove obsolete brief parameter from test_insert_size_info tests - Update test to match current API after merge with main

jlevy changed the title ~~Add SectionDoc and FlexDoc for structured document parsing~~ Enhanced document analysis with Markdown parsing and size statistics Aug 6, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Enhanced document analysis with Markdown parsing and size statistics#1

Enhanced document analysis with Markdown parsing and size statistics#1
jlevy wants to merge 17 commits intomainfrom
feature/extend-chopdiff-section-iteration

jlevy commented Aug 5, 2025 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

jlevy commented Aug 5, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Key Changes

Markdown Parsing Enhancement

Core Features

New Examples & Utilities

Documentation & Examples

Testing

Test Plan

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

jlevy commented Aug 5, 2025 •

edited

Loading