README
Project Name: TreeKG
Project Overview:
The TreeKG project aims to build a textbook-based knowledge graph, using explicit and implicit methods to generate structured chapter and entity relationship graphs. Following the approach outlined in the Tsinghua University TreeKG paper , the project relies on natural language processing and knowledge graph construction techniques, generating the final final_kg.json file through phased processing steps .
Main steps: This project is divided into two core phases, with the specific steps as follows:
Core Phase 1: Initial Construction (Explicit KG) Target: Construct a knowledge graph based on the natural hierarchy of textbooks to generate a structured skeleton of "chapter-entity" .
Step 1: Text Segmentation Core logic : Extract chapter hierarchy from the PDF table of contents using regular expressions. Based on the regular expressions, match the "chapter-section-subsection" boundaries to generate hierarchical TOC nodes . Example regular expression: (第\d+章.\n\n)Match "Chapter 1 Electric Field\n\n" as a level 2 node; (\?\d+\.\d+.*\n\n)Match "1.1 charge\n\n" as a level 3 node. Output : A list of TOC nodes representing the hierarchical relationship, with each node containing information such as id(e.g., "section_1.1"), title(section name), level(hierarchy), and page_start/ (PDF page number).page_end Step 2: Bottom-Up Summarization Core logic : Start generating summaries from the smallest level node (such as a level 3 section) and gradually aggregate upwards to generate summaries for the upper-level nodes. LLM Prompt : Generates a 200-300 word summary for each section, ensuring it includes core concepts, key theorems, and domain terminology. Output : Each TOC node is associated with a summary for subsequent entity extraction. Step 3: Entity & Relation Extraction Core logic : Extract domain entities (name, alias, type, and original description) from section summaries, and extract relationships between entities based on the summaries and entities. Entity extraction : Extract entities and generate corresponding JSON files. Relationship extraction : Identify relationships between entities (such as definitions, dependencies, applications, etc.) and establish edges for each section. Output : A list of entities and a list of relationships, generating an explicit knowledge graph structure. Step 4: Tree-like Graph Construction Core logic : Integrate the TOC nodes, entity nodes, and edges from steps 1 to 3 to generate a "tree hierarchy graph" (explicit KG). Output the knowledge graph in standard JSON format, as shown in the example below: { "nodes": [ {"id": "section_1.1", "type": "toc", "level": 3, "title": "电荷", "summary": "..."}, {"id": "entity_1", "type": "entity", "name": "电荷", "alias": [], "type": "物理概念", "description": "...", "section_id": "section_1.1", ...} ], "edges": [ {"source": "section_1", "target": "section_1.1", "type": "has_subsection"}, {"source": "section_1.1", "target": "entity_1", "type": "has_entity"} ] } Core Phase Two: Iterative Expansion (Implicit KG) Target: We extend the implicit knowledge graph based on the explicit KG and mine cross-chapter relationships through predefined operators.
Operator 1: Contextual-based Convolution Core objective : Enhance entity descriptions by supplementing entity descriptions with neighbor context information to improve semantic integrity. LLM Prompt : Enhances the entity description based on the entity's neighboring entities and relationships. Operator 2: Entity Aggregation Core objective : To assign core and non-core roles to entities, simplifying the hierarchical structure. LLM Prompt : Determines the coreness of each entity (core entity vs. auxiliary entity). Operator 3: Node Embedding Core objective : To convert entity descriptions into dense vectors for similarity retrieval and edge prediction. Tools used : Sentence-BERT model (such as all-MiniLM-L6-v2). Operator 4: Entity Deduplication Core objective : To identify entities with different names but the same meaning, and to eliminate redundancy. Implementation steps : Based on the entity embedding vector, further confirm whether the entities are the same through LLM. Operator 5: Edge Prediction Core objective : Predict potential relationships between entities and supplement horizontal edges ( entity_related). Implementation steps : Predict potential relationships by combining factors such as semantic similarity and structural relevance. Operating steps Data preparation and model Place the textbook source file (.Docx) in:src/ExplicitKG/output/
Place the BERT model folder in: src/HiddenKG/model/ (e.g. src/HiddenKG/model/bert-base-chinese/, , which contains config.json, pytorch_model.bin, vocab.txtetc.)
src/ExplicitKG/config/API information under modification
File order: Operational ExplicitKGphase :
ExplicitKGUnder the execution main.py, this file is responsible for data preprocessing, entity extraction, and relation extraction. Operational HiddenKGphase :
HiddenKGUnder the execution main.py, this file is responsible for operations such as implicit knowledge graph construction, entity deduplication, and edge prediction, generating the final knowledge graph. Output: The final generated knowledge graph file is located final_kg.jsonin src/HiddenKG/outputthe folder and is in the standard G=(V, E) knowledge graph format, where:
V : Contains TOC nodes and entity nodes. E : Contains edges between TOC nodes and between TOC nodes and entity nodes. Contact information: Code copyright: Li Ziliang, School of Electronic and Information Engineering, Wuyi University Contact email: lzl8800@foxmail.com