GitHub - ndbin-dcks/TreeKG: Tree-KG

README Project Name: TreeKG Project Overview: The TreeKG project aims to build a textbook-based knowledge graph, using explicit and implicit methods to generate structured chapter and entity relationship graphs. Following the approach outlined in the Tsinghua University TreeKG paper , the project relies on natural language processing and knowledge graph construction techniques, generating the final final_kg.json file through phased processing steps .

Main steps: This project is divided into two core phases, with the specific steps as follows:

Core Phase 1: Initial Construction (Explicit KG) Target: Construct a knowledge graph based on the natural hierarchy of textbooks to generate a structured skeleton of "chapter-entity" .

Step 1: Text Segmentation Core logic : Extract chapter hierarchy from the PDF table of contents using regular expressions. Based on the regular expressions, match the "chapter-section-subsection" boundaries to generate hierarchical TOC nodes . Example regular expression: (第\d+章.\n\n)Match "Chapter 1 Electric Field\n\n" as a level 2 node; (\?\d+\.\d+.*\n\n)Match "1.1 charge\n\n" as a level 3 node. Output : A list of TOC nodes representing the hierarchical relationship, with each node containing information such as id(e.g., "section_1.1"), title(section name), level(hierarchy), and page_start/ (PDF page number).page_end Step 2: Bottom-Up Summarization Core logic : Start generating summaries from the smallest level node (such as a level 3 section) and gradually aggregate upwards to generate summaries for the upper-level nodes. LLM Prompt : Generates a 200-300 word summary for each section, ensuring it includes core concepts, key theorems, and domain terminology. Output : Each TOC node is associated with a summary for subsequent entity extraction. Step 3: Entity & Relation Extraction Core logic : Extract domain entities (name, alias, type, and original description) from section summaries, and extract relationships between entities based on the summaries and entities. Entity extraction : Extract entities and generate corresponding JSON files. Relationship extraction : Identify relationships between entities (such as definitions, dependencies, applications, etc.) and establish edges for each section. Output : A list of entities and a list of relationships, generating an explicit knowledge graph structure. Step 4: Tree-like Graph Construction Core logic : Integrate the TOC nodes, entity nodes, and edges from steps 1 to 3 to generate a "tree hierarchy graph" (explicit KG). Output the knowledge graph in standard JSON format, as shown in the example below: { "nodes": [ {"id": "section_1.1", "type": "toc", "level": 3, "title": "电荷", "summary": "..."}, {"id": "entity_1", "type": "entity", "name": "电荷", "alias": [], "type": "物理概念", "description": "...", "section_id": "section_1.1", ...} ], "edges": [ {"source": "section_1", "target": "section_1.1", "type": "has_subsection"}, {"source": "section_1.1", "target": "entity_1", "type": "has_entity"} ] } Core Phase Two: Iterative Expansion (Implicit KG) Target: We extend the implicit knowledge graph based on the explicit KG and mine cross-chapter relationships through predefined operators.

Operator 1: Contextual-based Convolution Core objective : Enhance entity descriptions by supplementing entity descriptions with neighbor context information to improve semantic integrity. LLM Prompt : Enhances the entity description based on the entity's neighboring entities and relationships. Operator 2: Entity Aggregation Core objective : To assign core and non-core roles to entities, simplifying the hierarchical structure. LLM Prompt : Determines the coreness of each entity (core entity vs. auxiliary entity). Operator 3: Node Embedding Core objective : To convert entity descriptions into dense vectors for similarity retrieval and edge prediction. Tools used : Sentence-BERT model (such as all-MiniLM-L6-v2). Operator 4: Entity Deduplication Core objective : To identify entities with different names but the same meaning, and to eliminate redundancy. Implementation steps : Based on the entity embedding vector, further confirm whether the entities are the same through LLM. Operator 5: Edge Prediction Core objective : Predict potential relationships between entities and supplement horizontal edges ( entity_related). Implementation steps : Predict potential relationships by combining factors such as semantic similarity and structural relevance. Operating steps Data preparation and model Place the textbook source file (.Docx) in:src/ExplicitKG/output/

Place the BERT model folder in: src/HiddenKG/model/ (e.g. src/HiddenKG/model/bert-base-chinese/, , which contains config.json, pytorch_model.bin, vocab.txtetc.)

src/ExplicitKG/config/API information under modification

File order: Operational ExplicitKGphase :

ExplicitKGUnder the execution main.py, this file is responsible for data preprocessing, entity extraction, and relation extraction. Operational HiddenKGphase :

HiddenKGUnder the execution main.py, this file is responsible for operations such as implicit knowledge graph construction, entity deduplication, and edge prediction, generating the final knowledge graph. Output: The final generated knowledge graph file is located final_kg.jsonin src/HiddenKG/outputthe folder and is in the standard G=(V, E) knowledge graph format, where:

V : Contains TOC nodes and entity nodes. E : Contains edges between TOC nodes and between TOC nodes and entity nodes. Contact information: Code copyright: Li Ziliang, School of Electronic and Information Engineering, Wuyi University Contact email: lzl8800@foxmail.com

Name		Name	Last commit message	Last commit date
Latest commit History 93 Commits
src		src
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

About

Uh oh!

Contributors 2

Languages

Folders and files

Latest commit

History

Repository files navigation

About

Resources

Uh oh!

Stars

Watchers

Forks

Contributors 2

Languages