Editing Knowledge Graphs

<div style="background-color: #4B0082; color: #FFFFFF; padding: 20px; border-radius: 8px; margin-bottom: 15px;">
{{BloomIntro}}
Knowledge Graphs (KGs) are structured representations of information in which entities (people, places, concepts, organizations) are nodes and the relationships between them are typed, directed edges. Where tables store attributes of individual entities, knowledge graphs store the web of relationships between entities — enabling reasoning, inference, and navigation across connected knowledge. Major knowledge graphs include Google's Knowledge Graph (powering search results), Wikidata, DBpedia, and numerous proprietary enterprise knowledge graphs. In the AI era, knowledge graphs are experiencing a renaissance as a complement to neural AI — providing structured, verifiable, interpretable knowledge to ground language model outputs.
</div>

__TOC__

<div style="background-color: #000080; color: #FFFFFF; padding: 20px; border-radius: 8px; margin-bottom: 15px;">
== <span style="color: #FFFFFF;">Remembering</span> ==
* '''Entity''' — A distinct real-world object or concept represented as a node: a person (Albert Einstein), an organization (NASA), a concept (Relativity).
* '''Relationship (predicate)''' — A typed, directed connection between two entities: "bornIn," "worksFor," "isA," "hasCapital."
* '''Triple''' — The fundamental unit of a knowledge graph: (Subject, Predicate, Object). Example: (Albert_Einstein, bornIn, Ulm).
* '''RDF (Resource Description Framework)''' — A W3C standard for representing knowledge graph triples using URIs.
* '''SPARQL''' — The query language for RDF knowledge graphs, analogous to SQL for relational databases.
* '''Ontology''' — A formal specification of concepts, categories, and relationships within a domain; defines the schema of a knowledge graph.
* '''OWL (Web Ontology Language)''' — A W3C language for defining ontologies with rich semantic constraints.
* '''Property Graph''' — An alternative KG model where nodes and edges can have attributes (key-value pairs). Used in Neo4j.
* '''Cypher''' — The query language for property graph databases like Neo4j.
* '''Knowledge Graph Embedding''' — Representing entities and relations as vectors in a continuous space for machine learning over KGs.
* '''Link prediction''' — The task of inferring missing relationships in a knowledge graph from existing ones.
* '''Entity alignment''' — Matching entities across two different knowledge graphs that refer to the same real-world object.
* '''Named Entity Recognition (NER)''' — The NLP task of identifying entities in text; first step in knowledge graph construction from text.
* '''Relation extraction''' — Identifying the relationship between two named entities in text; used in automated KG construction.
* '''Wikidata''' — A free, multilingual, community-maintained knowledge graph with hundreds of millions of triples.
</div>

<div style="background-color: #006400; color: #FFFFFF; padding: 20px; border-radius: 8px; margin-bottom: 15px;">
== <span style="color: #FFFFFF;">Understanding</span> ==
Knowledge graphs represent knowledge as a directed, typed multigraph. Unlike a relational database that describes each entity's attributes in rows and columns, a KG describes the '''web of relationships''' — enabling multi-hop reasoning that relational databases make awkward.

Example: "What are the birthplaces of Nobel Prize winners in Physics who studied at German universities?"

In a relational database, this requires multiple JOINs across tables. In a knowledge graph, it's a graph traversal: find Nobel Physics winners → follow "studiedAt" to universities → filter to Germany → follow "bornIn" to places.

'''The Open World Assumption''': A key semantic difference from relational databases. KGs assume that the absence of a triple does not mean it's false — the information may simply not be recorded. (Closed World Assumption in relational databases: if a row doesn't exist, the fact is false.)

'''Knowledge Graph Embeddings''' (TransE, RotatE, ComplEx) learn dense vector representations of entities and relations, enabling:
* Link prediction: can we predict missing triples?
* Similarity computation: are these entities similar?
* KG completion: enrich an incomplete KG using learned patterns

TransE (a foundational embedding method) represents each relation as a translation in embedding space: h + r ≈ t for each triple (h, r, t). "Paris + locatedIn → France" should hold approximately in the embedding space.

'''Symbolic vs. neural AI''': Knowledge graphs are a form of '''symbolic AI''' — explicit, interpretable, structured representation. Neural models (LLMs) are statistical learners of implicit patterns. The combination — neuro-symbolic AI — is a growing research direction. RAG with a knowledge graph (GraphRAG) retrieves structured facts rather than unstructured text chunks, enabling more precise and verifiable grounding.
</div>

<div style="background-color: #8B0000; color: #FFFFFF; padding: 20px; border-radius: 8px; margin-bottom: 15px;">
== <span style="color: #FFFFFF;">Applying</span> ==
'''Building and querying a knowledge graph with Neo4j (Python driver):'''

<syntaxhighlight lang="python">
from neo4j import GraphDatabase

driver = GraphDatabase.driver("bolt://localhost:7687",
                               auth=("neo4j", "password"))

def create_knowledge_graph(session):
    """Create a simple research knowledge graph."""
    # Create entities and relationships using Cypher
    session.run("""
        MERGE (p:Person {name: 'Geoffrey Hinton'})
        MERGE (u:University {name: 'University of Toronto'})
        MERGE (a:Award {name: 'Nobel Prize in Physics'})
        MERGE (c:Concept {name: 'Backpropagation'})
        MERGE (p)-[:WORKED_AT {from: 1987, to: 2023}]->(u)
        MERGE (p)-[:RECEIVED {year: 2024}]->(a)
        MERGE (p)-[:PIONEERED]->(c)
        MERGE (c)-[:ENABLES]->(deep:Concept {name: 'Deep Learning'})
    """)

def find_award_winners_and_contributions(session):
    """Find Nobel Prize winners and what they pioneered."""
    result = session.run("""
        MATCH (p:Person)-[:RECEIVED]->(a:Award {name: 'Nobel Prize in Physics'})
        MATCH (p)-[:PIONEERED]->(c:Concept)
        RETURN p.name AS person, c.name AS contribution
        ORDER BY p.name
    """)
    return [dict(record) for record in result]

def multi_hop_query(session, concept_name):
    """Find all researchers connected to a concept within 2 hops."""
    result = session.run("""
        MATCH path = (p:Person)-[:PIONEERED|CONTRIBUTED_TO*1..2]->(c:Concept)
        WHERE c.name CONTAINS $concept
        RETURN p.name AS researcher, [node in nodes(path) | node.name] AS path_names
        LIMIT 10
    """, concept=concept_name)
    return [dict(record) for record in result]

with driver.session() as session:
    create_knowledge_graph(session)
    winners = find_award_winners_and_contributions(session)
    print(winners)
</syntaxhighlight>

; Knowledge graph construction pipeline
: '''Manual curation''' → Domain experts curate high-precision triples (medical ontologies, legal KGs)
: '''Crowd-sourced''' → Wikidata model: community contributes and validates
: '''Information extraction''' → NER + relation extraction from text corpora (automated, noisy)
: '''Web scraping + structured sources''' → Infoboxes, tables, linked data sources (Freebase model)
: '''Hybrid''' → Automated extraction + expert curation + community correction
</div>

<div style="background-color: #8B4500; color: #FFFFFF; padding: 20px; border-radius: 8px; margin-bottom: 15px;">
== <span style="color: #FFFFFF;">Analyzing</span> ==
{| class="wikitable"
|+ Knowledge Representation Comparison
! Approach !! Structure !! Reasoning !! Scalability !! Interpretability
|-
| Relational database || Tables, rows || SQL (closed world) || Very high || High
|-
| Knowledge graph (RDF) || Triples || SPARQL, inference rules || High || High
|-
| Property graph (Neo4j) || Nodes + edges with properties || Cypher, path queries || High || High
|-
| Vector embeddings alone || Implicit in weights || Neural similarity || Very high || Low
|-
| Knowledge graph + embeddings || Hybrid || Symbolic + neural || High || Medium
|}

'''Key challenges and failure modes:'''
* '''Incompleteness''' — Even the largest knowledge graphs (Wikidata: 100M+ triples) are dramatically incomplete. Most entity-relation-entity combinations that are true are not represented.
* '''Inconsistency''' — Different sources record conflicting information. Conflict resolution and provenance tracking are essential but difficult.
* '''Coverage-precision trade-off''' — Manual curation is precise but slow and incomplete; automated extraction has high recall but introduces errors.
* '''Schema evolution''' — As understanding of a domain evolves, the ontology needs updating, which can invalidate existing triples.
* '''Entity ambiguity''' — "Apple" could be the company, the fruit, or countless others. Entity linking (mapping text mentions to KG entities) is difficult and error-prone.
</div>

<div style="background-color: #483D8B; color: #FFFFFF; padding: 20px; border-radius: 8px; margin-bottom: 15px;">
== <span style="color: #FFFFFF;">Evaluating</span> ==
Expert evaluation of knowledge graphs is multi-dimensional:

'''Factual accuracy''': Randomly sample triples and verify against authoritative sources. Precision is the primary metric for knowledge graphs serving downstream systems.

'''Coverage / recall''': For a specific domain, what fraction of known true facts are represented? Measured by comparing against a held-out set of verified triples.

'''Link prediction benchmarks''': FB15k-237 (Freebase subset) and WN18RR (WordNet subset) are standard benchmarks for evaluating knowledge graph embedding methods. Metrics: Mean Reciprocal Rank (MRR), Hits@10.

'''Query performance''': For production KGs, SPARQL query execution time at p95 for typical query patterns. Neo4j and other property graph DBs provide query profiling tools.

'''Downstream task impact''': Does using the KG improve performance on the target application (question answering, recommendation, entity disambiguation)? This is the ultimate measure of KG quality.

Expert practitioners also evaluate '''provenance and freshness''': For each triple, is its source known and trusted? How recently was it validated? Temporal knowledge graphs additionally track when facts were true, enabling time-sensitive queries.
</div>

<div style="background-color: #2F4F4F; color: #FFFFFF; padding: 20px; border-radius: 8px; margin-bottom: 15px;">
== <span style="color: #FFFFFF;">Creating</span> ==
Designing a domain knowledge graph from scratch:

'''1. Domain scoping and ontology design'''
<syntaxhighlight lang="text">
Define scope: What entities matter? What relationships?
    ↓
Review existing ontologies: Schema.org, biomedical ontologies (SNOMED, MeSH), industry standards
    ↓
Design ontology: entity types, relation types, cardinality constraints
    ↓
Define naming conventions and URI scheme
    ↓
Validate with domain experts: are these the right concepts?
</syntaxhighlight>

'''2. Knowledge acquisition pipeline'''
<syntaxhighlight lang="text">
Structured sources (databases, APIs, spreadsheets)
    ↓ [Direct mapping to triples]
    
Unstructured sources (text documents, web pages)
    ↓ [NER → entity linking → relation extraction → triple extraction]
    ↓ [Confidence scoring: filter low-confidence triples]
    ↓ [Human validation of uncertain triples]
    
Semi-structured sources (tables, infoboxes)
    ↓ [Table understanding + header interpretation]
    ↓
[Deduplication + entity alignment]
    ↓
[Knowledge base population]
</div>