Mining Knowledge Graphs from Text [Note1]

Part 1 Knowledge Graph Primer

  1. What is a Knowledge Graph?

Knowledge Graph = Entities + Relationships + Attributes

Popular Knowledge Graphs(General): Google Knowledge Graph, Microsofy Satori Knowledge Graph
Domain Specific Knowledge Graphs: Microsoft Academic Graph, Linkedin Economic Graph, Common Sense Knowledge Graph

  1. Why Knowledge Graph Important?

For Humans:

  • Help organize world’s information
  • Combat Information Overload
  • Easier for Exploration via Clear Structure
  • Tool for Supporting Business Decisions

For AIs:

  • Key ingredient for many AI tasks
  • Bridge from data to human semantics
  • Use decades of work on graph analysis

Applications:

  • QA/Agents
  • Decision Support
  • Fueling Discovery
  1. Where Do Knowledge Graphs Come From?
  • Structured Text : Wikipedia Infoboxes, tables, databases, social nets
  • Unstructured Text : WWW, news, social media, reference articles
  • Images
  • Video : YouTube, video feeds
  1. Knowledge Representation Choices

1) Most knowledge graph implementations use RDF triples (Resource Description Framework)

RDF是一种处理元数据的应用,元数据是指描述数据的数据或者说是描述信息的信息
eg: 书的内容是书的数据,作者的名字、出版社、地址是书的元数据。
RDF的基本构造为陈述(statement)了一个资源-资源具有的属性(attribute)-属性值(value) (即,subject-predicate/relation-object)的三元组。它表现的是一个数据模型。
每一个被描述的资源拥有一个统一资源标识符(URI)。URI可以是URL或者是其他诸如电话号码、国际标准图书编号ISBN和地理坐标等能唯一标识对象的符号。
属性同样也需要用URI来标识,防止同义词造成的混乱。

2) ABox (assertions) versus TBox (terminology)

Tbox是关于概念术语的断言 ,Abox是关于个体的断言
Tbox声明概念和角色间的包含关系,而Abox是关于个体的实例断言集合,断言包括声明个体是某概念的实例,以及个体之间的二元关系。

3) Common ontological primitives

  • rdfs:domain, rdfs:range, rdf:type, rdfs:subClassOf, rdfs:subPropertyOf, …
  • owl:inverseOf, owl:TransitiveProperty, owl:FunctionalProperty, …

RDF是领域无关的,而使用RDFS(RDF Schema)可以定义应用领域所使用的术语和概念。
但是无论是RDF或是RDFS都只能表示二元谓词(连接两个客体的谓词就叫二元谓词),不足以支持web上的复杂应用,因此W3C又发展了Web本体语言(OWL),OWL是RDF的扩张,有相同的语法结构,可以定义词汇之间的关系,类与类的关系,属性与属性之间的关系等等。

4) Semantic Web
Standards for defining and exchanging knowledge.
Annotated data provide critical resource for automation
Major weakness: annotate everything?
被标注的数据可以为自动化的一些操作提供关键的资源,但是这一点也是它的弱点所在,对于大量的不标准的语义表达,难道要标注所有数据吗。

5) Information Extraction from Text (will be illustrated in Part 2)
Answer to the knowledge acquisition bottleneck
Many challenges:
chunking, polysemy/word sense disambiguation (多义词) , entity coreference , relational extraction

Ref: 【ReadingNotes】知识图谱导学 Knowledge Graph Tutorial - Part 1