知识图谱复习大纲
Week 1
Intro
What is a KG?
KG is the Knowledge in graph form, it captures entities, attributes, and relationships. In KG, nodes are entities and they are labeled with attributes. Edges between two nodes capture a relationship between entities
Knowledge Graphs: is a directed heterogeneous multigraph whose node and relation types have domain-specific semantics
Why KGs?
From human’s perspective, KG helps combat information overload. It also helps explore via intuitive structure. It is also a tool for supporting knowledge-driven tasks.
From AI’s perspective, KG’s content is the key ingredient for many AI tasks. It bridges from data to human semantics and people can use decades of work on graph analysis via KG.
How are KGs used?
QA/Agents: By extracting suitable candidates (entities, relations or literals) from the KG
Decision Support: By reasoning over the KG
Fueling Discovery: By reasoning over the KG, By extracting and suggesting suitable candidates in the KG, By matching and analyzing patterns in the KG
Where do KGs come from?
Structured Text (Wikipedia Infoboxes, tables, databases, social nets)
Unstructured Text (WWW, news, social media, reference articles)
Images
Video (YouTube, video feeds)
Crawling the Web & Intellectual Property
Surface web vs. deep web vs. dark web
Surface web: Pages reachable by following links from static pages, people can access these pages directly
Deep Web: Pages reachable only via web forms , for example: email information, databases
Dark Web: Pages reachable only via Tor or equivalent or other specific tools.
What are the challenges?
scale
deduplication
cost (fetching, parsing/extracting, memory/disk * speed)
errors, redirects
freshness
deep web, forms
counter-crawling/access (login, captchas, traps, fake errors, banning)
localization
dynamic pages
infinite scrolling
archiving
What is the basic architecture of a crawler?
Basically, a crawler can be split into four parts: Scheduler, Queue, Multi-thread downloader, Storage. The procedure of the crawler is : first a user providers a list of web pages(URLS) to Scheduler. Then Scheduler shedules these URLs by putting them into Queue accoring to certain rules. Then Scheduler assigns the different URLs in Queue to Multi-thread downloader to let the downloader automatically crawls the information. After information is downloaded, content will be parsed and content will be extracted. There might be some urls extracted, the Scheduler will review thses urls and and the worthy ones to Queue again. Target data(Text and metadata) will be storeed into Storages. The crawler will end if there is no url to crawl of the user terminate it.
What mechanisms exist to protect intellectual property?
Patents
Copyrights
Trademarks
Trade Secrets
Software Licenses
What are the requirements for a patent?
Patent can be split into 3 main types.
Utility patents protect useful processes, machines, articles of manufacture, and compositions of matter.
Design patents guard the unauthorized use of new, original, and ornamental designs for articles of manufacture.
Plant patents are the way we protect invented or discovered, asexually reproduced plant varieties
Patents provide rights for up to 20 years for inventions
What can be copyrighted, and what is the difference in CC licenses?
COPYRIGHTS protect works of authorship, such as writings, music, and works of art that have been tangibly expressed. So basically, these creative following works can be copyrighted:
Literary, musical and dramatic works.
Pantomimes and choreographic works.
Pictorial, graphic and sculptural works.
Sound recordings.
Motion pictures and other AV works.
Computer programs.
Compilations of works and derivative works.
Architectural works.
Attribution: Licensees may copy, distribute, display and perform the work only if they give owner the credits
Noncommercial: Licensees may copy, distribute, etc. the work only for noncommercial purpose
No Derivative Works: Licensees may copy, distribute, etc. the work, not derivatives based on it.
ShareAlike: Licensees may distribute derivative works only under a license identical to the license that governs the original work.
Week 2
Information Extraction
What are some typical IE tasks?
sentence:
Part of speech tagging
Dependency Parsing
Named entity recognition
document:
Coreference Resolution
documents:
Entity resolution
Entity linking
Relation extraction
What are the three components to IE?
Defining domain
Learning extractors
Scoring the facts
What are the possible levels of supervision for each component?
Supervised
Semi-supervised
Unsupervised
Real-world (O)IE systems
Open domain IE:
Defining domain: Unsupervised
Learning extractors: Unsupervised
Scoring candidate facts: Semi-supervised
Knowledge Representation
What are the basic elements of RDF?
S(subject)P(predicate)O(object)
What are the different syntaxes for RDF?
XML, N3 Turtle, N-Triples, RDFa, JSON-LD
What is RDF Schema?
RDF Schema is the language for defining RDF vocabularies.
It specifies the RDF inference rules: the triples that are implied by the triples you have.
Degree of semantics in graphs
????
Week 3
Entity Resolution
Which ER variants exist?
Ambiguity: Entities with the same name
Variance: Different names for the same entity
What are the three basic steps of ER?
???
How do we evaluate blocking?
Efficiency = num of pairs compared / total number of pairs in RxR
Recall = num of true matches compared / num of true matches in RxR
Precision = um of true matches compared / num of matches compared
Max canopy size: the size of the largest block
What are real-world examples of these ER settings?
text -> text ?
text -> KG ?
KG -> KG ?
KG1 -> KG2 ?
Week 4
Queries & KGs
RDF vs. Property Graphs
Similarity:
Both represent directed graphs as a basic data structure;
Both have associated graph-oriented query languages;
In practice, both are used as “graph stores”, accessible via HTTP and/or various API-s;
Differences:
RDF has an emphasis on OWA, and is rooted in the Web via URL-s. Not the case for PG, PG node is oblivious to what it “contains”: can be a URL, can be a literal; (node不一定是uri)
PG includes the possibility to add simple key/value pairs to “relationships” (i.e., RDF predicates) (关系中可以带属性)
RDF triple-stores vs. Graph DBs
In RDF triple-stores everything is expressed in terms of SPO and its predicates can’t have attributes;
In Graph DBs predicates can have attributes;
Fair to say that RDF triple-stores are a kind of Graph DB;
SPARQL: syntax, Graph Patterns, Aggregation, etc…
SPARQL vs. Cypher
SPARQL Protocol and RDF Query Language , this is RDF’s QUERY LANGUAGE
Neo4j’s QUERY LANGUAGE
Special Topics: KG Use Cases
Wikidata data model
Model using statements Subject, property, value, Qualifiers, references
Challenges and methods for building KGs in real applications
challenge: make the annotations easy to use
hundreds of millions of sex ads on the open internet…many trafficking related
Data obfuscation makes even simple questions hard to automatically answer
solution:
USE KG in tsv format
Knowledge Graph Is A TSV File
easy to process
Week 5
Large KGs & Entity Linking
Which (kinds of) large KGs exist?
DBpedia, YAGO, Wikidata, Freebase, ConceptNet, NELL, OpenIE
How is Wikidata different from other large KGs?
???
What are the different methods to get data from large KGs?
???
What is entity linking (from text) and why is it hard?
Establishing the identity of the entity in a given reference
database (text->KG).
name ambiguity: Entities with the same name
name variation: Different names for the same entity
Missing (NIL) entities
What is the basic architecture of entity linkers?
mention detection -> candidate selection -> Disambiguation -> entity annotatiton
What are some methods for disambiguation?
Word-based methods: DBpedia Spotlight: Compute cosine similarity between the text paragraph with an entity mention and Wikipedia descriptions of each candidate.Decide for one mention at a time. The linking can be restricted to certain types or even to a custom set of entities.
Graph-based methods: AIDA and AGDISTIS: 1. construct a subgraph that contains all entity candidates with some facts from a KB. 2. find the best connected candidates per mention.
String Similarity
Which similarity families exist and what is the main idea of each?
What are the strengths and weaknesses of each method?
How do hybrid methods work?
Week 6
Information Extraction
What are labeling functions (Snorkel)? What makes them good?
Labeling functions (LFs) help users encode domain knowledge and other supervision sources programmatically, it is a function desgined by human to help label data points.
Instead of poeple manually labeling the points, people can write heuristics to noisily label data! Human can leverage real-world knowledge, context, and common-sense heuristics to make labeling decisions. This help quick and low cost labeling.
What does the generative model do?
LFs have different latent accuracies, Snorkel wants to learn these latent accuracies without labeled data by leveraging overlap and conflict of LFs.
Snorkel creates a generative model to maximize the marginal likelihood of the LFs to learn parameters Intuitively, compares their agreements and disagreements, and detect correlations and other dependencies among LFs to correct their accuracies.
What is the purpose of the discriminative model?
Compiling Rules into Features.
The output of the generative model is a set of probabilistic training labels
Snorkel wants to use these labels to train final discriminative model. The discriminative model learns a feature representation of our LFs.
The discriminative model aims to be able to generalize beyond the noisy LFs. The more unlabeled data we train with Snorkel, the better is the predictive performance of the discriminatory model.
Week 7
本博客所有文章除特别声明外,均采用 CC BY-SA 4.0 协议 ,转载请注明出处!