知识图谱数据获取与知识产权

Tim Berners-Lee published the first-ever website in August 6, 1991
Knowledge Graph的数据大部分来源于Web

Web基本信息

Web结构

  • Surface Web: Pages reachable by following links from static pages (表层网络,可以直接在万维网浏览的内容)
  • Deep Web: Pages reachable only via web forms (不能被www直接访问到,需要账号密码,访问权限等才能进入,例如邮箱、数据库)
  • Dark Web: Pages reachable only via Tor or equivalent (通过特定的浏览器、特殊授权或者特殊设置才能链接上的网络)

Web的Bowtie模型

Strongly Connected Component – 27.5%
IN and OUT – 21.5%
Tendrils and tubes – 21.5%
Disconnected – 8%

Web 数据获取途径 - Crawler

Web Crawler定义: Software that automatically browses the web and downloads content
Crawler 用途: Building search engines, Study the web/internet, Archive the web, Analyze web sites, Download content(知识图谱数据获取)
Crawler 基本框架: 调度器调度爬取页面,下载器多线程下载。

Queue: one per web site
Scheduler: breath first or depth first ?
Multithreaded Downloader: take Politeness and Latency into account
After download: Parse -> Extract Content -> Extract URLs ->Add To Queue

Crawler常用工具:Scrapy, bs4, import.io parsehub
Crawler 挑战: scale、deduplication、cost、error、freshness、counter-crawling/access …
Crawler 要求: robustness、 politeness、robots.txt

Web下的知识产权(Intellectual Property)

Intellectual property (IP) is an intangible creative work. It is not the physical form on which it is stored or delivered. IP can be protected through the use of patents, copyrights, trademarks, and trade secret laws.

专利(Patents)

PATENTS provide rights for up to 20 years for inventions in three broad categories:

  1. Utility patents: protect useful processes, machines, articles of manufacture, and compositions of matter (实用专利)

  2. Design patents guard the unauthorized use of new, original, and ornamental designs for articles of manufacture (设计专利)

  3. Plant patents are the way we protect invented or discovered, asexually reproduced plant varieties (植物专利)

版权(Copyrights)

COPYRIGHTS protect works of authorship, such as writings, music, and works of art that have been tangibly expressed. (文学领域,版权也称为著作权)
The Library of Congress registers copyrights which last the life of the author plus 70 years. (去世前+70年)
Books, albums, movies are all copyrighted.
You cannot copyright facts, such as the information in the telephone book. But you can copyright the particular presentation of those facts

版权拥有者所拥有的权利:
Make copies of the work;
Produce derivative works;
Distribute copies;
Perform the work in public;
Display the work in public;
这也就意味着,非版权拥有者不能够享有这些权利。 => 需要经过许可

在某些情况下,不需要许可使用作品:
合理使用需要考虑的因素: purpose and character of the work, nature of the copyrighted work, amount and substantiality of the portion used in relation to the copyrighted work, effect on the potential market for or value of the copyrighted work
以下情况可以不需要经过版权许可:
Use of copyrighted material that contribute to the creation of new work(The new works cannot significantly affect sales of the source material, thus depriving copyright holders of their income);
Use for some research and educational purposes;
Use for news reporting and critiquing;

不受版权法保护的作品类型:
Works in the public domain are not covered by IP rights;
Ideas (e.g., mathematical formulas);
Works pre-dating copyright law (e.g., Bible);
Expired copyright;
Government works;
Traditional knowledge, folklore.

知识共享(Creative Commons (CC))
Attribution
Noncommercial
No Derivative Works
ShareAlike

商标(Trademarks)

TRADEMARKS protect words, names, symbols, sounds, or colors that distinguish goods and services from other manufacturer’s products (商品符号)
Trademarks, unlike patents, can be renewed forever as long as they are being used in business. (无限期)
No Fair-Use provision in Trademark law

商标种类:
™ : unregistered trade mark used to promote or brand goods
℠ : unregistered service mark used to promote or brand services
® : registered trademark

商业秘密(Trade Secrets)

TRADE SECRETS are information that companies keep secret to give them an advantage over their competitors. example: formula for Coca-Cola
Gives owner perpetual monopoly on secret information (无限期)

数据汇总(Data Aggregation)

Data aggregation, aka “screen scraping” or “wrapping” is conducted widely today
Some of it is on sound legal ground;
But some it is has more ambiguous legal standing

软件许可证(Software Licenses)

Copyleft licenses
Permissive licenses


本博客所有文章除特别声明外,均采用 CC BY-SA 4.0 协议 ,转载请注明出处!