社交网络

总览

社交网络与图, 社交网络的聚类, 社区发现

社交网络与图

1.What is a network? a network can be defined as a graph in which nodes and/or edges have attributes (e.g. names). 带有属性信息的图

2.Network’s properties:

  • size: # of nodes/edges
  • node degree: # of links to other nodes
  • degree distribution: probability that a randomly selected node has degree k
    • scale-free network: network whose degree distribution follows a power law, at least asymptotically. 网络中的大部分节点只和很少节点连接,而有极少的节点与非常多的节点连接,幂律分布/长尾效应
    • Real networks are scale-free
  • path,
  • shortest path
  • diameter: shortest path between most distant nodes/maximal shortest path

3.Mathematical representation
Adjacency matrix : A
Path matrix: 可以通过邻接矩阵计算路径矩阵, A^l 结果表示为路径矩阵

3.Phenomena arise as a result of properties:

  • Friendship paradox: on average, your friends have more friends than you do, 友谊悖论

  • Small world: most nodes are not neighbors of one another,but most nodes can be reached from every other by a small number of hops or steps. 小世界理论,六度空间

    • The typical distance L between two randomly chosen nodes grows proportionally to the logarithm of the number of nodes N in the network, 两个节点的距离和整个网络的节点数的对数成正比
  • Perception bias in networks,网络中存在感知误差

社交网络的聚类

方式:
1.Hierarchical

  • Agglomerative (bottom-up):
    Initially, each point is a cluster,每个点都是一个cluster
    Repeatedly combine the two “nearest” clusters into one,不断往上去合并点,最终得到结果。
    Use distance metric
  • Divisive (top-down): ==>> 本文关注的算法
    Start with one cluster and recursively split it 整个图为一个cluster,不断去细分clusters

2.Point assignment
Maintain a set of clusters
Points belong to “nearest” cluster
Use distance metric

社交网络中距离指标

1.d(x, y) is 0 if there is an edge and 1 if there is no such edge.
限制🚫: violate the triangle inequality
例子: 假设A,B,C三点,AC一边,那么 d(A,B) = d(B,C) = 0 and d(A,C) = 1 > d(A,B) + d(B,C)

====>

2.Shortest path distance: Minimum # of edges connecting to nodes:
例子: 假设A,B,C三点,边AB,BC,那么d(A,B) = d(B,C) = 1 and d(A,C) = 2

The problem is more complex because the distance between data points (nodes) is not measured by Euclidian
由此可以看到传统的聚类方法(KNN,Kmeans)不太适用于社交网络的聚类,因为距离指标计算很复杂

====>
3. Betweenness 中介性/居间性

Edge betweenness: # of shortest paths passing over the edge
边的中介性,简单理解就是经过该条边的最短路径数目

example:Betweenness of edge (a, b):
number of pairs of nodes x and y -> x
edge (a,b) lies on the shortest path between x and y

However,if there are several shortest paths between x and y, edge (a,b) is credited with the fraction of those shortest paths that include edge (a,b)

但是!!!y->x的最短路径不只有一条,假设y->x的最短路径有3条,其中有一条进过了边(a,b),那么这一条最短路径为(a,b)的中介性贡献了1/3

A high score is bad: suggests that edge (a,b) runs between two different communities
我们可以发现如果一条边的中介性很高的话,表明a和b是属于两个不同的社区<<== 聚类实现

社区发现

利用betweeness的社区发现算法:Girvan-Newman Algorithm

Idea: discover communities using divisive hierarchical clustering (Start with one cluster (the social network) and recursively split it)

Strategy: edge betweenness(# of shortest paths passing through the edge)

Algorithm: Girvan-Newman Algorithm

Repeat until no edges are left:
Calculate betweenness of edges
Remove edges with highest betweenness

Result: Connected components are communities

但是通过上述的描述可以发现,通过不断的移除最大的edge betweeness,最后得到结果是一个个孤立的点,所以需要明确以下两点:
1.How to compute betweenness?
2.How to select the number of clusters?

Compute betweenness

Strategy: BFS

1.Perform a breadth-first search (BFS) of the graph, starting at node X
首先对于每一个点 X,得到一个BFS tree,这个BFS tree可以反映出该点到其他所有点的最短路径

2.Label each node by the number of shortest paths that reach it from the root node
标记处每个点到root node X的最短路径数目

3.Calculate for each edge e, the sum over all nodes Y (of the fraction) of the shortest paths from the root X to Y that go through edge e
为每条边e计算从root nodeX到其他所有点Y的最短路径中经过e的总和(包括分数部分)

具体计算:

第一步: 从底部开始
• A and C are leaves: get credit = 1;
• Each of these nodes has only one parent, so their credit=1 is given to edges (B,A) and (B,C)
• At level 2, G is a leaf: gets credit = 1
• B gets credit 1 + credit of DAG edges entering from below = 1 + 1 +1 = 3
• B has only one parent, so edge (D,B) gets entire credit of node B = 3

第二步:碰见需要拆分的情况 G has 2 parents

• In this case, both D and F have just one shortest path from E to each of those nodes
• So, give half credit of node G to each of those edges
• Credit = 1/(1 + 1) = 0.5
• In general, how we distribute credit of a node to its edges depends on number of shortest paths
• Say there were 5 shortest paths to D and only 3 to F
• Then credit of edge (D,G) = 5/8 and credit of edge (F,G) = 3/8
• Node D gets credit = 1 + credits of edges below it = 1 + 3 + 0.5 = 4.5
• Node F gets credit = 1 + 0.5 = 1.5
• D has only one parent, so Edge (E,D) gets credit = 4.5 from D
• Likewise for F: Edge (E,F) gets credit = 1.5 from F.

To complete betweenness calculation, must:

  • Repeat this for every node as root
  • Sum the contributions on each edge
  • Divide by 2 to get true betweenness
    • since every shortest path will be counted twice, once for each of its endpoints

以上仅仅是对针对一个节点计算的各个边的betweeness,需要对每个节点重复这样的计算随后累加除以2得到最终结果。从这里也可以发现GN是针对无向图的,如果是有向图的话,就不需要除以2了。

Select the number of clusters

Communities: sets of tightly connected nodes

Modularity Q: A measure of how well a network is partitioned into communities
模块:衡量一个网络中社区聚类好坏的指标

Modularity compares the number of edges inside a cluster with the expected number of edges that one would find in the cluster if the network were a random network with the same number of nodes and where each node keeps its degree, but edges are otherwise randomly attached.
Networks with high modularity have dense connections between the nodes within groups but sparse connections between nodes in different groups.
The null model is a graph which matches one specific graph in some of its structural features, but which is otherwise taken to be an instance of a random graph. The null model is used as a term of comparison, to verify whether the graph in question displays some feature, such as community structure, or not.

rclone mount GD: /nas/home/binzhang/data –allow-other –allow-non-empty –vfs-cache-mode writes

rclone mount GD: /nas/home/binzhang/GoogleDrive –allow-other –allow-non-empty –vfs-cache-mode writes

rclone copy /home/backup gdrive:backu


本博客所有文章除特别声明外,均采用 CC BY-SA 4.0 协议 ,转载请注明出处!