Self-Supervised Dynamic Graph Representation Learning via Temporal Subgraph Contrast
This is a review of the paper Self-Supervised Dynamic Graph Representation Learning via Temporal Subgraph Contrast which tries to learn node representations or embeddings using a self-supervised learning technique called Graph Contrastive Learning.
Introduction
Borrowing the words of the authors, “Realistic graphs are often dynamic, which means the interaction between nodes occurs at a specific time”. The whole idea of Temporal Graphs or more formally Continuous Time Dynamic Graphs (CTDG) is to model the dynamic evolution of graphs with respect to time. Over the last five posts in this series, we well established the importance of temporal graphs and key breakthroughs from both industry and academia. Below is the index of the previous posts for reference,
- Transforming Data: The rise of Temporal Graphs
- Continuous-Time Dynamic Network Embeddings
- Embedding Temporal Network via Neighborhood Formation
- Direct Embedding of Temporal Network Edges via Time-Decayed Line Graphs
- Representation Learning in Continuous-Time Dynamic Signed Networks
The framework proposed by the authors is trying to model the dynamic nature of the temporal graph by sampling a subgraph of size k around each node in the given graph and thereby contrasting between positive and negative features generated from the subgraph for both the central node as well as the subgraph. The primary goals of this self-supervised technique is to maximise the mutual information within the positive pair of node and subgraph features and at the same time minimising the information within the negative pair. Also the features should be responsible to capture not only the temporal information in the graph (which is the desired) but also the structural properties along with time. To cater these requirements, a sampling technique that respects both the temporal and structural attributes of a graph, representation techniques for both nodes and subgraphs and a new contrastive loss function that helps to optimise the objective in hand are the key contributions introduced in this paper. We will go into each of these contributions in detail in the later sections.
Self-Supervised Learning in Language & Graph
In my personal opinion, self-supervised learning got its glory since the advent of language modelling where we wanted to learn meaningful high-dimensional representations for the words in a pre-defined vocabulary which then allowed us to carry out a lot of downstream tasks such as text classification, clustering, summarization etc. In comparison with supervised learning where we explicitly label each data point in the train and test datasets, the supervision signals or labels in self-supervised setting will come from the data itself. In language modelling, given a context window of text, the model is equipped to predict the next word which is the label that comes organically with free text i.e. given any context window, the next word will be its corresponding label.
Similar to the achievements in the language domain, self-supervised learning has made waves in graph research as well. It leverages the inherent structure and attributes of graph to generate supervisory signals without the need of explicit labels. Self-supervised approaches enable us to learn rich and high-dimensional representations by designing pretext tasks such as predicting missing edges and node attributes. This is valuable for graphs because they almost exist with sparse labels but abundant relational information. Such representations we learn can then be used for many downstream tasks such node classification, link prediction and graph classification.
Framework
The proposed DySubC (Dynamic graph representation via temporal Subgraph Contrast) — fancy name, huh? — uses contrastive learning that captures the temporal and structural features from temporal graphs during the training phase without additional supervision.
Temporal Graph: Given a graph G: (V, Eₜ, X) where V is the set of vertices, Eₜ ⊆ V × V × ℝ⁺ is the set of edges with timestamp t and X is node features.
The complete framework is divided into three components as,
- Temporal subgraph sampling: For each node in G, a temporal subgraph Gᵢ = (Aᵢ, Xᵢ) will be sampled that preserves the temporal and structural information and its counterpart Gᵢ` = (Aᵢ`, Xᵢ`) which does not have time information preserved. I was just wondering what will happen when Gᵢ = Gᵢ` — this is not impossible because a lot of interactions can happen at the same time, so in this case the two graphs will look identical even if we omit the time information.
- Node and subgraph representations: Given Gᵢ and Gᵢ`, we will generate embedding representations for both nodes and each subgraph using GCNs.
- Temporal contrastive learning: Once node and subgraph representations are formed, we will formulate positive and negative training samples and try to optimise the graph contrastive objective function.
Temporal Subgraph Sampling
This step is used to generate subgraphs of size k and subsequent training samples starting from each node in the graph G. The sampling process begins by preprocessing the graph to obtain both time-weighted and unweighted adjacency matrices. For each node, the algorithm initialises a queue and a sampling pool, then iteratively expands the candidate set of neighborhood nodes. If the number of nodes in the candidate set is insufficient to meet the required subgraph size k, all nodes are added to the queue and then to the pool. However, if the number of candidates exceeds k, the algorithm evaluates the importance of each node and selects the most significant ones based on a calculated importance score. This score is calculated based on both temporal and structural attributes of the nodes. This makes sure that the sampled subgraph not only respects the node connectivity but also reflects the temporal dynamics of the graph. The output consists of a time-weighted subgraph and its unweighted counter part.
The importance score is a combination of two things, structural importance score and temporal importance score. The structural importance score is simply the degree of a node j and the temporal importance score is a normalized largest timestamp connected to the node j. So the total importance score will be a weighted sum of these two components.
Once k nodes are sampled, the time-weighted subgraph Gᵢ and unweighted subgraph Gᵢ` are obtained referring to the original graph G represented by adjacency matrix Aᵢ and Aᵢ` respectively. The feature matrix Xᵢ is formed by simply taking out the iᵗʰ row in the full feature matrix X.
Node and Subgraph Representation
Once we have Gᵢ = (Aᵢ, Xᵢ) and Gᵢ` = (Aᵢ`, Xᵢ`), the next step is to use two separate encoders (GCNs) to output their representations Hᵢ and Hᵢ` respectively.
When GCNs are used, they process the graph as a whole and you get embeddings corresponding to each and every node in the graph (in our case a subgraph). Our current goal is to retrieve the embedding corresponding to our central node i and it is done using a process named picking out as per the above figure. It is as simple as taking the iᵗʰ row out of the learned embeddings matrix.
The next step is to form representations for both time-weighted and unweighted subgraphs. This is also a straightforward thing, for time-weighted subgraph we take the weighted average of the nodes in the subgraph where the importance scores act as the weighting factor. In contrast with the time-weighted subgraph, the embedding generation of unweighted subgraph is plain averaging of the node embeddings in it. The weighted averaging for time-weighted subgraph and normal averaging for unweighted subgraph are called readout 1 and readout 2 respectively as per the above figure.
Temporal Contrastive Learning
Now our goal is to formulate positive and negative samples and loss functions to facilitate contrastive learning. The representations of our time-weighted subgraph (output of the operation readout 1) and the central node i (output of the pick out operation) will form the only positive example for contrastive learning. But there will be two negative samples per training example which are formed this way — the first negative sample will be a pair of the representations of the unweighted subgraph (output of operation readout 2) and the central node i. This will act as a temporal negative sample with the purpose to make the central node i’s representation closer to the time-weighted subgraph representation and farther away from the unweighted subgraph representation. The second negative sample is generated by randomly picking a time-weighted subgraph from a set of shuffled representations which will be paired with the same representation of the central node i.
In short, one training example consists of three different pieces,
- A positive pair with the representations of time-weighted subgraph and the central node i.
- A negative pair with the representations of unweighted subgraph and the central node i.
- Another negative pair with the representations of a random time-weighted subgraph and the central node i.
Authors chose two sigmoid based margin loss functions — one for structural sample and another one for temporal sample. The two losses are then weighted averaged to get the final loss value which will then be propagated back to the parameters of the network.
Experimental Results
The experimental results in the paper highlight the impressive capabilities of the DySubC framework in dynamic graph representation learning. The authors conduct extensive tests on five real-world datasets, focusing on link predictions tasks to evaluate the proposed framework’s performance. The results clearly indicate that DySubC outperforms several baseline models, which include both traditional graph contrastive learnign models and contemporary dynamic graph representation models. This success can be attributed to the framework’s ability to effectively integrate temporal information allowing it to sample more relevant subgraphs that reflect the true nature of node interactions over time.
Furthermore, the authors perform an ablation study to dissect the contributions of various components within the DySubC framework. This analysis reveals that each element plays a vital role in enhancing model performance. Specifically, the study demonstrated how different time-aware modules contribute to capturing the structural evolution of graphs, emphasising that frequent changes in node’s neighborhood can indicate shifts in its relationships.
Conclusion
This paper introduced a contrastive learning based approach named Dynamic graph representation via temporal Subgraph Contrast (DySubC) to learn node embeddings from temporal graphs. This is done by sampling temporal subgraphs centered on each node individually and contrasting between the positive and negative samples generated from them during the training phase. The model learns the representations with both structural and temporal information by maximizing the mutual information of the node representation and its temporal subgraph representation. The proposed model is compared against a bunch of baseline models based on link prediction task where it performs well in almost all cases but there is no further information provided regarding other downstream tasks such as node classification, graph classification, clustering etc. I would say the two prime ideas of this paper are temporal subgraph sampling and contrastive learning in graphs which are well explained.
This was also a good read and it would not take much time to comprehend the core concepts mentioned in the paper.
Thanks for spending some time here, have a good one!