The goal of dynamic link property prediction is to predict the property (often the existence) of a link between a node pair at a future timestamp.
Negative Edge Sampling. In real applications, the true edges are not known in advance. Therefore, a large number of node pairs are queried, and onlypairs with the highest scores are treated as edges. Motivated by this, we frame the link prediction task as a ranking problem and sample multiple negative edges per each positive edge. In particular, for a given positive edge (s,d,t), we fix the source node s and timestamp t and sample q different destination nodes d. For each dataset, q is selected based on the trade-off between evaluation completeness and test set inference time. Out of the q negative samples, half are sampled uniformly at random, while the other half are historic negative edges (edges that were observed in the training set but are not present at time t).
Performance metric. We use the filtered Mean Reciprocal Rank (MRR) as the metric for this task, as it is designed for ranking problems. The MRR computes the reciprocal rank of the true destination node among the negative or fake destinations and is commonly used in recommendation systems and knowledge graph literature.
Results on small datasets. On the small tgbl-wiki
and tgbl-review
datasets, we observe that the best performing models are quite different. In addition, the top performing models on tgbl-wiki
such as CAWN and NAT have a significant reduction in performance on tgbl-review
. One possible explanation is that the tgbl-review
dataset has a much higher surprise index when compared to the tgbl-wiki
dataset. The high surprise index shows that a high ratio of test set edges is never observed in the training set thus tgbl-review
requires more inductive reasoning. In tgbl-review
, GraphMixer and TGAT are the best performing models. Due to their smaller size, we are able to sample all possible negatives for tgbl-wiki
and one hundred negatives for tgbl-review
per positive edge.
Most methods run out of GPU memory for these datasets thus we compare TGN, DyRep and Edgebank on these datasets due to their lower GPU memory requirement. Note that some datasets such as tgbl-comment
or tgbl-flight
spanning multiple years thus potentially resulting in distribution shift over its long time span.
Insights. As seen above in tgbl-wiki
, the number of negative samples used for evaluation can significantly impact model performance: we see a significant performance drop across most methods, when the number of negative samples increases from 20 to all possible destinations. This verifies that indeed, more negative samples are required for robust evaluation. Curiously, methods such as CAWN and Edgebank have relatively minor drop in performance and we leave it as future work to investigate why certain methods are less impacted.
Next, we observe up to two orders of magnitude difference in training and validation time of TG methods, with the heuristic baseline Edgebank always being the fastest (as it is implemented simply as a hashtable). This shows that improving the model efficiency and scalability is an important future direction such that novel and existing models can be tested on large datasets provided in TGB.