Yun Wang, Chrysanthi Kosyfaki, Sihem Amer-Yahia, Reynold Cheng
Hypothesis testing is a statistical method used to draw conclusions about populations from sample data, typically represented in tables. With the prevalence of graph representations in real-life applications, hypothesis testing on graphs is gaining importance. In this work, we formalize node, edge, and path hypotheses on attributed graphs. We develop a sampling-based hypothesis testing framework, which can accommodate existing hypothesis-agnostic graph sampling methods. To achieve accurate and time-efficient sampling, we then propose a Path-Hypothesis-Aware SamplEr, PHASE, an m-dimensional random walk that accounts for the paths specified in the hypothesis. We further optimize its time efficiency and propose \(\text{PHASE}_{\text{opt}}\). Experiments on three real datasets demonstrate the ability of our framework to leverage common graph sampling methods for hypothesis testing, and the superiority of hypothesis-aware sampling methods in terms of accuracy and time efficiency.
Our comprehensive experiments on MovieLens, DBLP, and Yelp datasets demonstrate the effectiveness of our approach:
\(\text{PHASE}_{\text{opt}}\) consistently outperforms the top 5 hypothesis-agnostic samplers across most hypotheses and sampling proportions. Its advantage is less pronounced for easier hypotheses with abundant relevant structures in \(\mathcal{G}\), where hypothesis-agnostic samplers perform comparably. However, for harder hypotheses (e.g., with (hard) tag), \(\text{PHASE}_{\text{opt}}\) shows clear superiority, achieving the highest accuracy at all sampling proportions.
@article{wang2024sampling,
title={A Sampling-based Framework for Hypothesis Testing on Large Attributed Graphs},
author={Wang, Yun and Kosyfaki, Chrysanthi and Amer-Yahia, Sihem and Cheng, Reynold},
journal={[Journal Name]},
year={2024}
}