Introducing ARFBench: A time series question-answering benchmark based on real incidents

By Datadog | The Monitor blog

April 24, 2026

85 views

Summary

The authors introduce ARFBench, a new benchmark designed to evaluate AI models' ability to perform time series question-answering (TSQA) using real-world Datadog incident data. While the study reveals that current frontier models still underperform compared to human experts, it demonstrates that a new hybrid TSFM-VLM architecture shows significant promise for specialized anomaly reasoning. Ultimately, the researchers suggest that the distinct error profiles of AI and humans offer opportunities for these two approaches to work complementarily in incident response.