A looming replication crisis in LLM evals?

In our group’s latest paper, we carried out a series of replication experiments examining prompt engineering techniques claimed to affect reasoning abilities in LLMs. Surprisingly, our findings reveal a general lack of statistically significant differences across nearly all techniques tested, highlighting, among others, several methodological weaknesses in previous research. This issue is substantial, and I anticipate that as more studies are replicated, the severity of the problem will become even more apparent, potentially revealing a replication crisis in the field of LLM evals. If you’re interested in reading further, our paper is available as a preprint here.