1/ CS majors are drilled to think about "worst-case" performance of algorithms. By contrast, much of the discourse on AI evals focuses on average-case or best-case (e.g. LLM X can solve IMO problems). Maybe one key to "reliability" is certifying the 1st quantile of outputs too, not just the mean.