Build AI benchmarks test surgical edits, not messy real-world codi