If you launch a feature powered by an LLM, how reliable does it need to be to not embarrass the team? Are there benchmarks people are using?