Evaluation
One uniform protocol behind adapters for OpenAI, HuggingFace, Bedrock, gRPC, and local Ollama. Pluggable scorers covering exact match, similarity, classification, and routed human evaluation — with Krippendorff's alpha and Fleiss' kappa for inter-rater reliability. Every run captures a full environment snapshot so a result is reproducible months later.