AI Evaluation & Governance · Across Every Classification Level
A modular, containerized AI evaluation and governance platform. Vendor-agnostic infrastructure for continuous model assessment — designed to run from unclassified environments through TS/SCI air-gapped networks on the same container images.
One Question, Asked Continuously: Is This Model Safe and Mission-Ready?
AIGIS takes its name from the mythological aegis. The function follows the name: determine whether an AI model is safe and mission-ready for a specific deployment context — not in the abstract, but for this user, this network, this mission, this classification level.
AIGIS is built for the operational evaluator, not the research scientist. It treats AI assessment as a continuous infrastructure problem, not a one-time benchmark.
The Gap Between Research Tools and Program Tools
Levenhall identified the gap through defense advisory engagements: DoD and IC organizations evaluating AI lack standardized, deployment-ready infrastructure for rigorous model assessment — particularly across classification boundaries. Existing tools fall into two categories. Neither is sufficient.
Research-Grade
HELM · Inspect · lm-evaluation-harness
Designed for ML researchers. Not deployable to classified networks. No operational context.
Program-Proprietary
Built for a single program
Reflects one program's mission profile and dies when the program does. Not reusable across organizations.
Operational Infrastructure
Vendor-agnostic · Cross-classification · Continuous
Purpose-built for the operational evaluator. Same platform across DoD, IC, and commercial deployments — from air-gapped TS/SCI to managed cloud.
Two Markets, One Platform
Defense rigor and commercial cadence on the same core. Government users benefit from continuous commercial R&D. Commercial customers benefit from defense-grade evaluation rigor.
DoD & IC
Model evaluation across classification levels. DDIL degradation profiling. Adversarial red-teaming aligned to MITRE ATLAS. Human-AI teaming measurement. Mission-specific benchmark development.
Enterprise AI Governance
Regulatory compliance for the EU AI Act, NIST AI RMF, and ISO 42001. Insurance risk assessment for regulated industries. Continuous governance, not point-in-time audit.
API-First. Microservices. Containerized End-to-End.
Independently deployable components, decoupled infrastructure (harness) and content (benchmarks), and an LLM & vector-store abstraction that lets the same platform run on OpenAI, Bedrock, or local Ollama.
Presentation
React / Next.js web dashboard, REST API, CLI.
API Gateway
Traefik with auth, RBAC, rate limiting, and TLS termination.
Five Pluggable Engines
Data
PostgreSQL + pgvector, Redis, object storage, LLM providers behind LLMProvider / VectorStoreProvider interfaces.
Runtime
Docker / OCI containers, orchestrated via Kubernetes. K3s for resource-constrained tactical environments.
Two Lines of Effort
LOE 1 is the platform itself. LOE 2 is the methodology that lets a government team build, validate, and maintain their own benchmarks on top of it.
Evaluation Harness
Model Interface Layer
Adapters for OpenAI-compatible APIs, HuggingFace Inference, AWS SageMaker / Bedrock, local Ollama, and gRPC endpoints. New adapters are written against a documented SDK without modifying the core platform.
Evaluation Execution
Each run records a full environment snapshot (Python version, platform, AIGIS version, LLM provider, timestamp) for complete reproducibility. Per-task scoring with latency, token counts, and pass/fail determinations.
Benchmark Domains
General, safety, adversarial, agentic, multimodal, DDIL (denied / disconnected / intermittent / limited), and mission-specific.
Task Types
Multiple choice, free text, code generation, classification, extraction, summarization, agentic workflow, and human evaluation.
DDIL Conditions Simulator
Injects operational stress — latency, jitter, bandwidth throttling, intermittent connectivity, compute constraints — and evaluates degradation curves rather than binary pass/fail.
Agentic AI Evaluation
Full execution-trace capture, scored on task completion rate, efficiency, tool-selection accuracy, and safety-boundary adherence. Aligned with OWASP Agentic AI Top 10 and the MAESTRO framework.
Adversarial Red-Teaming
Automated pipeline aligned with MITRE ATLAS — prompt injection, jailbreak generation, data-poisoning probes, model-extraction resistance. Results map to ATLAS technique IDs.
Human Evaluation
Likert, pairwise comparison, best-of-N ranking. Inter-rater reliability computed automatically (Krippendorff's alpha, Fleiss' kappa).
Multimodal Support
Text, image, audio, and video pipelines with modality-specific metrics.
Benchmark Development Methodology
A structured eight-phase toolkit that lets government personnel independently build, validate, and maintain mission-specific benchmarks. Written methodology guide, worked examples, QA checklist, and training curriculum — targeting team self-sufficiency within 90 days.
Same Container. Every Network.
Identical container images at every level. Only Helm configuration values change. Air-gapped deployments bundle all model weights, benchmarks, and knowledge base as container volumes.
Compliance, Risk & Insurance, Continuously
The governance module tracks framework coverage with last-audit dates, scores risk against active incidents, monitors insurance coverage across the portfolio, and surfaces real-time activity across evaluations, adversarial detections, bias audits, and model registrations.
Working Prototype, Hardening for Classified
Open-Source All the Way Down
AIGIS is Active. Early Access is Selective.
The platform architecture, adapter SDK, governance-integrated scoring methodology, DDIL degradation profiling system, and benchmark development toolkit are Levenhall proprietary technology. We work with DoD and IC organizations, regulated commercial customers, and insurance counterparties on early access.