A Levenhall Product
AIGIS

AI Evaluation & Governance · Across Every Classification Level

A modular, containerized AI evaluation and governance platform. Vendor-agnostic infrastructure for continuous model assessment — designed to run from unclassified environments through TS/SCI air-gapped networks on the same container images.

One Question, Asked Continuously: Is This Model Safe and Mission-Ready?

AIGIS takes its name from the mythological aegis. The function follows the name: determine whether an AI model is safe and mission-ready for a specific deployment context — not in the abstract, but for this user, this network, this mission, this classification level.

AIGIS is built for the operational evaluator, not the research scientist. It treats AI assessment as a continuous infrastructure problem, not a one-time benchmark.

Designed For UNCLASS → TS/SCI Identical container images across every classification level. Only Helm values change.

The Gap Between Research Tools and Program Tools

Levenhall identified the gap through defense advisory engagements: DoD and IC organizations evaluating AI lack standardized, deployment-ready infrastructure for rigorous model assessment — particularly across classification boundaries. Existing tools fall into two categories. Neither is sufficient.

Category 1

Research-Grade

HELM · Inspect · lm-evaluation-harness

Designed for ML researchers. Not deployable to classified networks. No operational context.

Category 2

Program-Proprietary

Built for a single program

Reflects one program's mission profile and dies when the program does. Not reusable across organizations.

AIGIS

Operational Infrastructure

Vendor-agnostic · Cross-classification · Continuous

Purpose-built for the operational evaluator. Same platform across DoD, IC, and commercial deployments — from air-gapped TS/SCI to managed cloud.

Dual-Use Strategy

Two Markets, One Platform

Defense rigor and commercial cadence on the same core. Government users benefit from continuous commercial R&D. Commercial customers benefit from defense-grade evaluation rigor.

Defense

DoD & IC

Model evaluation across classification levels. DDIL degradation profiling. Adversarial red-teaming aligned to MITRE ATLAS. Human-AI teaming measurement. Mission-specific benchmark development.

Commercial

Enterprise AI Governance

Regulatory compliance for the EU AI Act, NIST AI RMF, and ISO 42001. Insurance risk assessment for regulated industries. Continuous governance, not point-in-time audit.

API-First. Microservices. Containerized End-to-End.

Independently deployable components, decoupled infrastructure (harness) and content (benchmarks), and an LLM & vector-store abstraction that lets the same platform run on OpenAI, Bedrock, or local Ollama.

Layer 01

Presentation

React / Next.js web dashboard, REST API, CLI.

Layer 02

API Gateway

Traefik with auth, RBAC, rate limiting, and TLS termination.

Layer 03 · Core Services

Five Pluggable Engines

Model Interface Layer Pluggable adapters (OpenAI, HuggingFace, Ollama, SageMaker / Bedrock, gRPC) behind a uniform InvokeRequest / InvokeResponse protocol.
Execution Engine Python / FastAPI + Celery + Redis. Distributed task orchestration with parallel runs, priority queuing, automatic retry, and real-time progress streaming.
Scoring Engine Pluggable scorer registry. Exact match, contains match, similarity scoring, classification accuracy, and human-evaluation routing.
Benchmark Management Versioned benchmark suites with immutable snapshots, semver, and full task metadata.
Governance & Risk Engine Compliance tracking (NIST AI RMF, MITRE ATLAS, EU AI Act, ISO 42001, OWASP Agentic AI Top 10, DoD AI Ethical Principles), risk scoring, insurance coverage monitoring, incident tracking.
Layer 04

Data

PostgreSQL + pgvector, Redis, object storage, LLM providers behind LLMProvider / VectorStoreProvider interfaces.

Layer 05

Runtime

Docker / OCI containers, orchestrated via Kubernetes. K3s for resource-constrained tactical environments.

Capabilities

Two Lines of Effort

LOE 1 is the platform itself. LOE 2 is the methodology that lets a government team build, validate, and maintain their own benchmarks on top of it.

LOE 01

Evaluation Harness

Model Interface Layer

Adapters for OpenAI-compatible APIs, HuggingFace Inference, AWS SageMaker / Bedrock, local Ollama, and gRPC endpoints. New adapters are written against a documented SDK without modifying the core platform.

Evaluation Execution

Each run records a full environment snapshot (Python version, platform, AIGIS version, LLM provider, timestamp) for complete reproducibility. Per-task scoring with latency, token counts, and pass/fail determinations.

Benchmark Domains

General, safety, adversarial, agentic, multimodal, DDIL (denied / disconnected / intermittent / limited), and mission-specific.

Task Types

Multiple choice, free text, code generation, classification, extraction, summarization, agentic workflow, and human evaluation.

DDIL Conditions Simulator

Injects operational stress — latency, jitter, bandwidth throttling, intermittent connectivity, compute constraints — and evaluates degradation curves rather than binary pass/fail.

Agentic AI Evaluation

Full execution-trace capture, scored on task completion rate, efficiency, tool-selection accuracy, and safety-boundary adherence. Aligned with OWASP Agentic AI Top 10 and the MAESTRO framework.

Adversarial Red-Teaming

Automated pipeline aligned with MITRE ATLAS — prompt injection, jailbreak generation, data-poisoning probes, model-extraction resistance. Results map to ATLAS technique IDs.

Human Evaluation

Likert, pairwise comparison, best-of-N ranking. Inter-rater reliability computed automatically (Krippendorff's alpha, Fleiss' kappa).

Multimodal Support

Text, image, audio, and video pipelines with modality-specific metrics.

LOE 02

Benchmark Development Methodology

A structured eight-phase toolkit that lets government personnel independently build, validate, and maintain mission-specific benchmarks. Written methodology guide, worked examples, QA checklist, and training curriculum — targeting team self-sufficiency within 90 days.

01Requirements elicitation
02Task decomposition
03Input design
04Scoring criteria
05Baseline establishment
06Validation
07Gaming resistance
08Maintenance

Same Container. Every Network.

Identical container images at every level. Only Helm configuration values change. Air-gapped deployments bundle all model weights, benchmarks, and knowledge base as container volumes.

Environment LLM Provider Vector DB External APIs
Unclassified OpenAI API Managed cloud Full
IL5 Cloud OpenAI (FedRAMP) or Ollama pgvector Restricted
IL6 / SECRET Ollama (local) pgvector None
TS/SCI · JWICS Ollama (quantized, local) pgvector None
Governance Dashboard

Compliance, Risk & Insurance, Continuously

The governance module tracks framework coverage with last-audit dates, scores risk against active incidents, monitors insurance coverage across the portfolio, and surfaces real-time activity across evaluations, adversarial detections, bias audits, and model registrations.

94% DoD AI Ethical Principles
87% NIST AI RMF
72% MITRE ATLAS
65% EU AI Act
48% ISO 42001
$60M+ Insurance Coverage Tracked

Working Prototype, Hardening for Classified

Core evaluation engine (model interface, execution, scoring) — demonstrable in unclassified environment
Human evaluation forms and reliability computation — functional
Benchmark management and versioning — functional
Web dashboard (React / Next.js) — functional with all major views
DDIL simulator, adversarial pipeline, agentic evaluation, multimodal — designed, partially implemented
Classification-level deployment hardening — architecture complete; IL5+ testing requires Government access

Open-Source All the Way Down

Backend Python, FastAPI, Celery, SQLAlchemy (async), Alembic
Frontend React, Next.js, TypeScript, Tailwind CSS, Lucide
Data PostgreSQL + pgvector, Redis
Infra Docker / OCI, Kubernetes, K3s (edge), Traefik
LLM Providers Ollama (local / air-gapped), OpenAI, HuggingFace
Lock-In None — all infrastructure components are open-source.
Levenhall Proprietary

AIGIS is Active. Early Access is Selective.

The platform architecture, adapter SDK, governance-integrated scoring methodology, DDIL degradation profiling system, and benchmark development toolkit are Levenhall proprietary technology. We work with DoD and IC organizations, regulated commercial customers, and insurance counterparties on early access.