Runs
Manage evaluation runs that track the execution of evals with specific runtime configurations. Runs capture the conversation history, results, and metadata for each eval execution.
Import
import {
createProjectModules,
createStorageProvider,
resolveWorkspace,
RunProcessor,
evaluateCriteria,
generatePersonaMessage,
type Run,
type RunStatus,
type RunResult,
type CreateRunInput,
type CreatePlaygroundRunInput,
type CreateChatRunInput,
type ChatMessageInput,
type ChatMessageResult,
type UpdateRunInput,
type ListRunsOptions,
type RunProcessorOptions,
type CriteriaEvaluationResult,
type GeneratePersonaMessageResult,
} from "@evalstudio/core";
Setup
All entity operations are accessed through project modules:
const workspaceDir = resolveWorkspace();
const storage = await createStorageProvider(workspaceDir);
const modules = createProjectModules(storage, projectId);
Types
Run
interface Run {
id: string; // Unique identifier (UUID)
evalId?: string; // Parent eval ID (optional for playground runs)
personaId?: string; // Persona ID used for this run
scenarioId?: string; // Scenario ID used for this run
connectorId?: string; // Connector ID (for playground runs without eval)
executionId?: number; // Auto-generated ID grouping runs in the same batch
status: RunStatus; // Run status
startedAt?: string; // ISO timestamp when run started
completedAt?: string; // ISO timestamp when run completed
latencyMs?: number; // Total execution time in milliseconds
threadId?: string; // Thread ID for LangGraph (regenerated on retry)
messages: Message[]; // Conversation history (includes system prompt)
output?: Record<string, unknown>; // Structured output
result?: RunResult; // Evaluation result
error?: string; // Error message if failed
createdAt: string; // ISO 8601 timestamp
updatedAt: string; // ISO 8601 timestamp
}
Note: For eval-based runs, the connector is configured at the Eval level. For playground runs (without an eval), connectorId is stored directly on the run. LLM provider for evaluation is always configured at the project level via evalstudio.config.json llmSettings. The personaId and scenarioId are always stored directly on the run at creation time.
The messages array includes all messages stored during execution:
- System prompt (generated from persona/scenario)
- Seed messages from scenario
- Response messages from the agent
RunStatus
type RunStatus = "queued" | "pending" | "running" | "completed" | "error" | "chat";
queued- Run created and waiting to be executedpending- Run is being prepared for executionrunning- Run is currently executingcompleted- Run finished (checkresult.successfor pass/fail). Evaluation failures use this status withresult.success: falseerror- Run encountered a system error (check error field). Only runs with this status can be retriedchat- Live chat session in the Agents page. Chat runs are not processed by RunProcessor and are excluded from eval-related run lists
RunResult
interface RunResult {
success: boolean; // Whether the eval passed
score?: number; // Optional score (0-1)
reason?: string; // Explanation of result
}
CreateRunInput
interface CreateRunInput {
evalId: string; // Required: eval to run
}
Runs use the connector and LLM provider configured on the parent Eval. When runs are created, they are automatically assigned an executionId that groups all runs created in the same batch. This ID is auto-incremented.
CreatePlaygroundRunInput
interface CreatePlaygroundRunInput {
scenarioId: string; // Required: scenario to run
connectorId: string; // Required: connector for invoking the agent
personaId?: string; // Optional: persona to simulate
}
Used for creating runs directly from scenarios without requiring an eval. The connector is specified directly since there's no parent eval to inherit from. LLM provider for evaluation is resolved from the project's evalstudio.config.json llmSettings.
CreateChatRunInput
interface CreateChatRunInput {
connectorId: string; // Required: connector for the chat session
}
Used for creating live chat runs from the Agents page. Chat runs have status: "chat" and are not processed by RunProcessor.
ChatMessageInput
interface ChatMessageInput {
content: string; // The user message text
}
ChatMessageResult
interface ChatMessageResult {
run: Run; // The full updated run with all messages
messages: Message[]; // Only the new response messages from this turn
latencyMs: number; // Response time in milliseconds
error?: string; // Error message if the connector invoke failed
}
UpdateRunInput
interface UpdateRunInput {
status?: RunStatus;
startedAt?: string;
completedAt?: string;
latencyMs?: number;
threadId?: string;
messages?: Message[];
output?: Record<string, unknown>;
result?: RunResult;
error?: string;
}
ListRunsOptions
interface ListRunsOptions {
evalId?: string; // Filter by eval ID
scenarioId?: string; // Filter by scenario ID
status?: RunStatus; // Filter by status
limit?: number; // Maximum number of runs to return
}
RunProcessorOptions
interface RunProcessorOptions {
pollIntervalMs?: number; // Polling interval in milliseconds (default: 5000)
maxConcurrent?: number; // Max concurrent runs (falls back to project config, then 3)
onStatusChange?: (runId: string, status: RunStatus, run: Run) => void;
onRunStart?: (run: Run) => void;
onRunComplete?: (run: Run, result: ConnectorInvokeResult) => void;
onRunError?: (run: Run, error: Error) => void;
}
Methods
modules.runs.createMany()
Creates one or more runs for an eval. If the eval's scenario has multiple personas associated with it (personaIds), one run is created for each persona.
async function createMany(input: CreateRunInput): Promise<Run[]>;
Throws: Error if the eval, scenario, or any persona doesn't exist.
const runs = await modules.runs.createMany({
evalId: "eval-uuid",
});
// If scenario has 3 personas, returns 3 runs
// Each run has personaId and scenarioId stored directly
// All runs share the same executionId (auto-assigned)
// runs[0].status === "queued"
modules.runs.create()
Creates a single run for an eval. This is a convenience wrapper around createMany() that returns only the first run.
async function create(input: CreateRunInput): Promise<Run>;
Throws: Error if the eval doesn't exist.
const run = await modules.runs.create({
evalId: "eval-uuid",
});
// run.status === "queued"
// run.personaId and run.scenarioId are stored directly
// run.executionId is auto-assigned
modules.runs.createPlayground()
Creates a run directly from a scenario without requiring an eval. Useful for testing scenarios in a playground environment before setting up formal evaluations.
async function createPlayground(input: CreatePlaygroundRunInput): Promise<Run>;
Throws: Error if the scenario, connector, or persona doesn't exist.
const run = await modules.runs.createPlayground({
scenarioId: "scenario-uuid",
connectorId: "connector-uuid",
personaId: "persona-uuid", // Optional
});
// run.status === "queued"
// run.evalId is undefined
// run.connectorId is stored directly
The run is processed by RunProcessor like any other run. The processor checks for connectorId on the run itself when evalId is not present. LLM provider for evaluation is resolved from the project's evalstudio.config.json llmSettings.
modules.runs.createChatRun()
Creates a live chat run for a connector. Used by the Agents page for interactive chat sessions.
async function createChatRun(input: CreateChatRunInput): Promise<Run>;
Throws: Error if the connector doesn't exist.
const run = await modules.runs.createChatRun({
connectorId: "connector-uuid",
});
// run.status === "chat"
// run.connectorId is stored directly
// run.evalId, run.scenarioId, run.personaId are undefined
Chat runs are not processed by RunProcessor. They are managed interactively through the Agents page live chat interface.
modules.runs.sendChatMessage()
Sends a user message in a chat run. The server invokes the connector, appends the user message and assistant response to the run, and persists the updated state.
async function sendChatMessage(id: string, input: ChatMessageInput): Promise<ChatMessageResult>;
Throws: Error if the run doesn't exist, is not a chat run, or has no connector configured.
const result = await modules.runs.sendChatMessage(run.id, {
content: "Hello, I need help with my booking",
});
// result.run — full updated run with all messages
// result.messages — only the new assistant/tool response messages
// result.latencyMs — connector response time
// result.error — error message if invoke failed (run is still updated)
modules.runs.get()
Gets a run by its ID.
async function get(id: string): Promise<Run | undefined>;
const run = await modules.runs.get("run-uuid");
modules.runs.list()
Lists runs with flexible filtering options.
async function list(options?: ListRunsOptions): Promise<Run[]>;
// List all runs
const allRuns = await modules.runs.list();
// Filter by status
const queuedRuns = await modules.runs.list({ status: "queued", limit: 10 });
// Filter by eval
const evalRuns = await modules.runs.list({ evalId: "eval-uuid", status: "completed" });
// Filter by scenario
const scenarioRuns = await modules.runs.list({ scenarioId: "scenario-uuid" });
When using the options-based API, results are sorted by createdAt (oldest first), making it suitable for queue processing.
modules.runs.listByEval()
Lists runs for a specific eval, sorted by creation date (newest first).
async function listByEval(evalId: string): Promise<Run[]>;
const evalRuns = await modules.runs.listByEval("eval-uuid");
// Returns runs sorted by createdAt descending
modules.runs.listByScenario()
Lists runs for a specific scenario, sorted by creation date (newest first).
async function listByScenario(scenarioId: string): Promise<Run[]>;
const scenarioRuns = await modules.runs.listByScenario("scenario-uuid");
modules.runs.listByPersona()
Lists runs for a specific persona, sorted by creation date (newest first).
async function listByPersona(personaId: string): Promise<Run[]>;
const personaRuns = await modules.runs.listByPersona("persona-uuid");
modules.runs.listByConnector()
Lists runs for a specific connector, sorted by creation date (newest first).
async function listByConnector(connectorId: string): Promise<Run[]>;
const connectorRuns = await modules.runs.listByConnector("connector-uuid");
// Returns all runs (including chat runs) for this connector
modules.runs.update()
Updates an existing run.
async function update(id: string, input: UpdateRunInput): Promise<Run | undefined>;
// Start a run
await modules.runs.update(run.id, {
status: "running",
startedAt: new Date().toISOString(),
});
// Complete a run with success
await modules.runs.update(run.id, {
status: "completed",
completedAt: new Date().toISOString(),
latencyMs: 1500,
messages: [
{ role: "user", content: "Hello" },
{ role: "assistant", content: "Hi there!" },
],
result: {
success: true,
score: 0.95,
reason: "Agent responded appropriately",
},
});
// Mark a run as error (system failure - retryable)
await modules.runs.update(run.id, {
status: "error",
completedAt: new Date().toISOString(),
error: "Connection timeout",
});
modules.runs.delete()
Deletes a run by its ID.
async function delete(id: string): Promise<boolean>;
Returns true if the run was deleted, false if not found.
const deleted = await modules.runs.delete(run.id);
modules.runs.retry()
Retries a failed run by resetting it to "queued" status with a fresh thread ID.
async function retry(id: string): Promise<Run | undefined>;
Throws: Error if the run's status is not "error". Only runs with system errors can be retried.
const retriedRun = await modules.runs.retry(run.id);
// retriedRun.status === "queued"
// retriedRun.messages === []
// retriedRun.threadId is regenerated
RunProcessor
The RunProcessor class provides background execution of queued evaluation runs. It polls for runs with status "queued" and executes them via the configured connector.
Creating a Processor
const processor = new RunProcessor({
pollIntervalMs: 5000, // Poll every 5 seconds (default)
maxConcurrent: 3, // Process up to 3 runs concurrently (default)
onStatusChange: (runId, status, run) => {
console.log(`Run ${runId} is now ${status}`);
},
onRunStart: (run) => {
console.log(`Started processing run ${run.id}`);
},
onRunComplete: (run, result) => {
console.log(`Run ${run.id} completed`);
},
onRunError: (run, error) => {
console.error(`Run ${run.id} failed:`, error.message);
},
});
Starting and Stopping
// Start the processor (begins polling for queued runs)
processor.start();
// Check if running
console.log(processor.isRunning()); // true
// Get active run count
console.log(processor.getActiveRunCount()); // 0-maxConcurrent
// Graceful shutdown (waits for active runs to complete)
await processor.stop();
One-Shot Processing
For CLI tools or testing, you can process queued runs without starting the polling loop:
const processor = new RunProcessor();
// Process one batch of queued runs and wait for completion
const started = await processor.processOnce();
console.log(`Started ${started} runs`);
Crash Recovery
When start() is called, the processor automatically resets any runs stuck in "running" status back to "queued". This handles recovery from crashes or unexpected shutdowns.
Usage with CLI and API
The same RunProcessor can be used from both CLI and API contexts:
// CLI usage
const processor = new RunProcessor({
onStatusChange: (runId, status) => {
process.stdout.write(`\r${runId}: ${status}`);
},
});
processor.start();
// API usage (e.g., in Fastify)
const processor = new RunProcessor({
onStatusChange: (runId, status, run) => {
websocket.broadcast({ type: 'run_status', runId, status, run });
},
});
processor.start();
Atomic Claiming
The processor uses atomic status transitions to prevent duplicate processing across multiple processor instances. When a run is claimed:
- The run's status is checked to be "queued"
- Status is atomically updated to "running"
- If another processor claimed it first, the claim fails and the run is skipped
Evaluation Loop
When an eval has an LLM provider configured, the RunProcessor uses a multi-turn evaluation loop:
- Send message to agent - The user message (from scenario seed or generated) is sent to the connector
- Evaluate response - The agent's response is evaluated against the scenario's
successCriteriaandfailureCriteriausing an LLM judge - Check termination conditions:
- If
successCriteriais met → Run completes as passed - If
failureCriteriais met andfailureCriteriaModeis"every_turn"→ Run completes as failed - If
maxMessageslimit is reached → Run completes (pass/fail based on final evaluation) - Otherwise → Continue to step 4
- If
- Generate next user message - An LLM generates a contextual user message, optionally impersonating the configured persona
- Loop - Return to step 1
Failure Criteria Modes: The failureCriteriaMode field on the scenario controls when failure criteria stops the loop:
"on_max_messages"(default): Failure criteria is only checked whenmaxMessagesis reached without success. This allows the agent to recover from mistakes during the conversation."every_turn": Failure criteria is checked at every turn, just like success criteria. The loop stops immediately when failure is detected.
If no LLM provider is configured, the processor falls back to single-turn execution (one request/response cycle).
Standalone Functions
evaluateCriteria()
Evaluates a conversation against success and failure criteria using an LLM judge.
interface EvaluateCriteriaInput {
messages: Message[];
successCriteria?: string;
failureCriteria?: string;
llmProvider: LLMProvider;
model?: string;
}
interface CriteriaEvaluationResult {
successMet: boolean;
failureMet: boolean;
confidence: number; // 0-1 score
reasoning: string;
rawResponse?: string;
}
function evaluateCriteria(input: EvaluateCriteriaInput): Promise<CriteriaEvaluationResult>;
const result = await evaluateCriteria({
messages: conversationHistory,
successCriteria: "User successfully booked an appointment",
failureCriteria: "Agent refused to help or provided incorrect information",
llmProvider: provider, // LLMProvider object from getLLMProviderFromProjectConfig()
});
console.log(result.successMet); // true/false
console.log(result.failureMet); // true/false
console.log(result.confidence); // 0.95
console.log(result.reasoning); // "The agent successfully helped..."
generatePersonaMessage()
Generates a contextual user message for continuing a conversation, optionally impersonating a persona.
interface GeneratePersonaMessageInput {
messages: Message[];
persona?: Persona; // Optional - generates generic user message if not provided
scenario: Scenario;
llmProvider: LLMProvider;
model?: string;
}
interface GeneratePersonaMessageResult {
content: string;
rawResponse?: string;
}
function generatePersonaMessage(input: GeneratePersonaMessageInput): Promise<GeneratePersonaMessageResult>;
const result = await generatePersonaMessage({
messages: conversationHistory,
persona: userPersona, // Optional
scenario: testScenario,
llmProvider: provider, // LLMProvider object
});
console.log(result.content); // "I'd like to reschedule my appointment to next Tuesday"
Storage
Runs are stored in data/runs.json within the project directory.