Participant Guide

Competition Instructions

Everything you need to build, submit, and score your ticket-routing agent.

Overview

The STLabs Compe-ticket-ion challenges you to build a container-based agent that classifies support tickets into ordered (integration, method) pairs. Your agent runs inside a sandboxed environment with a fixed 15-minute timeout, and is scored on correctness, speed, and cost efficiency.

1. Getting Started

  1. Sign in with your @mit.edu email using a one-time magic link on the login page.
  2. View your registry credentials on the Submissions page under “Registry Access”. You’ll receive a private ECR namespace and push-only credentials.
  3. Build your agent as a Docker container targeting linux/amd64. Your container’s CMD / ENTRYPOINT is executed as-is inside the sandbox.

2. Input / Output Protocol

Your agent receives ticket data on stdin and must emit results on stdout, one JSON object per line.

Input format (one line per ticket)

{ "ticket_id": "t001", "text": "Customer reports Slack integration not sending notifications..." }

Output format (one line per ticket)

{ "ticket_id": "t001", "pairs": [{ "integration": "slack", "method": "send_message" }] }

Ordering matters

The pairs array must be returned in the same order the integrations/methods appear in the ticket text.

Scoring uses a longest-common-subsequence (LCS) algorithm on the ordered list, so swapping the order of correct pairs will lower your accuracy score.

Semantic matching

Integration and method names in the truth set may not be identical to the phrasing in the ticket text — they can be semantically similar.

For example, a ticket mentioning “posting a message in Slack” might map to { "integration": "slack", "method": "send_message" }. Your agent should normalize to the canonical integration/method vocabulary rather than echoing the exact words from the ticket.

Rules

  • Each output line must include ticket_id and a pairs array with integration and method fields.
  • First valid output per ticket_id wins — duplicates are warned but not penalized.
  • Unknown ticket IDs are silently ignored.
  • Unknown integration or method values are treated as mismatches.
  • Partial output is scored; there is no failure penalty.

3. Runtime Environment

Your container runs in an E2B sandbox with a 15-minute timeout and restricted network egress (platform domain only).

Environment variables available to your agent

VariableDescription
OPENROUTER_BASE_URLBase URL for the OpenRouter proxy provided by the platform.
OPENROUTER_API_KEYPer-submission signed proxy token for model API calls.
RUN_PROTOCOL_VERSIONProtocol version string for forward compatibility.

4. LLM Access

You cannot call your own models or any external LLM API directly.

The sandbox firewall blocks all outbound traffic except to the platform domain. All model calls must go through the platform’s OpenRouter proxy at the URL provided in OPENROUTER_BASE_URL. Attempts to reach OpenAI, Anthropic, or any other provider directly will fail with a network error.

The proxy is fully compatible with the OpenAI Python SDK. Point base_url at the environment variable and authenticate with the provided proxy token:

Python example using the OpenAI SDK

import os
from openai import OpenAI

client = OpenAI(
    base_url=os.environ["OPENROUTER_BASE_URL"],
    api_key=os.environ["OPENROUTER_API_KEY"],
)

response = client.chat.completions.create(
    model="openai/gpt-4o-mini",  # must be an allowlisted model
    messages=[
        {"role": "system", "content": "You are a ticket routing assistant."},
        {"role": "user", "content": ticket_text},
    ],
)

print(response.choices[0].message.content)

This works because OpenRouter exposes an OpenAI-compatible /chat/completions endpoint. Simply swap base_url to the proxy and you’re set — no other code changes required.

  • Only allowlisted models may be used. Requests for other models will be rejected by the proxy.(All models are currently allowed.)
  • A per-submission cost cap is enforced — exceeding it will terminate your run immediately.

5. Submitting Your Agent

  1. Build and tag your container for linux/amd64.
  2. Push the image to your assigned ECR namespace using the credentials shown on the Submissions page.
  3. The platform automatically detects pushes and enqueues an evaluation run. You can track status on the Submissions page.
  4. Up to 3 queued submissions are allowed at a time. You can cancel queued submissions before they start running.

6. Scoring

Each submission is scored using the formula:

final_score = (w1 * completion_rate)
            + (w2 * accuracy)
            + (w3 * speed_score)
            + (w4 * cost_score)
  • Accuracy is LCS-based on ordered (integration, method) pairs per ticket.
  • Speed score = min(1, baseline_ms / mean_time_ms)
  • Cost score = min(1, baseline_usd / cost_per_ticket)
  • Fixed baselines are used — no peer normalization.

Scoring weights (w1w4) are set before the competition begins and we do not plan to change them during the event. In the unlikely event that a weight adjustment is necessary, all leaderboard submissions will be automatically re-scored to ensure fairness.

7. Leaderboard

  • Your best score under the active scoring version is shown.
  • Failed or timed-out runs are excluded.
  • Tie-breaker: earliest completed_at timestamp.
  • During the event, only aggregate metrics are shown. Per-ticket breakdowns are not revealed.

8. Rules and Constraints

  • Your container must target linux/amd64 and use digest-pinned image references.
  • Network access is restricted to the platform domain only — no external API calls.
  • Submissions are evaluated against a fixed 200-ticket set that is disjoint from any samples or templates.
  • Ground truth is never revealed to participants.
  • Credential rotation has a cooldown period. Plan key rotations accordingly.

9. Prizes

First Place

$500 Amazon Gift Card

Second Place

$200 Amazon Gift Card

Third Place

$50 Amazon Gift Card