High-Level Overview
Task Types
An overview of the diverse task categories we design
to enhance frontier AI capabilities.
Reasoning-Focused Tasks
Rigorous problem sets crafted to build foundational and advanced logical capabilities.
Foundational
Mathematics
High school and AIME-level math problems designed to instill basic mathematical reasoning and structural logic in AI models.
Challenging
HLE Problems
Original problem sets spanning undergraduate to PhD and IMO-level difficulty, built specifically to sharpen multi-step reasoning and complex problem-solving to guide Frontier AI models towards the Humanity’s Last Exam benchmark.
Education-Focused Tasks
Scenarios simulating real-world academic interactions and structured learning environments.
Student Prompt
Simulation
Realistic queries from students (ages 13-24) targeting homework assistance, complex topic explanation, academic paper interpretation, and open-ended research.
Agentic
Environments
Complex “world scenarios” populated with hundreds of distinct artifacts, training models to act as autonomous educators and teaching assistants.
Generalist Data
Broad-domain prompts designed to test everyday utility, safety, and behavioral compliance.
Adversarial
Red Teaming
Specialized “trick” prompts utilizing roleplay, metaphors, and complex formatting to stress-test rule adherence and prevent sensitive content generation.
Student-Life
Tasks
Everyday student queries focused on general knowledge, productivity planning,
study scheduling, and
revision material generation.
Task Details
The structural components and deliverables
that make up our high-quality training datasets.
Prompt Artifacts
The varied input components and structural configurations that make up the initial query.
Mixed
Modalities
Single and multi-turn interactions spanning text-only inputs to complex mixed media, including geometric diagrams, graphs, and technical file analysis.
System
Prompts
Targeted system instructions – often situating the model as an expert mathematician, followed by the prompt designed by the writer.
Realistic
Noise Injection
Prompts intentionally layered with ambiguity, contradictions, sudden topic shifts or incomplete context to test models under authentic, non-idealized conditions.
Aspirational
Tasks
Requests that demand actions beyond current model capabilities, such as executing local file edits or sending live emails.
Chain of Thought (CoT) Analysis
Detailed breakdowns of model reasoning to pinpoint weaknesses and guide reinforcement learning.
Model Failure
Justification
Expert analysis of model responses to isolate specific logic breakdowns, paired with clear, actionable feedback on errors and areas of improvement.
Failure
Taxonomy
Systematic categorization of CoT flaws into Hard failures (hallucinations, factual mistakes, logical flaws) and Soft failures (compliance, clarity, formatting, tone).
RLHF
Preference Rating
Comparative evaluations rating model responses (1-5), backed by dimensional analysis detailing exactly why one response outperforms another.
Response Artifacts
High-fidelity expected answers, hints, and step-by-step reasoning guides.
Golden
Responses
Authoritative, formal solutions for text-based tasks or comprehensive descriptions of the ideal state for multimodal outputs.
Intuitive
Reasoning Paths
Golden responses structured along natural cognitive paths to optimize the model’s underlying comprehension of complex problems.
Progressive
Hints
Variable, difficulty-calibrated clues that highlight critical steps and gently guide the model toward the correct solution without revealing it entirely.
Structured JSON Rubrics
Custom grading criteria used to evaluate the accuracy and quality of the model’s responses.
Coverage and Verifiability
Criteria designed for auto-grader verification, structured as one-dimensional rubrics for single-answer problems or multi-dimensional rubrics to accommodate multiple valid solution paths.
Rubric
Dimensions
Atomic criteria evaluating specific response traits, supported by exact metadata: Description, Rationale, Purpose, Grading Guidance, Criterion Dependence, Source, and Weight.
Criterion Categorization
Specific tags (e.g., Quantitative Reasoning, Style, Safety, Compliance, Extraction) assigning an exact evaluation target to each atomic rubric criterion.
Baseline Human Rating
Expert-graded sample responses with detailed justifications, establishing a behavioral baseline for future auto-graders.
Quality Assurance
Rigorous verification pipelines ensuring every dataset
meets the highest standards of correctness and utility.
Review Process
Multi-stage expert evaluation to validate task alignment, accuracy, and difficulty.
Workflow
Overview
Trained reviewers approve, discard, or return tasks for editing, ensuring strict adherence to project-specific guidelines and scope.
Blind-Solve
Mechanic
Reviewers attempt problems without prior access to solutions or rubrics, ensuring unbiased evaluation of the answer, difficulty, dataset quality, and alternative solution paths.
Automated
Pipeline Checks
Interface-integrated automated checks that verify the structural accuracy of Prompts, CoT analysis, Golden Responses, and Rubrics, optimizing our experts’ time and minimizing mistakes.
Super Review Process and Quality Control
Our most trusted experts form the final authority before client delivery. They conduct strict originality checks and quality control, ensuring our entire talent pool remains perfectly calibrated. Only tasks passing this uncompromising review enter the final dataset.
What do you need to enhance your AI?