The Problem That Made Me Look at Stackless Coroutines
The callback hell wasn’t obvious at first. We had 400+ NPCs each running layered AI behaviors — patrol, investigate, combat, flee, idle chatter — and every transition between states lived in a different callback registered somewhere across six files. Debugging a report like “guard gets stuck after hearing a sound” meant tracing execution across a web of lambdas, state enums, and deferred function pointers. I’d spend 45 minutes just reconstructing what the sequence of events should have been before I could even start looking for the bug.
We tried stackful coroutines first. The implementation used setjmp/longjmp under the hood — each coroutine got a real stack allocation. 64KB per coroutine was the safe minimum we needed to avoid stack overflows inside deeply nested AI utility calls. Do the math: 400 NPCs × some behaviors each × 64KB and you’re looking at tens of megabytes just for coroutine stacks. On a PS5 that’s survivable. On Switch, where you’re fighting for every MB of RAM, it killed the budget immediately. We had to cap active fibers and write a recycling pool, which re-introduced half the complexity we were trying to escape.
The turning point was seeing a behavior tree node rewritten as a coroutine. The original “patrol to waypoint, wait, look around, repeat” behavior was spread across a state enum with 8 values, 3 callback registrations, 2 timer handles, and a context struct passed by pointer. The coroutine version looked like this:
Crossplane + CI/CD: How I Stopped Fighting Kubernetes Config Drift and Actually Shipped Faster
// Before: state machine with 8 states, timers, callbacks, context struct
// After: this.
Task patrolBehavior(NPC& npc, const PatrolRoute& route) {
for (auto& waypoint : route.points) {
npc.moveTo(waypoint);
co_await npc.reachedDestination(); // suspend until arrival
co_await waitSeconds(1.5f); // suspend, don't block thread
npc.playAnimation("look_around");
co_await waitSeconds(0.8f);
}
}
Fifteen lines. Every suspension point readable in sequence. No context struct, no registered callbacks, no state enum. I was sold immediately — not because it was elegant, but because the next junior dev who touched this code could actually reason about it.
The “stackless” constraint sounds scarier than it is. What it means concretely: there’s no separately allocated call stack for the coroutine. The compiler transforms your function into a state machine and stores only the local variables that need to survive a suspension — in a heap-allocated frame, typically a few hundred bytes. The catch is that co_await can only appear at the top level of the coroutine function itself, not inside a helper function it calls. You can’t do co_await buried inside doSomethingComplex() unless that function is also a coroutine. For game AI tasks this constraint almost never matters. Patrol, attack sequences, dialogue trees, cutscene scripting — these are all naturally flat sequences of “do thing, wait, do next thing.” The suspension points are obvious and top-level by design. Where it gets complicated is utility code shared across many behaviors, which I’ll get into — but the fix is usually just making that utility function a coroutine too, not restructuring your whole design.
Prerequisites and Compiler Setup
The thing that surprises most people is how little setup this actually requires. C++20 coroutines are in the standard library — no Boost, no third-party headers, no vcpkg incantation. You just need a compiler that’s new enough and the right flag. The catch is “new enough” was a moving target through 2021-2022 as compilers shipped partial implementations, so pinning to specific minimums matters.
Your minimums: MSVC 19.28+ (that’s Visual Studio 2019 version 16.8 — not just “VS 2019”), GCC 10, or Clang 12. Anything older and you’ll get either missing <coroutine> header errors or subtly broken promise type behavior that only blows up at runtime. I’ve been burned by a GCC 9 machine in CI that compiled without warnings and then segfaulted on the first co_await. Pin the version.
The flags couldn’t be simpler:
# GCC or Clang
g++ -std=c++20 -O2 game.cpp -o game
# MSVC (in your .vcxproj or via CLI)
cl /std:c++20 /O2 game.cpp
For CMake — which is what most cross-platform game projects use — do both of these together. The CMAKE_CXX_STANDARD global sets the floor, but target_compile_features makes it a hard requirement that fails loudly at configure time instead of silently building C++17:
cmake_minimum_required(VERSION 3.20)
project(MyGame)
set(CMAKE_CXX_STANDARD 20)
set(CMAKE_CXX_STANDARD_REQUIRED ON) # don't silently fall back to an older standard
set(CMAKE_CXX_EXTENSIONS OFF) # avoid GNU extensions polluting your headers
add_executable(my_game main.cpp)
target_compile_features(my_game PRIVATE cxx_std_20)
The Unreal situation is genuinely annoying. UE5.3 shipped experimental coroutine support, but “experimental” here means the promise type assumptions baked into Unreal’s build system can conflict with your own coroutine machinery — particularly around allocator customization and the get_return_object lifetime. My recommendation: drop a standalone header like cppcoro or write your own thin wrapper, compile it outside of Unreal’s unity build, and interface via a normal task queue. UE4 is a hard no — the MSVC version it ships with predates coroutine support entirely, and patching the toolchain voids your Epic support agreement in practice. For console targets, PS5 and Xbox Series X/S SDK revisions from 2023 both landed C++20 coroutine support — the exact SDK version numbers are behind NDA so check your dev portal, but if you’re on a 2023 or later SDK you’re almost certainly fine. Test with a trivial co_return before writing any real coroutine logic against a new SDK drop.
How the C++20 Coroutine Machinery Actually Works
The thing that tripped me up first was thinking coroutines were compiler magic. They’re not — they’re a mechanical transformation. When you write a coroutine function, the compiler splits it into a state machine, allocates a coroutine frame on the heap, and hands you back a handle. Every suspension point becomes a numbered state. The original stack frame is gone by the time your caller resumes execution. This is the “stackless” part: there’s no hidden call stack being preserved, just a heap blob holding your locals and a resume pointer.
Each keyword maps to a concrete operation:
co_await exprcallsexpr.await_ready(), and if it returns false, suspends the coroutine and callsexpr.await_suspend(handle). When resumed,expr.await_value()(orawait_resume()) is the result of the expression. The compiler injects all three calls around what you wrote.co_yield valueis syntactic sugar forco_await promise.yield_value(value). Your promise type defines what yielding even means — it’s not a language-level concept, it’s your code.co_return valuecallspromise.return_value(value)then runs the final suspend point. After that, the coroutine frame is destroyed (if you set up final_suspend to allow it).
The coroutine frame allocation is where gamedevs need to pay attention. By default, the compiler emits a operator new call to allocate the frame. In a hot path — say, spawning 500 enemy behavior coroutines on level load — that’s 500 separate heap allocations with whatever fragmentation your allocator produces. The fix is to provide operator new and operator delete inside your promise type. The compiler will use those instead. You can route them to a pool allocator, a linear arena, whatever your engine already uses. There’s also HALO (Heap Allocation eLision Optimization) in the spec, where the compiler can eliminate the allocation entirely if the coroutine lifetime is provably bounded — but don’t count on it in practice; it works in trivial cases and falls apart with virtual dispatch or complex callsites.
Symmetric transfer is the thing that actually makes chaining coroutines safe. Without it, resuming one coroutine from another’s await_suspend is a regular function call — so if you chain 10,000 coroutines, you blow your call stack. With symmetric transfer, await_suspend returns a coroutine_handle<> instead of void, and the compiler turns the resume into a tail call. The chain collapses to O(1) stack depth. You need to return std::noop_coroutine() at the end of the chain to terminate cleanly. Clang has supported this since Clang 11; MSVC got it in VS 2022 17.x. GCC’s support landed in GCC 11 but verify with your specific target.
Here’s a minimal promise_type that actually compiles. This gives you a generator-style coroutine you can iterate in a range-based for loop:
#include <coroutine>
#include <optional>
#include <cstdio>
template<typename T>
struct Generator {
struct promise_type {
std::optional<T> current_value;
// Compiler calls this to construct the return object
Generator get_return_object() {
return Generator{std::coroutine_handle<promise_type>::from_promise(*this)};
}
// Don't run the body until first next() call
std::suspend_always initial_suspend() { return {}; }
// Keep frame alive so caller can read current_value after loop ends
std::suspend_always final_suspend() noexcept { return {}; }
// co_yield routes here — store value, then suspend
std::suspend_always yield_value(T value) {
current_value = std::move(value);
return {};
}
void return_void() { current_value.reset(); }
void unhandled_exception() { std::terminate(); }
// Route frame alloc through your engine allocator here
// void* operator new(std::size_t sz) { return MyPool::alloc(sz); }
// void operator delete(void* ptr) { MyPool::free(ptr); }
};
std::coroutine_handle<promise_type> handle;
explicit Generator(std::coroutine_handle<promise_type> h) : handle(h) {}
~Generator() { if (handle) handle.destroy(); }
// Non-copyable, movable
Generator(const Generator&) = delete;
Generator(Generator&& other) noexcept : handle(other.handle) { other.handle = nullptr; }
bool next() {
handle.resume();
return !handle.done();
}
T& value() { return *handle.promise().current_value; }
};
// Usage — enemy patrol waypoints, AI state steps, anything sequential
Generator<int> countdown(int from) {
for (int i = from; i > 0; --i)
co_yield i;
}
int main() {
auto gen = countdown(5);
while (gen.next())
printf("%d\n", gen.value()); // prints 5 4 3 2 1
}
The final_suspend returning suspend_always is deliberate — if it returned suspend_never, the frame would be destroyed automatically, and your Generator destructor would call handle.destroy() on a dangling handle. That’s the kind of silent UB that only shows up under ASAN or after a week of intermittent crashes. Keep final_suspend as suspend_always and let the owning object control destruction.
Building a Game-Friendly Coroutine Task Type from Scratch
Why a Raw Coroutine Handle Will Leak All Over Your Game
The first time I used std::coroutine_handle<> directly in a game loop, I leaked coroutine frames for three hours before figuring out what happened. The handle is literally just a pointer — it owns nothing, destroys nothing, and if you lose track of it, the frame sits in memory until the process exits. For a game running at 60fps spawning coroutines for AI, cutscenes, and tweens, that’s a slow-motion memory disaster. You need a Task<T> wrapper that takes ownership of the handle the moment it’s created and calls destroy() in its destructor. That’s the entire motivation. Everything else flows from that.
The Promise Type Is the Real API Surface
People get confused about where to put logic in a coroutine type. The answer is almost always the promise_type. The compiler synthesizes calls to it — your job is to give it the right shape. Here’s a Task<void> that compiles clean under Clang 16+ and MSVC 19.35+ with /std:c++20:
#pragma once
#include <coroutine>
#include <exception>
// Forward declaration so promise_type can reference Task
template<typename T = void> struct Task;
template<>
struct Task<void> {
struct promise_type {
std::exception_ptr exception;
Task get_return_object() {
// Hand the coroutine handle back to Task immediately.
// At this point the coroutine frame is already allocated.
return Task{ std::coroutine_handle<promise_type>::from_promise(*this) };
}
// Suspend on entry — we decide when to resume, not the runtime.
std::suspend_always initial_suspend() noexcept { return {}; }
// Suspend on exit so the parent can observe completion
// before we destroy the frame. If we returned suspend_never here,
// a co_awaiting parent would read destroyed memory.
std::suspend_always final_suspend() noexcept { return {}; }
void return_void() noexcept {}
void unhandled_exception() noexcept {
exception = std::current_exception();
}
};
using handle_type = std::coroutine_handle<promise_type>;
explicit Task(handle_type h) : handle(h) {}
// No copies — exactly one owner of the frame at all times.
Task(const Task&) = delete;
Task& operator=(const Task&) = delete;
Task(Task&& other) noexcept : handle(other.handle) {
other.handle = nullptr;
}
~Task() {
// This is the whole point of the wrapper.
if (handle) handle.destroy();
}
bool done() const { return handle.done(); }
void resume() { handle.resume(); }
private:
handle_type handle;
};
The final_suspend returning std::suspend_always is the most surprising gotcha here. Your instinct might be to return std::suspend_never so cleanup is automatic, but if another coroutine is co_awaiting this task, it needs to read the result or exception after the coroutine body finishes and before the frame is torn down. Suspend at the end, let the parent read what it needs, then the owning Task destructor calls destroy(). That’s the safe sequence.
Making Task Awaitable — The await_suspend Chain
Once you have a runnable Task, you want to write co_await someTask from inside another coroutine. That requires implementing the three awaiter methods. The trick is that await_suspend receives the calling coroutine’s handle, and that’s how you wire parent-child resumption without a scheduler — you resume the child immediately and store the parent to resume later:
// Add this inside Task<void> as a nested struct, or as operator co_await()
struct Awaiter {
handle_type child;
bool await_ready() const noexcept {
// Don't skip suspension even if already done —
// edge case: a Task that completes synchronously still needs
// to transfer control cleanly.
return false;
}
// The magic: suspend the parent, resume the child.
// When the child hits final_suspend, we need a way back to the parent.
// Store the parent handle in the child's promise.
std::coroutine_handle<> await_suspend(std::coroutine_handle<> parent) noexcept {
child.promise().continuation = parent;
return child; // symmetric transfer — no stack growth
}
void await_resume() {
if (child.promise().exception)
std::rethrow_exception(child.promise().exception);
}
};
auto operator co_await() noexcept { return Awaiter{ handle }; }
Add a std::coroutine_handle<> continuation field to promise_type, and update final_suspend to return a custom awaiter that resumes it. This is called symmetric transfer — await_suspend returns a handle instead of void, which tells the runtime “resume this handle next” without pushing a new frame onto the call stack. Without it, deeply nested co_await chains will stack overflow around 10–20K levels deep, which you absolutely will hit in a game that chains cutscene tasks.
Pulling Coroutine Frames from a Pool Instead of malloc
Every coroutine frame allocation goes through operator new on the promise type. The compiler checks for it via a specific overload signature. This is how you redirect it to a pool allocator — I use a simple 64KB slab divided into 512-byte slots for short-lived game tasks:
struct promise_type {
// ... existing fields ...
// Compiler calls this with the frame size it calculated.
// Must be a static member with exactly this signature.
static void* operator new(std::size_t size) {
return GameCoroutinePool::alloc(size);
}
static void operator delete(void* ptr, std::size_t size) {
GameCoroutinePool::free(ptr, size);
}
};
The size parameter in operator delete is critical — without it the compiler may not generate sized deallocation and your pool can’t figure out which slot to return. Frame sizes vary based on captured locals; a coroutine capturing a glm::mat4 and a few ints will be around 80–100 bytes, while one holding an async load buffer might be several KB. Profile your actual frame sizes with sizeof(Task<void>::promise_type) and a few real coroutine captures before you fix your pool slot size. Getting this wrong means silent pool corruption — not a crash you’ll catch immediately.
The Full ~80-Line Header, Ready to Drop In
#pragma once
#include <coroutine>
#include <exception>
#include <utility>
// Minimal pool stub — replace with your allocator
namespace GameCoroutinePool {
inline void* alloc(std::size_t n) { return ::operator new(n); }
inline void free(void* p, std::size_t) { ::operator delete(p); }
}
template<typename T = void> struct Task;
template<>
struct Task<void> {
struct FinalAwaiter {
bool await_ready() const noexcept { return false; }
// Resume whoever co_awaited us
std::coroutine_handle<> await_suspend(
std::coroutine_handle<promise_type> h) noexcept {
auto cont = h.promise().continuation;
return cont ? cont : std::noop_coroutine();
}
void await_resume() noexcept {}
};
struct promise_type {
std::exception_ptr exception;
std::coroutine_handle<> continuation;
static void* operator new(std::size_t n) {
return GameCoroutinePool::alloc(n);
}
static void operator delete(void* p, std::size_t n) {
GameCoroutinePool::free(p, n);
}
Task get_return_object() {
return Task{ std::coroutine_handle<promise_type>::from_promise(*this) };
}
std::suspend_always initial_suspend() noexcept { return {}; }
FinalAwaiter final_suspend() noexcept { return {}; }
void return_void() noexcept {}
void unhandled_exception() noexcept {
exception = std::current_exception();
}
};
using handle_type = std::coroutine_handle<promise_type>;
struct Awaiter {
handle_type child;
bool await_ready() const noexcept { return false; }
std::coroutine_handle<> await_suspend(
std::coroutine_handle<> parent) noexcept {
child.promise().continuation = parent;
return child;
}
void await_resume() {
if (child.promise().exception)
std::rethrow_exception(child.promise().exception);
}
};
explicit Task(handle_type h) : handle(h) {}
Task(const Task&) = delete;
Task& operator=(const Task&) = delete;
Task(Task&& o) noexcept : handle(std::exchange(o.handle, nullptr)) {}
~Task() { if (handle) handle.destroy(); }
Awaiter operator co_await() noexcept { return Awaiter{ handle }; }
bool done() const noexcept { return handle.done(); }
void resume() const { handle.resume(); }
void start() const { handle.resume(); } // alias — clearer at call site
private:
handle_type handle;
};
One thing that tripped me up: FinalAwaiter references promise_type before it’s defined. You need to either forward-declare promise_type inside Task or reorder the structs. MSVC is stricter about this than Clang, so if it compiles on one and not the other, that’s where to look first. The std::noop_coroutine() fallback in FinalAwaiter::await_suspend handles root-level tasks that nothing is awaiting — without it, returning a null handle is UB.
Writing Your First Game Coroutine: NPC Patrol Behavior
The NPC Patrol Coroutine That Changed How I Think About Game Logic
The moment that clicked for me was realizing that NPC patrol logic is essentially a sequential script — walk here, wait, walk there, react — but state machines force you to invert that into “what state am I in, and what do I do this frame?” Coroutines let you write the script directly. Here’s the full patrol NPC we’re building: walks to waypoint A, waits 2 seconds, walks to B, and interrupts everything if the player enters line-of-sight.
// The coroutine — reads exactly like a designer's behavior spec
Task NPCPatrol::RunPatrol(NPC& npc)
{
while (true)
{
// Request navmesh path asynchronously — suspends until path is ready
NavPath pathToA = co_await RequestNavPath(npc.position, waypointA);
co_await npc.WalkPath(pathToA); // suspends until arrival
co_await WaitForSeconds(2.0f); // suspends for 2 game-time seconds
NavPath pathToB = co_await RequestNavPath(npc.position, waypointB);
co_await npc.WalkPath(pathToB);
// co_await on a condition — suspends until lambda returns true
// or we reach the next waypoint loop, whichever comes first
if (co_await WaitUntil([&]{ return npc.CanSeePlayer(); }, 0.5f))
{
co_await RunAlertBehavior(npc); // pivot to combat coroutine
}
}
}
Now look at what this same logic looked like as a state machine. The behavioral intent is identical — but the state machine version needs you to reconstruct the narrative from fragments:
// State machine version — same behavior, zero readability
enum class PatrolState { RequestPathA, WalkingToA, Waiting, RequestPathB, WalkingToB, Alert };
void NPCPatrol::Update(float dt)
{
switch (state)
{
case PatrolState::RequestPathA:
pathRequestHandle = navmesh.RequestPath(npc.position, waypointA);
state = PatrolState::WalkingToA;
break;
case PatrolState::WalkingToA:
if (!pathRequestHandle.IsComplete()) break; // forgot to handle this? enjoy your bug
if (npc.HasReachedDestination())
{
waitTimer = 2.0f;
state = PatrolState::Waiting;
}
if (npc.CanSeePlayer()) state = PatrolState::Alert; // interrupt logic scattered everywhere
break;
case PatrolState::Waiting:
waitTimer -= dt;
if (waitTimer <= 0.0f) state = PatrolState::RequestPathB;
if (npc.CanSeePlayer()) state = PatrolState::Alert; // same check, duplicated again
break;
// ... RequestPathB, WalkingToB, Alert — same pattern repeated
}
}
The state machine has the player-spotted check duplicated in every state that cares about it. Add a new interruption condition and you’ll touch six places. The coroutine has one WaitUntil call at the end of the loop that handles it cleanly. The cognitive load difference isn’t subtle.
WaitForSeconds: Hooking a Simple Awaitable Into Your Update Loop
The key insight here is that your awaitable doesn’t do any waiting itself — it just tells the coroutine machinery “don’t resume me until this condition is satisfied,” and your game loop does the actual checking. Here’s the minimal WaitForSeconds implementation:
struct WaitForSeconds
{
float duration;
float elapsed = 0.0f;
// Called immediately when co_await hits this — return true to skip suspension
bool await_ready() const noexcept { return duration <= 0.0f; }
// Called when we actually suspend — store the handle so the scheduler can resume us
void await_suspend(std::coroutine_handle<> handle)
{
// Register with the game's coroutine scheduler
// Scheduler stores (handle, remaining_time) and decrements each frame
CoroutineScheduler::Get().RegisterTimedResume(handle, duration);
}
void await_resume() const noexcept {} // no return value needed
};
// Inside your game loop:
void CoroutineScheduler::Tick(float dt)
{
for (auto& [handle, remaining] : timedCoroutines)
{
remaining -= dt;
if (remaining <= 0.0f)
pendingResumes.push_back(handle);
}
// flush pendingResumes after iteration — never resume inside the range-for
for (auto h : pendingResumes) h.resume();
pendingResumes.clear();
}
That “never resume inside the range-for” note in the comment? I learned that from a crash. If a resumed coroutine itself registers a new timed entry, you get iterator invalidation on the vector. Always batch into pendingResumes.
WaitUntil: Condition-Based Suspension With an Optional Timeout
The WaitUntil implementation is the same pattern, just with a stored predicate instead of a timer. The trick is that the scheduler needs to poll the predicate each frame, which costs something — so you wouldn’t use this for thousands of NPCs simultaneously. For a reasonable scene with 20-30 active coroutines it’s completely fine.
struct WaitUntil
{
std::function<bool()> predicate;
float timeout; // negative = no timeout
bool timedOut = false;
bool await_ready() const noexcept { return predicate(); }
void await_suspend(std::coroutine_handle<> handle)
{
CoroutineScheduler::Get().RegisterConditionalResume(handle, predicate, timeout,
[this](bool didTimeout) { timedOut = didTimeout; });
}
// Returns true if condition was met, false if we hit the timeout
bool await_resume() const noexcept { return !timedOut; }
};
// Usage — the return value tells you WHY suspension ended
if (co_await WaitUntil([&]{ return npc.CanSeePlayer(); }, 0.5f))
RunAlertBehavior(); // condition met
// else: 0.5s timeout hit, continue patrol normally
The Async Navmesh Path Request: Where Stackless Coroutines Get Interesting
Most navmesh implementations (Recast/Detour, or your engine’s equivalent) do pathfinding on a job thread and call you back. Co-awaiting that is actually the cleanest use case for stackless coroutines — the coroutine suspends on the game thread, the path computes on the job thread, and when it’s done the job system posts a resume to the game thread’s queue. No mutex held across frames, no polling.
struct NavPathAwaitable
{
PathRequestHandle requestHandle;
NavPath result;
NavPathAwaitable(Vec3 from, Vec3 to)
: requestHandle(NavMesh::Get().RequestPath(from, to)) {}
// Cheap check — path might already be done if navmesh is fast
bool await_ready() const noexcept
{
return requestHandle.IsComplete();
}
void await_suspend(std::coroutine_handle<> handle)
{
// Navmesh job calls this lambda when path is ready — on the job thread!
// So we post the resume back to the game thread queue, never resume directly
requestHandle.SetCompletionCallback([handle](NavPath path)
{
GameThread::Post([handle, path]()
{
// Now we're back on game thread — safe to resume
handle.resume();
});
});
}
NavPath await_resume()
{
return requestHandle.GetResult(); // guaranteed complete by this point
}
};
// Clean call site wraps the struct
NavPathAwaitable RequestNavPath(Vec3 from, Vec3 to)
{
return NavPathAwaitable(from, to);
}
The thing that catches people off guard here is that await_suspend is called on the game thread, but the completion callback fires on the job thread — you absolutely cannot call handle.resume() directly from that callback without a lock or a thread-safe queue. I use a lock-free MPSC queue for GameThread::Post and drain it at the start of each Tick. If your engine already has a task graph or dispatcher, post the resume through that instead of rolling your own.
Integrating Coroutines with Your Game Loop
The thing that tripped me up most when I first wired coroutines into a game loop wasn’t the coroutine mechanics — it was convincing myself I needed something sophisticated before I’d proven the basics worked. Start with the dumbest possible scheduler and upgrade it only when you hit a real constraint.
The Dead-Simple Scheduler (Start Here)
Before you reach for a priority queue or a task graph, this is all you actually need:
// CoroutineScheduler.h
#include <coroutine>
#include <vector>
struct ScheduledCoroutine {
std::coroutine_handle<> handle;
double wake_time = 0.0; // 0 means "wake immediately next frame"
};
class CoroutineScheduler {
public:
std::vector<ScheduledCoroutine> pending;
void enqueue(std::coroutine_handle<> h, double wake_at = 0.0) {
pending.push_back({ h, wake_at });
}
void tick(double current_time) {
// Swap-and-iterate so resumed coroutines can re-enqueue themselves
std::vector<ScheduledCoroutine> this_frame;
std::swap(this_frame, pending);
for (auto& sc : this_frame) {
if (sc.wake_time <= current_time) {
if (!sc.handle.done()) sc.handle.resume();
} else {
// Not ready yet — push back for later
pending.push_back(sc);
}
}
}
};
The swap trick matters. If you iterate pending in-place and a resumed coroutine calls enqueue(), you’ll get iterator invalidation and a crash that only happens in release builds when you have 200 enemies active. The swap costs one pointer copy per frame, which is nothing.
Tick-Based vs. Time-Based Resumption
Storing a wake_time as a double (seconds since level start) gives you real-time delays. But a lot of gameplay logic actually wants tick-based wakeup — “resume me after 3 frames” for deterministic simulation. I ended up keeping both fields:
struct ScheduledCoroutine {
std::coroutine_handle<> handle;
double wake_time = 0.0; // wall/game time in seconds
uint64_t wake_tick = 0; // frame count
};
// In tick():
bool ready = (sc.wake_time <= current_time) && (sc.wake_tick <= current_tick);
Your WaitForSeconds awaitable sets wake_time. Your WaitFrames awaitable sets wake_tick. When both are at their default zero, the coroutine resumes next frame, which is exactly the behavior you want from a plain co_await NextFrame{}. The scheduler doesn’t need to know which one a particular coroutine used — it just checks both conditions.
Thread Safety — Don’t Guess, Just Mutex
Coroutines are stack frames dressed up in a trenchcoat. resume() executes synchronously on whatever thread calls it. If your render thread and your audio thread both hold a reference to the same handle and call resume() concurrently, you get a data race on the coroutine’s internal state. The compiler won’t warn you. TSAN might catch it. Production won’t.
class ThreadSafeScheduler {
std::mutex mtx;
std::vector<ScheduledCoroutine> pending;
std::vector<ScheduledCoroutine> incoming; // other threads post here
public:
// Call from any thread
void enqueue_threadsafe(std::coroutine_handle<> h, double wake_at = 0.0) {
std::lock_guard lock(mtx);
incoming.push_back({ h, wake_at });
}
// Call from game thread ONLY — no lock needed on pending after merge
void tick(double current_time, uint64_t current_tick) {
{
std::lock_guard lock(mtx);
pending.insert(pending.end(), incoming.begin(), incoming.end());
incoming.clear();
}
// ... rest of tick logic operates on pending without holding the lock
}
};
The key insight: hold the lock only during the merge, not during resumption. Resuming coroutines while holding a mutex is a deadlock waiting to happen the moment a resumed coroutine tries to enqueue something else. This two-queue pattern (main thread owns pending, other threads post to incoming) keeps the critical section to a pointer-copy operation and nothing else.
Cancellation — The Part Everyone Skips
Every coroutine tutorial ends before explaining how to stop a coroutine mid-flight. The naïve approach is calling handle.destroy() from outside, but that’s undefined behavior if the coroutine is currently suspended inside a destructor. The correct approach is cooperative cancellation via a shared token:
// CancellationToken.h
#include <memory>
#include <atomic>
struct CancellationState {
std::atomic<bool> cancelled = false;
};
// Cheaply copyable — both the coroutine and its owner hold one
struct CancellationToken {
std::shared_ptr<CancellationState> state = std::make_shared<CancellationState>();
void cancel() const { state->cancelled.store(true, std::memory_order_relaxed); }
bool is_cancelled() const { return state->cancelled.load(std::memory_order_relaxed); }
};
// An awaitable that checks the token before suspending
struct CheckCancellation {
CancellationToken token;
bool await_ready() const noexcept { return token.is_cancelled(); }
void await_suspend(std::coroutine_handle<>) noexcept {}
// Returns true if cancelled — coroutine checks return value
bool await_resume() const noexcept { return token.is_cancelled(); }
};
Inside your coroutine, you pepper co_await CheckCancellation{token} at natural yield points — after pathfinding, before firing a projectile, at the top of a wait loop. If await_ready() returns true (cancelled), the coroutine never suspends — it resumes immediately and your code can co_return early. The coroutine destructs normally, all RAII cleanup runs, and nothing explodes. Passing the token in at construction time via a coroutine parameter is cleaner than a global; pass it through the promise type’s constructor if you want it available without threading it through every function signature.
Hooking Into Unreal and Unity
In Unreal, FTickableGameObject is the cleanest hook that doesn’t require an AActor in the world. Inherit from it, implement Tick(float DeltaTime) and IsTickable(), and call your scheduler’s tick() from there. The gotcha is that FTickableGameObject instances must be heap-allocated — stack instances get registered before the engine is ready and crash on shutdown. Use a TUniquePtr owned by your GameInstance.
// In your GameInstance.h
TUniquePtr<FCoroutineSchedulerTickable> CoroutineScheduler;
// FCoroutineSchedulerTickable.h
class FCoroutineSchedulerTickable : public FTickableGameObject {
CoroutineScheduler scheduler;
public:
void Tick(float DeltaTime) override {
GameTime += DeltaTime;
scheduler.tick(GameTime, FrameCount++);
}
bool IsTickable() const override { return true; }
TStatId GetStatId() const override { RETURN_QUICK_DECLARE_CYCLE_STAT(FCoroutineSchedulerTickable, STATGROUP_Tickables); }
double GameTime = 0.0;
uint64_t FrameCount = 0;
};
For Unity via IL2CPP, you’re in C# land at the API surface but the IL2CPP runtime compiles to native code. If you’re shipping a C++ plugin (e.g., via a native .dll or .so), your scheduler lives entirely in C++ and you expose a single SchedulerTick(double gameTime, uint64_t frame) function that Unity’s MonoBehaviour.Update() calls over the P/Invoke bridge. The IL2CPP threading model means your P/Invoke calls happen on the main Unity thread, so you get the thread-safety guarantees of the single-threaded scheduler for free — the danger zone is only if you’re also running work on a System.Threading.Thread that calls back into native code and tries to enqueue coroutines concurrently.
The Three Things That Surprised Me in Production
I went into this project thinking stackless coroutines were basically “lighter threads with manual scheduling.” That framing is close enough to get started but wrong enough to burn you. Here are the three things that actually caught me off guard after shipping a scene-streaming system built on top of them.
Surprise 1: Exceptions Are Disabled and Your Coroutine Will Silently Eat Errors
Almost every game studio I’ve worked with or talked to ships with -fno-exceptions (Clang/GCC) or /EHs-c- (MSVC). The C++ coroutine machinery still compiles under these flags — but it calls unhandled_exception() on your promise type when something throws. If you don’t define that method, or if you define it as a no-op, you get undefined behavior or a silent swallow of the error. Neither is acceptable. The only sane implementation in an exceptions-disabled codebase is:
void unhandled_exception() noexcept {
// Exceptions are disabled project-wide (/EHs-c-).
// If we somehow land here, the coroutine frame is corrupt — hard abort.
std::abort();
}
Document this in a comment at the top of your promise type, not buried in a wiki page. I’ve seen junior devs copy a coroutine tutorial from cppreference that defines unhandled_exception() as std::rethrow_exception(std::current_exception()). That compiles fine with exceptions disabled, links fine, and then explodes at runtime on the first error path hit during QA. Make the abort path loud and obvious — a crash with a stack trace beats a swallowed error every time.
Surprise 2: The Heap Allocation Happens at the Call Site, Not at First Suspension
This one hurt. I expected coroutine frame allocations to show up in the profiler during the first co_await suspension — the moment the frame needs to outlive the caller’s stack. Wrong. The compiler allocates the coroutine frame the moment you call the coroutine function, before any suspension point is hit. The compiler can elide this allocation (HALO — Heap Allocation eLision Optimization) when it can prove the coroutine lifetime is bounded, but in practice, with virtual dispatch or any non-trivial awaitable, HALO won’t fire.
What this meant for us: scene load kicked off ~400 coroutines in a single frame to manage asset streaming tasks. Profiler showed a spike of 400 heap allocations right at scene load start — not spread across gameplay where I expected them. Each frame was around 256–512 bytes depending on local variable count. The fix was a custom allocator via operator new on the promise type, backed by a pool pre-warmed during engine init:
struct TaskPromise {
// Redirect coroutine frame allocation to a pool.
// Pool is pre-allocated at engine start — zero heap pressure during load.
void* operator new(std::size_t sz) {
return g_coroutine_pool.allocate(sz);
}
void operator delete(void* ptr, std::size_t sz) {
g_coroutine_pool.free(ptr, sz);
}
// ... rest of promise type
};
After that, the scene load spike disappeared from the allocation profiler entirely. If you’re on a console with strict memory budgets, figure out your coroutine frame sizes in a debug build early — dump them with sizeof(std::coroutine_handle won’t give you the full frame size, so use platform heap profilers or -fsanitize=address with a custom allocator that logs sizes.
Surprise 3: Debugger Support Is Functional on MSVC, Patchy Everywhere Else
Visual Studio 2022 17.6+ genuinely fixed coroutine debugging. You can see local variables from the coroutine frame in the Watch window and Locals panel, step through co_await somewhat sanely, and the Call Stack at least shows the coroutine’s suspension point. It’s not perfect — stack traces through awaitable chains still collapse — but it’s usable. If you’re on Windows, update to at least 17.6 before trying to debug production coroutine code.
GDB 13 is a different story. As of GDB 13, coroutine frame inspection requires knowing the mangled internal frame struct name, which changes with compiler version and is not stable. The practical workflow on Linux is: add a name field to your promise type, write a GDB Python pretty-printer that finds live coroutine handles via your scheduler’s task list, and accept that you will be reading raw memory offsets more than you’d like. LLDB 16 with Clang is noticeably better — the frame variable command can navigate into the coroutine activation record if the debug info is complete. My recommendation: if your team is Linux-first, build debug tooling early. Don’t wait until you have 50 coroutine types and a live bug.
Performance Characteristics You Need to Measure
The thing that surprised me most when I first profiled our coroutine-based AI system wasn’t the suspension overhead — it was the allocation cost. Each coroutine frame gets heap-allocated by default, and when you’ve got 200 NPCs all spawning patrol coroutines at scene load, that’s 200 individual malloc calls hitting at roughly the same frame. The fix changed everything, but you need to measure first or you’ll optimize the wrong thing.
Coroutine Frame Size
Your coroutine frame holds every local variable that’s live across a suspension point, plus the promise object, plus internal bookkeeping. This is not free. On Clang you can get the actual size at compile time:
// Clang: check allocated frame size
// Compile with: clang++ -std=c++20 -fcoro-aligned-allocation -O2
#include <coroutine>
#include <cstdio>
struct Task {
struct promise_type {
Task get_return_object() { return {}; }
std::suspend_always initial_suspend() { return {}; }
std::suspend_always final_suspend() noexcept { return {}; }
void return_void() {}
void unhandled_exception() {}
};
};
Task patrol_behavior(int waypoint_count, float* path_data) {
// __builtin_coro_size() at any suspension point gives frame bytes
printf("Frame size: %zu\n", __builtin_coro_size());
for (int i = 0; i < waypoint_count; ++i) {
co_await std::suspend_always{};
}
}
Our NPC patrol frames landed at 128–256 bytes each depending on how many locals were live across suspension points. When I refactored a behavior to pull a big PathResult struct out of the coroutine body and pass it as a pointer, the frame dropped from 240 bytes to 96. That matters when you’re sizing your pool allocator. The -fcoro-aligned-allocation flag on Clang is worth enabling — without it, over-aligned types in your frame can cause silent padding that inflates size unpredictably.
Suspension and Resumption Overhead
On x64, suspending and resuming a stackless coroutine is roughly equivalent to an indirect function call through a pointer — you save the instruction pointer into the frame, return to the caller, and later call back in through the resume function pointer. No OS involvement, no stack swap, no signal mask manipulation like you’d get with fibers or ucontext. That said, “roughly a function call” still means branch mispredictions and potential instruction cache pressure if you’re resuming a different coroutine type every tick. Measure on your actual target. A desktop x64 machine and a last-gen console have very different branch predictor behaviors. Here’s the minimal use I use:
// Microbenchmark: resume cost in nanoseconds
#include <coroutine>
#include <chrono>
#include <cstdio>
Task idle_coro() { while(true) co_await std::suspend_always{}; }
int main() {
auto h = /* get handle from Task */;
const int N = 1'000'000;
auto t0 = std::chrono::high_resolution_clock::now();
for (int i = 0; i < N; ++i) h.resume();
auto t1 = std::chrono::high_resolution_clock::now();
printf("%.1f ns/resume\n",
std::chrono::duration<double, std::nano>(t1 - t0).count() / N);
}
Pool Allocator Impact
C++20 lets you override the coroutine frame allocator per promise type by adding operator new and operator delete to your promise_type. That’s the hook you need. I built a fixed-size block pool keyed on frame size — each NPC type gets its own pool sized to that type’s measured frame. Switching from default malloc to this dropped our AI tick cost noticeably in a 200-NPC scene. I’m not going to quote you a specific millisecond number because it depends entirely on your platform, frame size, and what else is happening in that tick — profile yours. The pattern looks like this:
template<std::size_t FrameSize>
struct PoolAllocatedPromise {
static void* operator new(std::size_t size) {
// Assert size == FrameSize in debug builds
assert(size <= FrameSize);
return NpcFramePool<FrameSize>::acquire();
}
static void operator delete(void* ptr) {
NpcFramePool<FrameSize>::release(ptr);
}
// ... rest of promise_type interface
};
The gotcha: if you size your pool wrong and the actual frame is larger than you measured (because you added a local variable that’s live across a co_await), you get silent memory corruption in release builds. Add a compile-time check with static_assert against the __builtin_coro_size output — or at minimum an assert in operator new. Treat frame size as an API contract that needs guarding.
When Coroutines Are Overkill
A guard NPC that toggles between IDLE and ALERT based on a single bool does not need a coroutine. The suspension overhead isn’t the issue — the allocator overhead is. Spinning up a coroutine frame for behavior that could be if (heard_player) state = ALERT; is adding machinery for no benefit. My rule: if the behavior has fewer than three distinct suspension points and no complex sequencing, use a state enum and a switch statement. Coroutines earn their cost when you have deep sequences — navigate to point, wait for animation, check for threats, navigate again — where the equivalent callback or state machine would require storing a dozen intermediate values explicitly. That’s the break-even point where the frame allocation pays for itself in code clarity and reduced bug surface.
Real Gotchas and How to Fix Them
Dangling References Across co_await Points
The nastiest bug I’ve hit with coroutines isn’t a crash — it’s silent corruption. Take this pattern:
Task loadAsset(AssetManager& mgr) {
// 'path' lives on the coroutine frame, fine.
// But 'config' is a reference to a local in the CALLER.
const Config& config = mgr.getActiveConfig(); // returns ref to internal buffer
co_await mgr.fetchAsync("texture.dds");
// By here, mgr may have rebuilt its config buffer.
// config is now a dangling reference. Compiler said nothing.
applyConfig(config); // 💥 silent corruption
}
GCC, Clang, and MSVC all fail to warn on this reliably. The coroutine frame captures the reference itself, not what it points to — and if the referent’s lifetime ends between suspension and resumption, you’re reading garbage. The fix: capture by value for anything that doesn’t have a guaranteed lifetime spanning the whole coroutine. Run AddressSanitizer (-fsanitize=address) with coroutine workloads specifically; ASAN does catch use-after-free here even though the compiler static analysis misses it. For MSVC, add /fsanitize=address to your debug build and actually run your scheduler for a few frames in your test suite before trusting a new coroutine.
The Coroutine That Never Wakes Up
Every game dev I’ve shown stackless coroutines to hits this within the first week: the coroutine suspends, the game keeps running, but nothing ever happens again. The awaitable’s await_suspend received the handle, stored it somewhere — and then the scheduler either never called .resume() on it, or worse, destroyed the handle thinking it was done.
// The bug: handle stored, but scheduler tick never drains this queue
struct FrameAwaiter {
std::coroutine_handle<> handle;
bool await_ready() { return false; }
void await_suspend(std::coroutine_handle<> h) {
handle = h;
// BUG: forgot to register h with the global scheduler
// GlobalScheduler::get().pushNextFrame(h); <-- this line missing
}
void await_resume() {}
};
The other variant: your scheduler destroys a handle it considers “orphaned” while a physics callback is holding a copy of it for deferred resumption. Then that callback calls .resume() on a destroyed handle. That’s UB and usually a crash deep inside MSVC’s coroutine machinery with no useful stack trace. The rule I enforce in my scheduler: one owner per handle, always. I use a wrapper type that asserts on double-resume and logs handle addresses in debug builds. Also add a watchdog in your scheduler that fires a warning if any handle has been pending for more than 500ms — it’s saved me hours of “why is my loading screen frozen” debugging.
MSVC vs Clang Frame Sizes and Your Pool Allocator
If you’re rolling a custom coroutine frame allocator (and you should be, heap allocation per-coroutine is brutal on frame time), the frame size for the same coroutine differs between compilers. Clang tends to be tighter; MSVC with /std:c++20 generates frames that can be 10–20% larger for the same coroutine body because of how it lays out the promise object and internal state machine. I found this out when my pool blocks were exactly right for Clang on Linux but the Windows build started silently overwriting memory because operator new for the coroutine frame was returning a block that was 8 bytes too small.
// In your coroutine's promise_type, override operator new:
static void* operator new(std::size_t sz) {
// sz is compiler-determined. Log it in a first-run assert:
assert(sz <= FramePool::BLOCK_SIZE &&
"Coroutine frame exceeds pool block — rebuild pool with larger blocks");
return FramePool::alloc();
}
My solution: add a static assert test that instantiates each coroutine type on every CI platform and dumps the requested sz to a known file. Then I set BLOCK_SIZE to the largest observed value across Clang 17, GCC 13, and MSVC 19.38, padded to the next 64-byte cache line. Shipping cross-platform with a pool? You must do this. “Works on my machine” is a memory corruption bug in disguise here.
The /await Flag That Breaks Everything on MSVC
Older MSVC guides tell you to add /await to get coroutine support. If your project already uses /std:c++20, that flag will conflict — you get cryptic linker errors about duplicate symbols in the coroutine machinery, or missing std::coroutine_handle specializations that exist in the standard header but get shadowed by the flag’s experimental version.
<!-- In your .vcxproj or CMakeLists, remove /await entirely -->
# CMakeLists.txt
target_compile_options(MyGame PRIVATE
$<$<CXX_COMPILER_ID:MSVC>:/std:c++20>
# Do NOT add /await here. /std:c++20 enables coroutines fully.
)
The error messages MSVC gives you don’t mention /await by name — you get things like LNK2005: "public: void __cdecl std::coroutine_handle<void>::resume(void)" already defined. If you see that, grep your build flags for /await and nuke it. /std:c++20 is the only flag you need on MSVC 19.29 and later.
Bridging C-Style Callbacks (PhysX, FMOD, etc.) Into Coroutines
PhysX contact callbacks, FMOD event callbacks, networking libraries — they all use the C pattern: a static function pointer plus a void* userdata, called from a thread you don’t own. You can’t co_await inside them. The solution is a promise bridge: a shared state object the coroutine waits on and the callback fulfills.
// Shared between the coroutine and the C callback
struct CallbackBridge {
std::atomic<bool> fired{false};
ContactData result{};
std::coroutine_handle<> waiter;
// Called from PhysX thread — no coroutine machinery, just set state
void fire(const ContactData& data) {
result = data;
fired.store(true, std::memory_order_release);
// Resume must happen on YOUR game thread, not PhysX's thread.
// Push the handle to your thread-safe scheduler queue instead:
GameScheduler::get().enqueueResume(waiter);
}
};
struct ContactAwaiter {
CallbackBridge* bridge; // must outlive the coroutine suspension
bool await_ready() { return bridge->fired.load(std::memory_order_acquire); }
void await_suspend(std::coroutine_handle<> h) {
bridge->waiter = h;
// If fired between await_ready and here, we'd stall.
// Re-check after storing the handle:
if (bridge->fired.load(std::memory_order_acquire)) {
GameScheduler::get().enqueueResume(h);
}
}
ContactData await_resume() { return bridge->result; }
};
// Usage in a coroutine:
Task handleCollision(CallbackBridge* bridge) {
ContactData contact = co_await ContactAwaiter{bridge};
applyDamage(contact.impulse);
}
The critical detail in that code is the re-check after storing the waiter handle in await_suspend. There’s a race window between await_ready returning false and the handle being stored — if the callback fires in that window on a PhysX thread, you’ll stall forever without that second check. Also: never call .resume() directly from the callback thread. PhysX callbacks run on their own job system; resuming a game coroutine there means your game logic runs on PhysX’s thread with none of your expected invariants. Always funnel through a thread-safe queue back to your main game thread or your designated coroutine thread.
When NOT to Use Stackless Coroutines
When Stackless Coroutines Are the Wrong Tool
Most coroutine tutorials sell you on the happy path. I want to talk about the four situations where I’ve watched teams waste weeks integrating C++20 coroutines only to rip them out. Knowing where the boundary is saves you from a very specific kind of refactor pain.
Deep Recursive Algorithms That Need Mid-Recursion Suspension
This is the fundamental mechanical limitation, not a fixable quirk. Stackless coroutines store their state in a compiler-generated heap object — only the current frame’s locals get suspended. If you’re writing a recursive pathfinder, a tree traversal that yields results, or any recursive descent parser that needs to pause mid-call, stackless coroutines physically cannot represent that. The recursion stack below the current frame is gone. The standard recommendation — convert recursion to iteration — sounds simple until your algorithm is genuinely recursive in structure and the iterative version requires you to manually maintain a stack anyway. At that point you’ve written more code for worse results.
The right tool here is stackful fibers. Boost.Context 1.84 gives you boost::context::fiber which suspends the entire OS-level stack. Platform fibers (CreateFiber on Windows, ucontext_t on POSIX) do the same. You pay more per context switch — typically 100–300ns versus near-zero for a stackless resume — but you actually get the semantics you need.
Stuck on C++17 With No Console SDK Upgrade for 12+ Months
I’ve seen this situation on two separate projects targeting older console SDKs. The polyfill options for C++20 coroutines on C++17 compilers involve either patching the compiler’s coroutine TS experimental support or using libraries like cppcoro — which targets pre-standardization coroutine TS semantics and is now effectively unmaintained. The mismatch between coroutine TS and C++20 final semantics isn’t cosmetic; the ABI and the promise_type protocol differ in ways that cause subtle miscompiles on MSVC 2019 and Clang 10. If your console vendor’s SDK pins you to a specific compiler version and that compiler doesn’t have solid C++20 coroutine support, the polyfill path is genuinely fragile. Just wait. Write a state machine now, document the refactor as a future ticket, and do it properly when the toolchain catches up.
Hot Paths Running Thousands of Times Per Frame With No Suspension
Every coroutine instantiation allocates its frame object — either on the heap (default) or via a custom allocator you wire up yourself. If you’re calling a coroutine 5,000 times per frame and it never actually suspends, you’ve paid allocation and initialization overhead for zero benefit. The compiler can elide the allocation in simple cases via HALO (Heap Allocation eLision Optimization), but HALO is fragile: it breaks if the coroutine escapes the calling scope or if the coroutine frame size is non-trivially complex. Don’t bet on it in a shipping game. For a hot path — think per-bullet collision checks, per-particle update logic — a plain loop with explicit state flags is faster, smaller, and debuggable in any profiler without coroutine-aware tooling.
// Don't do this for 5000 bullets per frame:
Task updateBullet(Bullet& b) {
// no co_await anywhere, just runs to completion
b.position += b.velocity * dt;
co_return;
}
// Just do this:
for (auto& b : bullets) {
b.position += b.velocity * dt;
}
Codebases Heavy on SEH (Structured Exception Handling) on Windows
This one caught me completely off guard the first time. C++ coroutines and Windows SEH (__try / __except / __finally) do not mix. The MSVC compiler explicitly rejects co_await or co_yield inside an SEH block with an error, but the real danger is subtler: coroutine frame unwinding during exception propagation interacts badly with SEH’s non-standard unwind model. If your codebase uses SEH for crash telemetry, hardware exception catching (access violations, divide-by-zero), or wraps platform plugin boundaries with __try blocks, you’ll hit cases where the coroutine’s destructor sequence and SEH’s EXCEPTION_EXECUTE_HANDLER path interfere. The symptoms are usually wrong-context destructor ordering or, on 32-bit targets, outright stack corruption. If SEH is load-bearing in your architecture, either isolate coroutines completely away from those code paths — enforced by module boundaries — or skip them entirely for that subsystem.
Libraries Worth Knowing About
The thing that surprised me most when I started this journey was how few of these libraries are actually usable in a game project without significant surgery. Most coroutine libraries are designed for async I/O servers, not for deterministic per-frame scheduling with a fixed allocator. Here’s the honest rundown.
cppcoro (lewissbaker/cppcoro)
This is where I’d send anyone who wants to understand how stackless coroutines are properly implemented in C++20. Lewis Baker is one of the people who actually shaped the coroutine TS, so reading his code is like reading the spec with examples. The repo is archived — no new commits — but that doesn’t make it irrelevant. The task<T>, generator<T>, and async_generator<T> types in cppcoro are textbook-clean. I spent a weekend just reading task.hpp and came away understanding await_suspend return types in a way the cppreference page never gave me. Don’t ship it, but absolutely read it.
libunifex (Facebook/Meta)
This is the reference implementation of the sender/receiver model that eventually became std::execution (P2300). It’s production-grade in the sense that Meta has run it at scale, but “production-grade for a C++ async server” and “usable in a game” are two different things. The binary size alone will give your build engineer a headache. The learning curve is genuinely steep — you’ll spend more time understanding the scheduler abstraction than writing game logic. If your game is doing heavy async I/O (streaming open world, asset pipeline work), it might be worth the pain. For most game loops? No.
Unreal’s TCoroutine (UE5.3 experimental)
UE5.3 shipped TCoroutine under the experimental flag, and it integrates with Unreal’s task graph which is actually the right instinct. The problem is that “experimental” in Unreal means exactly what it says — the API changed between 5.3 and 5.4, and I’ve seen reports of interaction bugs with garbage collection under certain Blueprint-heavy scenarios. Worth watching the release notes every cycle. I wouldn’t build a shipping feature on it until it’s out of experimental and has a full cycle of bug reports behind it. If you’re already deep in Unreal and want to prototype with it, the integration with UE::Tasks is genuinely elegant:
// UE5.3 experimental — API may drift
TCoroutine<void> SpawnEnemyWave(int32 Count)
{
for (int32 i = 0; i < Count; ++i)
{
SpawnEnemy();
co_await UE::Tasks::Suspend(); // yield back to task graph
}
}
Rolling Your Own 200-Line Task.h
This is genuinely the right call for most game projects, and I say that not because I enjoy reinventing wheels but because the control you get is worth it. You pick the allocator — no hidden operator new from a library you don’t control. You write the scheduler loop, so it integrates with your existing frame timing. You add your own debug output, which means coroutine names show up in your profiler the way you want them to, not however the library decided to stringify a coroutine_handle. A minimal task type that supports co_await chaining fits comfortably in one header:
// Task.h — ~200 lines, zero external deps, C++20
template<typename T = void>
struct Task {
struct promise_type {
T result;
std::coroutine_handle<> continuation; // who's waiting on us
Task get_return_object() {
return Task{std::coroutine_handle<promise_type>::from_promise(*this)};
}
std::suspend_always initial_suspend() { return {}; }
FinalAwaiter final_suspend() noexcept { return {}; } // resumes continuation
void return_value(T v) { result = std::move(v); }
void unhandled_exception() { std::terminate(); } // replace with your crash reporter
};
std::coroutine_handle<promise_type> handle;
// awaiter impl goes here — ~30 lines
};
The allocator hook is the part libraries don’t give you. Add void* operator new(size_t n) to promise_type and you can route all coroutine frame allocations through your pool allocator. That’s maybe 10 extra lines and it solves the single biggest objection game programmers have to coroutines on console: unpredictable heap churn.
Quick Reference: What to Pick and When
The most common mistake I see is teams defaulting to whatever they already know. If you’ve been writing state machines for years, everything looks like a state machine problem. If you just learned C++20 coroutines, you’ll reach for them even in hot-path code where the overhead will bite you. Here’s how I actually think about the decision:
C++20 stackless coroutines are the right call for:
- NPC behavior sequencing — patrol → investigate → alert → combat flows that read linearly in code but run over many frames. The suspension overhead is negligible when each state runs for hundreds of milliseconds.
- Cutscene scripting —
co_await WaitForAnimation("door_open")is genuinely more maintainable than a 200-line state machine with 12 transition flags. - Async asset loading callbacks — fire a load request,
co_awaitthe future, continue as if it was synchronous. No callback pyramid, no manual state restoration. - Tutorial flows — step-by-step sequences with conditionals that need to survive scene transitions. Coroutine state is self-contained in the frame object.
The hard limit: stackless coroutines cannot suspend mid-callstack. If your NPC logic calls PathFind() which calls AStarStep() and you want to yield from inside AStarStep(), you’re stuck. You’d have to make every layer of that call chain a coroutine, which cascades fast. That’s where stackful fibers earn their place.
Stackful fibers (Boost.Context, or a Naughty Dog-style job system) for:
- Mid-recursion suspension — a fiber has its own real stack, so you can block anywhere in the call tree without redesigning your entire API.
- Wrapping legacy callback APIs — if you’re integrating a third-party SDK that takes completion callbacks, a fiber lets you block on a semaphore inside the callback wrapper and surface a synchronous-looking interface upward.
- Pre-C++20 codebases — Boost.Context runs on C++11. If you’re shipping on a console with a locked toolchain from 2019, stackful is your only real option short of manual state machines.
The honest tradeoff: stackful fibers cost more memory (typical stack size is 64KB–256KB per fiber vs. a few dozen bytes for a coroutine frame), and context switches are measurably more expensive — you’re swapping registers and stack pointers rather than just resuming a function pointer. For 50 NPCs this is invisible. For 5,000 concurrent jobs it matters.
Plain state machines win when:
- You have fewer than 5 states and the transitions are fully enumerable — a coroutine here is just ceremony.
- The code is on the hot path (per-frame particle behavior, collision response) — coroutine frame allocation and the virtual dispatch overhead of
promise_typeare not free. - Your team isn’t familiar with coroutine semantics yet — a confused programmer maintaining coroutine code will introduce bugs that are genuinely hard to diagnose. A state machine bug is usually obvious.
// Mental model for the decision:
//
// Linear sequence + suspend between calls? → C++20 coroutine
// Need to suspend INSIDE a deep callstack? → Stackful fiber
// <5 states, hot path, or unfamiliar team? → State machine
//
// If you're unsure, prototype the state machine first.
// Coroutine refactors are straightforward once you know the shape.
// Going the other direction is painful.
One more thing: if you’re running a studio and the organizational side is giving you headaches — tracking contracts, invoicing clients, managing software subscriptions — that’s a completely separate problem from your tech stack. The Essential SaaS Tools for Small Business in 2026 guide covers that side of things. Keep it separate from your engine decisions; they have nothing to do with each other and conflating them just muddies both conversations.