Callback Hell: Abusing Callbacks, Tail Calls, and Proxy Frames to Obfuscate the Stack

Source & licence. This is a faithful English republication of “Callback hell: abusing callbacks, tail-calls, and proxy frames to obfuscate the stack” by klezVirus (posted 2025-12-21, updated 2025-12-22). The original is licensed under CC BY 4.0; that licence allows republication with attribution, which is provided here in full. All figures, assembly listings, and the POC video are reproduced from the original. Light editorial polish only — technical content and code are unchanged.

Executive Summary

Callback-based execution — whether through QueueUserAPC, EnumXxx-style APIs, timer queues, or the thread pool — is one of the most popular ways to invoke arbitrary code in modern offensive tooling, because the actual target is reached via a forward edge (a call or jmp through an indirect target) rather than a direct call. Unfortunately for the operator, the callback’s address is still very much present on the call stack at the moment of execution, which makes the technique trivially visible to any defender willing to walk the stack — a category that today includes essentially every credible EDR.

klezVirus’ “Callback hell” combines three ideas to remove the callback frame from that stack while still recovering the callee’s return value: (1) a frameless tail-call that hides the callback frame at the cost of the return value, (2) a forward proxy frame whose epilogue ends in a MOV [REG], RAX ; RET pattern, so the proxy can store the return value into a caller-controlled slot before unwinding, and (3) chained callbacks — n! permutations of generic and terminating callbacks — so the resulting call stack looks like a perfectly ordinary chain of system callback frames. The full POC, ThreadPoolExecChain, ships an assembly core with three small functions (FirstCallback, GenericCallback, LastCallback) that wire the chain together. This post walks through every step, preserves the original figures and code, and ends with the detection landscape and a hardening checklist for defenders.

Foreword

This piece started, the author writes, with a casual question from Athanasios Tserpelis (trickster0):

I’m using TpAllocWork + TpPostWork to execute an arbitrary function, but I’m not fully sure how to recover the return value. Any ideas?
Conversation referenced in the original article

Recovering the return value of a thread-pool-dispatched function is a small but annoying problem — the dispatcher signature does not give callers a natural way back. That nudged klezVirus into revisiting some older experimental work on stack spoofing.

Note: this specific work did not make it into the final talk. The full conference deck advanced further than what is described here; this material is published in isolation for the sake of documentation.

Definitions

Tail calls

A tail call happens when a function’s last action is to call another function and return whatever that callee returns. At the assembly level the optimisation replaces the call instruction with a jmp: the stack does not grow, and the current frame is reused. Contrast with a normal call, which pushes a return address and produces a brand-new frame on top.

Callbacks

A callback is a function passed as an argument to be invoked later by another piece of code — usually after a specific event or condition. They underpin asynchronous programming, event-driven UIs, and the extensibility of every reasonable API surface.

ROP Gadget

A ROP (Return-Oriented Programming) gadget is a short instruction sequence ending in a ret, chained with other gadgets to perform arbitrary computation without ever introducing new code. ROP exists precisely because direct code injection is now hard on modern Windows.

JOP / COP Gadget

JOP (jump-oriented) and COP (call-oriented) gadgets end in an indirect jmp or call respectively. Both reach the next gadget via a forward edge, which is exactly why they are interesting in a CET / shadow-stack world — they sidestep the mitigations that target return-edges.

Ad-hoc definitions

Forward Proxy Frame

A function (or gadget) executed via its own crafted stack frame, where control flow reaches it through a forward edge — an indirect jump or an indirect call — not a return. Legitimate system functions can be repurposed as forward proxy frames.

Backward Proxy Frame

A ROP gadget executed with a dedicated stack frame, reached via a backward edge (a ret). This is just classical ROP framed as a building block. The author’s earlier CONCEAL gadgets from the moonwalk research are examples.

How Callbacks Appear in Real Life

For a callback to fire at all, its address must live in executable memory — either inside a loaded module (the typical case) or inside an RWX private region (the noisy case). Either way, that address ends up on the stack at the moment the callback executes, which means anyone walking the stack sees it:

Normal Callback call stack showing callback address visible in inspector — Normal callback call stack — the callback frame and its address are visible to any stack inspector. *Source: original article.*

A well-known mitigation is to implement the callback as a frameless routine that ends in a tail call. The actual target gets reached via jmp rather than call, the callback frame never materialises, and a stack walker only sees the dispatcher frame:

Callback with tail-call optimisation — frame removed from call stack — A frameless callback uses a tail call (`jmp`) instead of a call. The callback frame disappears — at the cost of losing the return value of the invoked function. *Source: original article.*

The idea of frameless callbacks is not new. Chetan Nayak (creator of Brute Ratel C2) publicly discussed something similar years ago, and SafeBreach Labs’ PoolParty work showed the same trick in a different shape. What is interesting here is preserving the return value.
Paraphrased from the original blog

A concrete example: a tail-called WorkCallback that wraps NtAllocateVirtualMemory. The struct in RDX packs the syscall pointer plus all six arguments. The callback unpacks them, fixes up the stack for the 5th and 6th arguments, and tail-jumps into NtAllocateVirtualMemory. No frame, no return-value capture:

section .text

global WorkCallback

WorkCallback:
    mov rbx, rdx                ; backing up the struct as we are going to stomp rdx
    mov rax, [rbx]              ; NtAllocateVirtualMemory
    mov rcx, [rbx + 0x8]        ; HANDLE ProcessHandle
    mov rdx, [rbx + 0x10]       ; PVOID *BaseAddress
    xor r8, r8                  ; ULONG_PTR ZeroBits
    mov r9, [rbx + 0x18]        ; PSIZE_T RegionSize
    mov r10, [rbx + 0x20]       ; ULONG Protect
    mov [rsp+0x30], r10         ; stack pointer for 6th arg
    mov r10, 0x3000             ; ULONG AllocationType
    mov [rsp+0x28], r10         ; stack pointer for 5th arg
    jmp rax

The pattern works, but it kills the return value — and a lot of callbacks (NtCreateSection, allocators, file handles) absolutely need theirs. The rest of the article is about how to get the frame removal and keep the return value.

Initial Frame Swapping Design

While building the moonwalk stack-spoofing research, klezVirus designed an “opaque architecture” for hiding function calls inside an executable. It glues together two pieces:

A conditional trampoline — a dispatcher that deceives stack inspectors about which branch is actually being taken.
An arbitrary function invoker, built from two functions: a frameless function that prepares the stack with a legitimate-looking frame, and a standard framed function that is never fully executed. The frameless function plays the role of a prologue (allocating the frame with the correct return address); the framed function contributes only its epilogue (restoring the registers and returning).

Opaque dispatcher architecture combining a conditional trampoline and an arbitrary-function invoker — Initial opaque dispatcher architecture: a conditional trampoline and an arbitrary-function invoker whose prologue and epilogue are taken from two different functions. *Source: original article.*

The result obfuscates program flow well, but the complexity-to-benefit ratio eventually felt off. The frame-swapping primitive that came out of this work, however, is genuinely reusable, and that is what powers the rest of the article.

Frame Swapping Proxy Using a Thread Pool

Frame swapping turns out to be a very flexible primitive for proxying arbitrary function calls inside callbacks. The trick: hide the callback frame, but preserve the return value, by routing through a function whose epilogue contains the magic MOV [REG], RAX pattern — storing the function’s return value into memory pointed to by another register before tearing the frame down.

The author found one such gadget in GlobalGetUserAndPassW inside wininet.dll. The relevant epilogue:

180152087 48 89 03        MOV        qword ptr [RBX],RAX
18015208a 48 83 c4 20     ADD        RSP,0x20
18015208e 5b              POP        RBX
18015208f c3              RET

If we make the callback “return” directly to 0x180152087 — the MOV [RBX], RAX — we get the return value of whatever we just executed written into [RBX], then the stack is unwound and execution resumes wherever the original caller expects. To make that work, the proxy needs:

Allocate / restore the same shadow space the host function uses (SUB RSP, 0x20 and PUSH RBX).
Put a pointer to the result slot into RBX before jumping into the target.
Tail-jump into the target with the correctly-shaped stack so that, when the target returns, RAX contains the result and the return-address slot points at the MOV [RBX], RAX gadget.

The custom callback that arranges all this:

do_call:
     mov r10, [r10 + 08h]   ; addressToPush

     ;; Pretending we are in a frame proc
     ;; Note: compile this as a frame proc is absolutely not necessary
     ;;       in this case, it is just to show we are mirroring the prolog of GlobalGetUserAndPassW
     .pushreg rbx
     push rbx
     .allocstack 20h
     sub rsp, 20h           ; this is for the spoofed/swapped frame
     .endprolog

     ; This is the address after the call in GlobalGetUserAndPassW
     push r10
     ; GCONTEXT is the address of a user-controlled Work Item structure
     mov rbx, GCONTEXT      ; we need this structure in RBX
     add rbx, 8             ; we use the address of the return value, unused for generic callbacks
     ; Finally we just to the target function
     jmp rax

And the full function whose epilogue we are splicing into, with the return-address landing point marked:

180152075 e8 ee af        CALL       internal_function
          00 00
18015207a 48 89 03        MOV        qword ptr [RBX],RAX   <--- We can set the return address here
18015207d b8 01 00        MOV        EAX,0x1
          00 00
180152082 eb 06           JMP        PROCEDURE_END
SAVE_RBX_RDX
180152084 48 89 02        MOV        qword ptr [RDX],RAX
180152087 48 89 03        MOV        qword ptr [RBX],RAX

PROCEDURE_END
18015208a 48 83 c4 20     ADD        RSP,0x20
18015208e 5b              POP        RBX
18015208f c3              RET
180152090 cc              ??         CCh

Call stack after frame swapping — callback frame replaced by a legitimate proxy frame — Call stack after the frame-swapping proxy runs: the callback frame is hidden behind a legitimate proxy frame from `wininet!GlobalGetUserAndPassW`, and the return value lands in the caller-supplied slot. *Source: original article.*

Not really important in our artificial scenario, but if the proxy function had its own structured exception handling, you’d want to make sure the epilogue you splice into doesn’t try to consult unwind info that no longer matches reality.

Limitations

Return value(s)

The MOV [REG], RAX trick captures the single “normal” return value (the contents of RAX). Functions that return additional values out-of-band — via output pointers passed as parameters — do not need any special treatment, because the pointer arguments are already arranged by the caller.

Number of arguments

How many stack arguments you can pass is bounded by how much stack the proxy frame allocates. GlobalGetUserAndPassW reserves only shadow space and saves RBX, so there is just enough room for the four register arguments before you start overwriting the return address. Two options to lift the limit: pick a proxy with a larger allocation (you want at least 0x20 + nargs*8), or chain proxy frames.

Discussion with Alex Reid (Octoberfest73): this is essentially the same shape as the Moonwalk++ pattern, just with an extra proxy frame between the dispatcher and the target. Chaining proxies trades a bit more stack noise for the ability to pass arbitrarily many arguments.
Paraphrased from the original article

One, None, and a Hundred Thousand

The single frame-swap is only the simplest case. Once you accept that you can splice callbacks and proxy frames, you can construct very arbitrary stacks. Broadly:

Backward proxy chains. Chain multiple backward proxy frames (ROP-style), being careful with return addresses. This gives you a fully customisable stack but the epilogues will not be preceded by genuine call instructions — that is the textbook ROP detection signal.
Forward proxy chains via natural callbacks. Use real OS callback APIs as forward proxy frames. Every transition is a valid forward edge, so CET / shadow-stack don’t object; the only weirdness is that the resulting callback chains get increasingly absurd.
Hybrid chains. Mix forward and backward proxies as needed, depending on whether CET enforcement is in play and what you want the stack to look like.

Reference: the SafeBreach & Almond Consulting write-ups on thread-pool work-item abuse are the right background reading here, particularly the parts about how the dispatcher routes through generic worker functions.

Each work item carries a pointer to the next call, the parameters, and (optionally) the return-address slot. With three small assembly functions you can chain arbitrarily many callbacks in n! orderings:

FirstCallback — initialises the work-item structure for the chain (current pointer, return slot, argument count).
GenericCallback — performs a tail call into the next callback invoker, advancing the chain.
LastCallback — performs the actual target invocation through the backward-proxy-frame pattern from the previous section, capturing the return value.

The work-item structure that ties them together:

typedef struct WorkItemContext {
     FARPROC func;           // Function pointer
     void* retAddress;       // Return address to simulate stack return
     uint64_t argc;          // Argument count
     void* args[MAX_ARGC];   // Arguments (up to MAX_ARGC)
} WorkItemContext;

*Note: this kind of implementation is purely arbitrary — you can wire up the same primitive with a single callback type and conditional branches; the three-function split just made the assembly cleaner.

Ok, but Why?

Empirically: this unusual stack shape bypasses essentially every “walk the stack and look for suspicious frames” detection in current EDR products. The forward edges look like normal thread-pool work-item dispatch; the backward edge stops at a real epilogue inside a Microsoft-signed DLL; the return value is delivered exactly as if the original caller had invoked the target directly. That alone is enough operational value to justify the technique, even though the construction is, on inspection, deeply weird.

POC || GTFO

A reference implementation lives at github.com/klezVirus/ThreadPoolExecChain. The interesting bits are the three assembly functions; everything around them is plumbing. A short demo video, lifted from the original article:

POC video by klezVirus — chained thread-pool callbacks with return-value recovery in action. Source: original article.

Detection Landscape

Note: detection is only required for processes where Intel CET is disabled — CET’s shadow stack catches the backward-proxy half of this technique on its own, because the spliced return address is not the one the shadow stack remembers.

For the half that is not covered by CET, detection looks a lot like detection for the original stack-moonwalking work:

Return-without-CALL. The most reliable signal: walk back from each frame’s return address and confirm it is preceded by a genuine call instruction. A return address that lands at MOV [REG], RAX directly after a function’s prologue, with no matching call site, is exactly what the backward proxy produces.
Gadget-scarcity argument. Backward proxies need and: store-RAX-to-pointer, then return, without clobbering critical registers, with the right stack discipline. There are not many such gadgets in any single signed DLL — an EDR can maintain a small known-gadget list and treat hits as high signal.
Callback chains. Three or more nested thread-pool callbacks have essentially no legitimate use case. A simple depth heuristic ought to flag the forward-proxy chain variant.
IAT inspection for dynamic calls. If the target of a callback resolves to an address that is not exported by any module that the process legitimately depends on, it is interesting on its own.
CFG / XFG bitmap validation. The indirect call into the target should be a CFG-valid destination. If your stack-walk replays the indirect targets and one of them isn’t in the CFG bitmap, that’s a strong signal.
CALL-site emulation. For each return address, briefly emulate the bytes preceding it; if the result isn’t a control-transfer instruction whose target matches the callee, the frame is synthetic.

Closing Remarks

The whole thing started from one friend’s specific problem about recovering thread-pool return values — and, as klezVirus admits in the original, the friend has not yet been told about the solution. The technique works as long as CET is off; with CET enabled, the callback-chaining half still functions, but the backward-proxy return-value trick does not, and a follow-up has to find another way back. That is an open problem worth picking up.

Key Takeaways

A callback’s address has to live in executable memory at the moment it fires, which means it is on the stack — the “callback was invoked indirectly” argument does not buy any stack-walker invisibility on its own.
Tail-called frameless callbacks remove the callback frame at the cost of the return value, which is unacceptable for many real-world callees (allocators, handle creators, anything that returns NTSTATUS).
A forward proxy frame whose epilogue contains a MOV [REG], RAX ; RET pattern lets you reclaim the return value while still hiding the callback frame — the proxy stores the result into a caller-controlled slot during its own unwind.
Chained generic + terminating callbacks (FirstCallback → GenericCallback… → LastCallback) produce n! permutations of plausible-looking stacks for the same logical call, defeating shape-based stack-walk heuristics.
Intel CET’s shadow stack catches the backward-proxy half of the technique — the callback-chaining half still works, but return-value recovery via this method does not survive CET enforcement.
The detection lever defenders actually have is “return address not preceded by a CALL site that targets the current frame’s function” — everything else is noise around that primitive.

Defensive Recommendations

Turn on Intel CET / Hardware-Enforced Stack Protection for security-sensitive processes (EDR agents, browser brokers, AV scanners, anything that holds a high-value handle). CET’s shadow stack negates the backward-proxy half of this technique outright.
Stack-walk with CALL-site validation. For every frame, verify that the bytes preceding the saved return address contain an instruction that could have produced that return address (a call with a target matching the frame’s function). A synthetic frame won’t satisfy this constraint.
Maintain a small known-good gadget list for proxy abuse. The MOV [REG], RAX ; ADD RSP, 0x20 ; POP RBX ; RET pattern from wininet!GlobalGetUserAndPassW is the canonical example; build a list of similar epilogues across the standard Windows DLL set and treat hits as high-signal.
Flag thread-pool callback chains of depth > 2. Three or more nested TpAllocWork / TpPostWork dispatches inside one logical operation have essentially no legitimate use case in normal application code.
Correlate indirect-call targets with CFG / XFG bitmaps. If your stack-walk encounters an indirect target that isn’t a valid CFG destination for the calling module, treat it as a synthetic frame, not a legitimate one.
Cross-check IAT-resolved targets against the module’s declared imports. A callback resolving to ntdll!NtAllocateVirtualMemory from a module that doesn’t import it is the kind of thing that should never be quiet.
Alert on PPL-process targeting of ZwTerminateProcess / ZwSuspendProcess via proxied callbacks. The technique here is generic, so it is just as suitable for “disable the EDR” as for benign payloads — the destination matters.

Conclusion

“Callback hell” is a clever combination of three small ideas: a tail-call removes the callback frame, a forward proxy frame whose epilogue stores RAX recovers the return value, and a chain of generic + terminating callbacks produces stacks that look ordinary to walking-style detection. It works today because most stack-walking heuristics check shape rather than provenance — they ask “is this stack weird?” instead of “could this return address have been produced by a real call site?”. Intel CET and CALL-site validation are the two levers that change that game; until both are widely deployed, callback chaining will remain a quietly effective evasion primitive for red teams and, less helpfully, for everyone else.

References (from the original article)

klezVirus, “Callback hell: abusing callbacks, tail-calls, and proxy frames to obfuscate the stack” — original article, CC BY 4.0.
klezVirus, ThreadPoolExecChain — reference POC.
klezVirus, Moonwalk++ — stack-spoofing follow-up framework.
klezVirus, AlternativeShellcodeExec — earlier shellcode-execution proxy work.
SafeBreach Labs, PoolParty — thread-pool work-item abuse research.
Chetan Nayak, frameless-callback notes (Brute Ratel C2).
Octoberfest73 (Alex Reid), Moonwalk++ pattern discussion.

Original research, figures, code samples, and video by klezVirus, “Callback hell: abusing callbacks, tail-calls, and proxy frames to obfuscate the stack”, klezVirus blog, posted 2025-12-21, updated 2025-12-22. Republished here under the original’s Creative Commons Attribution 4.0 International (CC BY 4.0) licence, with full attribution to the author.

core-jmp

Callback Hell: Abusing Callbacks, Tail Calls, and Proxy Frames to Obfuscate the Stack