The epoll UAF: A Same-CPU Preemption Race in fs/eventpoll.c on Linux 6.6+

The epoll UAF: A Same-CPU Preemption Race in fs/eventpoll.c on Linux 6.6+

Original: This article is an independent of “The epoll uaf”, published on the personal blog at guysrd.github.io. Author not clearly listed on the source page — the site handle is guysrd, with no byline.

All vulnerability research, reverse engineering, the struct-offset table, the C excerpts from fs/eventpoll.c and the exploit-feasibility analysis are the work of the original author. The article’s two SVG diagrams are linked back to the source rather than re-hosted (the core-jmp.org WordPress install currently refuses SVG uploads). All three C source excerpts and the struct eventpoll layout table are reproduced verbatim from the original. For the full walkthrough — including the timing measurements, cross-cache analysis, and the author’s open challenge to exploit it — read the source.

Source: guysrd.github.io · Fix: git.kernel.org commit 07712db80857

Executive Summary

guysrd documents a use-after-free in Linux’s fs/eventpoll.c caused by an asymmetry between the epoll loop-detection walkers and the struct eventpoll teardown path. Before March 2023 the whole walk happened under a global epmutex; a benchmarking optimisation replaced that with per-instance reference counting and moved the walkers inside rcu_read_lock() — but ep_free() kept calling plain kfree(ep), with no call_rcu deferral. Under CONFIG_PREEMPT=y + CONFIG_PREEMPT_RCU=y, rcu_read_lock() does not disable preemption, so the walker thread can be preempted between loading an epi from the RCU-protected hlist and dereferencing epi->ep — a parallel closer can reach ep_free() in that window and the next dereference touches freed kmalloc-256 memory.

The bug is reachable from any unprivileged process and the affected code is in mainline Linux from 6.6 onwards, which includes recent Android kernels (the author tested on a Pixel 10). The achieved primitive is not a clean arbitrary read/write — it’s a constrained write of loop_check_gen (a u64 monotonic counter) plus a zero byte at fixed offsets, optionally followed by a chained recursion that uses an attacker-controlled epi->ep pointer as the next target. With Pixel’s default governor throttling to 729 MHz at idle, race success drops sharply; the author hit ~4% reliability with 8,000 parent nodes and a ~2 ms walk. Same-cache reclaim within kmalloc-256 is feasible; the cross-cache slab → PCP → PTE path is timing-bound and the author was unable to make it reliable without privileged scheduler access. The fix is one line: kfree(ep) → kfree_rcu(ep, rcu).

epoll in two seconds

The Linux epoll family — epoll_create(), epoll_ctl(), epoll_wait() — is the kernel’s scalable readiness-notification interface, and the foundation of every high-performance event loop on Linux (nginx, Node.js, io_uring’s polled paths, large parts of the Android system server, …). Each instance is a struct eventpoll, and each watched file descriptor is a struct epitem. Because epoll file descriptors can themselves be added to other epoll instances, the runtime relationships among instances form a directed graph, and the kernel must reject cycles or it would loop forever when delivering events. The graph check is what gets us into trouble.

Structures

The author publishes an SVG diagram of the two relevant structures and the lifetime collision involved. It’s linked here rather than re-hosted because the WordPress install rejects SVG uploads:

/1.svg — epoll data structures and the UAF

For the purposes of the bug, two facts about struct eventpoll matter: the mtx mutex sits at offset 0, and the refs hlist head sits at offset 176 — that 176 is the field the walker dereferences after the object has been freed.

The 2023 optimisation

Before March 2023, the loop-detection walkers ran under a single global epmutex. That mutex was cheap to reason about but expensive under contention — on HTTP-style benchmarks it accounted for roughly 58% of CPU contention. A patch series replaced the global mutex with per-instance reference counting and moved the upward-walk path into an RCU read-side critical section. The reported benefit was a ~60% throughput improvement on those benchmarks. The cost — not visible until you start reading the teardown side — was that the walkers now traversed structures that some other thread could be freeing.

The bug

Two pieces of code matter. First, the loop-check entry point and the upward-direction walker (verbatim from the source):

static int ep_loop_check(struct eventpoll *ep, struct eventpoll *to)
{
    int depth, upwards_depth;

    inserting_into = ep;
    /*
     * Check how deep down we can get from @to, and whether it is possible
     * to loop up to @ep.
     */
    depth = ep_loop_check_proc(to, 0);
    if (depth > EP_MAX_NESTS)
        return -1;
    /* Check how far up we can go from @ep. */
    rcu_read_lock();
    upwards_depth = ep_get_upwards_depth_proc(ep, 0);
    rcu_read_unlock();

    return (depth+1+upwards_depth > EP_MAX_NESTS) ? -1 : 0;
}

..
snip
..

static int ep_get_upwards_depth_proc(struct eventpoll *ep, int depth)
{
    int result = 0;
    struct epitem *epi;

    if (ep->gen == loop_check_gen)
        return ep->loop_check_depth;

    hlist_for_each_entry_rcu(epi, &ep->refs, fllink)
        result = max(result, ep_get_upwards_depth_proc(epi->ep, depth + 1) + 1);
    ep->gen = loop_check_gen;
    ep->loop_check_depth = result;
    return result;
}

The hlist iteration is RCU-safe for the struct epitem objects in the list, because epitems are freed via call_rcu(&epi->rcu, epi_rcu_free). But the recursion target epi->ep is a parent struct eventpoll, and the teardown for those does not go through RCU:

static void ep_free(struct eventpoll *ep)
{
    mutex_destroy(&ep->mtx);
    free_uid(ep->user);
    wakeup_source_unregister(ep->ws);
    kfree(ep);
}

Under CONFIG_PREEMPT=y + CONFIG_PREEMPT_RCU=y, the walker’s rcu_read_lock() does not disable preemption. That means a load of epi can happen, the walker can be preempted, another task on the same CPU can drive a parent eventpoll’s refcount to zero and reach ep_free() → kfree(ep), and then the walker resumes and dereferences a freed object as epi->ep. The same shape exists in reverse_path_check_proc() — another walker on the same surface.

Triggering it

The race timeline is published as a second SVG, linked from the original:

/2.svg — the race timeline

The window is “a handful of ARM64 instructions wide” once everything is lined up. To make hitting it tractable, the author scales the walk: ~400 µs walk with 4,096 parents, ~2 ms walk with 8,000 parents and a measured ~4% hit rate. The trigger kernel needs both CONFIG_PREEMPT=y and CONFIG_PREEMPT_RCU=y, which is the default config on Android (and on Pixel 10 specifically). One important inconvenience: Pixel’s default governor throttles to 729 MHz at idle, which stretches some of the timings and lowers reliability — the author calls out that workaround attempts beyond raising thread priority did not produce stable cross-cache results.

What gets written

The walker doesn’t just dereference the freed object — it writes to it. The ep->gen = loop_check_gen and ep->loop_check_depth = result assignments land in whatever now occupies the freed slot. The interesting offsets, reproduced verbatim from the source:

FieldOffsetSizeNotes
mtx048mutex
wq4824wait_queue_head_t
poll_wait7224wait_queue_head_t
rdllist9616list_head
lock1128rwlock_t
rbr12016rb_root_cached
ovflist1368epitem pointer
ws1448wakeup_source pointer
user1528user_struct pointer
file1608file pointer
gen1688u64, read then WRITE loop_check_gen
refs1768hlist_head, READ as hlist pointer
loop_check_depth1841u8, WRITE 0 or kernel pointer
refcount1884refcount_t
napi_id1924unsigned int
struct eventpoll — 200 bytes, 4 cachelines, in kmalloc-256. Source: original article.

Two cases follow. If the byte at offset 176 reads as zero (this is the easy case under init_on_free=1, which Android enables), hlist_for_each_entry_rcu exits immediately and the walker just writes loop_check_gen at offset 168 and a single zero at offset 184 — nine bytes of silent corruption. If offset 176 reads non-zero, the walker follows it as an hlist pointer, applies container_of() to recover an epitem, dereferences epi->ep, and recurses with a controlled pointer. Each recursion level writes nine bytes at controlled offsets relative to the target. It’s a constrained write primitive, not a free memcpy, but it’s positioned: the only u64 you can write is loop_check_gen, a monotonic global counter you can advance arbitrarily by repeating epoll_ctl(ADD) on a benign descriptor.

Can you cross-cache this?

The honest answer in the original is “not without privileged scheduling control.” struct eventpoll lives in kmalloc-256, which on the test device is an order-1 slab with 32 objects per slab and cpu_partial=52. Same-cache reclaim is the easy case: spray kmalloc-256-sized objects of your choice after the free and you’ll likely land one in the freed slot via the LIFO per-CPU freelist. But the offsets you can write are fixed by the walker, and finding a same-cache object whose layout puts a useful pointer at offset 168 (or 176, or 184) is an exercise the author did not pursue deeply.

Cross-cache is the more ambitious move: chase the freed slab back to the buddy allocator, then re-allocate the page as something with high security relevance — the canonical target on ARM64 is a PTE page (order-0, 4 KB), so a write at the right offset within the page could plant or modify a page-table entry. The path is slab → PCP → buddy → PTE-page reallocation. The author measured the pipeline at ~100 ms minimum end-to-end against a ~2 ms race window with ~4% hit rate, on a CPU that idles at 729 MHz under Pixel’s governor. The arithmetic doesn’t close cleanly without SCHED_FIFO or equivalent priority access.

The fix

Fix is a one-line patch: defer the struct eventpoll free behind RCU, the same way epitem already was. Patch diff verbatim:

static void ep_free(struct eventpoll *ep)
{
    mutex_destroy(&ep->mtx);
    free_uid(ep->user);
    wakeup_source_unregister(ep->ws);
-   kfree(ep);
+   kfree_rcu(ep, rcu);
}

Mainline commit 07712db80857: “eventpoll: defer struct eventpoll free to RCU grace period.” With the change, the walker’s rcu_read_lock() guarantees the object stays valid until at least the next grace period, the dereference is safe, and the writes land in the still-allocated structure rather than in whatever now occupies the slot.

Closing thoughts

The author closes with an honest assessment: the bug is real, the primitive is real, the patch is real, but the exploitability of the resulting constrained write on a stock Android device is hampered by frequency scaling and the tight cross-cache timing window. The post ends with an explicit invitation for anyone who can stabilise an arbitrary read/write on top of this bug to share their work.

Key Takeaways

  • Class of bug: use-after-free in fs/eventpoll.c’s graph-walk code under RCU read lock, with the teardown side not deferring through RCU.
  • Root cause introduced in: a March 2023 optimisation that replaced the global epmutex with per-instance refcounting and moved walkers into rcu_read_lock(), while ep_free() still called plain kfree(ep).
  • Required config: CONFIG_PREEMPT=y + CONFIG_PREEMPT_RCU=y (default on Android, including Pixel 10).
  • Affected versions: mainline Linux 6.6+ until the fix lands; affected Android kernels at the same point in time.
  • Primitive achieved: a constrained write of loop_check_gen (a u64 monotonic counter) at offset 168 of the freed object, plus a zero byte at offset 184. Chained recursion through a controllable epi->ep writes the same nine-byte shape at attacker-directed targets.
  • Exploitability: reachable from any unprivileged process, but real-world race success on stock Android is ~4% with 8,000 parent nodes and is undermined by 729 MHz idle throttling. Same-cache reclaim is feasible; cross-cache to PTE pages was attempted and not made stable.
  • Fix: commit 07712db80857 — one line, kfree(ep)kfree_rcu(ep, rcu).

Defensive Recommendations

  • Pull the fix. Confirm your kernel build includes commit 07712db80857 or the corresponding -stable backport.
  • For Android vendors: verify the Android-side mainline-merge path picks this up; the bug class is reachable from any sandboxed app and the affected code is in Android Common Kernel 6.6+.
  • For kernel maintainers: the structural lesson is the asymmetry — if walkers are moved under RCU, every object they can traverse needs RCU-deferred free, not just the obvious one. Audit similar “mutex → RCU + refcount” refactors elsewhere in the tree (epoll is unlikely to be unique).
  • For defenders: instrument or alert on heavy epoll_ctl(ADD) bursts against deeply nested epoll graphs from unprivileged processes. The exploit shape requires constructing thousands of parent nodes to enlarge the race window — that’s an unusual workload outside a few legitimate event-loop runtimes.
  • For exploit detection: KASAN-equivalent and KFENCE-style canary allocators catch this bug on the first iteration of the walk. Even if you can’t run them in production, they should be on for every CI fuzzing job that touches fs/eventpoll.c.
  • For high-assurance configurations: where feasible, set init_on_free=1 (which Android enables by default) — it doesn’t close the bug but it does turn one of the two exploit branches into a 9-byte silent corruption rather than a chained-recursion target.
  • For research teams: this is a textbook example of how a benchmark-driven optimisation can quietly widen a UAF surface. Bug-class detection on lock-replacement patches is a worthwhile static-analysis target.

Conclusion

This is a clean illustration of the canonical RCU pitfall: walkers under rcu_read_lock() are safe only if every object they can touch is freed via call_rcu/kfree_rcu. The 2023 epoll optimisation forgot half of that contract for the struct eventpoll teardown path; three years and a million-line kernel later, the fix is a one-line edit but the gap was reachable from unprivileged userspace on every Linux 6.6+ kernel until commit 07712db80857. The original write-up by guysrd is worth reading in full for the timing measurements alone — especially if you defend Linux kernel attack surface and want a concrete demonstration of how big a real-world race window can be when the workload is built specifically to enlarge it.

This article is an independent English-language rewrite of “The epoll uaf”, originally published at guysrd.github.io. All vulnerability research, code excerpts from fs/eventpoll.c, the struct eventpoll layout table, the patch diff, and the SVG diagrams remain the work of the original author. Please cite the source when referencing this material.

Comments are closed.