← History

The Stack, the Heap, and Data Layout

Where your bytes live and how they line up - automatic vs dynamic storage, struct padding and alignment, and laying data out for the cache across seven systems languages.

CC++ZigOdin

Every systems language eventually forces the same question on you: where do these bytes actually live, and in what order? A value can sit in a register, on the call stack, in a static segment fixed at link time, or in a block you carved out of the heap at runtime. The lifetime, the cost, and the cache behavior of your program all follow from that choice. This article walks through automatic versus dynamic storage, the mechanics of struct padding and alignment, why layout decides cache performance, and the tools each language gives you to inspect layout - sizeof, @sizeOf, size_of, and friends - across C, C++, HolyC, Zig, Hare, Odin, and Forth.

Two kinds of memory, one mental model

It is worth pinning down terms before the code, because "stack" and "heap" are conventions, not hardware.

The C standard names these explicitly: objects have automatic, static, thread, or allocated storage duration. Every language below maps onto the same four ideas; only the syntax and the safety rails differ.

C: malloc, free, and the duty of care

In C, a plain local has automatic storage; malloc/calloc/realloc hand back allocated storage you must free exactly once.

#include <stdlib.h>
#include <string.h>

typedef struct { int x, y; } Point;

Point on_stack(void) {
    Point p = {1, 2};   /* automatic: lives in this frame only      */
    return p;           /* returned by value - copied, so this is OK */
}

Point *on_heap(void) {
    Point *p = malloc(sizeof *p); /* allocated: outlives the call    */
    if (!p) return NULL;          /* malloc can fail - always check  */
    *p = (Point){3, 4};
    return p;                     /* caller now owns it and must free */
}

Note sizeof *p rather than sizeof(Point): it asks the compiler for the size of whatever p points at, so the allocation stays correct even if the type changes. Returning &p from on_stack instead of p would be the classic dangling-pointer mistake - the frame is gone the instant the function returns.

C++: the same stack/heap split, but lifetimes get destructors

C++ keeps C's storage durations and adds RAII: an object's destructor runs deterministically when its scope ends, so a stack object can own a heap allocation and free it automatically.

#include <memory>
#include <vector>

struct Point { int x, y; };

void demo() {
    Point local{1, 2};                       // automatic storage
    auto owned = std::make_unique<Point>();  // heap, freed when 'owned' dies
    std::vector<Point> grid(1000);           // heap buffer, freed by ~vector

    // No explicit delete anywhere: ~unique_ptr and ~vector reclaim the heap
    // when this scope exits, in reverse order of construction.
}

The heap is still the heap, but the responsibility for freeing it has been moved onto an automatic-storage object whose destruction the compiler guarantees. This is the single biggest ergonomic difference from C, and the rest of the languages here deliberately reject it in favor of explicit, visible cleanup (usually defer).

HolyC: per-task heaps in a single address space

HolyC, Terry A. Davis's language for TempleOS, is a C dialect with one of the most distinctive memory models in this group. TempleOS runs entirely in 64-bit ring 0 in a single flat address space with no memory protection. Memory is fully manual and GC-free, but the heap is per task: MAlloc() pulls from the current task's data heap and Free() returns it.

// HolyC (fence-tagged `c`): MAlloc/Free instead of malloc/free.
U0 Demo()
{
  Point *p = MAlloc(sizeof(Point)); // grabs from this task's data heap
  p->x = 3;
  p->y = 4;
  Print("size = %d\n", MSize(p));   // real allocated size; large requests
                                    // round up to a power of two
  Free(p);                          // Free(NULL) is permitted
}
Demo;                               // no main() required - the JIT runs this

Two things make HolyC unusual and worth respecting on its own terms. First, heaps are tied to tasks: when a task dies, its code and data heaps are reclaimed automatically, and you can even allocate against another task's heap or spin up an independent one with HeapCtrlInit(). Second, because there is no memory protection whatsoever, a double-free or a wild pointer can corrupt the entire running system. That is not a flaw to be patched - Davis designed TempleOS as a "modern Commodore 64," a machine simple enough for one person to understand completely, and the unforgiving memory model is part of that deliberate simplicity. MSize() reporting the rounded-up real size, rather than your requested size, is a small window into the bump-and-buddy style allocator underneath.

Zig, Hare, Odin: explicit, GC-free, with defer

The three modern languages here converge on the same philosophy - no garbage collector, no hidden allocations, no RAII destructors - and replace destructors with defer, which schedules cleanup to run at scope exit.

const std = @import("std");
const Point = struct { x: i32, y: i32 };

fn demo(allocator: std.mem.Allocator) !void {
    var local = Point{ .x = 1, .y = 2 }; // automatic storage
    _ = &local;

    const p = try allocator.create(Point); // heap; allocator is explicit
    defer allocator.destroy(p);            // runs when demo() returns
    p.* = .{ .x = 3, .y = 4 };

    const buf = try allocator.alloc(u8, 1024);
    defer allocator.free(buf);             // paired with its alloc, visibly
}

Zig's signature move is that allocation is always visible at the call site: any function that needs the heap takes an std.mem.Allocator parameter, so you can swap a GeneralPurposeAllocator (which detects leaks and double-frees), an ArenaAllocator, or a FixedBufferAllocator without touching the code that uses it.

type point = struct { x: int, y: int };

fn demo() void = {
	let local = point { x = 1, y = 2 };   // automatic storage
	let p = alloc(point { x = 3, y = 4 })!; // heap; alloc can fail (nomem),
	                                        // so `!` asserts it succeeded
	defer free(p);                          // released at scope exit
};

Hare uses a built-in alloc expression and free, with defer keeping the two close together. The runtime is tiny and, when linking libc, alloc goes straight through malloc - what you write is close to what runs.

import "core:mem"

Point :: struct { x: int, y: int }

demo :: proc() {
	local := Point{1, 2}          // automatic storage
	p := new(Point)               // heap via context.allocator
	defer free(p)                 // released at scope exit
	p.x, p.y = 3, 4

	buf := make([]u8, 1024)       // heap slice via the same allocator
	defer delete(buf)
}

Odin threads the allocator differently from Zig: an implicit context is passed as a hidden argument to every Odin-convention procedure, and it carries context.allocator. So new, make, free, and delete route through the in-scope allocator without you threading it through every call. Swap context.allocator for an arena and an entire subsystem's memory frees in a single reset.

Forth: the dictionary is your static allocator

Forth has no type system and no garbage collector; memory is fully exposed. Static "data space" is carved directly out of the dictionary - the same contiguous region where word definitions live - and HERE is literally the bump pointer to the next free address.

\ Static data space: ALLOT moves HERE forward; @ and ! are fetch/store.
CREATE point  2 CELLS ALLOT     \ reserve two cells (x, y)
3 point !                       \ store 3 at point+0 (x)
4 point CELL+ !                 \ store 4 at point+1 (y)

\ Optional heap word set: ALLOCATE / FREE behave like malloc / free.
1024 ALLOCATE THROW  CONSTANT buf   \ ( -- addr )  ALLOCATE pushes addr & ior
buf FREE THROW                       \ pair every ALLOCATE with a FREE by hand

CREATE ... ALLOT is static allocation by pointer arithmetic: you are moving HERE forward and remembering where you started. The optional ALLOCATE/FREE/RESIZE word set provides C-style dynamic memory, but it is opt-in and entirely manual - there is no safety net, only the raw mechanism.

Padding and alignment: why sizeof lies to beginners

Here is the fact that surprises everyone the first time: a struct is usually larger than the sum of its fields. The compiler inserts invisible padding so that each field lands on an address that is a multiple of its alignment - and the whole struct is padded out to a multiple of its largest member's alignment, so arrays of the struct stay aligned too.

Consider this struct in C:

#include <stdio.h>
#include <stddef.h>

struct Bad {
    char  a;   // offset 0
    // 7 bytes of padding so 'b' is 8-byte aligned
    double b;  // offset 8
    char  c;   // offset 16
    // 7 bytes of tail padding so the struct is a multiple of 8
};             // sizeof == 24, not 10

struct Good {
    double b;  // offset 0
    char  a;   // offset 8
    char  c;   // offset 9
    // 6 bytes of tail padding
};             // sizeof == 16

int main(void) {
    printf("Bad:  size=%zu  offset(b)=%zu\n",
           sizeof(struct Bad), offsetof(struct Bad, b));   // 24, 8
    printf("Good: size=%zu\n", sizeof(struct Good));        // 16
}

Same three fields, but reordering from largest-to-smallest shrank the struct from 24 to 16 bytes - a third smaller, for free. offsetof (from <stddef.h>) tells you exactly where each field lands and is the honest way to see the holes. The rule of thumb that falls out of this: declare fields in descending order of alignment (or size) to minimize padding.

C++ obeys the same alignment rules and exposes them through the language with alignof and alignas:

#include <cstdio>
#include <cstddef>

struct alignas(64) CacheLine {  // force this struct onto its own cache line
    int counter;                // ... lots of tail padding up to 64 bytes
};

static_assert(sizeof(CacheLine) == 64);
static_assert(alignof(int) == 4);

int main() {
    std::printf("alignof(double)=%zu\n", alignof(double)); // 8 on x86-64
}

alignas(64) is the standard trick for avoiding false sharing in concurrent code: pad a per-thread counter out to a full cache line so two threads never contend on the same line.

How the other languages let you query and control layout

Every language here gives you a compile-time size operator, and several give you alignment introspection and explicit layout control.

Zig uses builtins - @sizeOf, @alignOf, @offsetOf - and, crucially, distinguishes a normal struct (whose field order the compiler may reorder for you) from an extern struct (C ABI layout, fields in declared order) and a packed struct (bit-level, no padding):

const std = @import("std");

const Normal = struct { a: u8, b: u64, c: u8 }; // compiler may reorder fields
const CLike  = extern struct { a: u8, b: u64, c: u8 }; // C layout: padded, 24
const Packed = packed struct { a: u8, b: u64, c: u8 }; // bit-packed, no padding

comptime {
    // @sizeOf(Normal) is implementation-defined: Zig may reorder its
    // fields to cut padding (here, down to 16), so we don't assert on it.
    // extern structs keep C layout, so their size is fixed and assertable.
    std.debug.assert(@sizeOf(CLike) == 24);
    std.debug.assert(@alignOf(CLike) == 8);
    std.debug.assert(@offsetOf(CLike, "b") == 8);
}

That a normal Zig struct may be silently reordered is a real difference from C: it lets the compiler minimize padding without you reordering fields by hand, at the cost of a defined memory layout. When layout must match C or hardware, reach for extern or packed.

Hare exposes size, align, and offset as built-in expressions, and like C it does not reorder fields:

type sample = struct {
	a: u8,
	b: u64,
	c: u8,
};

export fn main() void = {
	const sz = size(sample);          // 24: padded like C
	const al = align(sample);         // 8
	let s = sample { a = 0, b = 0, c = 0 };
	const off = offset(s.b);          // 8 - offset takes a field access
};                                    //     on an object, not a type

Odin uses size_of, align_of, and offset_of. Like C - and unlike Zig - Odin keeps struct fields in their declared order by default, so the natural layout already matches the C ABI; struct tags then let you remove padding (#packed) or force a specific alignment (#align):

import "core:fmt"

Sample      :: struct { a: u8, b: u64, c: u8 }          // natural alignment (C layout)
Packed      :: struct #packed { a: u8, b: u64, c: u8 }  // no padding at all
Aligned     :: struct #align(64) { counter: int }       // whole struct aligned

main :: proc() {
	fmt.println(size_of(Sample))          // 24
	fmt.println(align_of(Sample))         // 8
	fmt.println(offset_of(Sample, "b"))   // 8
	fmt.println(size_of(Packed))          // 10 - packed, padding removed
}

HolyC has the familiar C sizeof, and its types are fixed-width by name - I8/U8/I16/U16/I32/U32/I64/U64, F64 as the only float, and U0 as a genuinely zero-sized void - which makes layout reasoning refreshingly direct since there is no ambiguity about how wide an int is.

// HolyC: fixed-width types make sizeof predictable.
class Sample { U8 a; U64 b; U8 c; };  // padded to natural alignment
Print("%d\n", sizeof(Sample));        // 24
Print("%d\n", sizeof(U0));            // 0 - U0 is truly zero-sized

Forth, being typeless, has no sizeof for structs - you compute layout yourself in cells (CELLS, the machine word size) and bytes (CHARS), which is the most honest version of what every other compiler is doing for you under the hood.

Layout is performance: the cache-friendliness payoff

Alignment is not just correctness pedantry; it is the lever that controls how your program meets the memory hierarchy. Main memory is ~100x slower than L1 cache, and the CPU never fetches one byte - it fetches a whole cache line (typically 64 bytes). So the real question for hot loops is: of the 64 bytes you just paid to drag in, how many did you actually use?

This is where data-oriented design comes in, and Odin makes it a first-class feature. Suppose you have ten thousand entities and a loop that only touches their positions. The naive Array of Structures (AoS) layout interleaves every field:

// Array of Structures: each Entity's fields are adjacent in memory.
Entity :: struct {
	pos:    [3]f32,   // the loop reads this...
	vel:    [3]f32,
	health: f32,
	name:   [32]u8,   // ...but this rides along in the cache line, wasted
}
entities: [10_000]Entity

update :: proc() {
	for &e in entities {
		e.pos += e.vel  // every iteration pulls in name[], health, etc.
	}
}

Each pos you read drags 50-plus unrelated bytes into cache. The fix is Structure of Arrays (SoA): store all the pos values contiguously, all the vel values contiguously, and so on, so a position-only loop streams through tightly packed data with no waste. Odin's #soa directive does this transparently - you write struct-like code, the compiler stores it column-wise:

// Structure of Arrays via #soa: a position-only loop touches only positions.
Entity :: struct {
	pos:    [3]f32,
	vel:    [3]f32,
	health: f32,
	name:   [32]u8,
}

entities: #soa[10_000]Entity   // laid out column-by-column under the hood

update :: proc() {
	for &e in entities {
		e.pos += e.vel    // pos[] and vel[] are now each contiguous streams
	}
}

The access syntax is unchanged - e.pos still works - but the bytes are reorganized so the CPU prefetcher can run flat out. For large arrays of "wide" structs where loops touch only a few fields, this can be the difference between memory-bound and compute-bound code, often several times faster, with the same algorithm.

You can reach the same destination by hand in any of these languages - declare parallel arrays instead of an array of structs - but Odin's #soa is notable for letting you flip between AoS and SoA with one keyword while keeping struct-style access. C++ programmers reach for the same idea with parallel std::vectors or libraries; Zig and Hare do it with explicit parallel slices. The principle is universal: lay your data out to match how the loop reads it, not how the human reads it.

A second, cheaper win is the one from the padding section: shrink the struct itself. Reordering fields from 24 bytes to 16 means 50% more of them fit per cache line, which directly raises the hit rate of any traversal - no algorithm change, just honesty about alignment.

Choosing where bytes live: a practical summary

Memory management is not a chore bolted onto these languages - in C, HolyC, Zig, Hare, Odin, and Forth it is the language. Understanding the stack, the heap, and the byte-level shape of your data is the whole game.