Pointers and Memory: From the First Byte to the Last Cache Miss

Black-and-white photograph by Martin Er
Photograph by Martin Er, via Unsplash.

A complete guide to how computer memory works, from what a byte is through virtual address spaces, heap bugs, and the MMU, all the way to CPU cache lines, false sharing, and memory ordering.

Memory is the most fundamental thing a computer does, and the least understood by most people who write code for a living. This article starts at the very beginning (what a byte is, what an address is) and ends at the kinds of problems that appear only in production, under load, on multi-core machines. No language is assumed. C code appears occasionally as the thinnest possible wrapper around the hardware model, not as the subject itself.

Part 1. Memory Is Just Numbered Boxes

Imagine a very long row of small boxes. Every box holds exactly eight light switches, eight bits. Together, those eight switches can encode any value from 0 to 255 (2⁸ = 256 possibilities). This eight-bit unit is a byte, the smallest individually addressable piece of RAM in every modern computer.

Now label each box with a sequential number, starting at zero. That label is its address. On a 64-bit machine, addresses are 64-bit numbers, meaning the machine can theoretically address up to 2⁶⁴ distinct bytes, about 18 exabytes. Real hardware addresses far less, but the address space is vast by design.

When your CPU wants to read or write a value, it places an address on the address bus, a set of parallel wires connecting the processor to RAM, and the memory hardware routes the operation to that exact byte (or group of bytes). The CPU does not look for values by content; it looks for them by location.

A pointer is nothing more than a variable whose value is one of these addresses.

That is the complete definition. A pointer is a number. It happens to be interpreted as “go look at this location in the box array.” Everything else (the syntax, the type system, the ownership rules) is built on top of this one idea.

Why Store an Address Instead of the Value?

If you have a value already, why keep a pointer to it instead of just working with the value directly?

Three reasons:

1. Large data. Copying a 100 MB structure every time you pass it to a function would be catastrophic. Pass a pointer (8 bytes) and the receiver operates on the original.

2. Shared mutation. Two parts of a program that need to see each other’s changes must agree on a single location. A pointer is that agreement.

3. Dynamic structures. A linked list or a tree does not have a fixed size at compile time. The structure is built at runtime by allocating nodes and connecting them with pointers, each node’s address stored in the previous node.

The Size of a Pointer

On a 64-bit system, a pointer is always 8 bytes, regardless of what it points to. Whether it points to a single character or to a 10 GB buffer makes no difference; the pointer itself holds one 64-bit address. On a 32-bit system, pointers are 4 bytes, limiting the addressable space to 4 GB. This is why 32-bit programs could not use more than 4 GB of RAM, and why 64-bit CPUs became necessary.

Part 2. Pointer Arithmetic

The type of a pointer (what kind of thing it points to) does not affect the address stored inside it. But it affects how arithmetic on the pointer behaves.

If p is a pointer to a 4-byte integer and you compute p + 1, you want the address of the next integer, which is 4 bytes ahead, not 1. The compiler knows the size of the pointed-to type and automatically scales the offset. Writing p + 1 produces an address 4 bytes ahead; p + 3 produces an address 12 bytes ahead.

int arr[5] = {10, 20, 30, 40, 50};
int *p = arr;          // p points to arr[0]

// These two are identical:
int x = arr[3];        // standard indexing
int y = *(p + 3);      // pointer arithmetic — same result: 40

Array indexing is pointer arithmetic in disguise. The notation arr[i] is literally defined as *(arr + i): add i positions (scaled by the element size), then dereference. Knowing this makes it clear why going out of bounds is not automatically caught: the arithmetic simply produces an address past the end of the array, and the hardware will read or write there without complaint.

Part 3. The Process’s View of Memory

Here is the first major illusion. Your program does not see physical RAM. What it sees is a virtual address space, a private mapping that the operating system constructs specifically for that process. Every process gets its own virtual address space, isolated from every other process’s.

On a 64-bit Linux machine, the lower 128 terabytes of the virtual address space belong to the user process; the upper half is reserved for the kernel. The process works with virtual addresses. Behind the scenes, the hardware translates each virtual address to a physical location in RAM, but that translation is invisible to the program.

The virtual address space is divided into distinct regions, each with a different purpose:

┌─────────────────────────┐  High addresses
│        Kernel           │  (reserved — programs cannot touch this)
├─────────────────────────┤
│         Stack           │  grows downward ↓
│           ↓             │
│                         │
│           ↑             │
│    Memory-mapped area   │  shared libs, mmap files, anonymous mmap
│           ↑             │
│          Heap           │  grows upward ↑
├─────────────────────────┤
│    BSS segment          │  uninitialized global variables (zero-filled)
├─────────────────────────┤
│    Data segment         │  initialized global variables
├─────────────────────────┤
│    Text segment         │  machine code (read-only)
└─────────────────────────┘  Low addresses (address 0)

The text segment holds the compiled machine instructions. It is mapped read-only; if a program tries to write to it, the OS kills the process with a segfault. This prevents accidental (or malicious) self-modification.

The data and BSS segments hold global and static variables. BSS is zero-filled at startup; the OS does not need to store actual zero bytes in the file, just note how much space to zero on load.

The heap grows upward as the program requests more memory. It starts right above BSS and expands as needed.

The stack grows downward from the high end of the user space. The heap and the stack grow toward each other; if they ever meet, the program runs out of memory.

The memory-mapped area in the middle holds shared libraries (the standard library, graphics drivers, and so on), as well as any files or anonymous regions the program explicitly maps.

Part 4. The Stack

The stack is the simplest and fastest region of a program’s memory.

Every time a function is called, the CPU reserves a block of space on the stack called a stack frame. That frame contains:

  • The function’s local variables
  • The return address (where to jump back when the function ends)
  • Saved values of registers the function will use

When the function returns, its stack frame is discarded instantly. The CPU adjusts a single register, the stack pointer (SP), to point to the previous frame. No allocation, no bookkeeping, no system call: one register increment, and the memory is available again.

This is why local variables are so fast to create. There is no “allocation” in any meaningful sense; just an increment of a register.

The Stack’s Limit

The stack has a fixed maximum size, typically around 8 MB on Linux. Exceed it (usually through very deep recursion) and the program crashes with a stack overflow. The name of the famous programming Q&A site is not a coincidence.

Dangling Stack Pointers

Because a stack frame is destroyed the moment its function returns, returning a pointer to a local variable is undefined behaviour:

int *get_number(void) {
    int x = 42;
    return &x;   // BUG: x is destroyed when this function returns
}

The caller receives a pointer that looks valid (the address is non-null, the bytes are probably still there). But the next function call will reuse that memory, silently overwriting it. The program may appear to work correctly for days before the corruption surfaces. This is the archetypal memory bug: invisible, non-deterministic, and often far from the crash site.

Part 5. The Heap

The heap is for memory that must outlive the function that creates it. When a program requests heap memory at runtime, the operating system provides it via a system call (brk or mmap on Linux). Between the program and the OS sits a heap allocator, a library component that manages the raw memory: breaking it into blocks of different sizes, tracking which blocks are free, and recycling them when released.

The price of this flexibility is responsibility. Heap memory is not cleaned up automatically (in languages without garbage collection). The programmer must release it explicitly. Getting this wrong is the source of every classic memory bug.

Use-After-Free

int *p = malloc(4);   // allocate 4 bytes on the heap
*p = 42;
free(p);              // return the memory to the allocator
*p = 99;              // BUG: the allocator may have given this memory to someone else

After free, the pointer p still holds the old address. The bytes at that address often still contain 42 for a moment; they have not been zeroed. But the allocator may immediately give that memory to a different allocation. When the program writes 99 to the freed address, it is writing into another object’s storage.

Use-after-free is the root of roughly half of all high-severity browser security vulnerabilities. An attacker who can influence what gets allocated in the freed slot between the free and the write can turn an apparent memory corruption into controlled code execution.

Double-Free

free(p);
free(p);  // BUG: calling free twice on the same pointer

The heap allocator uses the freed memory itself to store bookkeeping metadata (a linked list of free blocks, sizes, and flags). Calling free twice overwrites this metadata. The corruption is often silent; it surfaces later, at an unrelated allocation, as a crash or heap corruption error.

Buffer Overflow

char name[8];
strcpy(name, "Christopher");  // BUG: "Christopher" is 11 bytes + null terminator

Writing past the end of an array overwrites whatever comes next. On the stack, that is typically another local variable, the saved frame pointer, or the return address. Overwriting the return address is the classic stack-based exploit: redirect the CPU to attacker-controlled code when the function returns.

On the heap, a buffer overflow overwrites the allocator’s metadata or the contents of an adjacent allocation, usually with less predictable but equally damaging results.

Memory Leak

while (1) {
    char *buf = malloc(1024);  // allocate 1 KB
    process(buf);
    // BUG: forgot to call free(buf)
    // 1 KB disappears from available memory every iteration
}

A leak does not crash the program immediately; it slowly consumes available memory. Long-running server processes are the natural habitat of leaks; they grow and grow until the OS kills them or the operator restarts them. Profiling tools like Valgrind and AddressSanitizer can detect leaks by tracking every allocation and checking whether it is freed before program exit.

Part 6. Virtual Memory, the MMU, and the TLB

We said the process sees virtual addresses. How do those get converted to physical RAM locations?

The hardware component responsible is the Memory Management Unit (MMU), integrated into every modern CPU. The MMU operates on units called pages, typically 4 KB each. Every virtual address is split into two parts:

  • A virtual page number (VPN): which page?
  • An offset within that page: where in that page?

The offset passes through unchanged. The VPN is translated to a physical frame number (PFN) using a data structure called the page table, which the OS maintains in RAM. The physical address is then (PFN × 4096) + offset.

Virtual address:  [ virtual page number | offset ]

                       page table

Physical address: [ phys. frame number  | offset ]

The Problem: Page Tables Are in RAM

Consulting the page table for every single memory access would double the cost of all memory operations. The solution is the Translation Lookaside Buffer (TLB), a small, extremely fast cache inside the MMU that stores the most recently used virtual-to-physical mappings.

  • TLB hit: The mapping is cached. The translation costs approximately one CPU cycle.
  • TLB miss: The mapping is not cached. The CPU must walk the page table, traversing a multi-level tree of page table entries stored in RAM. On x86-64, a full page table walk touches four separate memory locations. Only then is the translation known and the TLB updated.

The TLB typically holds only 64 to 1024 entries. If a program accesses memory spread across many pages (for example, by chasing pointers through a fragmented linked list) it constantly evicts TLB entries, causing repeated page table walks. This is one reason why cache-friendly, contiguous data structures (arrays) are often dramatically faster than pointer-heavy structures (linked lists, trees) even when the algorithmic complexity is the same.

Page Faults

When a program accesses a virtual address that has no mapping in the page table at all, the CPU raises a page fault, a trap to the operating system. The OS decides what to do:

  • If the address is valid but the page is on disk (swapped out), the OS reads it back into RAM, updates the page table, and resumes the program. The program never knew it happened.
  • If the address is invalid (NULL dereference, access past the stack), the OS sends the process a signal, on Linux a SIGSEGV (segmentation violation). The process usually terminates.

ASLR: Randomising the Map

Virtual memory makes a powerful security technique possible. Address Space Layout Randomisation (ASLR) randomises the base addresses of the stack, heap, and loaded libraries on every run. An attacker who wants to overwrition cate a return address with the address of a useful funcnnot know that address in advance, as it changes on every execution. ASLR is enabled by default in every modern OS.

Part 7. CPU Caches and the True Cost of Memory Access

RAM is fast compared to a hard disk, but extraordinarily slow compared to the CPU. On a 3 GHz processor, one CPU cycle lasts about 0.3 nanoseconds. A round trip to main memory takes roughly 60–100 nanoseconds, equating to 200 to 300 CPU cycles during which the processor is waiting, doing nothing useful.

The solution is a hierarchy of caches, each smaller and faster than the one below:

LevelTypical sizeTypical latencyShared by
Registers~1 KB0 cyclesOne core
L1 cache32–64 KB~4 cyclesOne core
L2 cache256 KB–1 MB~12 cyclesOne core
L3 cache4–64 MB~40 cyclesAll cores on chip
RAM8–512 GB~200 cyclesAll cores
NVMe SSDTBs~50 000 cyclesMachine

When the CPU needs a value, it checks L1 first; on a miss, L2; on a miss, L3; on a miss, RAM.

Cache Lines: The Unit of Transfer

The cache does not operate byte by byte. It works in units called cache lines, typically 64 bytes each. When you access any single byte, the CPU fetches the entire 64-byte cache line containing it and stores it in cache. If you then access the next byte, or the byte after that, they are already in the cache, with no RAM access required.

This design exploits spatial locality: if you access one part of a data structure, you are likely to access nearby parts soon. Dense arrays exploit this perfectly; iterating through an array fetches one cache line per 64 bytes accessed. A linked list, where each node is a separately allocated heap object potentially scattered across memory, fetches a new cache line for every node visited.

Temporal locality is the other principle: recently accessed data is likely to be accessed again soon. Hot variables (loop counters, frequently read fields) stay warm in cache; the cache evicts cold data that has not been touched in a while.

False Sharing: A Concurrency Trap

On a multi-core machine, each core has its own L1 and L2 cache, but all cores share L3 (and RAM). If two cores hold a copy of the same cache line and one of them writes to it, the hardware must invalidate the other core’s copy, even if they were writing to different bytes within that cache line.

If thread A repeatedly writes to counter_a and thread B repeatedly writes to counter_b, and both variables happen to live on the same 64-byte cache line, the two cores spend most of their time fighting over ownership of that cache line rather than doing useful work. The performance loss can be severe, sometimes orders of magnitude slower than the sequential case.

The fix is to ensure that each thread’s private data occupies its own cache line, typically by padding:

struct Counter {
    long value;
    char _pad[56];  // pad to 64 bytes so the next Counter is on its own line
};

This wastes memory but eliminates the cache line bouncing.

Part 8. Memory Ordering, When CPUs Lie

Modern CPUs do not execute instructions in the order you wrote them. They reorder memory operations to maximise throughput, running a ready read before an earlier write that is still waiting on a result.

Additionally, CPUs have store buffers: a write goes into the store buffer first and is committed to cache asynchronously. Other cores may not see the write for some time after the instruction has “executed.”

On x86, this reordering is relatively constrained, but stores to cache are always seen in program order by other cores (though loads may be reordered). On ARM and POWER architectures, the memory model is significantly more relaxed; stores and loads can be reordered in almost any way the hardware finds efficient.

This has a concrete consequence for concurrent programs. Consider:

Thread A:           Thread B:
x = 1;              while (ready == 0) {}
ready = 1;          print(x);

Thread A sets x to 1, then sets ready to 1 to signal thread B. Thread B waits until ready is 1, then reads x. On a sequentially consistent machine, thread B would always print 1. On real hardware without synchronisation, thread B might print 0, as the store to x may still be in thread A’s store buffer when thread B observes ready = 1.

The solution is a memory barrier (also called a memory fence): an instruction that forces all preceding stores to be visible to other processors before any subsequent loads or stores proceed. This is the hardware mechanism underlying volatile, atomic, and mutex operations in every language. They ultimately compile down to barrier instructions.

Understanding this is not academic. Every lock you acquire and every atomic operation you use prevents a specific class of reordering. Using the wrong ordering (e.g., relaxed atomics when acquire/release is needed) produces bugs that appear only under high load on many-core machines and are nearly impossible to reproduce in a debugger.

Part 9. Memory-Mapped Files and Copy-on-Write

Memory-Mapped Files

The virtual memory system is not limited to mapping RAM. The OS can map the contents of a file directly into the process’s virtual address space. Accessing the mapped addresses causes the OS to page in the relevant file blocks on demand, exactly like a page fault on anonymous memory.

int fd = open("data.bin", O_RDONLY);
size_t size = /* file size */;
void *map = mmap(NULL, size, PROT_READ, MAP_PRIVATE, fd, 0);
// Now 'map' points to the file's contents as if they were RAM
// No explicit read() calls needed

The benefits are significant:

  • Multiple processes that map the same file share the same physical pages in RAM, with no duplication.
  • Only the accessed portions of the file are ever loaded into RAM.
  • On write (MAP_SHARED), changes are automatically written back to the file.

Databases, inter-process shared memory, and high-performance logging systems commonly exploit this mechanism.

Copy-on-Write (CoW)

When Unix-like systems create a new process with fork(), the child process starts as an exact copy of the parent. Copying all of the parent’s memory would be prohibitively expensive. Instead, the OS uses copy-on-write: both parent and child share the same physical pages, with the page table entries marked as read-only for both.

When either process attempts to write to a shared page, a page fault triggers. The OS copies just that page, updates the page table entry to point to the new copy, marks it writable, and resumes. Only the written pages are ever duplicated.

CoW is also used internally by some allocators, by the Linux kernel’s mmap(MAP_PRIVATE), and by data structures in functional programming languages that need cheap copies with local mutations.

Huge Pages

The default page size of 4 KB was set in the 1980s. A machine with 128 GB of RAM has 32 million pages. The TLB typically holds around 1000 entries, covering just 4 MB. Processes that work with large datasets spend a surprising fraction of their time on TLB misses.

Huge pages (2 MB or 1 GB on x86-64) dramatically reduce TLB pressure. Each TLB entry covers 2 MB instead of 4 KB, meaning the same 1000 TLB entries now cover 2 GB rather than 4 MB. Databases (PostgreSQL, Oracle) and high-performance computing applications often configure huge pages to eliminate TLB misses as a bottleneck.

Where This Leaves You

Here is what the introductory courses skip. The memory model you were taught (variables hold values, functions own their scope, assignments happen sequentially) is a simplification that the hardware actively cooperates to maintain. The OS, the compiler, and the CPU all quietly negotiate to keep your program’s assumptions intact.

None of them were asked to.

The TLB has a fixed number of entries and evicts older mappings when it fills. The store buffer drains on its own schedule, not yours. A cache line is 64 bytes regardless of how many you actually needed — and when two cores both write to variables that happen to share one, the hardware spends more time arbitrating ownership than doing useful work. At some point in this article, the gap between the model and the machine probably started to feel uncomfortable. That discomfort is the point.

None of this prevents you from writing memory bugs. It changes what you see when they appear. A use-after-free stops being mysterious corruption and becomes a write to memory that was given away on a specific earlier line. False sharing gets a cause rather than a shrug. Memory ordering failures stop feeling like nondeterminism and start reading as a predictable outcome on a machine that never promised what you assumed.

Heartbleed was four bytes. Not four megabytes, not a catastrophic miscalculation; four bytes past the end of a heap buffer in OpenSSL’s heartbeat handler. The Morris worm used a fixed-size stack buffer in fingerd, a Unix program most operators had not thought about in years. Cloudbleed was a use-after-free in Cloudflare’s HTTP parser, generated by Ragel from a state machine specification. In each case, the code did exactly what the programmer told it to do. The programmer’s model of what that meant at runtime was wrong.

References

  1. Bryant, R. E.; O’Hallaron, D. R. Computer Systems: A Programmer’s Perspective, 3rd ed. Pearson, 2015. The standard textbook for CMU 15-213 / Introduction to Computer Systems.
  2. Carnegie Mellon University. CS 15-213: Introduction to Computer Systems — Virtual Memory. cs.cmu.edu/afs/cs/academic/class/15213-s09/www/lectures/17-virtual-memory.pdf
  3. Stanford University. CS 107: Pointers, Arrays and the Heap (Lab 3). web.stanford.edu/class/archive/cs/cs107/cs107.1216/lab3
  4. Stanford University. CS 107: More Pointers and Arrays (Lecture 6). web.stanford.edu/class/archive/cs/cs107/cs107.1212/lectures/6/Lecture6.pdf
  5. MIT OpenCourseWare. 6.004 Computation Structures: Caches and the Memory Hierarchy. ocw.mit.edu/courses/6-004-computation-structures-spring-2017/pages/c14
  6. MIT OpenCourseWare. 6.004 Computation Structures: Virtual Memory. ocw.mit.edu/courses/6-004-computation-structures-spring-2017/pages/c16
  7. MIT OpenCourseWare. 6.172 Performance Engineering: Memory Systems and Performance Engineering. ocw.mit.edu/courses/6-172-performance-engineering-of-software-systems-fall-2010
  8. University of Illinois Urbana-Champaign. CS 225: Stack and Heap Memory. courses.grainger.illinois.edu/cs225/sp2022/resources/stack-heap
  9. Technische Universität Darmstadt. Rechnerorganisation, Kapitel 6: Speicherhierarchie. esa.informatik.tu-darmstadt.de
  10. Tanenbaum, A. S. Modern Operating Systems, 4th ed. Pearson, 2014. Ch. 3: “Memory Management.”
  11. Drepper, U. What Every Programmer Should Know About Memory. Red Hat, Inc., 2007. akkadia.org/drepper/cpumemory.pdf
  12. OWASP Foundation. Buffer Overflow Attack. owasp.org/www-community/attacks/Buffer_overflow_attack
  13. Preshing, J. Weak vs. Strong Memory Models. preshing.com, 2012. preshing.com/20120930/weak-vs-strong-memory-models
  14. Intel Corporation. Intel® 64 and IA-32 Architectures Software Developer’s Manual, Vol. 3A: System Programming Guide. Ch. 4: “Paging.” intel.com/content/www/us/en/developer/articles/technical/intel-sdm.html