Memory-Mapped Files

Prerequisites(3)

Builds on to(1)

mmap (memory map) is a system call that maps a file (or a portion of a file) directly into a process's virtual address space. Once mapped, the process can access the file's contents via ordinary memory reads and writes (pointer dereference), without using explicit read() or write() system calls. The kernel handles loading file data into memory transparently using the page fault mechanism.

How mmap Works

When a process calls mmap(NULL, length, PROT_READ, MAP_SHARED, fd, offset):

The kernel creates a new virtual memory area (VMA) in the process's address space, spanning length bytes.
The corresponding page table entries are initially marked not present -- no physical memory is allocated yet.
When the process first accesses an address in the mapped region, a page fault occurs.
The page fault handler reads the corresponding page of the file from disk (or finds it in the page cache) and maps it into the process's address space.
Subsequent accesses to the same page require no syscall or disk I/O -- the data is in physical memory, accessible at RAM speed.

This lazy loading is key: only pages actually touched by the program are loaded. If a process maps a 1 GB file but only reads 4 KB from the middle, only one page is loaded.

Shared vs. Private Mappings

Mapping Type	`mmap` Flag	Behavior
Shared	`MAP_SHARED`	Multiple processes see the same physical pages. Writes by one process are visible to others. Changes are eventually written back to the file.
Private	`MAP_PRIVATE`	The process gets its own copy. Writes trigger copy-on-write (COW): the kernel copies the page before modifying it, so other processes and the file are unaffected.

Shared mappings are used for IPC (inter-process communication) and for memory-mapped I/O. Private mappings are used by the dynamic linker to load shared libraries -- each process can modify its own copy of writable data without affecting others.

mmap vs. read()/write() Tradeoffs

Advantages of mmap:

No system call overhead per access -- once mapped, accessing the file is a pointer dereference, not a syscall.
Excellent for random access -- any byte in the file is accessible via its memory address; no seeking required.
Automatic caching -- the kernel's page cache manages which pages are in memory.
Shared memory -- multiple processes can map the same file for zero-copy IPC.

Disadvantages of mmap:

Page fault overhead -- each first access to a page incurs a page fault (kernel trap), which is more expensive than a well-buffered sequential read().
Bad for sequential streaming -- for reading a file sequentially from start to finish, buffered read() with read-ahead is faster because it avoids per-page fault overhead.
TLB pressure -- mapping large files uses many page table entries, potentially evicting other TLB entries.
Complex error handling -- I/O errors manifest as SIGBUS signals, not return codes.
Unmap cost -- munmap() is not free; it must flush dirty pages and invalidate page table entries.

Real-World Uses

Database buffer pools (SQLite, LMDB, MongoDB MMAPv1): map the entire database file, let the OS handle page management.
Dynamic linker (ld.so): shared libraries (.so / .dylib) are memory-mapped into every process that uses them, with code pages shared across all processes.
JIT compilers: map executable memory regions for generated code.
Large file processing: efficiently access multi-gigabyte files without loading them entirely into memory.

**mmap** (memory map) is a system call that maps a file (or a portion of a file) directly into a process's virtual address space. Once mapped, the process can access the file's contents via ordinary memory reads and writes (pointer dereference), without using explicit `read()` or `write()` system calls. The kernel handles loading file data into memory transparently using the page fault mechanism.

### How mmap Works

When a process calls `mmap(NULL, length, PROT_READ, MAP_SHARED, fd, offset)`:

1. The kernel creates a new **virtual memory area (VMA)** in the process's address space, spanning `length` bytes.
2. The corresponding page table entries are initially marked **not present** -- no physical memory is allocated yet.
3. When the process first accesses an address in the mapped region, a **page fault** occurs.
4. The page fault handler reads the corresponding page of the file from disk (or finds it in the page cache) and maps it into the process's address space.
5. Subsequent accesses to the same page require no syscall or disk I/O -- the data is in physical memory, accessible at RAM speed.

This **lazy loading** is key: only pages actually touched by the program are loaded. If a process maps a 1 GB file but only reads 4 KB from the middle, only one page is loaded.

### Shared vs. Private Mappings

| Mapping Type | `mmap` Flag | Behavior |
|-------------|-------------|----------|
| **Shared** | `MAP_SHARED` | Multiple processes see the same physical pages. Writes by one process are visible to others. Changes are eventually written back to the file. |
| **Private** | `MAP_PRIVATE` | The process gets its own copy. Writes trigger **copy-on-write (COW)**: the kernel copies the page before modifying it, so other processes and the file are unaffected. |

Shared mappings are used for IPC (inter-process communication) and for memory-mapped I/O. Private mappings are used by the dynamic linker to load shared libraries -- each process can modify its own copy of writable data without affecting others.

### mmap vs. read()/write() Tradeoffs

**Advantages of mmap:**
- **No system call overhead per access** -- once mapped, accessing the file is a pointer dereference, not a syscall.
- **Excellent for random access** -- any byte in the file is accessible via its memory address; no seeking required.
- **Automatic caching** -- the kernel's page cache manages which pages are in memory.
- **Shared memory** -- multiple processes can map the same file for zero-copy IPC.

**Disadvantages of mmap:**
- **Page fault overhead** -- each first access to a page incurs a page fault (kernel trap), which is more expensive than a well-buffered sequential `read()`.
- **Bad for sequential streaming** -- for reading a file sequentially from start to finish, buffered `read()` with read-ahead is faster because it avoids per-page fault overhead.
- **TLB pressure** -- mapping large files uses many page table entries, potentially evicting other TLB entries.
- **Complex error handling** -- I/O errors manifest as `SIGBUS` signals, not return codes.
- **Unmap cost** -- `munmap()` is not free; it must flush dirty pages and invalidate page table entries.

### Real-World Uses

- **Database buffer pools** (SQLite, LMDB, MongoDB MMAPv1): map the entire database file, let the OS handle page management.
- **Dynamic linker** (`ld.so`): shared libraries (`.so` / `.dylib`) are memory-mapped into every process that uses them, with code pages shared across all processes.
- **JIT compilers**: map executable memory regions for generated code.
- **Large file processing**: efficiently access multi-gigabyte files without loading them entirely into memory.

mmap for Database Buffer Pools

Real-World Example

SQLite (in WAL mode) and LMDB use mmap to manage their database files:

LMDB (Lightning Memory-Mapped Database):

// Simplified concept
void* db = mmap(NULL, fileSize, PROT_READ | PROT_WRITE, MAP_SHARED, dbFd, 0);

// Reading a record: just pointer arithmetic, no syscall
Record* r = (Record*)(db + offset);
printf("key=%s, value=%s\n", r->key, r->value);

// The kernel handles page faults: if the page containing this
// record is not in RAM, a page fault loads it from disk.

Why mmap works well for databases:

Random access: database queries jump to arbitrary locations in the file. mmap makes this a pointer dereference instead of lseek() + read().
Shared cache: the OS page cache serves as the buffer pool. No need to implement a separate caching layer.
Multiple readers: many processes can map the same file read-only and share the physical pages.

Why some databases avoid mmap (PostgreSQL, MySQL InnoDB):

They need fine-grained control over eviction policies (e.g., clock algorithm vs. OS LRU).
They need to control when dirty pages are flushed (for write-ahead log consistency).
mmap's page fault handling is synchronous, meaning a query can stall unpredictably on disk I/O.
They prefer to manage their own buffer pool with predictable latency characteristics.

Dynamic linker example: When you run any program that uses libc.so:

# Every process on the system maps libc.so
# Code pages (text segment) are MAP_SHARED: one physical copy shared by all
# Data pages (writable globals) are MAP_PRIVATE: each process gets COW copies

On a system with 200 running processes, libc.so's code pages exist only once in physical memory, saving hundreds of megabytes.

mmap: Lazy Loading via Page Faults

Step 1 of 3