Understanding Sandboxes: gVisor, Hypervisors, and Firecracker
Every time you run a serverless function on AWS Lambda, execute a container on Google Cloud Run, or spin up a GitHub Actions workflow, your code runs on a physical machine shared with hundreds of other tenants. The only thing standing between your workload and theirs is a sandbox – an isolation boundary that determines what your code can see, touch, and break.
Most developers have a vague sense that containers “handle this.” They don’t. Containers were designed for packaging and deployment consistency, not for security isolation. The distinction matters, and understanding it requires digging into what the Linux kernel actually provides, where those guarantees end, and what three very different technologies – gVisor, nested hypervisors, and Firecracker – do to close the gap.
What Is a Sandbox?
A sandbox is an execution environment that restricts what a program can do. It limits access to files, network interfaces, system calls, and hardware. The premise is simple: if you’re going to run code you don’t fully trust – whether it’s a third-party library, a customer’s serverless function, or an AI-generated script – you want to confine the blast radius. A vulnerability inside the sandbox should not yield access to the host, to other tenants’ data, or to the underlying infrastructure.
This is the principle of least privilege applied at the infrastructure level. A web server doesn’t need to load kernel modules. A function that resizes images doesn’t need access to the host’s network stack. Without a sandbox, every process runs with whatever authority the operating system grants it, and every kernel bug becomes a potential escape hatch.
The challenge is building sandboxes that are strong enough to be a real security boundary, yet lightweight enough to run thousands of them on a single host.
Containers: The Illusion of Isolation
To understand why containers fall short, you need to understand what a container actually is. There’s no “container” primitive in the Linux kernel. A container is a convention – a combination of several kernel features layered together.
Namespaces: What a Process Can See
Linux namespaces control visibility. Each namespace type isolates a different aspect of the system:
| Namespace | What It Isolates |
|---|---|
| PID | Process IDs. PID 1 inside the container is not PID 1 on the host. |
| Network | Network interfaces, IP addresses, routing tables, iptables rules. |
| Mount | Filesystem mount points. The container sees its own root filesystem. |
| UTS | Hostname. The container can have its own hostname. |
| IPC | System V IPC objects and POSIX message queues. |
| User | User and group IDs. UID 0 inside can map to an unprivileged UID on the host. |
| Cgroup | Cgroup root directory. The process sees its own cgroup as the hierarchy root. |
| Time | Clock offsets (added in Linux 5.6). Per-namespace clock_gettime() results. |
Namespaces answer the question: what does this process think the system looks like? A process in a PID namespace sees itself as PID 1. A process in a network namespace sees only its own virtual network interface. But these are visibility restrictions, not security boundaries. The process is still executing on the same kernel.
Cgroups: How Much a Process Can Use
Control groups (cgroups) handle resource limits. They answer the question: how much can this process consume?
Cgroups manage CPU scheduling weight, memory limits and OOM behavior, block I/O throttling, network traffic classification, process count limits, and device access control. A container with a 512 MB memory limit and 0.5 CPU shares is enforced by cgroups. Without them, a single runaway container could starve the entire host.
Cgroups v1 used separate hierarchies per resource controller – one for CPU, another for memory, another for I/O – which led to configuration complexity and race conditions. Cgroups v2, which became the default in most distributions by 2022-2023, unified everything into a single hierarchy with cleaner semantics and better pressure stall information (PSI) for detecting resource contention.
The Full Stack
A running container is the combination of:
┌──────────────────────────────────────────┐
│ Container Runtime │
├──────────────────────────────────────────┤
│ Namespaces Visibility isolation │
│ Cgroups Resource limits │
│ Seccomp-BPF System call filtering │
│ Capabilities Privilege partitioning │
│ AppArmor/SELinux MAC policies │
├──────────────────────────────────────────┤
│ Shared Host Kernel │
└──────────────────────────────────────────┘
Seccomp-BPF filters which system calls a process can invoke. Linux capabilities split root’s monolithic privileges into ~40 granular capabilities like CAP_NET_ADMIN and CAP_SYS_PTRACE. AppArmor or SELinux add mandatory access control policies on top.
All of these are resource management and visibility mechanisms. They are useful. They reduce the attack surface. But they are not a security boundary in the way a hypervisor is, because of one architectural fact that cannot be patched away.
The Shared Kernel Problem
Every container on a host shares the same Linux kernel. The x86_64 kernel exposes roughly 350+ system calls. Docker’s default seccomp profile blocks about 40-50 of them, leaving the rest accessible to every container. Each of those syscalls is an entry point into kernel code – code that runs at the highest privilege level the CPU offers.
This means a single kernel vulnerability is a potential container escape.
This is the motivation for everything that follows.
gVisor: A Kernel in User Space
Google’s approach to sandboxing is radical: instead of trying to restrict which system calls reach the host kernel, intercept all of them and handle them yourself. gVisor implements a guest kernel – called the Sentry – entirely in user space, written in Go.
How It Works
When an application inside a gVisor sandbox makes a system call, that call never reaches the host kernel. Instead, the Sentry intercepts it and processes it:
┌──────────────────────────────────────────┐
│ Application Process │
│ (thinks it's on Linux) │
├──────────────────────────────────────────┤
│ Sentry │
│ (user-space kernel in Go) │
│ │
│ Implements ~237 Linux syscalls: │
│ - Memory management │
│ - TCP/IP network stack (netstack) │
│ - Filesystem (tmpfs, procfs, sysfs) │
│ - Process management, signals │
│ - Pipes, sockets, epoll, futexes │
├──────────────┬───────────────────────────┤
│ Gofer │ Host Kernel │
│ (file proxy)│ (~68 syscalls used) │
└──────────────┴───────────────────────────┘
The Sentry reimplements around 237 of the ~350 Linux syscalls – enough to run most containerized workloads. It includes its own memory management with page tables and virtual memory areas, a complete TCP/IP network stack written in Go (called netstack), filesystem implementations for tmpfs, procfs, sysfs, and overlayfs, and full process management with signal handling and threading.
The critical security property: at most 68 host system calls can be made by the Sentry to the host kernel. The application’s 237 implemented syscalls are handled entirely in user space. The host kernel attack surface is reduced by roughly 80%.
The Gofer: File System Isolation
Filesystem access is the one area where the Sentry must interact with the host. When the sandbox needs to read container images or access bind mounts, those requests go through the Gofer – a separate, isolated process that acts as a file proxy.
The Sentry communicates with the Gofer over the LISAFS protocol (a 9P-inspired RPC protocol). The Gofer is the only component that makes host filesystem syscalls. The sandbox itself never directly touches the host filesystem. This separation means that even if an attacker compromises the Sentry, they still cannot directly access host files – they’d need to also compromise the Gofer, which runs as its own isolated process with its own seccomp filters.
Interception Platforms
gVisor needs a mechanism to intercept system calls before they reach the host kernel. It supports two approaches:
Systrap (the default since mid-2023) uses SECCOMP_RET_TRAP to intercept syscalls. When the sandboxed process executes a syscall, seccomp triggers a SIGSYS signal. A custom signal handler in shared memory notifies the Sentry, which processes the call and returns the result. Systrap works inside VMs, which makes it compatible with cloud environments that don’t expose /dev/kvm.
KVM platform uses the host’s KVM facility to run the Sentry as both a guest OS and VMM. It sets the MSR_LSTAR register to point to a custom syscall handler, so the CPU routes guest syscalls directly to the Sentry without the overhead of signal-based interception. This is faster than Systrap but requires /dev/kvm access, which means it doesn’t work inside VMs without nested virtualization.
Performance Characteristics
The trade-off for gVisor’s security is overhead:
- Syscall latency: Approximately 800ns per syscall with gVisor versus ~70ns for native Linux – roughly 10x overhead per call. This is structural to the interception mechanism.
- Compute-bound workloads: Near-native performance, since CPU-intensive work runs directly without frequent syscall interception.
- I/O-heavy workloads: The Gofer RPC path adds significant latency for filesystem operations. Database workloads and applications with heavy disk I/O feel this the most.
Google has been closing the gap. Directfs (2023) reduced filesystem overhead by 12-17% by allowing the Sentry to make some filesystem calls directly for trusted mounts, bypassing the Gofer. Seccomp-BPF filtering optimizations in 2024 removed ~29% of filtering overhead.
Where gVisor Runs
gVisor powers Google Cloud Run (all serverless containers run inside gVisor), GKE Sandbox (Kubernetes pods with runtimeClassName: gvisor), App Engine Standard, Cloud Functions, and Cloud ML Engine. It’s the right fit when you need container-compatible isolation with a much stronger security boundary than raw containers – and when the workload isn’t I/O-bound.
Nested Hypervisors: Hardware-Enforced Isolation
gVisor reduces the kernel attack surface by reimplementing syscalls in user space. Hypervisor-based isolation takes a fundamentally different approach: give each workload its own kernel entirely, and use hardware to enforce the boundary.
What a Hypervisor Does
A hypervisor (or Virtual Machine Monitor) multiplexes physical hardware across multiple virtual machines. Each VM gets its own kernel, its own memory space, and its own virtual devices. There are two types:
Type 1 (bare-metal) hypervisors run directly on hardware with no host OS underneath. VMware ESXi, Microsoft Hyper-V, and Xen are examples. The hypervisor is the operating system from the hardware’s perspective.
Type 2 (hosted) hypervisors run as applications on a host OS. VMware Workstation and VirtualBox are examples. The host OS manages hardware, and the hypervisor creates VMs within it.
KVM is a hybrid. It’s a Linux kernel module that turns the host kernel into a hypervisor, leveraging hardware virtualization extensions for Type 1-like isolation while running on a general-purpose OS.
How Hardware Virtualization Works
Modern CPUs (Intel VT-x, AMD-V) have two operating modes built into the silicon:
VMX Root Mode: The hypervisor runs here with full privilege, plus additional instructions for VM management (VMLAUNCH, VMRESUME, VMREAD, VMWRITE).
VMX Non-Root Mode: Guest VMs run here. The CPU appears completely normal to the guest – all four privilege rings are available, the guest kernel runs at ring 0 – but certain privileged operations trigger a VM Exit, an automatic hardware trap back to the hypervisor.
The VMCS (Virtual Machine Control Structure) is a per-vCPU data structure that defines the guest state, host state, and which operations trigger VM exits. The hypervisor configures it to control exactly what the guest can and cannot do.
Extended Page Tables (EPT on Intel, NPT on AMD) add a second level of address translation in hardware. The guest kernel manages its own page tables (virtual → guest-physical), and the hardware transparently translates guest-physical addresses to host-physical addresses without hypervisor intervention. Without EPT, every guest page table modification would require a VM exit – a technique called shadow page tables that was extremely expensive.
The lifecycle is:
- Hypervisor executes
VMLAUNCH→ VM Entry → CPU switches to non-root mode, loads guest state from VMCS. - Guest runs at near-native speed.
- Guest performs a sensitive operation → VM Exit → CPU saves guest state, loads host state.
- Hypervisor handles the exit, then
VMRESUME→ back to step 2.
The security boundary is enforced by the CPU itself. A vulnerability in the guest kernel cannot compromise the host because the guest kernel runs in non-root mode – it physically cannot access host memory, host devices, or other VMs. The attack surface is limited to the hypervisor’s device emulation code and VM exit handling, which is vastly smaller than the 350+ syscall kernel interface that containers share.
What “Nested” Means
Nested virtualization means running a hypervisor inside a VM. This creates three layers:
┌─────────────────────────┐
│ L2: Nested Guest VMs │ Created by L1
├─────────────────────────┤
│ L1: Guest Hypervisor │ Runs inside L0's VM
├─────────────────────────┤
│ L0: Host Hypervisor │ Bare metal
└─────────────────────────┘
When L1 executes VMLAUNCH to start an L2 guest, L0 intercepts it (since L1 is actually in non-root mode from L0’s perspective). L0 then merges L1’s VMCS for L2 with its own control structures and runs L2 directly. When L2 triggers a VM exit, L0 decides whether to handle it or forward it to L1.
This sounds expensive – every L2 VM exit potentially involves both L0 and L1. And it was, until hardware caught up. VMCS Shadowing (Intel, ~2013) allows L1 to read and write L2’s VMCS without causing VM exits to L0 for every VMREAD/VMWRITE. Before this feature, every L1 VMCS operation required software emulation by L0. VMCS Shadowing dramatically reduced the overhead of nested virtualization.
Why Nesting Matters
Nested virtualization enables important use cases in cloud infrastructure. Cloud providers (AWS, GCP, Azure) already run customer workloads in L1 VMs. When those customers need VM-level isolation within their VMs – for CI/CD pipelines that test VM images, for running Kata Containers or Firecracker, for Hyper-V inside a cloud VM – they need L0 to expose virtualization features to L1.
The performance overhead of nesting is 5-20% compared to L1 VMs for most workloads, with higher overhead for VM-exit-intensive operations. Hardware assists (VMCS Shadowing, nested EPT) have made this acceptable for production use.
Firecracker: The MicroVM Approach
Firecracker takes a third path. Rather than reimplementing the kernel (gVisor) or nesting hypervisors, it asks: what if we could get the full hardware isolation of a VM but with the speed and density of a container?
Built by Amazon and written in Rust, Firecracker powers AWS Lambda and AWS Fargate. It has been running in production since 2018, handling millions of workloads per second.
What Firecracker Actually Is
An important distinction first: Firecracker is not a hypervisor. It is a Virtual Machine Monitor (VMM) – the user-space component that sets up the VM, emulates devices, and manages the microVM lifecycle. The actual hypervisor is KVM, the Linux kernel module that provides CPU and memory virtualization via VT-x/AMD-V.
Think of it this way: KVM is the engine, Firecracker is the chassis. KVM handles the hardware-level isolation (non-root mode, EPT, VMCS). Firecracker handles everything else: booting the guest, emulating the devices the guest needs, and providing the API for creating and managing microVMs.
The comparison that matters is Firecracker versus QEMU, since both are VMMs that sit on top of KVM:
| Firecracker | QEMU | |
|---|---|---|
| Device model | 5 devices: virtio-net, virtio-block, virtio-vsock, serial console, i8042 | Hundreds: BIOS, PCI, USB, GPU, sound, etc. |
| Code size | ~50K lines of Rust | Millions of lines of C |
| Boot path | Direct kernel boot. No BIOS, no UEFI, no PCI bus. | Full BIOS/UEFI, PCI enumeration, ACPI tables |
| Boot time | <125ms to user space | Seconds |
| Memory overhead | <5 MiB per microVM | Tens to hundreds of MiB |
| Attack surface | Minimal device emulation | Massive legacy device emulation |
QEMU is a general-purpose VMM designed to run anything from a 1980s Macintosh to a modern GPU-accelerated server. It supports dozens of architectures and hundreds of emulated devices. This flexibility comes at a cost: millions of lines of C code, each line a potential vulnerability in the device emulation layer that runs on the host.
Firecracker strips all of that away. No BIOS, no UEFI, no PCI bus, no legacy devices. It boots a Linux kernel directly, exposes five virtio devices over MMIO (memory-mapped I/O, which avoids PCI bus emulation entirely), and nothing else. The result is a VMM with a minimal attack surface, written in a memory-safe language.
How Firecracker Achieves Density
The microVM concept only works at scale if you can run thousands of them per host. Firecracker achieves this through aggressive minimalism:
No firmware boot: Traditional VMs go through BIOS/UEFI initialization, PCI enumeration, ACPI table parsing – all before the kernel even starts. Firecracker skips everything. It loads the kernel directly into guest memory, sets up the initial CPU state, and jumps to the kernel entry point. This gets a microVM from API call to running user-space code in under 125ms.
Creation rate: Up to 150 microVMs per second per host.
Minimal device model: Each microVM needs a network interface, a block device, and a console. Five devices, implemented in a few thousand lines of Rust, with minimal state per VM. Compare this to QEMU, where each VM carries the state for dozens of emulated devices it will never use.
virtio-over-MMIO: Instead of emulating a PCI bus to present virtio devices (which is what QEMU does by default), Firecracker uses virtio-over-MMIO. The guest kernel accesses devices through memory-mapped regions, eliminating the PCI enumeration step and the PCI host bridge emulation.
Built-in rate limiters: I/O and network rate limiting per microVM prevents noisy neighbors – a single microVM cannot saturate the host’s I/O bandwidth.
Copy-on-Write: The Density Multiplier
Running thousands of microVMs per host sounds expensive – each VM has its own kernel, its own memory, its own filesystem. The secret to making this work at scale is copy-on-write (CoW).
Copy-on-write is a resource management technique where multiple consumers share the same physical memory pages as long as none of them modifies the data. When a consumer writes, the system creates a private copy of just that page and redirects the write to the private copy. Everyone else continues reading the shared original.
At the hardware level, this is implemented through page table entries with the write-protect bit set. When a write occurs, the CPU raises a page fault, and the kernel’s fault handler performs the copy.
Snapshot and Restore: Firecracker supports snapshotting a running microVM – capturing the complete guest memory contents and all vCPU state (registers, MSRs) to files. This snapshot becomes a template:
- Boot a microVM and let it reach a “warm” state – the application is initialized, the runtime is loaded, JIT compilation is complete.
- Snapshot it. This captures a fully initialized microVM in a file.
- Restore new microVMs from the snapshot instead of cold-booting them.
When restoring, guest memory pages are loaded on-demand from the snapshot file. Pages that the guest only reads are served from the shared page cache – the same physical pages back multiple restored microVMs. Pages that the guest modifies trigger a CoW fault and get their own private copy. For homogeneous workloads (like thousands of Lambda functions running the same Node.js runtime), the read-only overlap is enormous.
Differential snapshots: With track_dirty_pages enabled, Firecracker uses KVM’s dirty page tracking to record which pages have been modified since the last snapshot. A diff snapshot contains only modified pages, making incremental saves fast and compact.
Kernel Same-Page Merging (KSM): A Linux kernel feature that scans physical memory for pages with identical content, deduplicates them, and marks the shared page CoW. This is particularly effective for microVMs because multiple VMs running the same OS and application stack have large amounts of identical memory – kernel code, shared libraries, zero-initialized pages. KSM has a CPU cost (the scanning thread consumes cycles), but for high-density workloads the memory savings are worth it.
The combination is multiplicative. Minimal VMM overhead (<5 MiB per VM) keeps the per-VM fixed cost low. Snapshot/restore with CoW means new VMs don’t allocate full guest memory upfront – they share pages with the template until they diverge. KSM further deduplicates pages across running VMs that happen to contain identical data. And sparse memory allocation means guest memory that’s never touched never consumes physical pages.
A host with 256 GB of RAM can realistically run thousands of microVMs, each configured with 128 MB of guest memory, because actual physical memory consumed per VM is far less than 128 MB – most of it is shared.
The Three Models Compared
Each approach makes a different fundamental trade-off:
┌─────────────────────────────────────────────────────────────────┐
│ Security vs. Performance │
│ │
│ Containers ──── gVisor ──── Firecracker ──── Traditional VMs │
│ │
│ ◄─── Faster, lighter Stronger isolation ───► │
└─────────────────────────────────────────────────────────────────┘
| gVisor | Nested Hypervisor | Firecracker | |
|---|---|---|---|
| Isolation mechanism | User-space kernel intercepts syscalls | Hardware (VT-x/AMD-V) enforces VM boundary | KVM hardware isolation + Jailer |
| Kernel exposure | ~68 host syscalls | Full guest kernel, but isolated by hardware | Full guest kernel, but isolated by hardware |
| Boot time | Container-like (ms) | Seconds to minutes | <125ms |
| Memory overhead | Moderate (Go runtime) | High (full OS per VM) | <5 MiB per microVM |
| Density | High | Low | Very high (thousands per host) |
| Compatibility | ~237 of ~350 syscalls | Full Linux compatibility | Full Linux compatibility |
| I/O performance | Degraded (Gofer RPC path) | Near-native | Near-native (virtio) |
| Best for | Multi-tenant containers where compatibility is acceptable | Running hypervisors-in-hypervisors, full VM workloads | Serverless, ephemeral, high-density workloads |
| Used by | Google Cloud Run, GKE Sandbox | Cloud provider infrastructure, CI/CD | AWS Lambda, AWS Fargate |
gVisor keeps the container-like developer experience – same images, same orchestration – but interposes a user-space kernel that dramatically reduces host kernel exposure. The cost is syscall overhead and I/O latency.
Nested hypervisors provide the strongest isolation through hardware enforcement, at the cost of boot time and resource overhead. Each VM is a complete, independent system – there’s no shared kernel to exploit. But this comes with the weight of running a full OS per workload.
Firecracker finds a middle ground: hardware-enforced isolation (via KVM) with container-like density and speed (via aggressive minimalism and CoW). It strips away everything a microVM doesn’t need and uses memory sharing to amortize the cost across thousands of VMs.
The trend is clear. As multi-tenant workloads scale and the serverless model becomes dominant, the industry is moving toward stronger isolation with less overhead. Containers alone are not sufficient for workloads where you don’t control the code. The question is which combination of mechanisms – user-space kernels, hardware virtualization, or microVMs – fits the threat model and performance requirements of your specific workload.
There’s no single right answer, but “just use containers” is increasingly the wrong one.
Comments
Came here from LinkedIn or X? Join the conversation below — all discussion lives here.