You set up Puppeteer or Playwright, it works locally, and then production goes sideways. Browsers crash. Memory spikes. Processes become zombies. You've been there.

Here's what's actually happening — and what you can do about it.

The most common causes

1. `/dev/shm` is too small in Docker

Chrome uses shared memory for rendering. In Docker, /dev/shm defaults to 64MB. Chrome needs more.

Symptom: Crash on any page with canvas, WebGL, or heavy CSS.

Fix:

# docker-compose.yml
services:
  app:
    shm_size: '256mb'

Or pass --disable-dev-shm-usage to Chrome args:

await chromium.launch({
  args: ['--disable-dev-shm-usage']
});

2. No sandbox in restricted environments

Chrome's sandbox requires kernel features that many cloud environments (Kubernetes pods, certain VMs) don't expose.

Symptom: Running as root without --no-sandbox is not supported.

Fix:

await chromium.launch({
  args: ['--no-sandbox', '--disable-setuid-sandbox']
});

Security note: Only do this inside Docker with proper network isolation. Never in a multi-tenant environment.

3. Memory leaks from unclosed contexts

If you're opening a new browser per request without closing it, you'll run out of memory within hours.

Wrong:

// Don't do this
app.get('/screenshot', async (req, res) => {
  const browser = await chromium.launch();
  const page = await browser.newPage();
  // ... take screenshot ...
  // browser never closed if an error occurs
});

Right: Use a browser pool. Launch browsers once at startup, reuse pages, close the page (not the browser) after each screenshot, and restart browsers periodically to prevent memory accumulation.

4. GPU process crashes

Chrome's GPU process crashes in headless environments without a display server.

Fix:

args: ['--disable-gpu', '--no-first-run']

5. Zombie processes after timeout

If a screenshot times out, the browser process may not be killed properly.

Fix: Always use Promise.race with a timeout and explicitly kill the page/context:

const result = await Promise.race([
  page.goto(url),
  new Promise((_, reject) => setTimeout(() => reject(new Error('Timeout')), 30000))
]);

6. OOM killer

On low-memory instances, the Linux OOM killer will terminate Chrome without warning. Chrome is memory-hungry — each tab uses 100–300MB.

Fix: Run a browser pool with a max page count per browser, and limit concurrent requests.

The hidden operational cost

Even if you solve all the above, you're now responsible for:

Monitoring browser health and restarting crashed instances
Memory management — recycling browsers after N screenshots
Updates — Chrome releases updates that can break sites
Scaling — horizontal scaling of stateful browser processes is complex

For most teams, this is 20–40 hours of engineering time to get right, then ongoing maintenance.

When to use a managed API

If screenshots are not your core product, a managed API eliminates all of this:

No Docker /dev/shm configuration
No zombie processes
No OOM monitoring
No browser update management
Scales automatically

You trade a small per-request cost for engineering hours and operational overhead.

# 3 lines of bash instead of a browser operations runbook
curl "https://api.snapsharp.dev/v1/screenshot?url=https://example.com" \
  -H "Authorization: Bearer sk_live_..." \
  -o screenshot.png

If screenshots are your core product, you'll want control — so manage your own pool carefully, use the fixes above, and invest in proper browser pool management.

There's no universal answer. But know the true cost before deciding.

Why Headless Chrome Keeps Crashing (And How to Fix It)

The most common causes

1. `/dev/shm` is too small in Docker

2. No sandbox in restricted environments

3. Memory leaks from unclosed contexts

4. GPU process crashes

5. Zombie processes after timeout

6. OOM killer

The hidden operational cost

When to use a managed API

Related posts

Astro Content Collections + Auto-Thumbnails Tutorial — Screenshot API Integration

Cloudflare Workers Edge OG Images Tutorial — Sub-50ms Globally