Every time someone hits /v1/screenshot, we do not spin up a Chromium. We didn't arrive at that decision gracefully — we arrived at it after watching cold starts eat request budgets, watching RAM graphs climb like a fever curve, and watching the socket table fill up under a modest burst. This post is about what replaced the naive approach: a warm browser pool, per-request context isolation, and a singleflight layer that collapses duplicate work in front of the pool.

If you run a screenshot API — or any service that shells out to a heavy headless runtime — the shape of the problem will feel familiar.

The naive approach and why it fails

The first version of anything that renders a URL to a PNG looks like this:

app.get('/screenshot', async (req, res) => {
  const browser = await chromium.launch();
  const page = await browser.newPage();
  await page.goto(req.query.url);
  const png = await page.screenshot();
  await browser.close();
  res.type('image/png').send(png);
});

It works on a laptop. It falls apart the moment you put real traffic on it.

Cold-start Chromium is expensive. Launching a fresh chromium.launch() — including fetching the executable if it's not resident, fork-exec'ing the zygote, warming V8, loading fonts, initializing the GPU process — costs about 1.2 to 2.5 seconds on a well-provisioned server. That happens before you've loaded a single byte of the target URL. If your p95 budget for the endpoint is 5 seconds, you've spent a third of it on overhead you could have paid once.

Memory explodes. A running Chromium sits at roughly 200-300 MB resident. Now imagine 20 concurrent requests. That's 4-6 GB of RAM consumed by nothing but browser boot processes, with nothing useful happening yet. Add the page memory once each navigation actually starts, and you are out of memory on a 16 GB box before you've done anything interesting.

Socket exhaustion. Every fresh Chromium opens dozens of sockets for DevTools Protocol, font servers, cache pipes, and, of course, the actual network requests it makes for the target page. Under bursty traffic, the ephemeral port range on Linux (default 32768-60999) gets crowded fast. You'll see EADDRNOTAVAIL long before you expect to.

Process zoo. Browsers can be killed, but processes hang. If a request crashes mid-flight and your cleanup handler has a bug, you accumulate zombie chrome processes that aren't doing anything except holding file descriptors.

The fix isn't a cleverer way to launch browsers. It's to stop launching them per request.

Browser pool

SnapSharp keeps a small, fixed number of Chromium instances warm and hands out tabs — not browsers — to requests.

┌───────────────────────────────────────────────────────────────┐
│                    Browser Pool (BROWSER_POOL_SIZE = 3)       │
│                                                               │
│   ┌─────────────┐    ┌─────────────┐    ┌─────────────┐       │
│   │ Chromium #0 │    │ Chromium #1 │    │ Chromium #2 │       │
│   │             │    │             │    │             │       │
│   │  pages: 2/5 │    │  pages: 5/5 │    │  pages: 0/5 │       │
│   │  count: 18  │    │  count: 44  │    │  count: 50  │ ◄──── recycle
│   └─────────────┘    └─────────────┘    └─────────────┘       │
│                                                               │
│   Acquire → pick browser with fewest active pages             │
│   Release → decrement pages, bump counter, recycle if needed  │
└───────────────────────────────────────────────────────────────┘

Three constants define the pool's behavior, and they're worth understanding concretely:

BROWSER_POOL_SIZE (default 3) — how many long-lived Chromium processes we keep alive. This is a function of CPU cores and RAM, not request rate. A pool of 3 warm Chromiums on a 4-core box with 8 GB RAM is the sweet spot we've found in production. Bumping it higher buys you concurrency but at diminishing returns, because Chromium is CPU-bound during render.
MAX_PAGES_PER_BROWSER (default 5) — how many concurrent tabs a single browser will host. Chromium was not really designed to run 50 tabs at once on a headless server. Past 5-8 tabs, per-page latency starts to degrade noticeably because the GPU process, the renderer, and the shared event loop get contended. We cap at 5.
BROWSER_RECYCLE_AFTER (default 50) — how many screenshots a single Chromium instance serves before it's torn down and relaunched. This is a garbage-collection hack by another name. Long-running Chromium processes leak: shader caches grow, font tables grow, DevTools protocol channels accumulate state. Recycling after ~50 requests throws away the accumulated cruft and resets us to a clean slate, at the cost of one cold-start every ~50 requests — amortized into noise.

With BROWSER_POOL_SIZE=3 and MAX_PAGES_PER_BROWSER=5, the pool serves 15 concurrent requests before it saturates. That's enough for the traffic shapes we see at every plan tier below enterprise; if you need more, you scale horizontally behind a load balancer, not vertically.

A critical detail: we don't give each request a fresh browser, but we do give each request a fresh browser context (browser.newContext()). A context is Chromium's unit of cookie jar, local storage, cache, and permission grants. Reusing a single browser across requests while never reusing a context is what makes this architecture safe for multi-tenant traffic. More on that in the trade-offs section.

Singleflight deduplication

Now imagine this scenario, which actually happens every day:

A blog post with a dynamic OG image goes mildly viral. Ten Slack previews, five LinkedIn unfurls, and twenty Twitter card renderers all hit GET /v1/screenshot?url=example.com/post/123 within the same 800 ms window. All ten requests are byte-for-byte identical.

If the browser pool is the only layer of defense, ten tabs open, ten navigations run, and ten screenshots get rendered — only to produce the same PNG bytes ten times over. That's ten times the CPU, ten times the bandwidth, ten times the pool saturation pressure.

Singleflight fixes this upstream of the pool. Before any request touches a browser, we compute a cache key from the normalized request parameters. If a request with that key is already in flight, the new caller doesn't launch its own work — it awaits the same promise as the first caller and receives the same bytes.

Time ────────────────────────────────────────────────────────────►

Request A  ─┬─► singleflight: key="sha256:…" → not in map
            │                                 → register promise
            │                                 → launch screenshot
            │                                 ┌────────────────────► PNG
            │                                 │
Request B  ─┤   singleflight: key="sha256:…" → found in-flight
            │                                 → await same promise ─► PNG
            │
Request C  ─┤   singleflight: key="sha256:…" → found in-flight
            │                                 → await same promise ─► PNG
            │
            └─► first promise resolves → map cleared → bytes returned to all three

The implementation is boring, which is exactly what you want:

class Singleflight<T> {
  private inflight = new Map<string, Promise<T>>();

  async do(key: string, fn: () => Promise<T>): Promise<T> {
    const existing = this.inflight.get(key);
    if (existing) return existing;

    const promise = fn().finally(() => {
      this.inflight.delete(key);
    });

    this.inflight.set(key, promise);
    return promise;
  }
}

That's the whole thing. Twenty-four lines, no dependencies, no cleverness. The two things worth noting:

.finally() clears the entry on both success and failure. If we only cleared on success, a failing request would poison its cache key for all future callers.
There's no TTL. Singleflight is not a cache — it's deduplication during a single in-flight window. Once the first promise resolves, subsequent callers go through the normal Redis cache layer (or, on a cache miss, through the browser pool).

The order of the stack is: singleflight → Redis cache → browser pool → Playwright. In the ten-identical-requests scenario, nine requests never hit Redis and never hit the pool — they just wait for the one that did.

What happens when the pool saturates

Sometimes the pool simply runs out. A burst of 30 unique screenshots arrives at once. All 15 page slots are full. What happens to request 16?

It queues, with a deadline. The acquire() method on the pool polls for an available slot every 200 ms up to POOL_ACQUIRE_TIMEOUT_MS (default 8000 — eight seconds). If a slot frees up, the queued request proceeds. If the deadline passes, we throw a typed error that the middleware translates into:

{
  "error": "service_unavailable",
  "message": "No browser instances available",
  "status": 503,
  "retry_after": 5,
  "request_id": "…"
}

We also cap the queue depth itself. If more than poolSize × MAX_PAGES_PER_BROWSER × 2 callers are waiting, we reject immediately with 503 rather than letting the queue balloon. A client that's about to time out anyway is better off getting a fast failure and retrying than waiting ten seconds to fail.

Beyond the in-process queue, every screenshot navigation itself has a hard ceiling: SCREENSHOT_TIMEOUT (default 20000 ms, 20 seconds). Pages that don't reach their configured wait_until state in that window are aborted, the page is closed, the pool slot is returned, and the caller gets a structured timeout error. That timeout is intentionally strict: a screenshot that takes 45 seconds is a screenshot nobody wants, and holding the pool slot hostage punishes every other request in the queue.

Finally, when utilization climbs above 80 %, the pool fires a Telegram alert — rate-limited to one per five minutes so we don't spam the channel during a legitimate traffic spike. This is how we learn the pool is under-provisioned before users do.

Trade-offs

Long-lived Chromium is not free. These are the costs you take on in exchange for the performance wins.

Shared state bleed. A naive reuse of a single page across requests is a security disaster: cookies from user A's screenshot of mail.example.com would be visible to user B's screenshot of ads.example.com. We sidestep this entirely by using browser.newContext() per request. Each context has its own cookie jar, local storage, session storage, cache, and permission state. When the request completes, the context is closed and everything inside it is garbage-collected. The browser process stays warm; the isolation boundary is the context.

Stealth and proxy rely on context isolation. Stealth mode works by patching navigator.webdriver, the plugin list, WebGL fingerprints, and a dozen other detection vectors — via context.addInitScript(). Proxies are configured via context.route() or at the browser.newContext({ proxy: … }) level. Both of these require a per-request context; you can't swap a proxy on a page that's already loaded. This is why every SnapSharp request pays the small overhead of newContext() — it's what makes stealth and proxy work correctly at all.

Browser crashes take down the whole slot. If Chromium crashes mid-render (for example, a malformed WebGL shader triggers a renderer panic), every tab on that browser dies with it. We handle this with a disconnected event handler that triggers a background restart of the crashed slot, while in-flight requests on other slots continue untouched. Restart takes ~1.5 seconds; during that window, the pool operates at reduced capacity.

The recycle window is a tunable, not a magic number. BROWSER_RECYCLE_AFTER=50 is what we ship. On workloads that hit a lot of WebGL-heavy pages, we've seen the optimal value drop to 25. On image-lite workloads (mostly static HTML pages), 100 is fine. If you're building something similar, instrument screenshotCount per browser, watch how page render latency correlates with it, and pick the knee of the curve.

The engineering culture behind these choices

None of the ideas here are novel. Browser pools exist in every serious Puppeteer-as-a-service product. Singleflight is a pattern Google open-sourced in Go a decade ago. Per-request contexts are literally how Playwright expects you to use it. The work isn't in inventing the pattern — it's in committing to it, measuring it, and tuning the constants until they match your actual traffic.

Most screenshot APIs look like the naive example at the top of this post, with progressively more duct tape. Ours looks like this because we had the luxury of rebuilding from the failure modes and being honest about them. If you want to see the numbers we target — p50 and p95 latencies by endpoint, cache hit rates, pool saturation thresholds — they live in the performance docs. If you just want to generate an OG image or capture a site without operating any of this infrastructure yourself, that is the entire point of SnapSharp.

Browser Pool + Singleflight: How SnapSharp Serves Screenshots at Scale

The naive approach and why it fails

Browser pool

Singleflight deduplication

What happens when the pool saturates

Trade-offs

The engineering culture behind these choices

Related posts

Astro Content Collections + Auto-Thumbnails Tutorial — Screenshot API Integration

Cloudflare Workers Edge OG Images Tutorial — Sub-50ms Globally