Inspecting HTTP Response Headers Without Downloading Body with Guzzle

I recently needed to inspect the HTTP response headers of a very large file download in order to determine if we should commit to downloading the file based on the ETag header. If the ETag header hadn't changed for a file we've already downloaded previously, our application could skip the download entirely and simply use the file that we already downloaded. Some of these files are multiple gigabytes in size, so the time savings from this optimization really adds up.

The most obvious and immediate solution I reached for was to issue a HEAD request to the URL, which would just return the HTTP headers without the response body (thus, not actually downloading the file). This didn't work out as well as I expected. Some of the URLs I was working with were signed S3 URLs, and they were signed only to allow GET requests and would return a 403 Forbidden status code for HEAD requests. Additionally, this requires the server on the other end properly implementing HEAD requests, and I'd rather not rely on that being the case for arbitrary URLs.

So, I needed to issue actual GET requests with Guzzle, but somehow avoid downloading the response body. From looking at the documentation I found the on_headers request option. This seemed promising:

A callable that is invoked when the HTTP headers of the response have been received but the body has not yet begun to download.

That's exactly what I need - some way to inspect the response headers before receiving the body of the request! It even looks like you can throw an exception inside the on_headers callable to abort the request.

After some experimenting I landed on this getHeaders() function:

 1private function getHeaders(string $url): array
 2{
 3    $response = null;
 4 
 5    try {
 6        $this->guzzle->get($url, [
 7            'on_headers' => function (ResponseInterface $responseWithOnlyHeaders) use (&$response) {
 8                $response = $responseWithOnlyHeaders;
 9                throw new BlockResponseBodyDownload();
10            },
11        ]);
12    } catch (RequestException $e) {
13        if (get_class($e->getPrevious()) !== BlockResponseBodyDownload::class) {
14            throw $e;
15        }
16    }
17 
18    // Have to manually follow redirects when using `on_headers`.
19    if (in_array($response->getStatusCode(), [301, 302, 307, 308])) {
20        return $this->getHeaders($response->getHeader('Location')[0]);
21    }
22 
23    return $response->getHeaders();
24}

As you can see, we issue a GET request to the provided URL. In our on_headers callable we receive the HTTP response in $responseWithOnlyHeaders which we save for later. Then we immediately throw a BlockResponseBodyDownload exception, which aborts the downloading of the rest of the HTTP response. This exception should be a very specific one that's used only for this purpose, as if you use a generic \Exception it will be hard to deal with alongside native Guzzle exceptions. I named it BlockResponseBodyDownload simply to make it very clear what this exception does to the next developer who needs to work on this code.

When you throw an exception in on_headers, internally Guzzle will convert it to its own RequestException and pass your exception into the $previous parameter of RequestException. So in order to differentiate between our BlockResponseBodyDownload exception and Guzzle's native RequestExceptions, we need to access the previous exception via $e->getPrevious() and check if it's our exception. If it is, simply ignore it. If it's not, re-throw it.

The only other caveat to this solution is that I noticed Guzzle no longer automatically follows redirects when you use on_headers. Some of our file download URLs did redirect, so I had to manually implement redirects by checking if the status code of the response was a redirecting status code, and then calling the same function recursively with the URL given in the Location header.

I figured the easiest way to ensure this works as intended is to simply try to get the headers of a very large file download, and see how long it takes. I wrote this quick test script using a 1GB speed test file from Hetzner.

1$start = microtime(true);
2 
3$headers = $this->getHeaders('https://speed.hetzner.de/1GB.bin');
4 
5echo sprintf('Retrieved ETag header (%s) in %.2F seconds', $headers['ETag'][0], microtime(true) - $start);

Which outputs:

1Retrieved ETag header ("5253f10e-3e800000") in 0.79 seconds

with a very consistent time on multiple test runs. If we comment out our exception in the on_headers callable so that we don't abort the request after getting the headers:

1'on_headers' => function (ResponseInterface $responseWithOnlyHeaders) use (&$response) {
2    $response = $responseWithOnlyHeaders;
-    throw new BlockResponseBodyDownload(); 
+//  throw new BlockResponseBodyDownload(); 
5},

and then re-run the test, we see the time it takes to retrieve the headers skyrocket, because it is fully downloading the 1GB file contained in the HTTP response body:

1Retrieved ETag header ("5253f10e-3e800000") in 207.33 seconds

That's good enough proof for me!