Some real cognitive dissonance in this article…
“The PDF Association operates under a strict principle—any new feature must work seamlessly with existing readers” followed by introducing compression as a breaking change in the same paragraph.
All this for brotli… on a read-many format like pdf zstd’s decompression speed is a much better fit.
Are they using a custom dictionary with Brotli designed for PDFs? I am not sure if it would help or not, but it seems like one of those cases it may help?
Something like this:
https://developer.chrome.com/blog/shared-dictionary-compress...
In my applications, in the area of 3D, I've been moving away from Brotli because it is just so slow for large files. I prefer zstd, because it is like 10x faster for both compression and decompression.
The pdf association is still running experiments on whether or not to support custom dictionaries based on real life workloads gains.
So it might land in the spec once it has proven if offers enough value
It seems they're using the standard dictionary, which is utterly bizzare.
The standard Brotli dictionary bakes in a ton of assumptions about what the Web looked like in 2015, including not just which HTML tags were particularly common but also such things as which swear words were trendy.
It doesn't seem reasonable to think that PDFs have symbol probabilities remotely similar to the web corpus Google used to come up with that dictionary.
On top of that, it seems utterly daft to be baking that into a format which is expected to fit archival use cases and thus impose that 2015 dictionary on PDF readers for a century to come.
I too would strongly prefer that they use zstd.
BTW I've looked into custom dictionaries before for similar use cases and I suspect it would only offer like a 1% improvement or so for PDFs -- still good, but not a massive difference maker. The issue is that PDFs, like web pages, are incredibly repetitive in terms of their tags/structure. As such the custom dictionary only helps if the doc is really small, otherwise because of the repetitive nature, the self-inferred dictionary will resemble the custom dictionary after just a few blocks of PDF content.
The sole exception is if they are restarting the brotli stream for each page, and they are not sharing a dictionary, custom or inferred across the whole doc. Then the dictionary will have to be re-inferred on each page, and then a shared custom dictionary would make more sense.
How can iText claim that adding Brotli is not a backward incompatible change (in the "Why keep encoding separate" table)? In the first section the author states that any new feature must work seamlessly with existing readers. New documents created that include this compression would be unintelligible to any reader that only supports Deflate.
Am I missing something? Adoption will take a long time if you can't be confident the receiver of a document or viewers of a publication will be able to open the file.
It's prototypish work to support it before it land's in the official specification. But it will indeed take some adoption time.
Because I'm doing the work to patch in support across different viewers to help adoption grow. And once the big opensource ones ship it pdfjs, poppler, pdfium, adoption can quickly rise.
There are old devices where the viewer can’t be patched. That’s killing one of the main features of PDF
Who is responsible for the terrible decision? In the pro vs con analysis, saving 20% size occasionally vs updating ALL pdf libraries/apps/viewers ever built SHOULD be a no-brainer.
What is the point of using a generic compression algorithm in a file format? Does this actually get you much over turning on filesystem and transport compression, which can transparently swap the generic algorithm (e.g. my files are already all zstd compressed. HTTP can already negotiate brotli or zstd)? If it's not tuned to the application, it seems like it's better to leave it uncompressed and let the user decide what they want (e.g. people noting tradeoffs with bro vs zstd; let the person who has to live with the tradeoff decide it, not the original file author).
Few people enable file system compression, and even if they do it's usually with fast algorithms like lz4 or zstd -1. When authoring a document you have very different tradeoffs and can afford the cost of high compression levels of zstd or brotli.
Well, if sanity had prevailed, we would have likely stuck to .ps.gz (or you favourite compression format), instead of ending up with PDF.
Though we might still want to restrict the subset of PostScript that we allow. The full language might be a bit too general to take from untrusted third parties.
Compression filters are in PostScript.
Don't you end up with PDF if you start with PS and restrict it to a subset? And maybe normalize the structure of the file a little. The structure is nice when you want to take the content and draw a bit more on the page. Or when subsetting/combining files.
I suspect PDF was fairly sane in the initial incarnation, and it's the extra garbage that they've added since then that is a source of pain.
I'm not a big fan of this additional change (nor any of the javascript/etc), but I would be fine with people leaving content streams uncompressed and running the whole file through brotli or something.
I thought PDFs can contain arbitrary PS.
- inside the file, the compressor can be varied according to the file content. For example, images can use jpeg, but that isn’t useful for compressing text
- when jumping from page to page, you won’t have to decompress the entire file
> inside the file, the compressor can be varied according to the file content. For example, images can use jpeg, but that isn’t useful for compressing text
Okay, so we make a compressed container format that can perform such shenanigans, for the same amount of back-compat issues as extending PDF in this way.
> when jumping from page to page, you won’t have to decompress the entire file
This is already a thing with any compression format that supports quasi-random access, which is most of them. The answers to https://stackoverflow.com/q/429987/5223757 discuss a wide variety of tools for producing (and seeking into) such files, which can be read normally by tools not familiar with the conventions in use.
If we're making breaking changes to PDFs, I'd love if the committee added a modern image format like JPEG-XL. In my experience, most disk usage of PDFs comes from images, not streams.
I keep a bunch of comics in PDF but JPEG-XL is by far the best way to enjoy them in terms of disk space.
Odd you should say that, as that's exactly what they've been discussing
No it's not. This article is about proposing Brotli as another possible '/Filter' for stream objects, like content streams (page drawing commands). Images are streams too, but unless you mean compressing raw pixel bytes in Brotli, there's no mention of a JPEG-XL or WEBP filter.
well, not mentioned in this specific article. But JPEG-XL support is something they're working on [1].
[1]: https://pdfa.org/wp-content/uploads/2025/10/PDFDays2025-Brea...
Why not zstd?
I think this was the main reason (from the linked article) LOL:
"Brotli is a compression algorithm developed by Google."
They have no idea about Zstandard nor ANS/FSE comparing it with LZ77.
Sheer incompetence.
I can’t imagine the people actually doing the technical work don’t know about Zstandard.
EDIT: Something weird is going on here. When compressing zstd in parallel it produces the garbage results seen here, but when compressing on a single core, it produces result competitive with Brotli (37M). See: https://news.ycombinator.com/item?id=46723158
I just took all PDFs I had in my downloads folder (55, totaling 47M). These are invoices, data sheets, employment contracts, schematics, research reports, a bunch of random stuff really.
I compressed them all with 'zstd --ultra -22', 'brotli -9', 'xz -9' and 'gzip -9'. Here are the results:
+------+------+-----+------+--------+
| none | zstd | xz | gzip | brotli |
+------|------|-----|------|--------|
| 47M | 45M | 39M | 38M | 37M |
+------+------+-----+------+--------+
Here's a table with all the files: +------+------+------+------+--------+
| raw | zstd | xz | gzip | brotli |
+------+------+------+------+--------+
| 12K | 12K | 12K | 12K | 12K |
| 20K | 20K | 20K | 20K | 20K | x5
| 24K | 20K | 20K | 20K | 20K | x5
| 28K | 24K | 24K | 24K | 24K |
| 28K | 24K | 24K | 24K | 24K |
| 32K | 20K | 20K | 20K | 20K | x3
| 32K | 24K | 24K | 24K | 24K |
| 40K | 32K | 32K | 32K | 32K |
| 44K | 40K | 40K | 40K | 40K |
| 44K | 40K | 40K | 40K | 40K |
| 48K | 36K | 36K | 36K | 36K |
| 48K | 48K | 48K | 48K | 48K |
| 76K | 128K | 72K | 72K | 72K |
| 84K | 140K | 84K | 80K | 80K | x7
| 88K | 136K | 76K | 76K | 76K |
| 124K | 152K | 88K | 92K | 92K |
| 124K | 152K | 92K | 96K | 92K |
| 140K | 160K | 100K | 100K | 100K |
| 152K | 188K | 128K | 128K | 132K |
| 188K | 192K | 184K | 184K | 184K |
| 264K | 256K | 240K | 244K | 240K |
| 320K | 256K | 228K | 232K | 228K |
| 440K | 448K | 408K | 408K | 408K |
| 448K | 448K | 432K | 432K | 432K |
| 516K | 384K | 376K | 384K | 376K |
| 992K | 320K | 260K | 296K | 280K |
| 1.0M | 2.0M | 1.0M | 1.0M | 1.0M |
| 1.1M | 192K | 192K | 228K | 200K |
| 1.1M | 2.0M | 1.1M | 1.1M | 1.1M |
| 1.2M | 1.1M | 1.0M | 1.0M | 1.0M |
| 1.3M | 2.0M | 1.1M | 1.1M | 1.1M |
| 1.7M | 2.0M | 1.7M | 1.7M | 1.7M |
| 1.9M | 960K | 896K | 952K | 916K |
| 2.9M | 2.0M | 1.3M | 1.4M | 1.4M |
| 3.2M | 4.0M | 3.1M | 3.1M | 3.0M |
| 3.7M | 4.0M | 3.5M | 3.5M | 3.5M |
| 6.4M | 4.0M | 4.1M | 3.7M | 3.5M |
| 6.4M | 6.0M | 6.1M | 5.8M | 5.7M |
| 9.7M | 10M | 10M | 9.5M | 9.4M |
+------+------+------+------+--------+
Zstd is surprisingly bad on this data set. I'm guessing it struggles with the already-compressed image data in some of these PDFs.Going by only compression ratio, brotli is clearly better than the rest here and zstd is the worst. You'd have to find some other reason (maybe decompression speed, maybe spec complexity, or maybe you just trust Facebook more than Google) to choose zstd over brotli, going by my results.
I wish I could share the data set for reproducibility, but I obviously can't just share every PDF I happened to have laying around in my downloads folder :p
Turns out that these numbers are caused by APFS weirdness. I used 'du' to get them which reports the size on disk, which is weirdly bloated for some reason when compressing in parallel. I should've used 'du -A', which reports the apparent size.
Here's a table with the correct sizes, reported by 'du -A' (which shows the apparent size):
+---------+---------+--------+--------+--------+
| none | zstd | xz | gzip | brotli |
+---------|---------|--------|--------|--------|
| 47.81M | 37.92M | 37.96M | 38.80M | 37.06M |
+---------+---------+--------+--------+--------+
These numbers are much more impressive. Still, Brotli has a slight edge.If you're worried about double-compression of image data, you can uncompress all images by using qpdf:
qpdf --stream-data=uncompress in.pdf out.pdf
The resulting file should compress better with zstd.> | 1.1M | 2.0M | 1.1M | 1.1M | 1.1M |
Something is going terribly wrong with `zstd` here, where it is reported to compress a file of 1.1MB to 2MB. Zstd should never grow the file size by more than a very small percent, like any compressor. Am I interpreting it correctly that you're doing something like `zstd -22 --ultra $FILE && wc -c $FILE.zst`?
If you can reproduce this behavior, can you please file an issue with the zstd version you are using, the commands used, and if possible the file producing this result.
Yeah, `--adaptive` will enable adaptive compression, but it isn't enabled by default, so shouldn't apply here. But even with `--adaptive`, after compressing each block of 128KB of data, zstd checks that the output size is < 128KB. If it isn't, it emits an uncompressed block that is 128KB + 3B.
So it is very central to zstd that it will never emit a block that is larger than 128KB+3B.
I will try to reproduce, but I suspect that there is something unrelated to zstd going on.
What version of zstd are you using?
doesn't zstd cap out at compression level 19?
Why not use a more widespread compression algorithm (e.g. gzip) considering that Brotli barely performs better at all? Sounds like a pain for portability
I'm not sold on the idea of adding compression to PDF at all, I'm not convinced that the space savings are worth breaking compatibility with older readers. Especially when you consider that you can just compress it in transit with e.g HTTP's 'Content-Encoding' without any special PDF reader support. (You can even use 'Content-Encoding: br' for brotli!)
If you do wanna change PDF backwards-incompatibly, I don't think there's a significant advantage to choosing gzip to be honest, both brotli and zstd are pretty widely available these days and should be fairly easy to vendor. But yeah, it's a slight advantage I guess. Though I would expect that there are other PDF data sets where brotli has a larger advantage compared to gzip.
But what I really don't get is all the calls to use zstd instead of brotli and treating the choise to use brotli instead of zstd as some form of Google conspiracy. (Is Facebook really better?)
Could you add compression and decompression speeds to your table?
Zstd should not be slower than gzip to decompress here. Given that it has inflated the files to be bigger than the uncompressed data, it has to do more work to decompress. This seems like a bug, or somehow measuring the wrong thing, and not the expected behavior.
Thanks a lot. Interestingly Brotli’s author mentioned here that zstd is 2× faster at decompressing, which roughly matches your numbers:
https://news.ycombinator.com/item?id=46035817
I’m also really surprised that gzip performs better here. Is there some kind of hardware acceleration or the like?
incompetence
You can read about it here https://pdfa.org/brotli-compression-coming-to-pdf/
That mentions zstd in a weird incomplete sentence, but never compares it.
They don’t seem to provide a detailed comparison showing how each compression scheme fared at every task, but they do list (some of) their criteria and say they found Brotli the best of the bunch. I can’t tell if that’s a sensible conclusion or not, though. Maybe Brotli did better on code size or memory use?
I love when I perform all the due diligence tasks. You just can't counter that. Yes but, they did all the due diligence tasks. They considered all the factors. Every one. Think you have one they didn't consider? Nope.
I am often frustrated by PDF issues such as how complicated it is to create one.
But reading the article I realized PDFs have become ubiquitous because of its insistence on backwards compatibility. Maybe for some things it's good to move this slow.
The article is wrong, the PDF spec has introduced breaking changes plenty of times. It’s done slowly and conservatively though, particularly now that the format is an ISO spec.
The PDF format is versioned, and in the past new versions have introduced things like new types of encryption. It’s quite probable that a v1.7 compliant PDF won’t open on a reader app written when v1.3 was the latest standard.
This is a really really bad idea. Don't break backwards compat. for 20% of gains. Internet connection speeds and storage capacities only go up. In a few years time, 20% of gains will seem crazy to have broken back-compat for.
Wouldn't lzma2 be better here since a pdf is more read heavy?
Going by one of Brotli’s authors’ comment [1] on another post, it probably wouldn’t.
This article is AI slop.
Yep.
This is nice, but PDF jumped the shark already. It's no longer a document format that always looks the same everywhere. The inclusion of "Dynamic XFA (XML Form Architecture) PDF" in the spec made it so PDF is an unreliable format. The aformentioned is a PDF without content that pulls down all it's content from the web. It even still, ostensibly, supports Flash (swf) animations. In practice these "PDF"s are just empty white pages with an error message like,
>"Please wait... If this message is not eventually replaced by the proper contents of the document, your PDF viewer may not be able to display this type of document. You can upgrade to the latest version of Adobe Reader for Windows®, Mac, or Linux® by visiting http://www.adobe.com/go/reader_download. For more assistance with Adobe Reader visit http://www.adobe.com/go/acrreader. Windows is either a registered trademark or a trademark of Microsoft Corporation in the United States and/or other countries. Mac is a trademark of Apple Inc., registered in the United States and other countries. Linux is the registered trademark of Linus Torvalds in the U.S. and other countries."
Fortunately, XFA is deprecated. I haven’t seen one of those for a very long time.
Maybe in spec, but the damage is done and persists.
The (USA) Wisconsin Dept. of Natural Resources has nearly all their regulation PDFs as these XFA non-pdfs that I cannot read. So I cannot know the regulations. My emails about this topic (to multiple addresses over many years a dozen times) have gone unanswered.
If Acrobat supports it it doesn't matter what the spec says. Until Adobe drops XFA from Acrobat and forces these extremely silly people to stop, PDF is no longer PDF.
'Your PDF:s will open slower because we decided that the CDN providers are more important than you'.
If size was important to users then it wouldn't be so common that systems providers crap out huge PDF files consisting mainly of layout junk 'sophistication' with rounded borders and whatnot.
The PDF/A stuff I've built stays under 1 MB for hundreds of pages of information, because it's text placed in a typographically sensible manner.
Ridiculous statement. CDN providers can already use filesystem compression and standard HTTP Accept-Encoding compression for transfers (which includes brotli by the way). This ISO provides virtually no benefit to them
This reasoning comes from TFA.
tl;dr Commerical entity is paying to have the ISO altered to "legalize" their SDK they are pushing which is incompatible with standard PDF readers.
ISO is pay to play so :shrug:
No this feature is coming straight from the PDF association itself and we just added experimental support before it's officially in the spec to help testing between different sdk processors.
So your comment is a falsehood
It's not even clear that they were the ones suggesting inclusion. They're just saying their library now supports the new thing.
https://pdfa.org/brotli-compression-coming-to-pdf/
> As of March 2025, the current development version of MuPDF now supports reading PDF files with Brotli compression. The source is available from github.com/ArtifexSoftware/mupdf, and will be included as an experimental feature in the upcoming 1.26.0 release.
> Similarly, the latest development version of Ghostscript can now read PDF files with Brotli compression. File creation functionality is underway. The next official Ghostscript release is scheduled for August this year, but the source is available now from github.com/ArtifexSoftware/Ghostpdl.
Yes, I do not see any source of financial gain that could motivate them for this, because both MuPDF and Ghostscript are free.
MuPDF is an excellent PDF reader, the fastest that I have ever tested. There are plenty of big PDF files where most other readers are annoyingly slow.
It is my default PDF and EPUB reader, except that in very rare cases I encounter PDF files which MuPDF cannot understand, when I use other PDF readers (e.g. Okular).
I'm no fan of Adobe, but it is not that hard to add brotli support given that it is open. Probably can be added by AI without much difficulty - it is a simple feature. I think compared to the ton of other complex features PDF has, this is an easy one.
[flagged]
yup, zstd is better. Overall use zstd for pretty much anything that can benefit from a general purpose compression. It's a beyond excellent library, tool, and an algorithm (set of).
Brotli w/o a custom dictionary is a weird choice to begin with.
Brotli makes a bit of sense considering this is a static asset; it compresses somewhat more than zstd. This is why brotli is pretty ubiquitous for precompressed static assets on the Web.
That said, I personally prefer zstd as well, it's been a great general use lib.
You need to crank up zstd compression level.
zstd is Pareto better than brotli - compresses better and faster
EDIT: Something weird is going on here. When compressing zstd in parallel it produces the garbage results seen here, but when compressing on a single core, it produces result competitive with Brotli (37M). See: https://news.ycombinator.com/item?id=46723158
I did my own testing where Brotli also ended up better than ZSTD: https://news.ycombinator.com/item?id=46722044
Results by compression type across 55 PDFs:
> I couldn't quickly find a way to decompress them
Does your source .pdf material have FlateDecode'd chunks or did you fully uncompress it?
Whats the assumption we can potentially target as reason for the counter-intuitive result?
that data in pdf files are noisy and zstd should perform better on noisy files?
I love zstd but this isn't necessarily true.
Are you sure? Admittedly I only have 1 PDF in my homedir, but no combination of flags to zstd gets it to match the size of brotli's output on that particular file. Even zstd --long --ultra -22.
If that's about using predefined dictionaries, zstd can use them too.
If brotli has a different advantage on small source files, you have my curiosity.
If you're talking about max compression, zstd likely loses out there, the answer seems to vary based on the tests I look at, but it seems to be better across a very wide range.
It's correct use of Pareto, short for Pareto frontier, if the claim being made is "for every needed compression ratio, zstd is faster; and for every needed time budget, zstd is faster". (Whether this claim is true is another matter.)
brotli is ubiquitous because Google recommends it. While Deflate definitely sucks and is old, Google ships brotli in Chrome, and since Chrome is the de facto default platform nowadays, I'd imagine it was chosen because it was the lowest-effort lift.
Nevertheless, I expect this to be JBIG2 all over again: almost nobody will use this because we've got decades of devices and software in the wild that can't, and 20% filesize savings is pointless if your destination can't read the damn thing.
Brotli compresses my files way better, but it's doing it way slower. Anyway, universal statement "zstd is better" is not valid.
This bizzare move has all the hallmarks of embrace-extend-extinguish rather than technical excellence
Note the language: "You're not creating broken files—you're creating files that are ahead of their time."
Imagine a sales meeting where someone pitched that to you. They have to be joking, right?
I have no objection to adding Brotli, but I hope they take the compatability more seriously. You may need readers to deploy it for a long time - ten years? - before you deploy it in PDF creation tools.
(sarcasm warning...)
You're absolutely right! It's not just an inaccurate slogan—it's a patronizing use of artificial intelligence. What you're describing is not just true, it's precise.
Well, except for speed, compression algorithms need to be compared in terms of compression, you know.
Here's discussion by brotli's and zstd's staff:
https://news.ycombinator.com/item?id=19678985