ISO PDF spec is getting Brotli – ~20 % smaller documents with no quality loss

ericpauley • 16 days ago

Some real cognitive dissonance in this article…

“The PDF Association operates under a strict principle—any new feature must work seamlessly with existing readers” followed by introducing compression as a breaking change in the same paragraph.

All this for brotli… on a read-many format like pdf zstd’s decompression speed is a much better fit.

xxs • 16 days ago

yup, zstd is better. Overall use zstd for pretty much anything that can benefit from a general purpose compression. It's a beyond excellent library, tool, and an algorithm (set of).

Brotli w/o a custom dictionary is a weird choice to begin with.

adzm • 16 days ago

Brotli makes a bit of sense considering this is a static asset; it compresses somewhat more than zstd. This is why brotli is pretty ubiquitous for precompressed static assets on the Web.

That said, I personally prefer zstd as well, it's been a great general use lib.

dist-epoch • 16 days ago

You need to crank up zstd compression level.

zstd is Pareto better than brotli - compresses better and faster

+5

atiedebee • 16 days ago

mort96 • 16 days ago

EDIT: Something weird is going on here. When compressing zstd in parallel it produces the garbage results seen here, but when compressing on a single core, it produces result competitive with Brotli (37M). See: https://news.ycombinator.com/item?id=46723158

I did my own testing where Brotli also ended up better than ZSTD: https://news.ycombinator.com/item?id=46722044

Results by compression type across 55 PDFs:

    +------+------+-----+------+--------+
    | none | zstd | xz  | gzip | brotli |
    +------|------|-----|------|--------|
    | 47M  | 45M  | 39M | 38M  | 37M    |
    +------+------+-----+------+--------+

mrspuratic • 16 days ago

> I couldn't quickly find a way to decompress them

    pdftk in.pdf output out.pdf decompress

Thoreandan • 15 days ago

Does your source .pdf material have FlateDecode'd chunks or did you fully uncompress it?

atiedebee • 15 days ago

Ran the tests again with some more files, this time decompressing the pdf in advance. I picked some widely available PDFs to make the experiment reproducable.

  file            | raw         | zstd       (%)      | brotli     (%)     |
  gawk.pdf        | 8.068.092   | 1.437.529  (17.8%)  | 1.376.106  (17.1%) |
  shannon.pdf     | 335.009     | 68.739     (20.5%)  | 65.978     (19.6%) |
  attention.pdf   | 24.742.418  | 367.367    (1.4%)   | 362.578    (1.4%)  |
  learnopengl.pdf | 253.041.425 | 37.756.229 (14.9%)  | 35.223.532 (13.9%) |

For learnopengl.pdf I also tested the decompression performance, since it is such a large file, and got the following (less surprising) results using 'perf stat -r 5':

  zstd:   0.4532 +- 0.0216 seconds time elapsed  ( +-  4.77% )
  brotli: 0.7641 +- 0.0242 seconds time elapsed  ( +-  3.17% )

The conclusion seems to be consistent with what brotli's authors have said: brotli achieves slightly better compression, at the cost of a little over half the decompression speed.

order-matters • 16 days ago

Whats the assumption we can potentially target as reason for the counter-intuitive result?

that data in pdf files are noisy and zstd should perform better on noisy files?

+1

jeffbee • 16 days ago

DetroitThrow • 16 days ago

I love zstd but this isn't necessarily true.

+1

dchest • 16 days ago

+1

itsdesmond • 16 days ago

stonogo • 15 days ago

brotli is ubiquitous because Google recommends it. While Deflate definitely sucks and is old, Google ships brotli in Chrome, and since Chrome is the de facto default platform nowadays, I'd imagine it was chosen because it was the lowest-effort lift.

Nevertheless, I expect this to be JBIG2 all over again: almost nobody will use this because we've got decades of devices and software in the wild that can't, and 20% filesize savings is pointless if your destination can't read the damn thing.

deepsun • 15 days ago

Brotli compresses my files way better, but it's doing it way slower. Anyway, universal statement "zstd is better" is not valid.

xxs • 15 days ago

On max compression "--ultra -22", zstd is likely to be 2-4% less dense (larger) on text alike input. While taking over 2x times times to compress. Decompression is also much faster, usually over 2x.

I have not tried using a dictionary for zstd.

greenavocado • 16 days ago

This bizzare move has all the hallmarks of embrace-extend-extinguish rather than technical excellence

mmooss • 15 days ago

Note the language: "You're not creating broken files—you're creating files that are ahead of their time."

Imagine a sales meeting where someone pitched that to you. They have to be joking, right?

I have no objection to adding Brotli, but I hope they take the compatability more seriously. You may need readers to deploy it for a long time - ten years? - before you deploy it in PDF creation tools.

nxobject • 15 days ago

(sarcasm warning...)

You're absolutely right! It's not just an inaccurate slogan—it's a patronizing use of artificial intelligence. What you're describing is not just true, it's precise.

mmooss • 14 days ago

I don't understand your point ...

eventualcomp • 14 days ago

The commenter is making a joke about the style of delivery of the sentence you quoted, because the style is [1]characteristic of AI generated writing.

[1]https://en.wikipedia.org/wiki/Wikipedia:Signs_of_AI_writing

spider-mario • 15 days ago

> on a read-many format like pdf zstd’s decompression speed is a much better fit.

brotli decompression is already plenty fast. For PDFs, zstd’s advantage in decompression speed is academic.

deepsun • 15 days ago

Well, except for speed, compression algorithms need to be compared in terms of compression, you know.

Here's discussion by brotli's and zstd's staff:

https://news.ycombinator.com/item?id=19678985

bhouston • 16 days ago

Are they using a custom dictionary with Brotli designed for PDFs? I am not sure if it would help or not, but it seems like one of those cases it may help?

Something like this:

https://developer.chrome.com/blog/shared-dictionary-compress...

In my applications, in the area of 3D, I've been moving away from Brotli because it is just so slow for large files. I prefer zstd, because it is like 10x faster for both compression and decompression.

whizzx • 16 days ago

The pdf association is still running experiments on whether or not to support custom dictionaries based on real life workloads gains.

So it might land in the spec once it has proven if offers enough value

Proclus • 16 days ago

It seems they're using the standard dictionary, which is utterly bizzare.

The standard Brotli dictionary bakes in a ton of assumptions about what the Web looked like in 2015, including not just which HTML tags were particularly common but also such things as which swear words were trendy.

It doesn't seem reasonable to think that PDFs have symbol probabilities remotely similar to the web corpus Google used to come up with that dictionary.

On top of that, it seems utterly daft to be baking that into a format which is expected to fit archival use cases and thus impose that 2015 dictionary on PDF readers for a century to come.

I too would strongly prefer that they use zstd.

bhouston • 16 days ago

BTW I've looked into custom dictionaries before for similar use cases and I suspect it would only offer like a 1% improvement or so for PDFs -- still good, but not a massive difference maker. The issue is that PDFs, like web pages, are incredibly repetitive in terms of their tags/structure. As such the custom dictionary only helps if the doc is really small, otherwise because of the repetitive nature, the self-inferred dictionary will resemble the custom dictionary after just a few blocks of PDF content.

The sole exception is if they are restarting the brotli stream for each page, and they are not sharing a dictionary, custom or inferred across the whole doc. Then the dictionary will have to be re-inferred on each page, and then a shared custom dictionary would make more sense.

bobpaw • 16 days ago

How can iText claim that adding Brotli is not a backward incompatible change (in the "Why keep encoding separate" table)? In the first section the author states that any new feature must work seamlessly with existing readers. New documents created that include this compression would be unintelligible to any reader that only supports Deflate.

Am I missing something? Adoption will take a long time if you can't be confident the receiver of a document or viewers of a publication will be able to open the file.

whizzx • 16 days ago

It's prototypish work to support it before it land's in the official specification. But it will indeed take some adoption time.

Because I'm doing the work to patch in support across different viewers to help adoption grow. And once the big opensource ones ship it pdfjs, poppler, pdfium, adoption can quickly rise.

croes • 16 days ago

There are old devices where the viewer can’t be patched. That’s killing one of the main features of PDF

nialse • 16 days ago

Who is responsible for the terrible decision? In the pro vs con analysis, saving 20% size occasionally vs updating ALL pdf libraries/apps/viewers ever built SHOULD be a no-brainer.

superkuh • 16 days ago

This is nice, but PDF jumped the shark already. It's no longer a document format that always looks the same everywhere. The inclusion of "Dynamic XFA (XML Form Architecture) PDF" in the spec made it so PDF is an unreliable format. The aformentioned is a PDF without content that pulls down all it's content from the web. It even still, ostensibly, supports Flash (swf) animations. In practice these "PDF"s are just empty white pages with an error message like,

>"Please wait... If this message is not eventually replaced by the proper contents of the document, your PDF viewer may not be able to display this type of document. You can upgrade to the latest version of Adobe Reader for Windows®, Mac, or Linux® by visiting http://www.adobe.com/go/reader_download. For more assistance with Adobe Reader visit http://www.adobe.com/go/acrreader. Windows is either a registered trademark or a trademark of Microsoft Corporation in the United States and/or other countries. Mac is a trademark of Apple Inc., registered in the United States and other countries. Linux is the registered trademark of Linus Torvalds in the U.S. and other countries."

kayodelycaon • 16 days ago

Fortunately, XFA is deprecated. I haven’t seen one of those for a very long time.

superkuh • 15 days ago

Maybe in spec, but the damage is done and persists.

The (USA) Wisconsin Dept. of Natural Resources has nearly all their regulation PDFs as these XFA non-pdfs that I cannot read. So I cannot know the regulations. My emails about this topic (to multiple addresses over many years a dozen times) have gone unanswered.

If Acrobat supports it it doesn't matter what the spec says. Until Adobe drops XFA from Acrobat and forces these extremely silly people to stop, PDF is no longer PDF.

ndriscoll • 16 days ago

What is the point of using a generic compression algorithm in a file format? Does this actually get you much over turning on filesystem and transport compression, which can transparently swap the generic algorithm (e.g. my files are already all zstd compressed. HTTP can already negotiate brotli or zstd)? If it's not tuned to the application, it seems like it's better to leave it uncompressed and let the user decide what they want (e.g. people noting tradeoffs with bro vs zstd; let the person who has to live with the tradeoff decide it, not the original file author).

wongarsu • 16 days ago

Few people enable file system compression, and even if they do it's usually with fast algorithms like lz4 or zstd -1. When authoring a document you have very different tradeoffs and can afford the cost of high compression levels of zstd or brotli.

Someone • 16 days ago

- inside the file, the compressor can be varied according to the file content. For example, images can use jpeg, but that isn’t useful for compressing text

- when jumping from page to page, you won’t have to decompress the entire file

wizzwizz4 • 15 days ago

> inside the file, the compressor can be varied according to the file content. For example, images can use jpeg, but that isn’t useful for compressing text

Okay, so we make a compressed container format that can perform such shenanigans, for the same amount of back-compat issues as extending PDF in this way.

> when jumping from page to page, you won’t have to decompress the entire file

This is already a thing with any compression format that supports quasi-random access, which is most of them. The answers to https://stackoverflow.com/q/429987/5223757 discuss a wide variety of tools for producing (and seeking into) such files, which can be read normally by tools not familiar with the conventions in use.

Someone • 15 days ago

> Okay, so we make a compressed container format that can perform such shenanigans, for the same amount of back-compat issues as extending PDF in this way.

Far from the same amount:

- existing tools that split PDFs into pages will remain working

- if defensively programmed, existing PDF readers will be able to render PDFs containing JPEG XL images, except for the images themselves.

eru • 16 days ago

Well, if sanity had prevailed, we would have likely stuck to .ps.gz (or you favourite compression format), instead of ending up with PDF.

Though we might still want to restrict the subset of PostScript that we allow. The full language might be a bit too general to take from untrusted third parties.

dunham • 16 days ago

Don't you end up with PDF if you start with PS and restrict it to a subset? And maybe normalize the structure of the file a little. The structure is nice when you want to take the content and draw a bit more on the page. Or when subsetting/combining files.

I suspect PDF was fairly sane in the initial incarnation, and it's the extra garbage that they've added since then that is a source of pain.

I'm not a big fan of this additional change (nor any of the javascript/etc), but I would be fine with people leaving content streams uncompressed and running the whole file through brotli or something.

eru • 15 days ago

> Don't you end up with PDF if you start with PS and restrict it to a subset?

PDF is also a binary format.

mikkupikku • 16 days ago

I thought PDFs can contain arbitrary PS.

lmz • 15 days ago

Compression filters are in PostScript.

ksec • 16 days ago

Why not zstd?

PunchyHamster • 16 days ago

incompetence

whizzx • 16 days ago

You can read about it here https://pdfa.org/brotli-compression-coming-to-pdf/

jeffbee • 16 days ago

That mentions zstd in a weird incomplete sentence, but never compares it.

F3nd0 • 16 days ago

They don’t seem to provide a detailed comparison showing how each compression scheme fared at every task, but they do list (some of) their criteria and say they found Brotli the best of the bunch. I can’t tell if that’s a sensible conclusion or not, though. Maybe Brotli did better on code size or memory use?

+1

eviks • 16 days ago

HackerThemAll • 16 days ago

I think this was the main reason (from the linked article) LOL:

"Brotli is a compression algorithm developed by Google."

They have no idea about Zstandard nor ANS/FSE comparing it with LZ77.

Sheer incompetence.

cortesoft • 16 days ago

I can’t imagine the people actually doing the technical work don’t know about Zstandard.

mort96 • 16 days ago

EDIT: Something weird is going on here. When compressing zstd in parallel it produces the garbage results seen here, but when compressing on a single core, it produces result competitive with Brotli (37M). See: https://news.ycombinator.com/item?id=46723158

I just took all PDFs I had in my downloads folder (55, totaling 47M). These are invoices, data sheets, employment contracts, schematics, research reports, a bunch of random stuff really.

I compressed them all with 'zstd --ultra -22', 'brotli -9', 'xz -9' and 'gzip -9'. Here are the results:

    +------+------+-----+------+--------+
    | none | zstd | xz  | gzip | brotli |
    +------|------|-----|------|--------|
    | 47M  | 45M  | 39M | 38M  | 37M    |
    +------+------+-----+------+--------+

Here's a table with all the files:

    +------+------+------+------+--------+
    | raw  | zstd | xz   | gzip | brotli |
    +------+------+------+------+--------+
    | 12K  | 12K  | 12K  | 12K  | 12K    |
    | 20K  | 20K  | 20K  | 20K  | 20K    | x5
    | 24K  | 20K  | 20K  | 20K  | 20K    | x5
    | 28K  | 24K  | 24K  | 24K  | 24K    |
    | 28K  | 24K  | 24K  | 24K  | 24K    |
    | 32K  | 20K  | 20K  | 20K  | 20K    | x3
    | 32K  | 24K  | 24K  | 24K  | 24K    |
    | 40K  | 32K  | 32K  | 32K  | 32K    |
    | 44K  | 40K  | 40K  | 40K  | 40K    |
    | 44K  | 40K  | 40K  | 40K  | 40K    |
    | 48K  | 36K  | 36K  | 36K  | 36K    |
    | 48K  | 48K  | 48K  | 48K  | 48K    |
    | 76K  | 128K | 72K  | 72K  | 72K    |
    | 84K  | 140K | 84K  | 80K  | 80K    | x7
    | 88K  | 136K | 76K  | 76K  | 76K    |
    | 124K | 152K | 88K  | 92K  | 92K    |
    | 124K | 152K | 92K  | 96K  | 92K    |
    | 140K | 160K | 100K | 100K | 100K   |
    | 152K | 188K | 128K | 128K | 132K   |
    | 188K | 192K | 184K | 184K | 184K   |
    | 264K | 256K | 240K | 244K | 240K   |
    | 320K | 256K | 228K | 232K | 228K   |
    | 440K | 448K | 408K | 408K | 408K   |
    | 448K | 448K | 432K | 432K | 432K   |
    | 516K | 384K | 376K | 384K | 376K   |
    | 992K | 320K | 260K | 296K | 280K   |
    | 1.0M | 2.0M | 1.0M | 1.0M | 1.0M   |
    | 1.1M | 192K | 192K | 228K | 200K   |
    | 1.1M | 2.0M | 1.1M | 1.1M | 1.1M   |
    | 1.2M | 1.1M | 1.0M | 1.0M | 1.0M   |
    | 1.3M | 2.0M | 1.1M | 1.1M | 1.1M   |
    | 1.7M | 2.0M | 1.7M | 1.7M | 1.7M   |
    | 1.9M | 960K | 896K | 952K | 916K   |
    | 2.9M | 2.0M | 1.3M | 1.4M | 1.4M   |
    | 3.2M | 4.0M | 3.1M | 3.1M | 3.0M   |
    | 3.7M | 4.0M | 3.5M | 3.5M | 3.5M   |
    | 6.4M | 4.0M | 4.1M | 3.7M | 3.5M   |
    | 6.4M | 6.0M | 6.1M | 5.8M | 5.7M   |
    | 9.7M | 10M  | 10M  | 9.5M | 9.4M   |
    +------+------+------+------+--------+

Zstd is surprisingly bad on this data set. I'm guessing it struggles with the already-compressed image data in some of these PDFs.

Going by only compression ratio, brotli is clearly better than the rest here and zstd is the worst. You'd have to find some other reason (maybe decompression speed, maybe spec complexity, or maybe you just trust Facebook more than Google) to choose zstd over brotli, going by my results.

I wish I could share the data set for reproducibility, but I obviously can't just share every PDF I happened to have laying around in my downloads folder :p

mort96 • 15 days ago

Turns out that these numbers are caused by APFS weirdness. I used 'du' to get them which reports the size on disk, which is weirdly bloated for some reason when compressing in parallel. I should've used 'du -A', which reports the apparent size.

Here's a table with the correct sizes, reported by 'du -A' (which shows the apparent size):

    +---------+---------+--------+--------+--------+
    |  none   |  zstd   |   xz   |  gzip  | brotli |
    +---------|---------|--------|--------|--------|
    | 47.81M  | 37.92M  | 37.96M | 38.80M | 37.06M |
    +---------+---------+--------+--------+--------+

These numbers are much more impressive. Still, Brotli has a slight edge.

terrelln • 16 days ago

> | 1.1M | 2.0M | 1.1M | 1.1M | 1.1M |

Something is going terribly wrong with `zstd` here, where it is reported to compress a file of 1.1MB to 2MB. Zstd should never grow the file size by more than a very small percent, like any compressor. Am I interpreting it correctly that you're doing something like `zstd -22 --ultra $FILE && wc -c $FILE.zst`?

If you can reproduce this behavior, can you please file an issue with the zstd version you are using, the commands used, and if possible the file producing this result.

+2

mort96 • 16 days ago

gcr • 15 days ago

If you're worried about double-compression of image data, you can uncompress all images by using qpdf:

    qpdf --stream-data=uncompress in.pdf out.pdf

The resulting file should compress better with zstd.

noname120 • 16 days ago

Why not use a more widespread compression algorithm (e.g. gzip) considering that Brotli barely performs better at all? Sounds like a pain for portability

+1

mort96 • 16 days ago

noname120 • 16 days ago

Could you add compression and decompression speeds to your table?

+2

mort96 • 16 days ago

gcr • 15 days ago

If we're making breaking changes to PDFs, I'd love if the committee added a modern image format like JPEG-XL. In my experience, most disk usage of PDFs comes from images, not streams.

I keep a bunch of comics in PDF but JPEG-XL is by far the best way to enjoy them in terms of disk space.

Bolwin • 15 days ago

Odd you should say that, as that's exactly what they've been discussing

gcr • 15 days ago

No it's not. This article is about proposing Brotli as another possible '/Filter' for stream objects, like content streams (page drawing commands). Images are streams too, but unless you mean compressing raw pixel bytes in Brotli, there's no mention of a JPEG-XL or WEBP filter.

NoahZuniga • 15 days ago

well, not mentioned in this specific article. But JPEG-XL support is something they're working on [1].

[1]: https://pdfa.org/wp-content/uploads/2025/10/PDFDays2025-Brea...

gcr • 14 days ago

Oh cool!! TIL

whinvik • 16 days ago

I am often frustrated by PDF issues such as how complicated it is to create one.

But reading the article I realized PDFs have become ubiquitous because of its insistence on backwards compatibility. Maybe for some things it's good to move this slow.

jhealy • 15 days ago

The article is wrong, the PDF spec has introduced breaking changes plenty of times. It’s done slowly and conservatively though, particularly now that the format is an ISO spec.

The PDF format is versioned, and in the past new versions have introduced things like new types of encryption. It’s quite probable that a v1.7 compliant PDF won’t open on a reader app written when v1.3 was the latest standard.

nbevans • 15 days ago

This is a really really bad idea. Don't break backwards compat. for 20% of gains. Internet connection speeds and storage capacities only go up. In a few years time, 20% of gains will seem crazy to have broken back-compat for.

cess11 • 16 days ago

'Your PDF:s will open slower because we decided that the CDN providers are more important than you'.

If size was important to users then it wouldn't be so common that systems providers crap out huge PDF files consisting mainly of layout junk 'sophistication' with rounded borders and whatnot.

The PDF/A stuff I've built stays under 1 MB for hundreds of pages of information, because it's text placed in a typographically sensible manner.

noname120 • 16 days ago

Ridiculous statement. CDN providers can already use filesystem compression and standard HTTP Accept-Encoding compression for transfers (which includes brotli by the way). This ISO provides virtually no benefit to them

cess11 • 15 days ago

This reasoning comes from TFA.

h4x0rr • 16 days ago

Wouldn't lzma2 be better here since a pdf is more read heavy?

F3nd0 • 16 days ago

Going by one of Brotli’s authors’ comment [1] on another post, it probably wouldn’t.

[1] https://news.ycombinator.com/item?id=46035817

avalys • 16 days ago

This article is AI slop.

jeffbee • 16 days ago

Yep.

delfinom • 16 days ago

tl;dr Commerical entity is paying to have the ISO altered to "legalize" their SDK they are pushing which is incompatible with standard PDF readers.

ISO is pay to play so :shrug:

whizzx • 16 days ago

No this feature is coming straight from the PDF association itself and we just added experimental support before it's officially in the spec to help testing between different sdk processors.

So your comment is a falsehood

lmz • 16 days ago

It's not even clear that they were the ones suggesting inclusion. They're just saying their library now supports the new thing.

https://pdfa.org/brotli-compression-coming-to-pdf/

> As of March 2025, the current development version of MuPDF now supports reading PDF files with Brotli compression. The source is available from github.com/ArtifexSoftware/mupdf, and will be included as an experimental feature in the upcoming 1.26.0 release.

> Similarly, the latest development version of Ghostscript can now read PDF files with Brotli compression. File creation functionality is underway. The next official Ghostscript release is scheduled for August this year, but the source is available now from github.com/ArtifexSoftware/Ghostpdl.

adrian_b • 16 days ago

Yes, I do not see any source of financial gain that could motivate them for this, because both MuPDF and Ghostscript are free.

MuPDF is an excellent PDF reader, the fastest that I have ever tested. There are plenty of big PDF files where most other readers are annoyingly slow.

It is my default PDF and EPUB reader, except that in very rare cases I encounter PDF files which MuPDF cannot understand, when I use other PDF readers (e.g. Okular).

bhouston • 16 days ago

I'm no fan of Adobe, but it is not that hard to add brotli support given that it is open. Probably can be added by AI without much difficulty - it is a simple feature. I think compared to the ton of other complex features PDF has, this is an easy one.

vgtftf • 16 days ago

[flagged]