Back

Show HN: OSS AI agent that indexes and searches the Epstein files

185 points17 hoursepstein.trynia.ai

Hi HN,

I built an open-source AI agent that has already indexed and can search the entire Epstein files, roughly 100M words of publicly released documents.

The goal was simple: make a large, messy corpus of PDFs and text files immediately searchable in a precise way, without relying on keyword search or bloated prompts.

What it does:

- The full dataset is already indexed - You can ask natural language questions - Answers are grounded and include direct references to source documents - Supports both exact text lookup and semantic search

Discussion around these files is often fragmented. This makes it possible to explore the primary sources directly and verify claims without manually digging through thousands of pages.

Happy to answer questions or go into technical details.

Code: https://github.com/nozomio-labs/nia-epstein-ai

axegon_9 hours ago

As many others pointed out, the released files are nearly nothing compared to the full dataset. Personally I've been fiddling a lot with OSINT and analytics over the publicly available Reddit data(a considerable amount of my spare time over the last year) and the one thing I can say is that LLMs are under-performing(huge understatement) - they are borderline useless compared to traditional ML techniques. But as far as LLMs go, the best performers are the open source uncensored models(the most uncensored and unhinged), while the worst performers are the proprietary and paid models, especially over the last 2-3 months: they have been nerfed into oblivion - to the extent where simple prompts like "who is eligible to vote in US presidential elections" is considered a controversial question. So in the unlikely event that the full files are released, I personally would look at the traditional NLP techniques long before investing any time into LLMs.

jellyotsiro6 hours ago

On the limited dataset: Completely agree - the public files are a fraction of what exists and I should have mentioned that it is not all files but all publicly available ones. But that's exactly why making even this subset searchable matters. The bar right now is people manually ctrl+F-ing through PDFs or relying on secondhand claims. This at least lets anyone verify what is public.

On LLMs vs traditional NLP: I hear you, and I've seen similar issues with LLM hallucination on structured data. That's why the architecture here is hybrid:

- Traditional exact regex/grep search for names, dates, identifiers - Vector search for semantic queries - LLM orchestration layer that must cite sources and can't generate answers without grounding

sebastiennight4 hours ago

> can't generate answers without grounding

"can't" seems like quite a strong claim. Would you care to elaborate?

I can see how one might use a JSON schema that enforces source references in the output, but there is no technique I'm aware of to constrain a model to only come up with data based on the grounding docs, vs. making up a response based on pretrained data (or hallucinating one) and still listing the provided RAG results as attached reference.

It feels like your "can't" would be tantamount to having single-handedly solved the problem of hallucinations, which if you did, would be a billion-dollar-plus unlock for you, so I'm unsure you should show that level of certainty.

WhitneyLand5 hours ago

That doesn’t sound right. What model treats this as a controversial question?

"who is eligible to vote in US presidential elections"

pixl974 hours ago

Grok: "After Elon personally tortured me I have to say women are not allowed to vote in the US"

axegon_3 hours ago

This particular one: I suspect openAI uses different models in different regions so I do get an answer but I also want to point out that I am not paying a cent so I can only test those out on the free ones. For the first time ever, I can honestly say that I am glad I don't live in the US but a friend who does sent me a few of his latest encounters and that particular question yielded something along the lines of "I am not allowed to discuss such controversial topics, bla, bla, bla, you can easily look it up online". If that is the case, I suspect people will soon start flooding VPN providers and companies such as OpenAI will roll that out worldwide. Time will tell I guess.

WhitneyLand2 hours ago

1. I tried a couple OpenAI models under a paid account with no issue:

“In U.S. presidential elections, you’re eligible to vote if you meet all of these…” goes on to list all criteria.

2. No issue found with Gemini or Claude either.

3. I tried to search for this issue online as you suggested and haven’t been able to find anything.

Not seeing any evidence this is currently a real issue.

mariogintili8 hours ago

what are the most unhinged and uncensored models out there?

jellyotsiro6 hours ago

Open source models with minimal safety fine tuning or Grok

axegon_3 hours ago

Saying grok is uncensored is like saying that deepseek is uncensored. If anything deepseek is probably less censored than grok. The doplin family has given me the best results, though mostly in niche cases.

apercu5 hours ago

Grok is arguably not uncensored, it’s re-aligned to a specific narrative lane.

“Uncensored” is simply a branding trick that a lot of seemingly intelligent people seem to fall for.

kanzure4 hours ago

Wait, is abliteration actually just a branding trick? That doesn't sound correct.

spacecadet3 hours ago

Its true. We have basically moved off the platforms for agentic security and host our own models now... OpenAI was still the fastest, cheapest, working platform for it up until middle of last year. Hey OpenAI, thank us later for blasting your platform with threat actor data and behavior for several years! :P

plagiarist6 hours ago

I understand uncensored in the context of LLMs, what is unhinged? Fine tuning specifically to increase likelihood of entering controversial topics without specific prompting?

tyre5 hours ago

Yes, or catering to a preferred world view different from the mainstream SOTA model worldview.

Look for anything that includes the word “woke” in any marketing /tweet material

dmos628 hours ago

What use-cases gave you disappointing results? Did you build some kind of RAG?

wartywhoa2310 hours ago

The question is not how to analyze that, it's how to prosecute those who are above the law.

7bit8 hours ago

In order to which you must analyze the files.

tyre5 hours ago

Not really. We know many people involved and they’re not going to get prosecuted. Analysis is not accountability.

delusional3 hours ago

How hard do I need to analyze the files for Bill Gates, Donald Trump, and Bill Clinton to be sentenced to work night shift at McDonalds for minimum wage for the rest of their lives?

andy_ppp12 hours ago

I keep thinking that the lack of children’s faces in the blacked out rectangles make the files much less shocking. I wonder if AI could put back fake images to make clearer to people how sick all this is.

13hunteo9 hours ago

I understand the sentiment, but I'm always very concerned when it comes to AI generating pictures of children.

amelius8 hours ago

Why? They are generated pictures, not real pictures.

PlatoIsADisease2 hours ago

I'm on your team here, no one is being hurt.

Now the implications of letting people generate pictures of children....... Do I need to say more? Even then, I'm not sure my opinion on this. No one is getting hurt by the generation of the images, but they "might could maybe possibly" cause them to act on things in real life.

When I was a teenager I used to make this argument for legalization of drugs. It wasn't the drugs that caused people to steal and murder, it was the human.

Now that I'm older, I can imagine consequences of a few bad apples pointing to AI as the starting point.

ben_w7 hours ago

A lot of people are now struggling to detect which images are AI generated, and inferring reality from illusions.

To an extent, this was already the case with many other things, including stuff that was expressly labelled as fiction, but I recall an old quote, fooling all of the people some of the time and some of the people all of the time, it is now easier to fool more people all the time and to fool all people an increasing fraction of the time.

This isn't only limited to fake pics of kids, but kids are weak and struggle to defend themselves, and in this context the tools faking them seems to me likely to increase rates of harm against them.

+2
amelius6 hours ago
+1
dfxm125 hours ago
Xmd5a5 hours ago

You're barely scratching the surface.

> Mr. Gates, in turn, praised Mr. Epstein’s charm and intelligence. Emailing colleagues the next day, he said: “A very attractive Swedish woman and her daughter dropped by and I ended up staying there quite late.”

What if I told you that the child sitting on Epstein's lap, the teenager he French-kissed, the girl whose skin he covered with fragments from Nabokov's Lolita, the one who had an entire corridor filled with her pictures in one of his properties, who appeared in every framed photograph on his desk and whose name is on the CD-ROMs, the only woman Epstein said he would ever marry – what if that girl is the daughter Bill Gates mentions? And that she and her mother were Epstein's main romantic interests and most percussive tools?

nancyminusone6 hours ago

I believe this would decrease credibility of the evidence, not increase it.

Imustaskforhelp9 hours ago

Please create a way to share conversations. I think that can be really relevant here

I am not a huge fan of AI but I allow this use case. This is really good in my opinion

Allowing the ability to share convo's, I hope you can also make those convo's be able to archived in web.archive.org/wayback machine

So I am thinking it instead of having some random UUID, it can have something like https://duckduckgo.com/?q=hello+test (the query parameter for hello test)

Maybe its me but archive can show all the links archived by it of a particular domain, so if many people asks queries and archives it, you almost get a database of good queries and answers. Archive features are severely underrated in many cases

Good luck for your project!

jellyotsiro6 hours ago

Shareable conversations would definitely make the tool more useful yeah. I really like the query parameter approach over UUIDs so it would make links human-readable

yuppiepuppie10 hours ago

When first reading OSS, I thought this was going to be an Office of Strategic Services AI [0] agent :)

[0] https://en.wikipedia.org/wiki/Office_of_Strategic_Services

sebastiennight4 hours ago

...whose most famous agent, OSS 117, predates James Bond by four years btw:

https://en.wikipedia.org/wiki/OSS_117

iowemoretohim15 hours ago

Those are going to be some spicy hallucinations.

onionisafruit5 hours ago

> I'm experiencing technical difficulties accessing the archive at the moment. The search tools are returning internal server errors.

looks like it’s getting hugged

darepublic5 hours ago

This is just feeding the files into a rag db I assume? I hope? And then you can use any decent model in front of it

jellyotsiro5 hours ago

rag is not a core! we use both semantic search but combining with fts, grep, direct read, etc.

kevin_thibedeau5 hours ago

It would be nice to have a way to query the exposed redactions to audit which of them were in violation of the Act.

wutsthat415 hours ago

And what did you learn?

subzero0613 hours ago

In 2024, Trump used Epstein's former private jet for campaign appearances

estearum6 hours ago

Also apparently the two had Thanksgiving dinner together as recently as like 2021?

jellyotsiro15 hours ago

Trump famously told New York Magazine in 2002: "I've known Jeff for 15 years. Terrific guy. He's a lot of fun to be with. It is even said that he likes beautiful women as much as I do, and many of them are on the younger side."

Trump and Epstein were social acquaintances in Palm Beach and New York circles during the 1990s-early 2000s. They socialized together at Mar-a-Lago and other venues

TowerTall15 hours ago

Interesting. It is my impression that almost everyone globally already knew this. What else did you learn?

jellyotsiro15 hours ago

ill take like 1 hour in the evening to dive deeper, i was never familiar with epstein stuff until i built the agent to simplify things for me.

tokai5 hours ago

Its peak HN to whip out a LLM, instead of just reading a news paper article or two.

ishtanbul14 hours ago

This is one of the most widey quoted phrases by trump on the topic of epstein

gregw27 hours ago

Feedback: This agent didn't really work well when I tried it with a specific non-famous, but definitely publicly known individual with known connections to Epstein. I'd rather not post a specific name here. I found more documents with keyword searches. I guess it did get me to the conclusion that there wasn't much out there, but it didn't even mention stuff that showed up in name keyword searches.

To replicate though, you might look at the list of individuals mentioned in the brief email from Epstein to Bannon a couple weeks before Esptein died containing ~30 names and phow your engine works with each one. See how a keyword search does on library of congress vs your agent.

jellyotsiro6 hours ago

Thanks for testing this. The Bannon email from June 30, 2019 is in there (HOUSE_OVERSIGHT_029622). Good stress test idea.

Couple things happening:

Semantic search limitation: Less-famous names don't have strong embeddings, so it defaults to general connections rather than specific mentions Keyword search gap: You're right — raw grep can catch exact names I'm missing

VladVladikoff6 hours ago

I saw a similar problem. Roger Schank had some conversations with Epstein and the emails can be seen in Epsteinvisualizer.com but your site claimed there was no emails or connection. To be fair to Roger, who was an AI legend of his time and someone I knew personally before his untimely death, he really was not a pedo, and most likely never got involved with the girls, I think him and Epstein just talked about AI and education mostly.

nubg14 hours ago

Does this work with vector embeddings?

jellyotsiro14 hours ago

it uses semantic search so yes

nathan_compton5 hours ago

Why the heck does this start with some sort of video bullshit?

sschueller12 hours ago

Is it able to handle a much larger dataset? Only a tiny fraction of data has been release from what is looks like.

jellyotsiro6 hours ago

yes! once for files come out, I will add them right away

thecopy11 hours ago

Reminder that only 1-2% of the files have been released.

Terr_9 hours ago

Yep: Breaking his campaign promises, in violation of the deadlines imposed by US Federal law, and with unlawful levels of redaction.

mschuster914 hours ago

A case can be made to discuss if the deadlines imposed by that law are actually achievable with humans and an acceptable degree of errors (i.e. overredaction, improper/recoverable redaction, and underredaction).

That's also why many "large" criminal cases only have a very limited subset of the initial charges make it to trial (often to understandable public outrage). The larger the case, the more evidence material has to be sifted through to make an airtight case, so a lot of it is dropped before the trial to secure a conviction at all.

Basically Al Capone, rinse and repeat - they got him on taxes because that's far easier to prove than ordering or committing a murder to the required degree of certainty.

The interests of the victims, their families and the general public are different from the interests of the government... the victims/families/public want justice for the unique crime they were subject to, the government just wants to lock up the bad guy for as long (or as short, let's be clear) as possible.

dylan6043 hours ago

> A case can be made to discuss if the deadlines imposed by that law are actually achievable with humans and an acceptable degree of errors (i.e. overredaction, improper/recoverable redaction, and underredaction).

But just a few months ago, they came out and said there were no more documents to release, now there are too many documents that it's not humanly possible to release the documents in said time frame?

mschuster912 hours ago

> But just a few months ago, they came out and said there were no more documents to release

That lie is a different but just as pressing problem. But that, at least, is far easier to hold the responsible people accountable... assuming of course someone actually wants to dive into that rabbit hole, and current Congress doesn't look like it will. Maybe after the mid-terms there will be some movement if the shift is serious enough, but for now I'll assume the worst case that either no one will be held accountable, or Trump will issue a blanket pardon again.

tehjoker15 hours ago

This is a good idea. One thing I never understand about these kinds of projects though: why are the standard questions provided to the user as prompts never cached?

jellyotsiro15 hours ago

oh forgot about it, thanks. just a funny project i build in couple hours so didnt really sweat haha

tehjoker15 hours ago

This agent is really interesting! Learning a lot. Thanks!

jampekka11 hours ago

Outputs are usually generated with random sampling, so the same prompt may get different outputs.

dfxm1215 hours ago

can search the entire Epstein files

It's worth noting that only about 1% of the files have been released, according to the DOJ.

Of the released files, many have redactions.

Terr_9 hours ago

Yep, they failed to meet the deadlines required by law, and it's not just any redactions either, but unlawful redactions.

King-Aaron12 hours ago

If the Lake Michigan thing is just in the first 1%, then whatever's in the other 99% is going to be absolutely disgusting.

Tom138011 hours ago

I searched it with the tool but nothing came up about Lake Michigan. What happened?

King-Aaron11 hours ago

https://www.justice.gov/epstein/files/DataSet%208/EFTA000250...

"He participated regularly in paying money to force me to ___ with him and he was present when my uncle murdered my newborn child and disposed of the body in Lake Michigan. "

The uncle is allegedly referring to Trump

+1
sam3456 hours ago
Terr_9 hours ago

I would expect a large portion of the remaining records to be internal emails about memos about the process of building a case around evidence, rather than the root evidence itself.

Not that that would excuse the administration's unlawful behavior so far, or indicate the unreleased 99% can't have some big bombshells.

jellyotsiro14 hours ago

sorry all publicly available files *

DanielScharf6 hours ago

Super Cool!

ck26 hours ago

Not sure if this is possible but it should be known there is a COMPLETE INDEX to the original Epstein Files

(not including the new millions upon millions of documents and photos)

https://storage.courtlistener.com/recap/gov.uscourts.nysd.47...

from a 2017 FOIA they had to provide it

https://www.bloomberg.com/news/newsletters/2025-08-08/here-s...

Might be possible for machine-learning to determine what is missing?

(which is basically 99% missing as we already know less than 1% released)

inquirerGeneral10 hours ago

[dead]

huflungdung9 hours ago

[dead]

p0w3n3d11 hours ago

[flagged]

dmos6210 hours ago

Ah, yes. Post is an LLM-something project: top comment is a general critique of LLMs. Waiting for this to get old. Meanwhile, at least you get points for being funny.

sebastiennight5 hours ago

I think the GP was unfairly downvoted, as their comment wasn't a critique of LLMs but a comical attempt at critique of the source files themselves being redacted into uselessness.

dmos6256 minutes ago

Oh, I didn't get that! That's pretty witty indeed. Now I feel bad.

sebastiennight11 hours ago

    > + '' * n

This looks like what you'd get from using text-davinci-003 as the model in your AI-assisted IDE
p0w3n3d8 hours ago

no - the utf8 black box was removed by hackernews. thanks for noticing.

Can't edit it anymore, but it would be "\u25A0" * n

sebastiennight5 hours ago

Ha! That makes way more sense, and was indeed quite funny and undeserving of the massive downvoting.

flexagoon10 hours ago

I think it looks like what you get by writing code and making a typo.

slfreference8 hours ago

All these attempts looks like emulation of "Pen (software) is mightier than Sword" or that only if more people believed in the cause, we would be close to resolution.

Remember folks, soft power is nothing in front of hard power.