I am fascinated by this and similar research (RotorQuant, etc). It seem by next year we will be able to run this year's largest models on last year's hardware. :)
Maybe we won't need as many data centers and as much power as we thought. Maybe we can run more powerful models locally.
[dead]
[dead]
Maybe we can run more powerful models locally.
I thought the principal consequence of these KV cache optimisations was letting you run more simultaneous inferences on the same model with the same memory. It doesn’t let you store more model. In some sense that puts local LLM usage at a further disadvantage to inference done in a hyperscaler’s data center.