Wow, this was super helpful - probably the most intuitive write up on LLM hardware economics that I’ve read anywhere.

One question for you - how does the length of the context window fit into this equation? AFAIK, longer context windows are more computationally expensive, even if you don’t fill them with tokens. How do you account for that in your calculations?

I tried to work out the math when you describe the optimal batch size for memory bound vs compute bound and I think there may be an error. The multiplicative factor of B (batch size) should be with the compute latency calculation.

## How is LLaMa.cpp possible?

Wow, this was super helpful - probably the most intuitive write up on LLM hardware economics that I’ve read anywhere.

One question for you - how does the length of the context window fit into this equation? AFAIK, longer context windows are more computationally expensive, even if you don’t fill them with tokens. How do you account for that in your calculations?

I tried to work out the math when you describe the optimal batch size for memory bound vs compute bound and I think there may be an error. The multiplicative factor of B (batch size) should be with the compute latency calculation.

Kipply's blog also has the same - https://kipp.ly/transformer-inference-arithmetic/#batch-sizes