Efficient LLM inference

Finbarr Timbers

May 9, 2023

On quantization, distillation, and efficiency

Read →

7 Comments

Sidhant Chadda

Sidhant’s Substack

Jan 4Liked by Finbarr Timbers

Thanks for the post,

Most model weights I have seen are floating points between -1 and 1, If we got rid of the exponent bits wouldn't we be able to save ~31% model weight size in a 16 bit floating point?

Presumably this would require changes in the underlying hardware itself, in order to perform calculations with this new floating point.

But still find it bizarre that ML models have all these useless bits lying around.

Expand full comment

Reply (1)

Nathan Lambert

Interconnects

May 9, 2023Liked by Finbarr Timbers

The part about GPTQ is pretty bizarre - I would've thought quantization is just doing what you showed at scale. Maybe it works because it does that rounding operation in a vectorized operation? Rather than naive rounding which is slower? That doesn't sound like I'm saying anything intelligence. A tad funny that we don't know exactly why quantization works.

Expand full comment

Hamza

Aug 24, 2023

Seems like the figure comparing model sizes to levels of precision is missing.

Expand full comment

Eddie Coda

Eddie’s Substack

Jan 5

Correct me if I'm being naive, but this seems like the same approach that was published in Exponentially Faster Language Modeling. Have you had the chance to check out their work?

Link to paper:

https://arxiv.org/abs/2308.14711

Expand full comment

Rohit Akiwatkar

The AI Edge

Sep 1, 2023

Efficiency in inference for large language models is paramount, and this article provides valuable insights. This article highlights key strategies for optimizing inference in large language models, emphasizing the significance of code profiling and simple optimizations like data structure changes to enhance performance and minimize resource utilization.

Expand full comment

Marl Renfro

Data Brainiacs

May 25, 2023

I've posted some of the conversations I've been engaged in with ChatGPT (Chatty). Thought you might find them interesting. I'd definitely love to hear your feedback regarding the models Chatty develops as a high-level architecture to a GAI entity's sentience.

I've finished the second of three posts which is the second section of a long interview (in a series I’ve done). You can find the first post in the series @ In It's Own Words- "I Apologize. I Can't Do That" In the post, I ask Chatty (ChatGPT) to write a series of essays touching of various aspects of biological sensation and perception. This article ends just prior to me asking Chatty to “generate an emergent model of qualia, which begins with single cell organisms and ranges through advanced sensory and cognitive organisms.” Chatty’s answer sets up the final part of this three-part interview to have Chatty model the GAI entity analog of biological developmental stages which Chatty speculates (given an architecture could support such high-level models) GAI entities would experience subjective and (perhaps) qualitative states. Chatty’s speculation on instantiating GAI volition are presented at the end of the final section of this three-part interview.

Expand full comment

Artificial Fintelligence

Efficient LLM inference