Portable LLMs with llamafile
Portable LLMs with llamafile
Posted May 16, 2024 10:00 UTC (Thu) by yaap (subscriber, #71398)Parent article: Portable LLMs with llamafile
> The difference is attributable to the fact that during prompt evaluation, the model can use matrix-matrix multiplications, instead of matrix-vector multiplications.
As I understand it, it's not the only factor. The evaluation processes the whole input buffer (so all the input tokens) in a single pass of the LLM encoder. While the generation is iterative: there will be a pass through the LLM decoder for each newly generated token.
