A Survey on Efficient Inference Large Language Models¶
约 630 个字 预计阅读时间 2 分钟
Preliminaries¶
大部分都是 Transformer 的知识。这里就写一些我不熟悉的。
- The inference process of LLMs can be divided into two stages
- Prefilling Stage: The LLM calculates and stores the KV cache of the initial input tokens, and generates the first output token.
- Decoding Stage: The LLM generates the output tokens one by one with the KV cache, and then updates it with the key (K) and value (V) pairs of the newly generated token.
- Efficiency indicators
- latency
- first token latency: latency to generate the first output token in the prefilling stage.
- per-output token latency
- generation latency: whole output token seqeuences
- Memory
- model size: the memory to store the model weights
- KV cache size: memory to store KV cache
- peak memory: maximum memory usage during the generation process(appro equal to the sum of model and KV size)
- Throughput
- token throughput: number of generated tokens per second
- request throughput: the number of completed requests per second
- latency
- Efficiency Analysis
- computational cost, memory access cost, memory usage
- Model size
- Attention Operation
- Decoding Approach
Data-Level Optimization¶
Input Compression¶
- Prompt Pruning
- core idea: remove unimportant tokes
- Prompt Summary
System-Level Optimization¶
Inference Engine¶
Serving System¶
- Memory Management
- paged appraoch
- vLLM, 见论文阅读笔记
- LightLLM uses a more fine-grained KV cache storage to cut down the waste happening with the irregular boundary. (treat the KV cache of a token as a unit, so always saturates the pre-allocated space)
- paged storage leads to irregular memory access in the attention operator. 数据被分散存储在内存的不同物理位置,需要考虑映射关系。
- vLLM 将 K cache 的头大小维度结构化为一个 16 字节的连续向量,优化内存访问模式
- FlashInfer 涉及多种数据布局以及相应的内粗访问策略。
- paged appraoch
- Continuous Batching
- The request lenghths in a batch can be different->low utilization
- mitigate such periods of low utilization
- continuous batching: batching new requests once some old requests are finished.
- ORCA
- split-and-fuse: batch together prefilling requests and decoding requests. (分割较长的预填充请求,然后将分割后的部分与多个短解码请求组合在一起进行批处理。)
- Scheduling Strategy
- LLM service 中每个请求的处理时间可能不同,导致执行请求的顺序对吞吐量有显著影响。
- Head-of-Line Blocking: 长请求被优先处理时,会发生队头阻塞,长请求迅速占用大量内存阻碍后续请求的处理。
- FCFS: 按请求到达的顺序处理。
- 优先解码请求。
- 抢占式调度策略。
- 多级反馈队列,优先处理剩余时间最短的请求。
- 预测请求长度:skip-join
- Distributed Systems
- LLM 服务通常在多个计算节点上部署,并行处理多个请求。
- Prefilling 和 Decoding 的分离
- ExeGPT: 不确定的延迟约束下,控制变量来最大化系统吞吐量
- Llumnix: 运行时重新调度(负载均衡、内存碎片整理、优先级排序)
- TBD
最后更新:
2024年9月27日 11:34:32
创建日期: 2024年9月27日 11:34:32
创建日期: 2024年9月27日 11:34:32