Accelerating Inference In Foundational LLMs & Text Generation

Nov 10, 2024

This blog post summarizes Trade offs, Output-approximating methods and Output-preserving methods while accelerating inference in LLMs

Trade offs

The Quality vs Latency/Cost Tradeoff

The Latency vs Cost Tradeoff

Output-approximating methods

Quantization

Distillation

Output-preserving methods

Flash Attention

Prefix Caching

Speculative Decoding

Batching and Parallelization