LocalAI - Backends

cuda11-vllm

vLLM is a fast and easy-to-use library for LLM inference and serving. Originally developed in the Sky Computing Lab at UC Berkeley, vLLM has evolved into a community-driven project with contributions from both academia and industry. vLLM is fast with: State-of-the-art serving throughput Efficient management of attention key and value memory with PagedAttention Continuous batching of incoming requests Fast model execution with CUDA/HIP graph Quantizations: GPTQ, AWQ, AutoRound, INT4, INT8, and FP8 Optimized CUDA kernels, including integration with FlashAttention and FlashInfer Speculative decoding Chunked prefill

https://github.com/vllm-project/vllm

cuda12-vllm

vLLM is a fast and easy-to-use library for LLM inference and serving. Originally developed in the Sky Computing Lab at UC Berkeley, vLLM has evolved into a community-driven project with contributions from both academia and industry. vLLM is fast with: State-of-the-art serving throughput Efficient management of attention key and value memory with PagedAttention Continuous batching of incoming requests Fast model execution with CUDA/HIP graph Quantizations: GPTQ, AWQ, AutoRound, INT4, INT8, and FP8 Optimized CUDA kernels, including integration with FlashAttention and FlashInfer Speculative decoding Chunked prefill

https://github.com/vllm-project/vllm

rocm-vllm

vLLM is a fast and easy-to-use library for LLM inference and serving. Originally developed in the Sky Computing Lab at UC Berkeley, vLLM has evolved into a community-driven project with contributions from both academia and industry. vLLM is fast with: State-of-the-art serving throughput Efficient management of attention key and value memory with PagedAttention Continuous batching of incoming requests Fast model execution with CUDA/HIP graph Quantizations: GPTQ, AWQ, AutoRound, INT4, INT8, and FP8 Optimized CUDA kernels, including integration with FlashAttention and FlashInfer Speculative decoding Chunked prefill

https://github.com/vllm-project/vllm

intel-sycl-f32-vllm

vLLM is a fast and easy-to-use library for LLM inference and serving. Originally developed in the Sky Computing Lab at UC Berkeley, vLLM has evolved into a community-driven project with contributions from both academia and industry. vLLM is fast with: State-of-the-art serving throughput Efficient management of attention key and value memory with PagedAttention Continuous batching of incoming requests Fast model execution with CUDA/HIP graph Quantizations: GPTQ, AWQ, AutoRound, INT4, INT8, and FP8 Optimized CUDA kernels, including integration with FlashAttention and FlashInfer Speculative decoding Chunked prefill

https://github.com/vllm-project/vllm

intel-sycl-f16-vllm

vLLM is a fast and easy-to-use library for LLM inference and serving. Originally developed in the Sky Computing Lab at UC Berkeley, vLLM has evolved into a community-driven project with contributions from both academia and industry. vLLM is fast with: State-of-the-art serving throughput Efficient management of attention key and value memory with PagedAttention Continuous batching of incoming requests Fast model execution with CUDA/HIP graph Quantizations: GPTQ, AWQ, AutoRound, INT4, INT8, and FP8 Optimized CUDA kernels, including integration with FlashAttention and FlashInfer Speculative decoding Chunked prefill

https://github.com/vllm-project/vllm

cuda11-vllm-development

vLLM is a fast and easy-to-use library for LLM inference and serving. Originally developed in the Sky Computing Lab at UC Berkeley, vLLM has evolved into a community-driven project with contributions from both academia and industry. vLLM is fast with: State-of-the-art serving throughput Efficient management of attention key and value memory with PagedAttention Continuous batching of incoming requests Fast model execution with CUDA/HIP graph Quantizations: GPTQ, AWQ, AutoRound, INT4, INT8, and FP8 Optimized CUDA kernels, including integration with FlashAttention and FlashInfer Speculative decoding Chunked prefill

https://github.com/vllm-project/vllm

cuda12-vllm-development

vLLM is a fast and easy-to-use library for LLM inference and serving. Originally developed in the Sky Computing Lab at UC Berkeley, vLLM has evolved into a community-driven project with contributions from both academia and industry. vLLM is fast with: State-of-the-art serving throughput Efficient management of attention key and value memory with PagedAttention Continuous batching of incoming requests Fast model execution with CUDA/HIP graph Quantizations: GPTQ, AWQ, AutoRound, INT4, INT8, and FP8 Optimized CUDA kernels, including integration with FlashAttention and FlashInfer Speculative decoding Chunked prefill

https://github.com/vllm-project/vllm

rocm-vllm-development

vLLM is a fast and easy-to-use library for LLM inference and serving. Originally developed in the Sky Computing Lab at UC Berkeley, vLLM has evolved into a community-driven project with contributions from both academia and industry. vLLM is fast with: State-of-the-art serving throughput Efficient management of attention key and value memory with PagedAttention Continuous batching of incoming requests Fast model execution with CUDA/HIP graph Quantizations: GPTQ, AWQ, AutoRound, INT4, INT8, and FP8 Optimized CUDA kernels, including integration with FlashAttention and FlashInfer Speculative decoding Chunked prefill

https://github.com/vllm-project/vllm

intel-sycl-f32-vllm-development

vLLM is a fast and easy-to-use library for LLM inference and serving. Originally developed in the Sky Computing Lab at UC Berkeley, vLLM has evolved into a community-driven project with contributions from both academia and industry. vLLM is fast with: State-of-the-art serving throughput Efficient management of attention key and value memory with PagedAttention Continuous batching of incoming requests Fast model execution with CUDA/HIP graph Quantizations: GPTQ, AWQ, AutoRound, INT4, INT8, and FP8 Optimized CUDA kernels, including integration with FlashAttention and FlashInfer Speculative decoding Chunked prefill

https://github.com/vllm-project/vllm

intel-sycl-f16-vllm-development

vLLM is a fast and easy-to-use library for LLM inference and serving. Originally developed in the Sky Computing Lab at UC Berkeley, vLLM has evolved into a community-driven project with contributions from both academia and industry. vLLM is fast with: State-of-the-art serving throughput Efficient management of attention key and value memory with PagedAttention Continuous batching of incoming requests Fast model execution with CUDA/HIP graph Quantizations: GPTQ, AWQ, AutoRound, INT4, INT8, and FP8 Optimized CUDA kernels, including integration with FlashAttention and FlashInfer Speculative decoding Chunked prefill

https://github.com/vllm-project/vllm

cuda11-rerankers

cuda12-rerankers

intel-sycl-f32-rerankers

intel-sycl-f16-rerankers

rocm-rerankers

cuda11-rerankers-development

cuda12-rerankers-development

rocm-rerankers-development

intel-sycl-f32-rerankers-development

intel-sycl-f16-rerankers-development

cuda12-transformers

Transformers acts as the model-definition framework for state-of-the-art machine learning models in text, computer vision, audio, video, and multimodal model, for both inference and training. It centralizes the model definition so that this definition is agreed upon across the ecosystem. transformers is the pivot across frameworks: if a model definition is supported, it will be compatible with the majority of training frameworks (Axolotl, Unsloth, DeepSpeed, FSDP, PyTorch-Lightning, ...), inference engines (vLLM, SGLang, TGI, ...), and adjacent modeling libraries (llama.cpp, mlx, ...) which leverage the model definition from transformers.

https://github.com/huggingface/transformers

Backend Management

Filter by type: