Skip to main content

💥 Evaluate LLMs - OpenAI Compatible Server

LiteLLM Server, is a simple, fast, and lightweight OpenAI-compatible server to call 100+ LLM APIs in the OpenAI Input/Output format

LiteLLM Server supports:

  • LLM API Calls in the OpenAI ChatCompletions format
  • Caching + Logging capabilities (Redis and Langfuse, respectively)
  • Setting API keys in the request headers or in the .env

See Code

info

We want to learn how we can make the server better! Meet the founders or join our discord

Usage​

docker run -e PORT=8000 -p 8000:8000 ghcr.io/berriai/litellm:latest

OpenAI Proxy running on http://0.0.0.0:8000

curl http://0.0.0.0:8000/v1/chat/completions \
-H "Content-Type: application/json" \
-H "Authorization: Bearer $YOUR_API_KEY"
-d '{
"model": "gpt-3.5-turbo",
"messages": [{"role": "user", "content": "Say this is a test!"}],
"temperature": 0.7
}'

Other supported models:​

$ docker run -e PORT=8000 -e AWS_ACCESS_KEY_ID=<your-access-key> -e AWS_SECRET_ACCESS_KEY=<your-secret-key> -p 8000:8000 ghcr.io/berriai/litellm:latest

Tutorials (Chat-UI, NeMO-Guardrails, PromptTools, Phoenix ArizeAI, Langchain, ragas, LlamaIndex, etc.)​

Start server:

`docker run -e PORT=8000 -p 8000:8000 ghcr.io/berriai/litellm:latest`

The server is now live on http://0.0.0.0:8000

Here's the docker-compose.yml for running LiteLLM Server with Mckay Wrigley's Chat-UI:

version: '3'
services:
container1:
image: ghcr.io/berriai/litellm:latest
ports:
- '8000:8000'
environment:
- PORT=8000
- OPENAI_API_KEY=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

container2:
image: ghcr.io/mckaywrigley/chatbot-ui:main
ports:
- '3000:3000'
environment:
- OPENAI_API_KEY=my-fake-key
- OPENAI_API_HOST=http://container1:8000

Run this via:

docker-compose up

Endpoints:​

  • /chat/completions - chat completions endpoint to call 100+ LLMs
  • /embeddings - embedding endpoint for Azure, OpenAI, Huggingface endpoints
  • /models - available models on server

Save Model-specific params (API Base, API Keys, Temperature, etc.)​

Use the router_config_template.yaml to save model-specific information like api_base, api_key, temperature, max_tokens, etc.

  1. Create a config.yaml file
model_list:
- model_name: gpt-3.5-turbo
litellm_params: # params for litellm.completion() - https://docs.litellm.ai/docs/completion/input#input---request-body
model: azure/chatgpt-v-2 # azure/<your-deployment-name>
api_key: your_azure_api_key
api_version: your_azure_api_version
api_base: your_azure_api_base
- model_name: mistral-7b
litellm_params:
model: ollama/mistral
api_base: your_ollama_api_base
  1. Start the server
docker run -e PORT=8000 -p 8000:8000 -v $(pwd)/config.yaml:/app/config.yaml ghcr.io/berriai/litellm:latest

Caching​

Add Redis Caching to your server via environment variables

### REDIS
REDIS_HOST = ""
REDIS_PORT = ""
REDIS_PASSWORD = ""

Docker command:

docker run -e REDIST_HOST=<your-redis-host> -e REDIS_PORT=<your-redis-port> -e REDIS_PASSWORD=<your-redis-password> -e PORT=8000 -p 8000:8000 ghcr.io/berriai/litellm:latest

Logging​

  1. Debug Logs Print the input/output params by setting SET_VERBOSE = "True".

Docker command:

docker run -e SET_VERBOSE="True" -e PORT=8000 -p 8000:8000 ghcr.io/berriai/litellm:latest
  1. Add Langfuse Logging to your server via environment variables
### LANGFUSE
LANGFUSE_PUBLIC_KEY = ""
LANGFUSE_SECRET_KEY = ""
# Optional, defaults to https://cloud.langfuse.com
LANGFUSE_HOST = "" # optional

Docker command:

docker run -e LANGFUSE_PUBLIC_KEY=<your-public-key> -e LANGFUSE_SECRET_KEY=<your-secret-key> -e LANGFUSE_HOST=<your-langfuse-host> -e PORT=8000 -p 8000:8000 ghcr.io/berriai/litellm:latest

Local Usage​

$ git clone https://github.com/BerriAI/litellm.git
$ cd ./litellm/litellm_server
$ uvicorn main:app --host 0.0.0.0 --port 8000

Setting LLM API keys​

This server allows two ways of passing API keys to litellm

  • Environment Variables - This server by default assumes the LLM API Keys are stored in the environment variables
  • Dynamic Variables passed to /chat/completions
    • Set AUTH_STRATEGY=DYNAMIC in the Environment
    • Pass required auth params api_key,api_base, api_version with the request params

Deploy on Google Cloud Run​

Click the button to deploy to Google Cloud Run

Deploy

On a successfull deploy your Cloud Run Shell will have this output

Testing your deployed server​

Assuming the required keys are set as Environment Variables

https://litellm-7yjrj3ha2q-uc.a.run.app is our example server, substitute it with your deployed cloud run app

curl https://litellm-7yjrj3ha2q-uc.a.run.app/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "gpt-3.5-turbo",
"messages": [{"role": "user", "content": "Say this is a test!"}],
"temperature": 0.7
}'

Set LLM API Keys​

Environment Variables​

More info here

  1. In the Google Cloud console, go to Cloud Run: Go to Cloud Run

  2. Click on the litellm service

  3. Click Edit and Deploy New Revision

  4. Enter your Environment Variables Example OPENAI_API_KEY, ANTHROPIC_API_KEY

Advanced​

Caching - Completion() and Embedding() Responses​

Enable caching by adding the following credentials to your server environment

REDIS_HOST = ""       # REDIS_HOST='redis-18841.c274.us-east-1-3.ec2.cloud.redislabs.com'
REDIS_PORT = "" # REDIS_PORT='18841'
REDIS_PASSWORD = "" # REDIS_PASSWORD='liteLlmIsAmazing'

Test Caching​

Send the same request twice:

curl http://0.0.0.0:8000/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "gpt-3.5-turbo",
"messages": [{"role": "user", "content": "write a poem about litellm!"}],
"temperature": 0.7
}'

curl http://0.0.0.0:8000/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "gpt-3.5-turbo",
"messages": [{"role": "user", "content": "write a poem about litellm!"}],
"temperature": 0.7
}'

Control caching per completion request​

Caching can be switched on/off per /chat/completions request

  • Caching on for completion - pass caching=True:
    curl http://0.0.0.0:8000/v1/chat/completions \
    -H "Content-Type: application/json" \
    -d '{
    "model": "gpt-3.5-turbo",
    "messages": [{"role": "user", "content": "write a poem about litellm!"}],
    "temperature": 0.7,
    "caching": true
    }'
  • Caching off for completion - pass caching=False:
    curl http://0.0.0.0:8000/v1/chat/completions \
    -H "Content-Type: application/json" \
    -d '{
    "model": "gpt-3.5-turbo",
    "messages": [{"role": "user", "content": "write a poem about litellm!"}],
    "temperature": 0.7,
    "caching": false
    }'