Streaming Chat Completion

A streaming example showing how to use HF-Inferoxy for real-time chat completion with automatic token management.

Overview

This example demonstrates:

Getting a managed token from the proxy server (requires authentication)
Creating an InferenceClient with the token
Making a streaming chat completion request
Processing real-time response chunks
Proper error handling and token status reporting

⚠️ Important: Authentication Required

All client operations now require authentication with the HF-Inferoxy server. This is part of the Role-Based Access Control (RBAC) system that provides secure access to the proxy services.

Getting Your API Key

Default Admin User: The system creates a default admin user on first run. Check your server logs or the users.json file for the default admin credentials.

Create a User Account: Use the admin account to create a regular user account:

curl -X POST "http://localhost:8000/admin/users" \
  -H "Authorization: Bearer ADMIN_API_KEY" \
  -H "Content-Type: application/json" \
  -d '{"username": "youruser", "email": "user@example.com", "full_name": "Your Name", "role": "user"}'

Use the Generated API Key: The response will include an API key that you’ll use in all client operations.

For detailed RBAC setup and user management, see RBAC_README.md.

Code Example

from huggingface_hub import InferenceClient
from huggingface_hub.errors import HfHubHTTPError
from hf_token_utils import get_proxy_token, report_token_status

def simple_chat_stream(proxy_api_key: str):
    # Get token from proxy server (requires authentication)
    token, token_id = get_proxy_token(api_key=proxy_api_key)
    
    # Create client with managed token
    client = InferenceClient(
        provider="hf-inference",
        api_key=token
    )
    
    try:
        # Make streaming chat completion request
        stream = client.chat.completions.create(
            model="HuggingFaceTB/SmolLM3-3B",
            messages=[
                {"role": "user", "content": "What is the capital of France?"}
            ],
            stream=True
        )
        
        # Process streaming response
        full_response = ""
        for chunk in stream:
            if chunk.choices[0].delta.content:
                content = chunk.choices[0].delta.content
                print(content, end="", flush=True)
                full_response += content
        
        print()  # New line after streaming completes
        
        # Report success after stream completes
        report_token_status(token_id, "success", api_key=proxy_api_key)
        
        return full_response
        
    except HfHubHTTPError as e:
        # Report the error
        report_token_status(token_id, "error", str(e), api_key=proxy_api_key)
        raise
    except Exception as e:
        # Report generic error
        report_token_status(token_id, "error", str(e), api_key=proxy_api_key)
        raise

# Usage
if __name__ == "__main__":
    # You need to get your API key from the admin or create a user account
    # See RBAC_README.md for details on user management
    proxy_api_key = "your_proxy_api_key_here"  # Get this from admin
    simple_chat_stream(proxy_api_key)

Key Features

Real-time Streaming: Processes response chunks as they arrive
Authentication Required: Uses get_proxy_token(api_key=proxy_api_key) to get a valid token
Error Handling: Catches both HuggingFace-specific and generic errors
Status Reporting: Reports token success/failure back to the proxy server
Provider Flexibility: Can easily switch to any supported provider

Usage

Ensure your HF-Inferoxy server is running
Install required dependencies: uv add huggingface-hub requests
Copy the hf_token_utils.py file to your project
Get your API key from the HF-Inferoxy admin or create a user account
Run the example: Copy the code and run it in your Python environment

Customization

Change Provider: Replace "hf-inference" with any supported provider
Change Model: Update the model name to use different models
Change Message: Modify the user message content
Add Context: Extend the messages list with conversation history
Custom Streaming: Modify how you process and display streaming chunks

Streaming Benefits

Faster Perceived Response: Users see content as it’s generated
Better User Experience: Real-time interaction feels more responsive
Progress Indication: Users can see the model is working
Early Termination: Can stop generation if needed

Simple Chat Completion - For non-streaming responses
Token Utilities - Helper functions for token management
Provider Examples - Provider-specific configuration guides
RBAC Setup - User management and authentication setup