Streaming Chat Completion

A streaming example showing how to use HF-Inferoxy for real-time chat completion with automatic token management.

Overview

This example demonstrates:

  • Getting a managed token from the proxy server (requires authentication)
  • Creating an InferenceClient with the token
  • Making a streaming chat completion request
  • Processing real-time response chunks
  • Proper error handling and token status reporting

⚠️ Important: Authentication Required

All client operations now require authentication with the HF-Inferoxy server. This is part of the Role-Based Access Control (RBAC) system that provides secure access to the proxy services.

Getting Your API Key

  1. Default Admin User: The system creates a default admin user on first run. Check your server logs or the users.json file for the default admin credentials.

  2. Create a User Account: Use the admin account to create a regular user account:
    curl -X POST "http://localhost:8000/admin/users" \
      -H "Authorization: Bearer ADMIN_API_KEY" \
      -H "Content-Type: application/json" \
      -d '{"username": "youruser", "email": "user@example.com", "full_name": "Your Name", "role": "user"}'
    
  3. Use the Generated API Key: The response will include an API key that you’ll use in all client operations.

For detailed RBAC setup and user management, see RBAC_README.md.

Code Example

from huggingface_hub import InferenceClient
from huggingface_hub.errors import HfHubHTTPError
from hf_token_utils import get_proxy_token, report_token_status

def simple_chat_stream(proxy_api_key: str):
    # Get token from proxy server (requires authentication)
    token, token_id = get_proxy_token(api_key=proxy_api_key)
    
    # Create client with managed token
    client = InferenceClient(
        provider="hf-inference",
        api_key=token
    )
    
    try:
        # Make streaming chat completion request
        stream = client.chat.completions.create(
            model="HuggingFaceTB/SmolLM3-3B",
            messages=[
                {"role": "user", "content": "What is the capital of France?"}
            ],
            stream=True
        )
        
        # Process streaming response
        full_response = ""
        for chunk in stream:
            if chunk.choices[0].delta.content:
                content = chunk.choices[0].delta.content
                print(content, end="", flush=True)
                full_response += content
        
        print()  # New line after streaming completes
        
        # Report success after stream completes
        report_token_status(token_id, "success", api_key=proxy_api_key)
        
        return full_response
        
    except HfHubHTTPError as e:
        # Report the error
        report_token_status(token_id, "error", str(e), api_key=proxy_api_key)
        raise
    except Exception as e:
        # Report generic error
        report_token_status(token_id, "error", str(e), api_key=proxy_api_key)
        raise

# Usage
if __name__ == "__main__":
    # You need to get your API key from the admin or create a user account
    # See RBAC_README.md for details on user management
    proxy_api_key = "your_proxy_api_key_here"  # Get this from admin
    simple_chat_stream(proxy_api_key)

Key Features

  • Real-time Streaming: Processes response chunks as they arrive
  • Authentication Required: Uses get_proxy_token(api_key=proxy_api_key) to get a valid token
  • Error Handling: Catches both HuggingFace-specific and generic errors
  • Status Reporting: Reports token success/failure back to the proxy server
  • Provider Flexibility: Can easily switch to any supported provider

Usage

  1. Ensure your HF-Inferoxy server is running
  2. Install required dependencies: uv add huggingface-hub requests
  3. Copy the hf_token_utils.py file to your project
  4. Get your API key from the HF-Inferoxy admin or create a user account
  5. Run the example: Copy the code and run it in your Python environment

Customization

  • Change Provider: Replace "hf-inference" with any supported provider
  • Change Model: Update the model name to use different models
  • Change Message: Modify the user message content
  • Add Context: Extend the messages list with conversation history
  • Custom Streaming: Modify how you process and display streaming chunks

Streaming Benefits

  • Faster Perceived Response: Users see content as it’s generated
  • Better User Experience: Real-time interaction feels more responsive
  • Progress Indication: Users can see the model is working
  • Early Termination: Can stop generation if needed