Streaming Chat Completion
A streaming example showing how to use HF-Inferoxy for real-time chat completion with automatic token management.
Overview
This example demonstrates:
- Getting a managed token from the proxy server (requires authentication)
- Creating an InferenceClient with the token
- Making a streaming chat completion request
- Processing real-time response chunks
- Proper error handling and token status reporting
⚠️ Important: Authentication Required
All client operations now require authentication with the HF-Inferoxy server. This is part of the Role-Based Access Control (RBAC) system that provides secure access to the proxy services.
Getting Your API Key
-
Default Admin User: The system creates a default admin user on first run. Check your server logs or the
users.json
file for the default admin credentials. - Create a User Account: Use the admin account to create a regular user account:
curl -X POST "http://localhost:8000/admin/users" \ -H "Authorization: Bearer ADMIN_API_KEY" \ -H "Content-Type: application/json" \ -d '{"username": "youruser", "email": "user@example.com", "full_name": "Your Name", "role": "user"}'
- Use the Generated API Key: The response will include an API key that you’ll use in all client operations.
For detailed RBAC setup and user management, see RBAC_README.md.
Code Example
from huggingface_hub import InferenceClient
from huggingface_hub.errors import HfHubHTTPError
from hf_token_utils import get_proxy_token, report_token_status
def simple_chat_stream(proxy_api_key: str):
# Get token from proxy server (requires authentication)
token, token_id = get_proxy_token(api_key=proxy_api_key)
# Create client with managed token
client = InferenceClient(
provider="hf-inference",
api_key=token
)
try:
# Make streaming chat completion request
stream = client.chat.completions.create(
model="HuggingFaceTB/SmolLM3-3B",
messages=[
{"role": "user", "content": "What is the capital of France?"}
],
stream=True
)
# Process streaming response
full_response = ""
for chunk in stream:
if chunk.choices[0].delta.content:
content = chunk.choices[0].delta.content
print(content, end="", flush=True)
full_response += content
print() # New line after streaming completes
# Report success after stream completes
report_token_status(token_id, "success", api_key=proxy_api_key)
return full_response
except HfHubHTTPError as e:
# Report the error
report_token_status(token_id, "error", str(e), api_key=proxy_api_key)
raise
except Exception as e:
# Report generic error
report_token_status(token_id, "error", str(e), api_key=proxy_api_key)
raise
# Usage
if __name__ == "__main__":
# You need to get your API key from the admin or create a user account
# See RBAC_README.md for details on user management
proxy_api_key = "your_proxy_api_key_here" # Get this from admin
simple_chat_stream(proxy_api_key)
Key Features
- Real-time Streaming: Processes response chunks as they arrive
- Authentication Required: Uses
get_proxy_token(api_key=proxy_api_key)
to get a valid token - Error Handling: Catches both HuggingFace-specific and generic errors
- Status Reporting: Reports token success/failure back to the proxy server
- Provider Flexibility: Can easily switch to any supported provider
Usage
- Ensure your HF-Inferoxy server is running
- Install required dependencies:
uv add huggingface-hub requests
- Copy the
hf_token_utils.py
file to your project - Get your API key from the HF-Inferoxy admin or create a user account
- Run the example: Copy the code and run it in your Python environment
Customization
- Change Provider: Replace
"hf-inference"
with any supported provider - Change Model: Update the model name to use different models
- Change Message: Modify the user message content
- Add Context: Extend the messages list with conversation history
- Custom Streaming: Modify how you process and display streaming chunks
Streaming Benefits
- Faster Perceived Response: Users see content as it’s generated
- Better User Experience: Real-time interaction feels more responsive
- Progress Indication: Users can see the model is working
- Early Termination: Can stop generation if needed
Related Examples
- Simple Chat Completion - For non-streaming responses
- Token Utilities - Helper functions for token management
- Provider Examples - Provider-specific configuration guides
- RBAC Setup - User management and authentication setup