Using HuggingFace Hub with HF-Inferoxy Token Management
This guide demonstrates how to use the huggingface_hub
library with HF-Inferoxy’s smart token management system for seamless API key rotation and error handling across all supported providers.
Overview
HF-Inferoxy provides a simple client-side utility that automatically manages HuggingFace API tokens, handles errors, and reports usage back to the proxy server for intelligent key rotation. This eliminates the need to manually manage multiple API keys or handle token-related errors in your application code.
Supported Providers: HF-Inferoxy works with all major HuggingFace providers including Cerebras, Cohere, Groq, Together, Nebius, and many more. See the provider examples directory for comprehensive examples.
⚠️ Important: Authentication Required
All client operations now require authentication with the HF-Inferoxy server. This is part of the Role-Based Access Control (RBAC) system that provides secure access to the proxy services.
Getting Your API Key
-
Default Admin User: The system creates a default admin user on first run. Check your server logs or the
users.json
file for the default admin credentials. - Create a User Account: Use the admin account to create a regular user account:
curl -X POST "http://localhost:8000/admin/users" \ -H "Authorization: Bearer ADMIN_API_KEY" \ -H "Content-Type: application/json" \ -d '{"username": "youruser", "email": "user@example.com", "full_name": "Your Name", "role": "user"}'
- Use the Generated API Key: The response will include an API key that you’ll use in all client operations.
Authentication in Client Code
All examples in this guide require you to pass your proxy API key to the utility functions:
# Get your API key from the admin or user creation
proxy_api_key = "your_proxy_api_key_here"
# Use it in all operations
token, token_id = get_proxy_token(api_key=proxy_api_key)
report_token_status(token_id, "success", api_key=proxy_api_key)
For detailed RBAC setup and user management, see RBAC_README.md.
End-user tracking (optional)
You can optionally include an end-user identifier in token usage reports via the client_name
field. If omitted, the server will default to the authenticated maintainer’s username.
end_user = "customer_123" # e.g., your app's user/customer id
token, token_id = get_proxy_token(api_key=proxy_api_key)
# ... perform requests ...
# Include end-user when reporting status
report_token_status(token_id, "success", api_key=proxy_api_key, client_name=end_user)
Installation
First, ensure you have the required dependencies:
# Install huggingface_hub
uv add huggingface-hub
# Install requests for API communication
uv add requests
Token Utils Implementation
Create a file called hf_token_utils.py
with the following functions:
# hf_token_utils.py
import os
import requests
import json
from typing import Dict, Optional, Any, Tuple
def get_proxy_token(proxy_url: str = "http://localhost:8000", api_key: str = None) -> Tuple[str, str]:
"""
Get a valid token from the proxy server.
Args:
proxy_url: URL of the HF-Inferoxy server
api_key: Your API key for authenticating with the proxy server
Returns:
Tuple of (token, token_id)
Raises:
Exception: If token provisioning fails
"""
headers = {}
if api_key:
headers["Authorization"] = f"Bearer {api_key}"
response = requests.get(f"{proxy_url}/keys/provision", headers=headers)
if response.status_code != 200:
raise Exception(f"Failed to provision token: {response.text}")
data = response.json()
token = data["token"]
token_id = data["token_id"]
# For convenience, also set environment variable
os.environ["HF_TOKEN"] = token
return token, token_id
def report_token_status(
token_id: str,
status: str = "success",
error: Optional[str] = None,
proxy_url: str = "http://localhost:8000",
api_key: str = None,
client_name: Optional[str] = None,
) -> bool:
"""
Report token usage status back to the proxy server.
Args:
token_id: ID of the token to report (from get_proxy_token)
status: Status to report ('success' or 'error')
error: Error message if status is 'error'
proxy_url: URL of the HF-Inferoxy server
api_key: Your API key for authenticating with the proxy server
Returns:
True if report was accepted, False otherwise
"""
payload = {"token_id": token_id, "status": status}
if error:
payload["error"] = error
# Extract error classification based on actual HF error patterns
error_type = None
if "401 Client Error" in error:
error_type = "invalid_credentials"
elif "402 Client Error" in error and "exceeded your monthly included credits" in error:
error_type = "credits_exceeded"
if error_type:
payload["error_type"] = error_type
if client_name:
payload["client_name"] = client_name
headers = {"Content-Type": "application/json"}
if api_key:
headers["Authorization"] = f"Bearer {api_key}"
try:
response = requests.post(f"{proxy_url}/keys/report", json=payload, headers=headers)
return response.status_code == 200
except Exception as e:
# Silently fail to avoid breaking the client application
# In production, consider logging this error
return False
Basic Usage Patterns
1. Simple Chat Completion
With HF-Inference (Default)
from huggingface_hub import InferenceClient
from huggingface_hub.errors import HfHubHTTPError
from hf_token_utils import get_proxy_token, report_token_status
def simple_chat(proxy_api_key: str):
# Get token from proxy server (requires authentication)
token, token_id = get_proxy_token(api_key=proxy_api_key)
# Create client with managed token
client = InferenceClient(
provider="hf-inference",
api_key=token
)
try:
# Make chat completion request
completion = client.chat.completions.create(
model="HuggingFaceTB/SmolLM3-3B",
messages=[
{"role": "user", "content": "What is the capital of France?"}
]
)
# Report success
report_token_status(token_id, "success", api_key=proxy_api_key)
print(completion.choices[0].message.content)
return completion
except HfHubHTTPError as e:
# Report the error
report_token_status(token_id, "error", str(e), api_key=proxy_api_key)
raise
except Exception as e:
# Report generic error
report_token_status(token_id, "error", str(e), api_key=proxy_api_key)
raise
# Usage
if __name__ == "__main__":
# You need to get your API key from the admin or create a user account
# See RBAC_README.md for details on user management
proxy_api_key = "your_proxy_api_key_here" # Get this from admin
simple_chat(proxy_api_key)
With Other Providers
from huggingface_hub import InferenceClient
from huggingface_hub.errors import HfHubHTTPError
from hf_token_utils import get_proxy_token, report_token_status
def provider_chat(provider: str = "cerebras", proxy_api_key: str = None):
"""Chat completion with different providers"""
# Get token from proxy server (requires authentication)
token, token_id = get_proxy_token(api_key=proxy_api_key)
# Provider-specific model mapping
models = {
"cerebras": "openai/gpt-oss-120b",
"cohere": "CohereLabs/c4ai-command-r-plus",
"groq": "openai/gpt-oss-120b",
"together": "openai/gpt-oss-120b",
"nebius": "Qwen/Qwen3-235B-A22B-Instruct-2507"
}
# Create client with managed token
client = InferenceClient(
provider=provider,
api_key=token
)
try:
# Make chat completion request
completion = client.chat.completions.create(
model=models.get(provider, "openai/gpt-oss-120b"),
messages=[
{"role": "user", "content": "What is the capital of France?"}
]
)
# Report success
report_token_status(token_id, "success", api_key=proxy_api_key)
print(f"[{provider.upper()}] {completion.choices[0].message.content}")
return completion
except HfHubHTTPError as e:
# Report the error
report_token_status(token_id, "error", str(e), api_key=proxy_api_key)
raise
except Exception as e:
# Report generic error
report_token_status(token_id, "error", str(e), api_key=proxy_api_key)
raise
# Usage with different providers
if __name__ == "__main__":
# You need to get your API key from the admin or create a user account
# See RBAC_README.md for details on user management
proxy_api_key = "your_proxy_api_key_here" # Get this from admin
providers = ["cerebras", "cohere", "groq", "together"]
for provider in providers:
try:
provider_chat(provider, proxy_api_key)
except Exception as e:
print(f"Error with {provider}: {e}")
2. Feature Extraction
from huggingface_hub import InferenceClient
from huggingface_hub.errors import HfHubHTTPError
from hf_token_utils import get_proxy_token, report_token_status
def extract_features(text: str, model: str = "intfloat/multilingual-e5-large", proxy_api_key: str = None):
# Get managed token (requires authentication)
token, token_id = get_proxy_token(api_key=proxy_api_key)
# Create client
client = InferenceClient(provider="hf-inference", api_key=token)
try:
# Extract features
result = client.feature_extraction(text, model=model)
# Report success
report_token_status(token_id, "success", api_key=proxy_api_key)
return result
except HfHubHTTPError as e:
report_token_status(token_id, "error", str(e), api_key=proxy_api_key)
raise
except Exception as e:
report_token_status(token_id, "error", str(e), api_key=proxy_api_key)
raise
# Usage
# You need to get your API key from the admin or create a user account
# See RBAC_README.md for details on user management
proxy_api_key = "your_proxy_api_key_here" # Get this from admin
embeddings = extract_features("Today is a sunny day", proxy_api_key=proxy_api_key)
print(f"Embedding shape: {len(embeddings)}")
3. Vision-Language Models (VLM)
from huggingface_hub import InferenceClient
from huggingface_hub.errors import HfHubHTTPError
from hf_token_utils import get_proxy_token, report_token_status
def vlm_chat(provider: str = "cerebras", image_url: str = "https://example.com/image.jpg", proxy_api_key: str = None):
"""Vision-language chat completion with different providers"""
# Get managed token (requires authentication)
token, token_id = get_proxy_token(api_key=proxy_api_key)
# Provider-specific VLM models
vlm_models = {
"cerebras": "meta-llama/Llama-4-Scout-17B-16E-Instruct",
"cohere": "CohereLabs/command-a-vision-07-2025",
"featherless": "google/gemma-3-27b-it",
"fireworks": "meta-llama/Llama-4-Scout-17B-16E-Instruct",
"groq": "meta-llama/Llama-4-Scout-17B-16E-Instruct",
"hyperbolic": "Qwen/Qwen2.5-VL-7B-Instruct",
"nebius": "google/gemma-3-27b-it",
"novita": "zai-org/GLM-4.5V",
"nscale": "meta-llama/Llama-4-Scout-17B-16E-Instruct",
"sambanova": "meta-llama/Llama-4-Maverick-17B-128E-Instruct",
"together": "meta-llama/Llama-4-Scout-17B-16E-Instruct"
}
# Create client with managed token
client = InferenceClient(
provider=provider,
api_key=token
)
try:
# Make VLM completion request
completion = client.chat.completions.create(
model=vlm_models.get(provider, "meta-llama/Llama-4-Scout-17B-16E-Instruct"),
messages=[
{
"role": "user",
"content": [
{
"type": "text",
"text": "Describe this image in one sentence."
},
{
"type": "image_url",
"image_url": {"url": image_url}
}
]
}
],
)
# Report success
report_token_status(token_id, "success", api_key=proxy_api_key)
print(f"[{provider.upper()}] {completion.choices[0].message.content}")
return completion
except HfHubHTTPError as e:
# Report the error
report_token_status(token_id, "error", str(e), api_key=proxy_api_key)
raise
except Exception as e:
# Report generic error
report_token_status(token_id, "error", str(e), api_key=proxy_api_key)
raise
# Usage
if __name__ == "__main__":
# You need to get your API key from the admin or create a user account
# See RBAC_README.md for details on user management
proxy_api_key = "your_proxy_api_key_here" # Get this from admin
image_url = "https://cdn.britannica.com/61/93061-050-99147DCE/Statue-of-Liberty-Island-New-York-Bay.jpg"
vlm_providers = ["cerebras", "cohere", "groq", "together"]
for provider in vlm_providers:
try:
vlm_chat(provider, image_url, proxy_api_key)
except Exception as e:
print(f"Error with {provider}: {e}")
4. Image Generation
from huggingface_hub import InferenceClient
from huggingface_hub.errors import HfHubHTTPError
from hf_token_utils import get_proxy_token, report_token_status
def generate_image(prompt: str, provider: str = "fal-ai", proxy_api_key: str = None):
"""Generate image with different providers"""
# Get managed token (requires authentication)
token, token_id = get_proxy_token(api_key=proxy_api_key)
# Provider-specific image models
image_models = {
"fal-ai": "Qwen/Qwen-Image",
"hf-inference": "stabilityai/stable-diffusion-xl-base-1.0",
"nebius": "black-forest-labs/FLUX.1-dev",
"nscale": "stabilityai/stable-diffusion-xl-base-1.0",
"replicate": "Qwen/Qwen-Image",
"together": "black-forest-labs/FLUX.1-dev"
}
# Create client with managed token
client = InferenceClient(
provider=provider,
api_key=token
)
try:
# Generate image
image = client.text_to_image(prompt, model=image_models.get(provider, "Qwen/Qwen-Image"))
# Report success
report_token_status(token_id, "success", api_key=proxy_api_key)
return image
except HfHubHTTPError as e:
report_token_status(token_id, "error", str(e), api_key=proxy_api_key)
raise
except Exception as e:
report_token_status(token_id, "error", str(e), api_key=proxy_api_key)
raise
# Usage
if __name__ == "__main__":
# You need to get your API key from the admin or create a user account
# See RBAC_README.md for details on user management
proxy_api_key = "your_proxy_api_key_here" # Get this from admin
prompt = "Astronaut riding a horse"
image_providers = ["fal-ai", "hf-inference", "nebius"]
for provider in image_providers:
try:
image = generate_image(prompt, provider, proxy_api_key)
print(f"Generated image with {provider}: {image.size}")
# image.save(f"astronaut_{provider}.png")
except Exception as e:
print(f"Error with {provider}: {e}")
5. Automatic Speech Recognition
from huggingface_hub import InferenceClient
from huggingface_hub.errors import HfHubHTTPError
from hf_token_utils import get_proxy_token, report_token_status
def transcribe_audio(audio_file: str, model: str = "openai/whisper-large-v3", proxy_api_key: str = None):
# Get managed token (requires authentication)
token, token_id = get_proxy_token(api_key=proxy_api_key)
# Create client
client = InferenceClient(provider="hf-inference", api_key=token)
try:
# Transcribe audio
result = client.automatic_speech_recognition(audio_file, model=model)
# Report success
report_token_status(token_id, "success", api_key=proxy_api_key)
return result
except HfHubHTTPError as e:
report_token_status(token_id, "error", str(e), api_key=proxy_api_key)
raise
except Exception as e:
report_token_status(token_id, "error", str(e), api_key=proxy_api_key)
raise
# Usage
# You need to get your API key from the admin or create a user account
# See RBAC_README.md for details on user management
# proxy_api_key = "your_proxy_api_key_here" # Get this from admin
# transcription = transcribe_audio("audio.wav", proxy_api_key=proxy_api_key)
# print(transcription["text"])
Advanced Usage Patterns
1. Automatic Retry with New Token
from huggingface_hub import InferenceClient
from huggingface_hub.errors import HfHubHTTPError
from hf_token_utils import get_proxy_token, report_token_status
def chat_with_retry(messages: list, provider: str = "hf-inference", model: str = None, max_retries: int = 2, proxy_api_key: str = None):
"""Chat completion with automatic token retry on auth errors"""
# Default models for different providers
default_models = {
"hf-inference": "HuggingFaceTB/SmolLM3-3B",
"cerebras": "openai/gpt-oss-120b",
"cohere": "CohereLabs/c4ai-command-r-plus",
"groq": "openai/gpt-oss-120b",
"together": "openai/gpt-oss-120b"
}
if model is None:
model = default_models.get(provider, "openai/gpt-oss-120b")
for attempt in range(max_retries + 1):
# Get a fresh token for each attempt
token, token_id = get_proxy_token(api_key=proxy_api_key)
client = InferenceClient(provider=provider, api_key=token)
try:
completion = client.chat.completions.create(
model=model,
messages=messages
)
# Report success
report_token_status(token_id, "success", api_key=proxy_api_key)
return completion
except HfHubHTTPError as e:
error_str = str(e)
# Report the error
report_token_status(token_id, "error", error_str, api_key=proxy_api_key)
# Check if we should retry
if attempt < max_retries and ("401 Client Error" in error_str or "402 Client Error" in error_str):
print(f"Token error on attempt {attempt + 1}, retrying with new token...")
continue
else:
# No more retries or non-retryable error
raise
except Exception as e:
# Non-HTTP error, report and don't retry
report_token_status(token_id, "error", str(e), api_key=proxy_api_key)
raise
raise Exception("All retry attempts failed")
# Usage
# You need to get your API key from the admin or create a user account
# See RBAC_README.md for details on user management
proxy_api_key = "your_proxy_api_key_here" # Get this from admin
messages = [
{"role": "user", "content": "Explain quantum computing in simple terms"}
]
# Try with different providers
providers = ["hf-inference", "cerebras", "groq"]
for provider in providers:
try:
completion = chat_with_retry(messages, provider=provider, proxy_api_key=proxy_api_key)
print(f"[{provider}] {completion.choices[0].message.content}")
break # Success, no need to try more providers
except Exception as e:
print(f"[{provider}] Failed: {e}")
continue
2. Context Manager for Token Management
from contextlib import contextmanager
from huggingface_hub import InferenceClient
from huggingface_hub.errors import HfHubHTTPError
from hf_token_utils import get_proxy_token, report_token_status
@contextmanager
def managed_hf_client(provider: str = "hf-inference", proxy_api_key: str = None):
"""Context manager that handles token lifecycle automatically"""
token, token_id = get_proxy_token(api_key=proxy_api_key)
client = InferenceClient(provider=provider, api_key=token)
try:
yield client, token_id
# If we get here, operation was successful
report_token_status(token_id, "success", api_key=proxy_api_key)
except HfHubHTTPError as e:
report_token_status(token_id, "error", str(e), api_key=proxy_api_key)
raise
except Exception as e:
report_token_status(token_id, "error", str(e), api_key=proxy_api_key)
raise
# Usage
def process_text(text: str, provider: str = "hf-inference", proxy_api_key: str = None):
with managed_hf_client(provider, proxy_api_key) as (client, token_id):
result = client.feature_extraction(text, model="intfloat/multilingual-e5-large")
return result
# Multiple operations with same token
def multiple_operations(provider: str = "hf-inference", proxy_api_key: str = None):
with managed_hf_client(provider, proxy_api_key) as (client, token_id):
# All operations use the same token
if provider == "hf-inference":
embedding = client.feature_extraction("Hello world")
chat_result = client.chat.completions.create(
model="HuggingFaceTB/SmolLM3-3B",
messages=[{"role": "user", "content": "Hi"}]
)
else:
# For other providers, use chat completion
chat_result = client.chat.completions.create(
model="openai/gpt-oss-120b",
messages=[{"role": "user", "content": "Hi"}]
)
embedding = None
return embedding, chat_result
# Usage with different providers
if __name__ == "__main__":
# You need to get your API key from the admin or create a user account
# See RBAC_README.md for details on user management
proxy_api_key = "your_proxy_api_key_here" # Get this from admin
providers = ["hf-inference", "cerebras", "groq"]
for provider in providers:
try:
if provider == "hf-inference":
embedding, chat = multiple_operations(provider, proxy_api_key)
print(f"[{provider}] Embedding: {len(embedding) if embedding else 'N/A'}")
else:
_, chat = multiple_operations(provider, proxy_api_key)
print(f"[{provider}] Chat: {chat.choices[0].message.content}")
except Exception as e:
print(f"[{provider}] Error: {e}")
3. Decorator for Automatic Token Management
import functools
from huggingface_hub import InferenceClient
from huggingface_hub.errors import HfHubHTTPError
from hf_token_utils import get_proxy_token, report_token_status
def with_managed_token(provider: str = "hf-inference", proxy_api_key: str = None):
"""Decorator that automatically manages HF tokens for specific providers"""
def decorator(func):
@functools.wraps(func)
def wrapper(*args, **kwargs):
# Get token from proxy
token, token_id = get_proxy_token(api_key=proxy_api_key)
# Add token to kwargs if api_key parameter exists
if 'api_key' in func.__code__.co_varnames:
kwargs['api_key'] = token
try:
# Call the original function
result = func(*args, **kwargs)
# Report success
report_token_status(token_id, "success", api_key=proxy_api_key)
return result
except HfHubHTTPError as e:
error_str = str(e)
# Report error
report_token_status(token_id, "error", error_str, api_key=proxy_api_key)
# Retry once for auth errors
if "401 Client Error" in error_str or "402 Client Error" in error_str:
print(f"Auth error detected with {provider}, retrying with new token...")
# Get new token
new_token, new_token_id = get_proxy_token(api_key=proxy_api_key)
# Retry with new token
result = func(*args, **kwargs)
report_token_status(new_token_id, "success", api_key=proxy_api_key)
return result
except Exception as retry_error:
report_token_status(new_token_id, "error", str(retry_error), api_key=proxy_api_key)
raise
# Re-raise original error
raise
except Exception as e:
# Handle non-HTTP errors
report_token_status(token_id, "error", str(e), api_key=proxy_api_key)
raise
return wrapper
return decorator
# Usage with decorator
@with_managed_token("hf-inference")
def get_embeddings(text: str, model: str, api_key: str = None):
client = InferenceClient(provider="hf-inference", api_key=api_key)
return client.feature_extraction(text, model=model)
@with_managed_token("cerebras")
def chat_completion(messages: list, model: str, api_key: str = None):
client = InferenceClient(provider="cerebras", api_key=api_key)
return client.chat.completions.create(model=model, messages=messages)
# Usage
if __name__ == "__main__":
# You need to get your API key from the admin or create a user account
# See RBAC_README.md for details on user management
proxy_api_key = "your_proxy_api_key_here" # Get this from admin
try:
embeddings = get_embeddings("Hello world", "intfloat/multilingual-e5-large")
print(f"Embeddings: {len(embeddings)}")
except Exception as e:
print(f"Embedding error: {e}")
try:
chat_result = chat_completion(
[{"role": "user", "content": "What is AI?"}],
"openai/gpt-oss-120b"
)
print(f"Chat: {chat_result.choices[0].message.content}")
except Exception as e:
print(f"Chat error: {e}")
Streaming Support
For streaming responses, use the same pattern but handle the stream appropriately:
from huggingface_hub import InferenceClient
from huggingface_hub.errors import HfHubHTTPError
from hf_token_utils import get_proxy_token, report_token_status
def streaming_chat(messages: list, provider: str = "hf-inference", model: str = None, proxy_api_key: str = None):
"""Streaming chat completion with token management"""
token, token_id = get_proxy_token(api_key=proxy_api_key)
# Default models for different providers
default_models = {
"hf-inference": "HuggingFaceTB/SmolLM3-3B",
"cerebras": "openai/gpt-oss-120b",
"cohere": "CohereLabs/c4ai-command-r-plus",
"groq": "openai/gpt-oss-120b"
}
if model is None:
model = default_models.get(provider, "openai/gpt-oss-120b")
client = InferenceClient(provider=provider, api_key=token)
try:
# Create streaming completion
stream = client.chat.completions.create(
model=model,
messages=messages,
stream=True
)
# Process stream
full_response = ""
for chunk in stream:
if chunk.choices[0].delta.content:
content = chunk.choices[0].delta.content
print(content, end="", flush=True)
full_response += content
print() # New line after streaming
# Report success after stream completes
report_token_status(token_id, "success", api_key=proxy_api_key)
return full_response
except HfHubHTTPError as e:
report_token_status(token_id, "error", str(e), api_key=proxy_api_key)
raise
except Exception as e:
report_token_status(token_id, "error", str(e), api_key=proxy_api_key)
raise
# Usage
# You need to get your API key from the admin or create a user account
# See RBAC_README.md for details on user management
proxy_api_key = "your_proxy_api_key_here" # Get this from admin
messages = [{"role": "user", "content": "Tell me a short story about a robot"}]
providers = ["hf-inference", "cerebras", "groq"]
for provider in providers:
try:
print(f"\n--- {provider.upper()} ---")
response = streaming_chat(messages, provider=provider, proxy_api_key=proxy_api_key)
break # Success, no need to try more providers
except Exception as e:
print(f"Error with {provider}: {e}")
continue
Error Handling Best Practices
1. Comprehensive Error Handling
from huggingface_hub import InferenceClient
from huggingface_hub.errors import HfHubHTTPError, RepositoryNotFoundError, LocalTokenNotFoundError
from hf_token_utils import get_proxy_token, report_token_status
def robust_inference(text: str, model: str, provider: str = "hf-inference", proxy_api_key: str = None):
"""Example with comprehensive error handling"""
token, token_id = get_proxy_token(api_key=proxy_api_key)
client = InferenceClient(provider=provider, api_key=token)
try:
if provider == "hf-inference":
result = client.feature_extraction(text, model=model)
else:
# For other providers, use chat completion
result = client.chat.completions.create(
model="openai/gpt-oss-120b",
messages=[{"role": "user", "content": text}]
)
report_token_status(token_id, "success", api_key=proxy_api_key)
return result
except HfHubHTTPError as e:
error_str = str(e)
report_token_status(token_id, "error", error_str, api_key=proxy_api_key)
if "401 Client Error" in error_str:
print("❌ Authentication failed - token may be invalid")
elif "402 Client Error" in error_str:
print("❌ Payment required - token credits exceeded")
elif "429" in error_str:
print("❌ Rate limited - too many requests")
elif "500" in error_str:
print("❌ Server error - try again later")
else:
print(f"❌ HTTP error: {error_str}")
raise
except RepositoryNotFoundError as e:
report_token_status(token_id, "error", str(e), api_key=proxy_api_key)
print(f"❌ Model not found: {model}")
raise
except Exception as e:
report_token_status(token_id, "error", str(e), api_key=proxy_api_key)
print(f"❌ Unexpected error: {str(e)}")
raise
2. Logging Integration
import logging
from huggingface_hub import InferenceClient
from huggingface_hub.errors import HfHubHTTPError
from hf_token_utils import get_proxy_token, report_token_status
# Configure logging
logging.basicConfig(level=logging.INFO)
logger = logging.getLogger(__name__)
def logged_inference(text: str, model: str, provider: str = "hf-inference", proxy_api_key: str = None):
"""Example with proper logging"""
logger.info(f"Starting inference with provider: {provider}, model: {model}")
token, token_id = get_proxy_token(api_key=proxy_api_key)
logger.info(f"Obtained token: {token_id}")
client = InferenceClient(provider=provider, api_key=token)
try:
if provider == "hf-inference":
result = client.feature_extraction(text, model=model)
else:
result = client.chat.completions.create(
model="openai/gpt-oss-120b",
messages=[{"role": "user", "content": text}]
)
# Report success
report_status = report_token_status(token_id, "success", api_key=proxy_api_key)
logger.info(f"Inference successful, status reported: {report_status}")
return result
except HfHubHTTPError as e:
error_str = str(e)
logger.error(f"HF Hub error with {provider}: {error_str}")
# Report error
report_status = report_token_status(token_id, "error", error_str, api_key=proxy_api_key)
logger.info(f"Error status reported: {report_status}")
raise
except Exception as e:
error_str = str(e)
logger.error(f"Unexpected error with {provider}: {error_str}")
# Report error
report_status = report_token_status(token_id, "error", error_str, api_key=proxy_api_key)
logger.info(f"Error status reported: {report_status}")
raise
Configuration
Environment Variables
You can configure the proxy URL using environment variables:
import os
from hf_token_utils import get_proxy_token, report_token_status
# Set proxy URL via environment variable
os.environ["HF_PROXY_URL"] = "http://my-proxy-server:8000"
def get_proxy_token_env(proxy_api_key: str = None) -> tuple[str, str]:
"""Get token using environment variable for proxy URL"""
proxy_url = os.getenv("HF_PROXY_URL", "http://localhost:8000")
return get_proxy_token(proxy_url, api_key=proxy_api_key)
def report_token_status_env(token_id: str, status: str = "success", error: str = None, proxy_api_key: str = None, client_name: Optional[str] = None) -> bool:
"""Report status using environment variable for proxy URL"""
proxy_url = os.getenv("HF_PROXY_URL", "http://localhost:8000")
return report_token_status(token_id, status, error, proxy_url, api_key=proxy_api_key, client_name=client_name)
Performance Considerations
1. Token Reuse
For multiple operations, consider reusing the same token:
def batch_operations(provider: str = "hf-inference", proxy_api_key: str = None):
"""Reuse token for multiple operations"""
# Get token once
token, token_id = get_proxy_token(api_key=proxy_api_key)
client = InferenceClient(provider=provider, api_key=token)
results = []
error_occurred = False
try:
if provider == "hf-inference":
# Multiple operations with same token
for text in ["Hello", "World", "AI", "Future"]:
result = client.feature_extraction(text, model="intfloat/multilingual-e5-large")
results.append(result)
else:
# For other providers, use chat completion
for text in ["Hello", "World", "AI", "Future"]:
result = client.chat.completions.create(
model="openai/gpt-oss-120b",
messages=[{"role": "user", "content": text}]
)
results.append(result)
# Report success once for all operations
report_token_status(token_id, "success", api_key=proxy_api_key)
except Exception as e:
error_occurred = True
report_token_status(token_id, "error", str(e), api_key=proxy_api_key)
raise
return results
2. Async Operations
For async operations with asyncio:
import asyncio
from huggingface_hub import AsyncInferenceClient
from hf_token_utils import get_proxy_token, report_token_status
async def async_inference(texts: list[str], provider: str = "hf-inference", model: str = "intfloat/multilingual-e5-large", proxy_api_key: str = None):
"""Async batch inference with token management"""
token, token_id = get_proxy_token(api_key=proxy_api_key)
async with AsyncInferenceClient(provider=provider, api_key=token) as client:
try:
if provider == "hf-inference":
# Create tasks for all texts
tasks = [
client.feature_extraction(text, model=model)
for text in texts
]
else:
# For other providers, use chat completion
tasks = [
client.chat.completions.create(
model="openai/gpt-oss-120b",
messages=[{"role": "user", "content": text}]
)
for text in texts
]
# Execute all tasks concurrently
results = await asyncio.gather(*tasks)
# Report success
report_token_status(token_id, "success", api_key=proxy_api_key)
return results
except Exception as e:
report_token_status(token_id, "error", str(e), api_key=proxy_api_key)
raise
# Usage
# You need to get your API key from the admin or create a user account
# See RBAC_README.md for details on user management
# proxy_api_key = "your_proxy_api_key_here" # Get this from admin
# results = asyncio.run(async_inference(["text1", "text2", "text3"], provider="cerebras", proxy_api_key=proxy_api_key))
Provider-Specific Considerations
Model Compatibility
Different providers support different models and capabilities:
- Cerebras: High-performance inference, supports most open models
- Cohere: Advanced language models with multilingual support
- Groq: Ultra-fast inference, optimized for speed
- Together: Collaborative AI hosting, wide model support
- Nebius: Cloud-native services with enterprise features
- HF-Inference: Core API with comprehensive model support
Rate Limits and Quotas
Each provider has different rate limits and pricing:
- Monitor usage through HF-Inferoxy’s reporting system
- Use appropriate retry strategies for rate limit errors
- Consider provider-specific error handling patterns
Troubleshooting
Common Issues
- Connection Refused: Ensure HF-Inferoxy server is running on the specified URL
- No Valid Keys: Check that valid HF API keys are added to the proxy server
- Import Errors: Ensure
huggingface_hub
andrequests
are installed - Token Errors: The proxy will automatically handle and rotate problematic tokens
- Provider Not Found: Verify provider name spelling and availability
- Model Compatibility: Check if the model is supported by the chosen provider
Debug Mode
Enable debug logging to see detailed token management:
import logging
logging.basicConfig(level=logging.DEBUG)
# Your code here - will show detailed logs
Health Check
Verify the proxy server is working:
curl http://localhost:8000/health
curl http://localhost:8000/keys/status
Conclusion
The HF-Inferoxy token management system provides seamless integration with huggingface_hub
across all supported providers, automatically handling:
- Token provisioning: Get valid tokens automatically
- Error handling: Detect and report authentication/credit errors
- Token rotation: Automatic switching to valid tokens
- Quarantine management: Intelligent handling of temporary issues
- Multi-provider support: Unified interface across all providers
This allows you to focus on your AI applications without worrying about token management, rate limits, or credit exhaustion, while leveraging the unique capabilities of each provider.
For detailed provider-specific examples, see the provider examples directory.