Using HuggingFace Hub with HF-Inferoxy Token Management

This guide demonstrates how to use the huggingface_hub library with HF-Inferoxy’s smart token management system for seamless API key rotation and error handling across all supported providers.

Overview

HF-Inferoxy provides a simple client-side utility that automatically manages HuggingFace API tokens, handles errors, and reports usage back to the proxy server for intelligent key rotation. This eliminates the need to manually manage multiple API keys or handle token-related errors in your application code.

Supported Providers: HF-Inferoxy works with all major HuggingFace providers including Cerebras, Cohere, Groq, Together, Nebius, and many more. See the provider examples directory for comprehensive examples.

⚠️ Important: Authentication Required

All client operations now require authentication with the HF-Inferoxy server. This is part of the Role-Based Access Control (RBAC) system that provides secure access to the proxy services.

Getting Your API Key

  1. Default Admin User: The system creates a default admin user on first run. Check your server logs or the users.json file for the default admin credentials.

  2. Create a User Account: Use the admin account to create a regular user account:
    curl -X POST "http://localhost:8000/admin/users" \
      -H "Authorization: Bearer ADMIN_API_KEY" \
      -H "Content-Type: application/json" \
      -d '{"username": "youruser", "email": "user@example.com", "full_name": "Your Name", "role": "user"}'
    
  3. Use the Generated API Key: The response will include an API key that you’ll use in all client operations.

Authentication in Client Code

All examples in this guide require you to pass your proxy API key to the utility functions:

# Get your API key from the admin or user creation
proxy_api_key = "your_proxy_api_key_here"

# Use it in all operations
token, token_id = get_proxy_token(api_key=proxy_api_key)
report_token_status(token_id, "success", api_key=proxy_api_key)

For detailed RBAC setup and user management, see RBAC_README.md.

End-user tracking (optional)

You can optionally include an end-user identifier in token usage reports via the client_name field. If omitted, the server will default to the authenticated maintainer’s username.

end_user = "customer_123"  # e.g., your app's user/customer id
token, token_id = get_proxy_token(api_key=proxy_api_key)

# ... perform requests ...

# Include end-user when reporting status
report_token_status(token_id, "success", api_key=proxy_api_key, client_name=end_user)

Installation

First, ensure you have the required dependencies:

# Install huggingface_hub
uv add huggingface-hub

# Install requests for API communication
uv add requests

Token Utils Implementation

Create a file called hf_token_utils.py with the following functions:

# hf_token_utils.py
import os
import requests
import json
from typing import Dict, Optional, Any, Tuple

def get_proxy_token(proxy_url: str = "http://localhost:8000", api_key: str = None) -> Tuple[str, str]:
    """
    Get a valid token from the proxy server.
    
    Args:
        proxy_url: URL of the HF-Inferoxy server
        api_key: Your API key for authenticating with the proxy server
        
    Returns:
        Tuple of (token, token_id)
        
    Raises:
        Exception: If token provisioning fails
    """
    headers = {}
    if api_key:
        headers["Authorization"] = f"Bearer {api_key}"
    
    response = requests.get(f"{proxy_url}/keys/provision", headers=headers)
    if response.status_code != 200:
        raise Exception(f"Failed to provision token: {response.text}")
    
    data = response.json()
    token = data["token"]
    token_id = data["token_id"]
    
    # For convenience, also set environment variable
    os.environ["HF_TOKEN"] = token
    
    return token, token_id

def report_token_status(
    token_id: str, 
    status: str = "success", 
    error: Optional[str] = None,
    proxy_url: str = "http://localhost:8000",
    api_key: str = None,
    client_name: Optional[str] = None,
) -> bool:
    """
    Report token usage status back to the proxy server.
    
    Args:
        token_id: ID of the token to report (from get_proxy_token)
        status: Status to report ('success' or 'error')
        error: Error message if status is 'error'
        proxy_url: URL of the HF-Inferoxy server
        api_key: Your API key for authenticating with the proxy server
        
    Returns:
        True if report was accepted, False otherwise
    """
    payload = {"token_id": token_id, "status": status}
    
    if error:
        payload["error"] = error
        
        # Extract error classification based on actual HF error patterns
        error_type = None
        if "401 Client Error" in error:
            error_type = "invalid_credentials"
        elif "402 Client Error" in error and "exceeded your monthly included credits" in error:
            error_type = "credits_exceeded"
            
        if error_type:
            payload["error_type"] = error_type
    
    if client_name:
        payload["client_name"] = client_name

    headers = {"Content-Type": "application/json"}
    if api_key:
        headers["Authorization"] = f"Bearer {api_key}"
        
    try:
        response = requests.post(f"{proxy_url}/keys/report", json=payload, headers=headers)
        return response.status_code == 200
    except Exception as e:
        # Silently fail to avoid breaking the client application
        # In production, consider logging this error
        return False

Basic Usage Patterns

1. Simple Chat Completion

With HF-Inference (Default)

from huggingface_hub import InferenceClient
from huggingface_hub.errors import HfHubHTTPError
from hf_token_utils import get_proxy_token, report_token_status

def simple_chat(proxy_api_key: str):
    # Get token from proxy server (requires authentication)
    token, token_id = get_proxy_token(api_key=proxy_api_key)
    
    # Create client with managed token
    client = InferenceClient(
        provider="hf-inference",
        api_key=token
    )
    
    try:
        # Make chat completion request
        completion = client.chat.completions.create(
            model="HuggingFaceTB/SmolLM3-3B",
            messages=[
                {"role": "user", "content": "What is the capital of France?"}
            ]
        )
        
        # Report success
        report_token_status(token_id, "success", api_key=proxy_api_key)
        
        print(completion.choices[0].message.content)
        return completion
        
    except HfHubHTTPError as e:
        # Report the error
        report_token_status(token_id, "error", str(e), api_key=proxy_api_key)
        raise
    except Exception as e:
        # Report generic error
        report_token_status(token_id, "error", str(e), api_key=proxy_api_key)
        raise

# Usage
if __name__ == "__main__":
    # You need to get your API key from the admin or create a user account
    # See RBAC_README.md for details on user management
    proxy_api_key = "your_proxy_api_key_here"  # Get this from admin
    simple_chat(proxy_api_key)

With Other Providers

from huggingface_hub import InferenceClient
from huggingface_hub.errors import HfHubHTTPError
from hf_token_utils import get_proxy_token, report_token_status

def provider_chat(provider: str = "cerebras", proxy_api_key: str = None):
    """Chat completion with different providers"""
    # Get token from proxy server (requires authentication)
    token, token_id = get_proxy_token(api_key=proxy_api_key)
    
    # Provider-specific model mapping
    models = {
        "cerebras": "openai/gpt-oss-120b",
        "cohere": "CohereLabs/c4ai-command-r-plus",
        "groq": "openai/gpt-oss-120b",
        "together": "openai/gpt-oss-120b",
        "nebius": "Qwen/Qwen3-235B-A22B-Instruct-2507"
    }
    
    # Create client with managed token
    client = InferenceClient(
        provider=provider,
        api_key=token
    )
    
    try:
        # Make chat completion request
        completion = client.chat.completions.create(
            model=models.get(provider, "openai/gpt-oss-120b"),
            messages=[
                {"role": "user", "content": "What is the capital of France?"}
            ]
        )
        
        # Report success
        report_token_status(token_id, "success", api_key=proxy_api_key)
        
        print(f"[{provider.upper()}] {completion.choices[0].message.content}")
        return completion
        
    except HfHubHTTPError as e:
        # Report the error
        report_token_status(token_id, "error", str(e), api_key=proxy_api_key)
        raise
    except Exception as e:
        # Report generic error
        report_token_status(token_id, "error", str(e), api_key=proxy_api_key)
        raise

# Usage with different providers
if __name__ == "__main__":
    # You need to get your API key from the admin or create a user account
    # See RBAC_README.md for details on user management
    proxy_api_key = "your_proxy_api_key_here"  # Get this from admin
    providers = ["cerebras", "cohere", "groq", "together"]
    for provider in providers:
        try:
            provider_chat(provider, proxy_api_key)
        except Exception as e:
            print(f"Error with {provider}: {e}")

2. Feature Extraction

from huggingface_hub import InferenceClient
from huggingface_hub.errors import HfHubHTTPError
from hf_token_utils import get_proxy_token, report_token_status

def extract_features(text: str, model: str = "intfloat/multilingual-e5-large", proxy_api_key: str = None):
    # Get managed token (requires authentication)
    token, token_id = get_proxy_token(api_key=proxy_api_key)
    
    # Create client
    client = InferenceClient(provider="hf-inference", api_key=token)
    
    try:
        # Extract features
        result = client.feature_extraction(text, model=model)
        
        # Report success
        report_token_status(token_id, "success", api_key=proxy_api_key)
        
        return result
        
    except HfHubHTTPError as e:
        report_token_status(token_id, "error", str(e), api_key=proxy_api_key)
        raise
    except Exception as e:
        report_token_status(token_id, "error", str(e), api_key=proxy_api_key)
        raise

# Usage
# You need to get your API key from the admin or create a user account
# See RBAC_README.md for details on user management
proxy_api_key = "your_proxy_api_key_here"  # Get this from admin
embeddings = extract_features("Today is a sunny day", proxy_api_key=proxy_api_key)
print(f"Embedding shape: {len(embeddings)}")

3. Vision-Language Models (VLM)

from huggingface_hub import InferenceClient
from huggingface_hub.errors import HfHubHTTPError
from hf_token_utils import get_proxy_token, report_token_status

def vlm_chat(provider: str = "cerebras", image_url: str = "https://example.com/image.jpg", proxy_api_key: str = None):
    """Vision-language chat completion with different providers"""
    # Get managed token (requires authentication)
    token, token_id = get_proxy_token(api_key=proxy_api_key)
    
    # Provider-specific VLM models
    vlm_models = {
        "cerebras": "meta-llama/Llama-4-Scout-17B-16E-Instruct",
        "cohere": "CohereLabs/command-a-vision-07-2025",
        "featherless": "google/gemma-3-27b-it",
        "fireworks": "meta-llama/Llama-4-Scout-17B-16E-Instruct",
        "groq": "meta-llama/Llama-4-Scout-17B-16E-Instruct",
        "hyperbolic": "Qwen/Qwen2.5-VL-7B-Instruct",
        "nebius": "google/gemma-3-27b-it",
        "novita": "zai-org/GLM-4.5V",
        "nscale": "meta-llama/Llama-4-Scout-17B-16E-Instruct",
        "sambanova": "meta-llama/Llama-4-Maverick-17B-128E-Instruct",
        "together": "meta-llama/Llama-4-Scout-17B-16E-Instruct"
    }
    
    # Create client with managed token
    client = InferenceClient(
        provider=provider,
        api_key=token
    )
    
    try:
        # Make VLM completion request
        completion = client.chat.completions.create(
            model=vlm_models.get(provider, "meta-llama/Llama-4-Scout-17B-16E-Instruct"),
            messages=[
                {
                    "role": "user",
                    "content": [
                        {
                            "type": "text",
                            "text": "Describe this image in one sentence."
                        },
                        {
                            "type": "image_url",
                            "image_url": {"url": image_url}
                        }
                    ]
                }
            ],
        )
        
        # Report success
        report_token_status(token_id, "success", api_key=proxy_api_key)
        
        print(f"[{provider.upper()}] {completion.choices[0].message.content}")
        return completion
        
    except HfHubHTTPError as e:
        # Report the error
        report_token_status(token_id, "error", str(e), api_key=proxy_api_key)
        raise
    except Exception as e:
        # Report generic error
        report_token_status(token_id, "error", str(e), api_key=proxy_api_key)
        raise

# Usage
if __name__ == "__main__":
    # You need to get your API key from the admin or create a user account
    # See RBAC_README.md for details on user management
    proxy_api_key = "your_proxy_api_key_here"  # Get this from admin
    image_url = "https://cdn.britannica.com/61/93061-050-99147DCE/Statue-of-Liberty-Island-New-York-Bay.jpg"
    vlm_providers = ["cerebras", "cohere", "groq", "together"]
    
    for provider in vlm_providers:
        try:
            vlm_chat(provider, image_url, proxy_api_key)
        except Exception as e:
            print(f"Error with {provider}: {e}")

4. Image Generation

from huggingface_hub import InferenceClient
from huggingface_hub.errors import HfHubHTTPError
from hf_token_utils import get_proxy_token, report_token_status

def generate_image(prompt: str, provider: str = "fal-ai", proxy_api_key: str = None):
    """Generate image with different providers"""
    # Get managed token (requires authentication)
    token, token_id = get_proxy_token(api_key=proxy_api_key)
    
    # Provider-specific image models
    image_models = {
        "fal-ai": "Qwen/Qwen-Image",
        "hf-inference": "stabilityai/stable-diffusion-xl-base-1.0",
        "nebius": "black-forest-labs/FLUX.1-dev",
        "nscale": "stabilityai/stable-diffusion-xl-base-1.0",
        "replicate": "Qwen/Qwen-Image",
        "together": "black-forest-labs/FLUX.1-dev"
    }
    
    # Create client with managed token
    client = InferenceClient(
        provider=provider,
        api_key=token
    )
    
    try:
        # Generate image
        image = client.text_to_image(prompt, model=image_models.get(provider, "Qwen/Qwen-Image"))
        
        # Report success
        report_token_status(token_id, "success", api_key=proxy_api_key)
        
        return image
        
    except HfHubHTTPError as e:
        report_token_status(token_id, "error", str(e), api_key=proxy_api_key)
        raise
    except Exception as e:
        report_token_status(token_id, "error", str(e), api_key=proxy_api_key)
        raise

# Usage
if __name__ == "__main__":
    # You need to get your API key from the admin or create a user account
    # See RBAC_README.md for details on user management
    proxy_api_key = "your_proxy_api_key_here"  # Get this from admin
    prompt = "Astronaut riding a horse"
    image_providers = ["fal-ai", "hf-inference", "nebius"]
    
    for provider in image_providers:
        try:
            image = generate_image(prompt, provider, proxy_api_key)
            print(f"Generated image with {provider}: {image.size}")
            # image.save(f"astronaut_{provider}.png")
        except Exception as e:
            print(f"Error with {provider}: {e}")

5. Automatic Speech Recognition

from huggingface_hub import InferenceClient
from huggingface_hub.errors import HfHubHTTPError
from hf_token_utils import get_proxy_token, report_token_status

def transcribe_audio(audio_file: str, model: str = "openai/whisper-large-v3", proxy_api_key: str = None):
    # Get managed token (requires authentication)
    token, token_id = get_proxy_token(api_key=proxy_api_key)
    
    # Create client
    client = InferenceClient(provider="hf-inference", api_key=token)
    
    try:
        # Transcribe audio
        result = client.automatic_speech_recognition(audio_file, model=model)
        
        # Report success
        report_token_status(token_id, "success", api_key=proxy_api_key)
        
        return result
        
    except HfHubHTTPError as e:
        report_token_status(token_id, "error", str(e), api_key=proxy_api_key)
        raise
    except Exception as e:
        report_token_status(token_id, "error", str(e), api_key=proxy_api_key)
        raise

# Usage
# You need to get your API key from the admin or create a user account
# See RBAC_README.md for details on user management
# proxy_api_key = "your_proxy_api_key_here"  # Get this from admin
# transcription = transcribe_audio("audio.wav", proxy_api_key=proxy_api_key)
# print(transcription["text"])

Advanced Usage Patterns

1. Automatic Retry with New Token

from huggingface_hub import InferenceClient
from huggingface_hub.errors import HfHubHTTPError
from hf_token_utils import get_proxy_token, report_token_status

def chat_with_retry(messages: list, provider: str = "hf-inference", model: str = None, max_retries: int = 2, proxy_api_key: str = None):
    """Chat completion with automatic token retry on auth errors"""
    
    # Default models for different providers
    default_models = {
        "hf-inference": "HuggingFaceTB/SmolLM3-3B",
        "cerebras": "openai/gpt-oss-120b",
        "cohere": "CohereLabs/c4ai-command-r-plus",
        "groq": "openai/gpt-oss-120b",
        "together": "openai/gpt-oss-120b"
    }
    
    if model is None:
        model = default_models.get(provider, "openai/gpt-oss-120b")
    
    for attempt in range(max_retries + 1):
        # Get a fresh token for each attempt
        token, token_id = get_proxy_token(api_key=proxy_api_key)
        client = InferenceClient(provider=provider, api_key=token)
        
        try:
            completion = client.chat.completions.create(
                model=model,
                messages=messages
            )
            
            # Report success
            report_token_status(token_id, "success", api_key=proxy_api_key)
            return completion
            
        except HfHubHTTPError as e:
            error_str = str(e)
            
            # Report the error
            report_token_status(token_id, "error", error_str, api_key=proxy_api_key)
            
            # Check if we should retry
            if attempt < max_retries and ("401 Client Error" in error_str or "402 Client Error" in error_str):
                print(f"Token error on attempt {attempt + 1}, retrying with new token...")
                continue
            else:
                # No more retries or non-retryable error
                raise
                
        except Exception as e:
            # Non-HTTP error, report and don't retry
            report_token_status(token_id, "error", str(e), api_key=proxy_api_key)
            raise
    
    raise Exception("All retry attempts failed")

# Usage
# You need to get your API key from the admin or create a user account
# See RBAC_README.md for details on user management
proxy_api_key = "your_proxy_api_key_here"  # Get this from admin

messages = [
    {"role": "user", "content": "Explain quantum computing in simple terms"}
]

# Try with different providers
providers = ["hf-inference", "cerebras", "groq"]
for provider in providers:
    try:
        completion = chat_with_retry(messages, provider=provider, proxy_api_key=proxy_api_key)
        print(f"[{provider}] {completion.choices[0].message.content}")
        break  # Success, no need to try more providers
    except Exception as e:
        print(f"[{provider}] Failed: {e}")
        continue

2. Context Manager for Token Management

from contextlib import contextmanager
from huggingface_hub import InferenceClient
from huggingface_hub.errors import HfHubHTTPError
from hf_token_utils import get_proxy_token, report_token_status

@contextmanager
def managed_hf_client(provider: str = "hf-inference", proxy_api_key: str = None):
    """Context manager that handles token lifecycle automatically"""
    token, token_id = get_proxy_token(api_key=proxy_api_key)
    client = InferenceClient(provider=provider, api_key=token)
    
    try:
        yield client, token_id
        # If we get here, operation was successful
        report_token_status(token_id, "success", api_key=proxy_api_key)
    except HfHubHTTPError as e:
        report_token_status(token_id, "error", str(e), api_key=proxy_api_key)
        raise
    except Exception as e:
        report_token_status(token_id, "error", str(e), api_key=proxy_api_key)
        raise

# Usage
def process_text(text: str, provider: str = "hf-inference", proxy_api_key: str = None):
    with managed_hf_client(provider, proxy_api_key) as (client, token_id):
        result = client.feature_extraction(text, model="intfloat/multilingual-e5-large")
        return result

# Multiple operations with same token
def multiple_operations(provider: str = "hf-inference", proxy_api_key: str = None):
    with managed_hf_client(provider, proxy_api_key) as (client, token_id):
        # All operations use the same token
        if provider == "hf-inference":
            embedding = client.feature_extraction("Hello world")
            
            chat_result = client.chat.completions.create(
                model="HuggingFaceTB/SmolLM3-3B",
                messages=[{"role": "user", "content": "Hi"}]
            )
        else:
            # For other providers, use chat completion
            chat_result = client.chat.completions.create(
                model="openai/gpt-oss-120b",
                messages=[{"role": "user", "content": "Hi"}]
            )
            embedding = None
        
        return embedding, chat_result

# Usage with different providers
if __name__ == "__main__":
    # You need to get your API key from the admin or create a user account
    # See RBAC_README.md for details on user management
    proxy_api_key = "your_proxy_api_key_here"  # Get this from admin
    
    providers = ["hf-inference", "cerebras", "groq"]
    
    for provider in providers:
        try:
            if provider == "hf-inference":
                embedding, chat = multiple_operations(provider, proxy_api_key)
                print(f"[{provider}] Embedding: {len(embedding) if embedding else 'N/A'}")
            else:
                _, chat = multiple_operations(provider, proxy_api_key)
                print(f"[{provider}] Chat: {chat.choices[0].message.content}")
        except Exception as e:
            print(f"[{provider}] Error: {e}")

3. Decorator for Automatic Token Management

import functools
from huggingface_hub import InferenceClient
from huggingface_hub.errors import HfHubHTTPError
from hf_token_utils import get_proxy_token, report_token_status

def with_managed_token(provider: str = "hf-inference", proxy_api_key: str = None):
    """Decorator that automatically manages HF tokens for specific providers"""
    def decorator(func):
        @functools.wraps(func)
        def wrapper(*args, **kwargs):
            # Get token from proxy
            token, token_id = get_proxy_token(api_key=proxy_api_key)
            
            # Add token to kwargs if api_key parameter exists
            if 'api_key' in func.__code__.co_varnames:
                kwargs['api_key'] = token
            
            try:
                # Call the original function
                result = func(*args, **kwargs)
                
                # Report success
                report_token_status(token_id, "success", api_key=proxy_api_key)
                
                return result
                
            except HfHubHTTPError as e:
                error_str = str(e)
                
                # Report error
                report_token_status(token_id, "error", error_str, api_key=proxy_api_key)
                
                # Retry once for auth errors
                if "401 Client Error" in error_str or "402 Client Error" in error_str:
                    print(f"Auth error detected with {provider}, retrying with new token...")
                    
                    # Get new token
                    new_token, new_token_id = get_proxy_token(api_key=proxy_api_key)
                    
                    # Retry with new token
                    result = func(*args, **kwargs)
                    report_token_status(new_token_id, "success", api_key=proxy_api_key)
                    return result
                except Exception as retry_error:
                    report_token_status(new_token_id, "error", str(retry_error), api_key=proxy_api_key)
                    raise
                
                # Re-raise original error
                raise
                
            except Exception as e:
                # Handle non-HTTP errors
                report_token_status(token_id, "error", str(e), api_key=proxy_api_key)
                raise
                
        return wrapper
    return decorator

# Usage with decorator
@with_managed_token("hf-inference")
def get_embeddings(text: str, model: str, api_key: str = None):
    client = InferenceClient(provider="hf-inference", api_key=api_key)
    return client.feature_extraction(text, model=model)

@with_managed_token("cerebras")
def chat_completion(messages: list, model: str, api_key: str = None):
    client = InferenceClient(provider="cerebras", api_key=api_key)
    return client.chat.completions.create(model=model, messages=messages)

# Usage
if __name__ == "__main__":
    # You need to get your API key from the admin or create a user account
    # See RBAC_README.md for details on user management
    proxy_api_key = "your_proxy_api_key_here"  # Get this from admin
    
    try:
        embeddings = get_embeddings("Hello world", "intfloat/multilingual-e5-large")
        print(f"Embeddings: {len(embeddings)}")
    except Exception as e:
        print(f"Embedding error: {e}")
    
    try:
        chat_result = chat_completion(
            [{"role": "user", "content": "What is AI?"}], 
            "openai/gpt-oss-120b"
        )
        print(f"Chat: {chat_result.choices[0].message.content}")
    except Exception as e:
        print(f"Chat error: {e}")

Streaming Support

For streaming responses, use the same pattern but handle the stream appropriately:

from huggingface_hub import InferenceClient
from huggingface_hub.errors import HfHubHTTPError
from hf_token_utils import get_proxy_token, report_token_status

def streaming_chat(messages: list, provider: str = "hf-inference", model: str = None, proxy_api_key: str = None):
    """Streaming chat completion with token management"""
    token, token_id = get_proxy_token(api_key=proxy_api_key)
    
    # Default models for different providers
    default_models = {
        "hf-inference": "HuggingFaceTB/SmolLM3-3B",
        "cerebras": "openai/gpt-oss-120b",
        "cohere": "CohereLabs/c4ai-command-r-plus",
        "groq": "openai/gpt-oss-120b"
    }
    
    if model is None:
        model = default_models.get(provider, "openai/gpt-oss-120b")
    
    client = InferenceClient(provider=provider, api_key=token)
    
    try:
        # Create streaming completion
        stream = client.chat.completions.create(
            model=model,
            messages=messages,
            stream=True
        )
        
        # Process stream
        full_response = ""
        for chunk in stream:
            if chunk.choices[0].delta.content:
                content = chunk.choices[0].delta.content
                print(content, end="", flush=True)
                full_response += content
        
        print()  # New line after streaming
        
        # Report success after stream completes
        report_token_status(token_id, "success", api_key=proxy_api_key)
        
        return full_response
        
    except HfHubHTTPError as e:
        report_token_status(token_id, "error", str(e), api_key=proxy_api_key)
        raise
    except Exception as e:
        report_token_status(token_id, "error", str(e), api_key=proxy_api_key)
        raise

# Usage
# You need to get your API key from the admin or create a user account
# See RBAC_README.md for details on user management
proxy_api_key = "your_proxy_api_key_here"  # Get this from admin

messages = [{"role": "user", "content": "Tell me a short story about a robot"}]
providers = ["hf-inference", "cerebras", "groq"]

for provider in providers:
    try:
        print(f"\n--- {provider.upper()} ---")
        response = streaming_chat(messages, provider=provider, proxy_api_key=proxy_api_key)
        break  # Success, no need to try more providers
    except Exception as e:
        print(f"Error with {provider}: {e}")
        continue

Error Handling Best Practices

1. Comprehensive Error Handling

from huggingface_hub import InferenceClient
from huggingface_hub.errors import HfHubHTTPError, RepositoryNotFoundError, LocalTokenNotFoundError
from hf_token_utils import get_proxy_token, report_token_status

def robust_inference(text: str, model: str, provider: str = "hf-inference", proxy_api_key: str = None):
    """Example with comprehensive error handling"""
    token, token_id = get_proxy_token(api_key=proxy_api_key)
    client = InferenceClient(provider=provider, api_key=token)
    
    try:
        if provider == "hf-inference":
            result = client.feature_extraction(text, model=model)
        else:
            # For other providers, use chat completion
            result = client.chat.completions.create(
                model="openai/gpt-oss-120b",
                messages=[{"role": "user", "content": text}]
            )
        
        report_token_status(token_id, "success", api_key=proxy_api_key)
        return result
        
    except HfHubHTTPError as e:
        error_str = str(e)
        report_token_status(token_id, "error", error_str, api_key=proxy_api_key)
        
        if "401 Client Error" in error_str:
            print("❌ Authentication failed - token may be invalid")
        elif "402 Client Error" in error_str:
            print("❌ Payment required - token credits exceeded")
        elif "429" in error_str:
            print("❌ Rate limited - too many requests")
        elif "500" in error_str:
            print("❌ Server error - try again later")
        else:
            print(f"❌ HTTP error: {error_str}")
        
        raise
        
    except RepositoryNotFoundError as e:
        report_token_status(token_id, "error", str(e), api_key=proxy_api_key)
        print(f"❌ Model not found: {model}")
        raise
        
    except Exception as e:
        report_token_status(token_id, "error", str(e), api_key=proxy_api_key)
        print(f"❌ Unexpected error: {str(e)}")
        raise

2. Logging Integration

import logging
from huggingface_hub import InferenceClient
from huggingface_hub.errors import HfHubHTTPError
from hf_token_utils import get_proxy_token, report_token_status

# Configure logging
logging.basicConfig(level=logging.INFO)
logger = logging.getLogger(__name__)

def logged_inference(text: str, model: str, provider: str = "hf-inference", proxy_api_key: str = None):
    """Example with proper logging"""
    logger.info(f"Starting inference with provider: {provider}, model: {model}")
    
    token, token_id = get_proxy_token(api_key=proxy_api_key)
    logger.info(f"Obtained token: {token_id}")
    
    client = InferenceClient(provider=provider, api_key=token)
    
    try:
        if provider == "hf-inference":
            result = client.feature_extraction(text, model=model)
        else:
            result = client.chat.completions.create(
                model="openai/gpt-oss-120b",
                messages=[{"role": "user", "content": text}]
            )
        
        # Report success
        report_status = report_token_status(token_id, "success", api_key=proxy_api_key)
        logger.info(f"Inference successful, status reported: {report_status}")
        
        return result
        
    except HfHubHTTPError as e:
        error_str = str(e)
        logger.error(f"HF Hub error with {provider}: {error_str}")
        
        # Report error
        report_status = report_token_status(token_id, "error", error_str, api_key=proxy_api_key)
        logger.info(f"Error status reported: {report_status}")
        
        raise
        
    except Exception as e:
        error_str = str(e)
        logger.error(f"Unexpected error with {provider}: {error_str}")
        
        # Report error
        report_status = report_token_status(token_id, "error", error_str, api_key=proxy_api_key)
        logger.info(f"Error status reported: {report_status}")
        
        raise

Configuration

Environment Variables

You can configure the proxy URL using environment variables:

import os
from hf_token_utils import get_proxy_token, report_token_status

# Set proxy URL via environment variable
os.environ["HF_PROXY_URL"] = "http://my-proxy-server:8000"

def get_proxy_token_env(proxy_api_key: str = None) -> tuple[str, str]:
    """Get token using environment variable for proxy URL"""
    proxy_url = os.getenv("HF_PROXY_URL", "http://localhost:8000")
    return get_proxy_token(proxy_url, api_key=proxy_api_key)

def report_token_status_env(token_id: str, status: str = "success", error: str = None, proxy_api_key: str = None, client_name: Optional[str] = None) -> bool:
    """Report status using environment variable for proxy URL"""
    proxy_url = os.getenv("HF_PROXY_URL", "http://localhost:8000")
    return report_token_status(token_id, status, error, proxy_url, api_key=proxy_api_key, client_name=client_name)

Performance Considerations

1. Token Reuse

For multiple operations, consider reusing the same token:

def batch_operations(provider: str = "hf-inference", proxy_api_key: str = None):
    """Reuse token for multiple operations"""
    # Get token once
    token, token_id = get_proxy_token(api_key=proxy_api_key)
    client = InferenceClient(provider=provider, api_key=token)
    
    results = []
    error_occurred = False
    
    try:
        if provider == "hf-inference":
            # Multiple operations with same token
            for text in ["Hello", "World", "AI", "Future"]:
                result = client.feature_extraction(text, model="intfloat/multilingual-e5-large")
                results.append(result)
        else:
            # For other providers, use chat completion
            for text in ["Hello", "World", "AI", "Future"]:
                result = client.chat.completions.create(
                    model="openai/gpt-oss-120b",
                    messages=[{"role": "user", "content": text}]
                )
                results.append(result)
        
        # Report success once for all operations
        report_token_status(token_id, "success", api_key=proxy_api_key)
        
    except Exception as e:
        error_occurred = True
        report_token_status(token_id, "error", str(e), api_key=proxy_api_key)
        raise
    
    return results

2. Async Operations

For async operations with asyncio:

import asyncio
from huggingface_hub import AsyncInferenceClient
from hf_token_utils import get_proxy_token, report_token_status

async def async_inference(texts: list[str], provider: str = "hf-inference", model: str = "intfloat/multilingual-e5-large", proxy_api_key: str = None):
    """Async batch inference with token management"""
    token, token_id = get_proxy_token(api_key=proxy_api_key)
    
    async with AsyncInferenceClient(provider=provider, api_key=token) as client:
        try:
            if provider == "hf-inference":
                # Create tasks for all texts
                tasks = [
                    client.feature_extraction(text, model=model) 
                    for text in texts
                ]
            else:
                # For other providers, use chat completion
                tasks = [
                    client.chat.completions.create(
                        model="openai/gpt-oss-120b",
                        messages=[{"role": "user", "content": text}]
                    )
                    for text in texts
                ]
            
            # Execute all tasks concurrently
            results = await asyncio.gather(*tasks)
            
            # Report success
            report_token_status(token_id, "success", api_key=proxy_api_key)
            
            return results
            
        except Exception as e:
            report_token_status(token_id, "error", str(e), api_key=proxy_api_key)
            raise

# Usage
# You need to get your API key from the admin or create a user account
# See RBAC_README.md for details on user management
# proxy_api_key = "your_proxy_api_key_here"  # Get this from admin
# results = asyncio.run(async_inference(["text1", "text2", "text3"], provider="cerebras", proxy_api_key=proxy_api_key))

Provider-Specific Considerations

Model Compatibility

Different providers support different models and capabilities:

  • Cerebras: High-performance inference, supports most open models
  • Cohere: Advanced language models with multilingual support
  • Groq: Ultra-fast inference, optimized for speed
  • Together: Collaborative AI hosting, wide model support
  • Nebius: Cloud-native services with enterprise features
  • HF-Inference: Core API with comprehensive model support

Rate Limits and Quotas

Each provider has different rate limits and pricing:

  • Monitor usage through HF-Inferoxy’s reporting system
  • Use appropriate retry strategies for rate limit errors
  • Consider provider-specific error handling patterns

Troubleshooting

Common Issues

  1. Connection Refused: Ensure HF-Inferoxy server is running on the specified URL
  2. No Valid Keys: Check that valid HF API keys are added to the proxy server
  3. Import Errors: Ensure huggingface_hub and requests are installed
  4. Token Errors: The proxy will automatically handle and rotate problematic tokens
  5. Provider Not Found: Verify provider name spelling and availability
  6. Model Compatibility: Check if the model is supported by the chosen provider

Debug Mode

Enable debug logging to see detailed token management:

import logging
logging.basicConfig(level=logging.DEBUG)

# Your code here - will show detailed logs

Health Check

Verify the proxy server is working:

curl http://localhost:8000/health
curl http://localhost:8000/keys/status

Conclusion

The HF-Inferoxy token management system provides seamless integration with huggingface_hub across all supported providers, automatically handling:

  • Token provisioning: Get valid tokens automatically
  • Error handling: Detect and report authentication/credit errors
  • Token rotation: Automatic switching to valid tokens
  • Quarantine management: Intelligent handling of temporary issues
  • Multi-provider support: Unified interface across all providers

This allows you to focus on your AI applications without worrying about token management, rate limits, or credit exhaustion, while leveraging the unique capabilities of each provider.

For detailed provider-specific examples, see the provider examples directory.