Quick Start: HuggingFace Hub + HF-Inferoxy

Get started with managed HuggingFace API tokens in under 5 minutes.

Setup

  1. Install dependencies:
    uv add huggingface-hub requests
    
  2. Create token utility file:

    Save this as hf_token_utils.py:

    import os
    import requests
    from typing import Tuple
       
    def get_proxy_token(proxy_url: str = "http://localhost:8000", api_key: str = None) -> Tuple[str, str]:
        response = requests.get(f"{proxy_url}/keys/provision", 
                              headers={"Authorization": f"Bearer {api_key}"} if api_key else {})
        if response.status_code != 200:
            raise Exception(f"Failed to get token: {response.text}")
           
        data = response.json()
        return data["token"], data["token_id"]
       
    def report_token_status(token_id: str, status: str = "success", error: str = None, proxy_url: str = "http://localhost:8000", api_key: str = None, client_name: str | None = None) -> bool:
        payload = {"token_id": token_id, "status": status}
        if error:
            payload["error"] = error
            if "401 Client Error" in error:
                payload["error_type"] = "invalid_credentials"
            elif "402 Client Error" in error and "exceeded your monthly included credits" in error:
                payload["error_type"] = "credits_exceeded"
        if client_name:
            payload["client_name"] = client_name
           
        headers = {"Content-Type": "application/json"}
        if api_key:
            headers["Authorization"] = f"Bearer {api_key}"
           
        try:
            response = requests.post(f"{proxy_url}/keys/report", json=payload, headers=headers)
            return response.status_code == 200
        except:
            return False
    

⚠️ Important: Authentication Required

All client operations now require authentication with the HF-Inferoxy server. This is part of the Role-Based Access Control (RBAC) system.

Getting Your API Key

  1. Default Admin User: The system creates a default admin user on first run. Check your server logs or the users.json file for the default admin credentials.

  2. Create a User Account: Use the admin account to create a regular user account:
    curl -X POST "http://localhost:8000/admin/users" \
      -H "Authorization: Bearer ADMIN_API_KEY" \
      -H "Content-Type: application/json" \
      -d '{"username": "youruser", "email": "user@example.com", "full_name": "Your Name", "role": "user"}'
    
  3. Use the Generated API Key: The response will include an API key that you’ll use in all client operations.

For detailed RBAC setup, see RBAC_README.md.

Basic Usage

End-user tracking (optional)

Include an end-user identifier in reports via client_name. If omitted, the server defaults to the authenticated username.

end_user = "customer_123"  # from your app's user context
token, token_id = get_proxy_token(api_key=proxy_api_key)

# on success
report_token_status(token_id, "success", api_key=proxy_api_key, client_name=end_user)

# on error
report_token_status(token_id, "error", str(e), api_key=proxy_api_key, client_name=end_user)

Chat Completion

With HF-Inference (Default)

from huggingface_hub import InferenceClient
from huggingface_hub.errors import HfHubHTTPError
from hf_token_utils import get_proxy_token, report_token_status

# Get managed token (requires authentication)
proxy_api_key = "your_proxy_api_key_here"  # Get this from admin
token, token_id = get_proxy_token(api_key=proxy_api_key)

# Create client
client = InferenceClient(provider="hf-inference", api_key=token)

try:
    # Make request
    completion = client.chat.completions.create(
        model="HuggingFaceTB/SmolLM3-3B",
        messages=[{"role": "user", "content": "Hello!"}]
    )
    
    # Report success
    report_token_status(token_id, api_key=proxy_api_key)
    
    print(completion.choices[0].message.content)
    
except HfHubHTTPError as e:
    # Report error
    report_token_status(token_id, "error", str(e), api_key=proxy_api_key)
    raise

With Other Providers

from huggingface_hub import InferenceClient
from huggingface_hub.errors import HfHubHTTPError
from hf_token_utils import get_proxy_token, report_token_status

# Get managed token (requires authentication)
proxy_api_key = "your_proxy_api_key_here"  # Get this from admin
token, token_id = get_proxy_token(api_key=proxy_api_key)

# Choose your provider
providers = {
    "cerebras": "openai/gpt-oss-120b",
    "cohere": "CohereLabs/c4ai-command-r-plus", 
    "groq": "openai/gpt-oss-120b",
    "together": "openai/gpt-oss-120b"
}

# Example with Cerebras
client = InferenceClient(provider="cerebras", api_key=token)

try:
    completion = client.chat.completions.create(
        model=providers["cerebras"],
        messages=[{"role": "user", "content": "Hello!"}]
    )
    
    report_token_status(token_id, api_key=proxy_api_key)
    print(completion.choices[0].message.content)
    
except HfHubHTTPError as e:
    report_token_status(token_id, "error", str(e), api_key=proxy_api_key)
    raise

Feature Extraction

from huggingface_hub import InferenceClient
from hf_token_utils import get_proxy_token, report_token_status

# Get managed token (requires authentication)
proxy_api_key = "your_proxy_api_key_here"  # Get this from admin
token, token_id = get_proxy_token(api_key=proxy_api_key)
client = InferenceClient(provider="hf-inference", api_key=token)

try:
    # Extract features
    embeddings = client.feature_extraction(
        "Hello world", 
        model="intfloat/multilingual-e5-large"
    )
    
    # Report success
    report_token_status(token_id, api_key=proxy_api_key)
    
    print(f"Got embeddings with shape: {len(embeddings)}")
    
except Exception as e:
    report_token_status(token_id, "error", str(e), api_key=proxy_api_key)
    raise

Vision-Language Models (VLM)

from huggingface_hub import InferenceClient
from huggingface_hub.errors import HfHubHTTPError
from hf_token_utils import get_proxy_token, report_token_status

# Get managed token (requires authentication)
proxy_api_key = "your_proxy_api_key_here"  # Get this from admin
token, token_id = get_proxy_token(api_key=proxy_api_key)

# Use a VLM-capable provider
client = InferenceClient(provider="cerebras", api_key=token)

try:
    completion = client.chat.completions.create(
        model="meta-llama/Llama-4-Scout-17B-16E-Instruct",
        messages=[
            {
                "role": "user",
                "content": [
                    {"type": "text", "text": "Describe this image in one sentence."},
                    {
                        "type": "image_url",
                        "image_url": {"url": "https://example.com/image.jpg"}
                    }
                ]
            }
        ]
    )
    
    report_token_status(token_id, api_key=proxy_api_key)
    print(completion.choices[0].message.content)
    
except HfHubHTTPError as e:
    report_token_status(token_id, "error", str(e), api_key=proxy_api_key)
    raise

Image Generation

from huggingface_hub import InferenceClient
from huggingface_hub.errors import HfHubHTTPError
from hf_token_utils import get_proxy_token, report_token_status

# Get managed token (requires authentication)
proxy_api_key = "your_proxy_api_key_here"  # Get this from admin
token, token_id = get_proxy_token(api_key=proxy_api_key)

# Use an image generation provider
client = InferenceClient(provider="fal-ai", api_key=token)

try:
    # Generate image
    image = client.text_to_image(
        "Astronaut riding a horse",
        model="Qwen/Qwen-Image"
    )
    
    # Report success
    report_token_status(token_id, api_key=proxy_api_key)
    
    # Save the image
    image.save("astronaut_horse.png")
    print("Image generated successfully!")
    
except HfHubHTTPError as e:
    report_token_status(token_id, "error", str(e), api_key=proxy_api_key)
    raise

With Auto-Retry

from huggingface_hub import InferenceClient
from huggingface_hub.errors import HfHubHTTPError
from hf_token_utils import get_proxy_token, report_token_status

def chat_with_retry(message: str, provider: str = "hf-inference", max_retries: int = 2, proxy_api_key: str = None):
    for attempt in range(max_retries + 1):
        # Get fresh token for each attempt
        token, token_id = get_proxy_token(api_key=proxy_api_key)
        client = InferenceClient(provider=provider, api_key=token)
        
        try:
            # Choose model based on provider
            model = "HuggingFaceTB/SmolLM3-3B" if provider == "hf-inference" else "openai/gpt-oss-120b"
            
            completion = client.chat.completions.create(
                model=model,
                messages=[{"role": "user", "content": message}]
            )
            
            # Report success
            report_token_status(token_id, api_key=proxy_api_key)
            return completion.choices[0].message.content
            
        except HfHubHTTPError as e:
            error_str = str(e)
            report_token_status(token_id, "error", error_str, api_key=proxy_api_key)
            
            # Retry on auth errors, otherwise raise
            if attempt < max_retries and ("401 Client Error" in error_str or "402 Client Error" in error_str):
                print(f"Token error on attempt {attempt + 1}, retrying...")
                continue
            else:
                raise
                
        except Exception as e:
            report_token_status(token_id, "error", str(e), api_key=proxy_api_key)
            raise
    
    raise Exception("All retries failed")

# Usage with different providers
# You need to get your API key from the admin or create a user account
proxy_api_key = "your_proxy_api_key_here"  # Get this from admin
response = chat_with_retry("What is machine learning?", provider="cerebras", proxy_api_key=proxy_api_key)
print(response)

Context Manager Pattern

from contextlib import contextmanager
from huggingface_hub import InferenceClient
from hf_token_utils import get_proxy_token, report_token_status

@contextmanager
def managed_client(provider: str = "hf-inference", proxy_api_key: str = None):
    token, token_id = get_proxy_token(api_key=proxy_api_key)
    client = InferenceClient(provider=provider, api_key=token)
    
    try:
        yield client
        report_token_status(token_id, api_key=proxy_api_key)  # Success
    except Exception as e:
        report_token_status(token_id, "error", str(e), api_key=proxy_api_key)
        raise

# Usage with different providers
# You need to get your API key from the admin or create a user account
proxy_api_key = "your_proxy_api_key_here"  # Get this from admin

with managed_client("cerebras", proxy_api_key) as client:
    result = client.chat.completions.create(
        model="openai/gpt-oss-120b",
        messages=[{"role": "user", "content": "Hello"}]
    )
    print(result.choices[0].message.content)

with managed_client("hf-inference", proxy_api_key) as client:
    embeddings = client.feature_extraction("Hello", model="intfloat/multilingual-e5-large")
    print(f"Embedding length: {len(embeddings)}")

Available Providers

HF-Inferoxy supports these providers with comprehensive examples:

Provider Capabilities Example File
Cerebras LLM, VLM, Text Generation cerebras.md
Cohere LLM, VLM, Advanced Language Models cohere.md
Fal-AI Speech, Image, Video Generation fal-ai.md
Featherless AI LLM, VLM, Text Generation featherless-ai.md
Fireworks AI LLM, VLM, Fast Inference fireworks-ai.md
Groq LLM, VLM, Ultra-fast Inference groq.md
HF-Inference All Tasks, Core API hf-inference.md
Hyperbolic LLM, VLM, Vision AI hyperbolic.md
Nebius LLM, VLM, Cloud AI Services nebius.md
Novita LLM, VLM, Video Generation novita.md
NScale LLM, VLM, Image Generation nscale.md
Replicate Image, Video Generation replicate.md
SambaNova LLM, VLM, Enterprise AI sambanova.md
Together LLM, VLM, Collaborative AI together.md

Testing Your Setup

Create test_setup.py:

#!/usr/bin/env python3
from hf_token_utils import get_proxy_token, report_token_status

def test_connection(proxy_api_key: str):
    try:
        # Test token provisioning
        token, token_id = get_proxy_token(api_key=proxy_api_key)
        print(f"✅ Got token: {token_id}")
        
        # Test status reporting
        success = report_token_status(token_id, "success", api_key=proxy_api_key)
        print(f"✅ Reported status: {success}")
        
        return True
    except Exception as e:
        print(f"❌ Error: {e}")
        return False

if __name__ == "__main__":
    print("Testing HF-Inferoxy connection...")
    # You need to get your API key from the admin or create a user account
    proxy_api_key = "your_proxy_api_key_here"  # Get this from admin
    test_connection(proxy_api_key)

Run with: uv run test_setup.py

Next Steps

Common Issues

  • Connection refused: Ensure HF-Inferoxy server is running
  • No valid keys: Add API keys to the proxy server first
  • Import errors: Install dependencies with uv add huggingface-hub requests
  • Provider not found: Check provider name spelling and availability

API Reference

  • get_proxy_token(proxy_url)(token, token_id)
  • report_token_status(token_id, status, error, proxy_url)bool

That’s it! You’re now using intelligent token management with HuggingFace Hub across all supported providers.