Hosting Hugging Face Transformers on Serverless GPUs | AI Infrastructure

Hosting Hugging Face Transformers on Serverless GPUs

By AI Infrastructure Team
•
June 23, 2025
•
12 min read

Hugging Face Transformers have become the go-to library for natural language processing tasks, offering thousands of pre-trained models for tasks like text classification, question answering, and text generation. However, deploying these models in production, especially for inference at scale, can be challenging due to their computational requirements. Serverless GPU platforms provide an elegant solution, offering the perfect balance of performance, scalability, and cost-efficiency.

Note: This guide focuses on deploying Hugging Face models for inference. For fine-tuning models on serverless GPUs, check out our Fine-Tuning Guide.

Why Serverless GPUs for Hugging Face Models?

Serverless GPU platforms offer several advantages for hosting Hugging Face models:

Cost-Effective Scaling

Pay only for the inference time you use, with automatic scaling to handle traffic spikes without over-provisioning resources.

Reduced Latency

Global distribution options bring your models closer to end-users, reducing inference latency.

Simplified Operations

No need to manage Kubernetes clusters or GPU drivers – focus on your models and applications.

Top Platforms for Serverless Hugging Face Deployment

Platform	Cold Start	Max Memory	GPU Options	Pricing
AWS Lambda with EFS	1-3s (warm)	10GB	NVIDIA T4	Per request + GB-sec
Google Cloud Run	100ms-1s (warm)	8GB	NVIDIA T4, L4	vCPU-sec + memory
Hugging Face Inference API	~500ms	N/A	Various	Per token/request
Banana.dev	2-10s (cold)	40GB	A100, A10G	Per second

Deploying with AWS Lambda and EFS

Let’s walk through deploying a Hugging Face model using AWS Lambda with EFS for model storage, which is cost-effective for models up to 10GB.

1. Set Up EFS for Model Storage

First, create an EFS filesystem and mount target in your VPC:

# Create EFS filesystem
aws efs create-file-system --creation-token huggingface-models 
    --performance-mode generalPurpose 
    --tags Key=Name,Value=huggingface-models

# Create mount target in each subnet
aws efs create-mount-target 
    --file-system-id fs-12345678 
    --subnet-id subnet-12345678 
    --security-groups sg-12345678

# Create access point for Lambda
aws efs create-access-point 
    --file-system-id fs-12345678 
    --posix-user Uid=1000,Gid=1000 
    --root-directory "Path=/models,Create={OwnerUid=1000,OwnerGid=1000,Permissions=755}"

2. Create a Lambda Layer with Dependencies

Create a directory structure and install the required packages:

mkdir -p python/lib/python3.9/site-packages
cd python/lib/python3.9/site-packages

# Install dependencies to the directory
pip install 
    torch torchvision torchaudio --extra-index-url https://download.pytorch.org/whl/cu117 
    transformers==4.30.0 
    sentencepiece==0.1.99 
    -t .

# Create zip file
cd ../../..
zip -r ../hf-transformers-layer.zip .

3. Deploy the Lambda Function

Create a Lambda function that loads the model from EFS:

import os
import json
import torch
from transformers import AutoModelForSequenceClassification, AutoTokenizer

# Initialize model and tokenizer outside the handler for warm starts
MODEL_PATH = "/mnt/efs/models/distilbert-base-uncased-finetuned-sst-2-english"

try:
    tokenizer = AutoTokenizer.from_pretrained(MODEL_PATH)
    model = AutoModelForSequenceClassification.from_pretrained(MODEL_PATH)
    model.eval()
    print("Model loaded successfully")
except Exception as e:
    print(f"Error loading model: {str(e)}")
    raise

def lambda_handler(event, context):
    try:
        # Get input text from the event
        text = event.get('text', '')
        if not text:
            return {
                'statusCode': 400,
                'body': json.dumps({'error': 'No text provided'})
            }
        
        # Tokenize and predict
        inputs = tokenizer(text, return_tensors="pt", truncation=True, max_length=512)
        with torch.no_grad():
            outputs = model(**inputs)
        
        # Get predictions
        predictions = torch.nn.functional.softmax(outputs.logits, dim=-1)
        confidence, predicted_class = torch.max(predictions, dim=1)
        
        return {
            'statusCode': 200,
            'body': {
                'prediction': model.config.id2label[predicted_class.item()],
                'confidence': confidence.item(),
                'model': model.config._name_or_path
            }
        }
        
    except Exception as e:
        return {
            'statusCode': 500,
            'body': json.dumps({'error': str(e)})
        }

Important: Make sure your Lambda function has sufficient memory (at least 3GB) and timeout settings appropriate for your model size.

Optimizing Performance

To get the best performance from your serverless Hugging Face deployment:

1. Model Optimization

Use optimum to optimize models for inference
Quantize models to FP16 or INT8 for faster inference
Consider using smaller, distilled models when possible

# Optimize a model with Optimum
from optimum.onnxruntime import ORTModelForSequenceClassification
from transformers import AutoTokenizer

model_id = "distilbert-base-uncased-finetuned-sst-2-english"

# Convert to ONNX format
ort_model = ORTModelForSequenceClassification.from_pretrained(
    model_id, 
    export=True,
    provider="CUDAExecutionProvider"
)

# Save optimized model
ort_model.save_pretrained("./optimized_model")
tokenizer = AutoTokenizer.from_pretrained(model_id)
tokenizer.save_pretrained("./optimized_model")

2. Cold Start Mitigation

Use provisioned concurrency to keep functions warm
Implement a warming strategy with scheduled events
Consider using larger memory sizes for faster model loading

Monitoring and Scaling

Monitor your serverless deployment with these key metrics:

Invocation Count: Number of times your function is invoked
Duration: Execution time of each invocation
Concurrent Executions: Ensure you’re not hitting account limits
Errors: Monitor for model loading or inference errors

Cost Considerations

Serverless GPU pricing can vary significantly based on:

Model size and complexity
Request volume and concurrency
Inference time per request
Data transfer costs

For high-traffic applications, consider a hybrid approach with dedicated GPU instances for baseline traffic and serverless for handling spikes.

Conclusion

Hosting Hugging Face Transformers on serverless GPU platforms provides an excellent balance of performance, scalability, and cost-efficiency for many use cases. By following the patterns and optimizations outlined in this guide, you can deploy production-ready NLP models that scale with your application’s needs.

Remember to continuously monitor your deployment and adjust configurations based on actual usage patterns to optimize both performance and cost.

Hosting Hugging Face Transformers On Serverless GPUs

Hosting Hugging Face Transformers on Serverless GPUs

Why Serverless GPUs for Hugging Face Models?

Cost-Effective Scaling

Reduced Latency

Simplified Operations

Top Platforms for Serverless Hugging Face Deployment

Deploying with AWS Lambda and EFS

1. Set Up EFS for Model Storage

2. Create a Lambda Layer with Dependencies

3. Deploy the Lambda Function

Optimizing Performance

1. Model Optimization

2. Cold Start Mitigation

Monitoring and Scaling

Cost Considerations

Conclusion

Leave a Comment Cancel Reply

Why Serverless GPUs for Hugging Face Models?

Cost-Effective Scaling

Reduced Latency

Simplified Operations

Top Platforms for Serverless Hugging Face Deployment

Deploying with AWS Lambda and EFS

1. Set Up EFS for Model Storage

2. Create a Lambda Layer with Dependencies

3. Deploy the Lambda Function

Optimizing Performance

1. Model Optimization

2. Cold Start Mitigation

Monitoring and Scaling

Cost Considerations

Conclusion

Related Posts

Leave a Comment Cancel Reply