On-demand workers that scale to zero when not in use, so you only pay when processing requests. Flex workers are ideal for variable workloads, non-time-sensitive applications, and maximizing cost efficiency for sporadic usage.
Always-on workers that run 24/7. Active workers receive a 20-30% discount compared to flex workers, but you are charged continuously regardless of usage. Use active workers for consistent workloads, latency-sensitive applications, and high-volume processing.
Serverless billing operates on a precise pay-as-you-go model with specific timing mechanisms.Billing starts when the system signals a worker to wake up and ends when the worker is fully stopped. Runpod Serverless is charged by the second, with partial seconds rounded up to the next full second. For example, if your request takes 2.3 seconds to complete, you’ll be billed for 3 seconds.
Your total Serverless costs include both compute time (GPU usage) and temporary storage:
Compute costs: Charged per second based on the GPU type as shown in the pricing table above.
Storage costs: The worker container disk incurs charges only while workers are running, calculated in 5-minute intervals. Even if your worker runs for less than 5 minutes, you’ll be charged for the full 5-minute period. The storage cost is $0.000011574 per GB per 5 minutes (equivalent to approximately $0.10 per GB per month).
If you have many workers continuously running with high storage costs, you can utilize network volumes to reduce expenses. Network volumes allow you to share data efficiently across multiple workers, reduce per-worker storage requirements by centralizing common files, and maintain persistent storage separate from worker lifecycles.Network volumes are billed hourly at a rate of $0.07 per GB per month for the first 1TB, and $0.05 per GB per month for additional storage beyond that.
A worker start occurs when a worker is initialized from a scaled-down state. This typically involves starting the container, loading models into GPU memory, and initializing runtime environments. Worker start time varies based on model size and complexity. Larger models take longer to load into GPU memory.To optimize worker start times, you can use FlashBoot (included at no extra charge) or configure your endpoint settings.
This is the time your worker spends processing a request. Execution time depends on the complexity of your workload, the size of input data, and the performance of the GPUs you’ve selected.Set reasonable execution timeout limits to prevent runaway jobs from consuming excessive resources, and optimize your code to reduce processing time where possible.
After completing a request, workers remain active for a specified period before scaling down. This reduces response times for subsequent requests but incurs additional charges. The default idle timeout is 5 seconds, but you can configure this in your endpoint settings.