Using GPUs in Serverless Architecture
Computer motherboard and electronical components cpu gpu memory and different sockets for a video card close up

Using GPUs in Serverless Architecture

The one hurdle we always face while working on AI or other compute-intensive projects that requires a GPU to run in a production environment is deployment. Deployment sounds easy when you don’t think about it deeply. You’ll maybe write a simple flask server, spin up a GPU machine, run your code and voila! you have an API. Simple right? 

If you’re a startup you might not have a lot of active users who want to use GPU based inferencing services. In such a case, you can safely keep scaling aside for some time and focus more on the cost. On average, a GPU based cloud server will cost you somewhere around 2,000 USD to 10,000 USD monthly. And it might not make a lot of sense for early-stage startups to be spending so much on a cloud server. 

The solution for this costing problem – you guessed right – is a serverless architecture. In simple words, instead of getting one complete GPU machine, you build your architecture in such a manner that you pay just for the time you use the GPU for processing!

Now we know what we want to do. 

If you’ll search the three of the biggest clouds, there is a gamut of services that can help you build a serverless architecture. All the services, on a higher level, can be divided into 2 classes:

  1. Function as A Service: AWS has Lambda, Google has Cloud Function, Azure has Function Apps.
  1. Serverless Containers: Google cloud run, AWS Fargate, Azure Container Instances.

The disappointment comes when you realize that all of the above-mentioned services don’t provide the inherent support of GPU. However, Google Cloud and AWS have some good news for us. Google with AI platform prediction and AWS with SageMaker offer solutions including the use of GPU for inference using deep learning models.

But even these services have their own shortcomings – 

SageMaker: At least one machine should always be running. For use-cases with periodic or discontinuous loads, this would be a waste of resources. 

On the other hand, Google AI platform currently allows the use of GPUs with the Tensorflow SavedModel format. Google does allow auto-scaling to 0, but you are charged for a minimum of 10 minutes of computing time even if the request takes only a fraction of a second.

Beyond the Big Three, searching for ‘serverless GPUs can bring you to companies like Nuclio or Algorithmia. They tout themselves as truly serverless platforms supporting GPUs for optimized utilization and sharing. 

Algorithmia seems like a good option, where you only pay for the actual compute time. It is well suited for small and not constant loads. There’s a problem of cold start with Algorithmia, although this can be handled with dummy calls to keep your GPU warm (this adds some cost).

Since developments in cloud computing are accelerating, we have no doubt more such services will be available down the line. As of the moment, SageMaker, AI Platform and Alogirthmia sound like the best options to choose from.

If you’re building an AI product and have questions about the scalability, performance and optimization of your architecture feel free to drop us an email info@aidetic.in.

Leave a Reply