In terms of cold starts, we seem to be very comparable from what users have mentioned and tests we have run.
Easier config/setup is feedback we have gotten from users since we don't have and special syntax or a "Cerebrium way" of doing things which makes migration pretty easier as well as doesn't lock you in which some engineers appreciate. We just run your Python code as is with an extra .toml setup file.
Additionally, we offer AWS Inferentia/Tranium nodes which offer a great price/performance trade-offs for many open-Source LLM's - even when using TensorRT/vLLM on Nvidia GPU's and gets rid of the scarcity problem. We plan to support TPU's and others in future.
We are listed on AWS Marketplace as well as others which means you can subtract your Cerebrium cost from your commited cloud spend.
Two things we are working on that will hopefully make us a bit different is: - GPU checkpointing - Running compute in your own cluster to use credits/for privacy concerns.
Where Modal does really shine is training/data-processing use cases which we currently don't support too well. However, we do have this on our roadmap for the near future.
Related to that, it seems the syntax isn't documented https://docs.cerebrium.ai/cerebrium/environments/config-file...
You can see an example config file at the bottom of that link you attached - agreed we should probably make it more obvious
As for the quoting part, it's mysterious to me why a structured file would use a quoted string for what is obviously an interior structure. Imagine if you opened a file and saw
fred = "{alpha: ['beta', 'charlie''s dog', 'delta']}"
wouldn't you strongly suspect that there was some interior syntax going on there?Versus the sane encoding of:
fred:
alpha:
- beta
- charlie's dog
- delta
in a normal markup language, no "inner/outer quoting" nonsense requiredBut I did preface it with my toml n00b-ness and I know that the toml folks believe they can do no wrong, so maybe that's on purpose, I dunno
The support is next level - team is ready to dive into any problem, response is super fast, and has helped us solve a bunch of dev problems that a normal platform probably won’t.
Really excited to see this one grow!!
We're definitely looking for something like this as we're looking to transition from Azure's (expensive) GPUs. I'm curious how you stack against something like Runpod's serverless offering (which seems quite a bit cheaper). Do you offer faster cold starts? How long would a ~30GB model load takes?
In terms of cold starts, they mentioned their cold starts are 250ms which I am not sure what workload that is on, or if we have the same measure of cold starts. We have had quite a few customers that we have told us we are quite a bit faster 2-4 seconds vs ~10 seconds although we haven't confirmed this ourselves.
For a 30GB model, we have a few ways to speed this up such as using the Tensorizer framework from Coreweave, we cache model files in our distributed caching layer but I would need to test. We see reads of up to 1GB/s. If you tell me the model you are running (if open-source) I can get results to you - you can message me on our Slack/Discord community or email me at michael@cerebrium.ai or
I may be misunderstanding your explanation a bit here, but Runpod's serverless "flex" tier looks like the same model (it only charges you for non-idle resources). And at that tier they are still 2x cheaper for A100, at your price point with them you could rent an H100.
(Congrats on the launch as well, by the way).
I just shared this on Slack and it looks like the site description has a typo: "A serverless AI infrastructure platform [...] customers experience a 40%+ cost savings as opposed to AWS of GCP"
When you ran it the first time, it took a while to load up. Do subsequent runs go faster?
And what cloud provider are you all using under the hood? We work in a specific sector that excludes us from using certain cloud providers (ie. AWS) at my company.
We are running on top of AWS however can run on top of any cloud provider as well as are working on you using your own cloud. Happy to hear more about your use case and see if we can help you at all - email me at michael@cerebrium.ai.
PS: I will state that vLLM has shocking load times into VRam that we are resolving.
Cerebrium abstracts some functionality - like streaming and batching endpoints. I think you would need to build that yourself on paperspace.