Deploying A GPU-Powered LLM on Cloud Run // TRAIN BRAIN

Deploying A GPU-Powered LLM on Cloud Run

Discover how you can deploy your own GPU-powered Large Language Model (LLM) on Google Cloud Run. This video walks through taking an open-source model like Gemma and deploying it as a scalable, serverless service with GPU acceleration. We explore the essential Dockerfile configurations and the `gcloud run deploy` command, highlighting key flags for optimizing performance and managing costs effectively. This setup enables independent scaling for your AI agent's intelligent core.
Chapters:
0:00 - Introduction
0:32 - Why deploy a separate LLM?
1:32 - Serving the LLM with Ollama and Dockerfile
2:41 - Deploying to Cloud Run with `gcloud run deploy`
2:59 - Hardware configuration
3:20 - Performance and cost control
3:40 - Deployment summary
3:57 - Summary and next steps
Resources:
Codelab → https://goo.gle/aaiwcr-3
GitHub Repository → http://goo.gle/4pYAmMi
Google Cloud Run GPU → http://goo.gle/46EYI6g
ADK Documentation → http://goo.gle/46Thw0d
Subscribe to Google Cloud Tech → https://goo.gle/GoogleCloudTech
#GoogleCloud #LLM #CloudRun
Speakers: Amit Maraj
Products Mentioned: Cloud GPUs, Cloud Run

Google Cloud Tech

Helping you build what's next with secure infrastructure, developer tools, APIs, data analytics and machine learning....

Stop coding, start architecting: Google Antigravity + Cloud Run

[Demo] Network Security Integration with Palo Alto

How to build a financial analyst assistant with Vertex AI Studio & Gemini in under 10 minutes

The agent evaluation revolution

Agent sandbox and Pod snapshotting: Supercharging agents on GKE | The Agent Factory Podcast

Leveraging the Looker connector in Looker Studio

How to assess data lake and data warehouse migrations to BigQuery

Refining your vision: A guide to AI image editing

From text to vision: An intro to AI image generation

Evolving your story: A guide to AI video editing

Bringing ideas to life: An intro to AI video generation

Building with Gemini 3, AI Studio, Antigravity, and Nano Banana | The Agent Factory Podcast

Fine-tuning open LLMs on GKE: The implementation gap

Video avatar agent | The Agent Factory Podcast

Gemini CLI: Write and deploy a Cloud Run app in 5 minutes

Build ANYTHING with Gemini 3 | The Agent Factory Podcast

Building Your Own MCP Server with ADK

This AI agent runs on Cloud Run + NVIDIA GPUs

Scaling AI with Google Cloud's TPUs

Deploying scalable and reliable AI inference on Google Cloud

Serving AI models at scale with vLLM

AI workload orchestration options

AI/ML frameworks for cloud TPUs

Model types and performance bottlenecks

AI workload storage options

Connecting ADK Agents to MCP Servers

Use the Gemini CLI Jules and Observability extensions together

Introduction to Vertex AI Agent Engine

Power your AI agents with MCP tools on Google Cloud Run

Use the Gemini CLI Jules and security extensions to fix security vulnerabilities in the background

Use the Jules extension for Gemini CLI to fix multiple GitHub issues

Dataplex fundamentals: Aspects & glossaries

We tried to jailbreak our AI (and Model Armor stopped it)

Parallel bug fixing & unit testing with Jules and Observability extensions for Gemini CLI

How to fix security vulnerabilities with the Jules and security extensions for Gemini CLI

How to fix multiple GitHub issues at once using the Jules extension for Gemini CLI

The path to AI inferencing on GKE Part 1: Guided model research

Vibe coding with Google AI Studio | The Agent Factory

Is it possible to create a model agnostic prompt?

Building agentic RAG for e-commerce with ADK and Vector Search

Demo: Vibe coding a command line Markdown viewer with the Gemini CLI

Don't guess: How to benchmark your AI prompts

Identity and Access Management for Agents

ComfyUI on GKE for Genmedia solutions

Meet Cloud SQL: Google Cloud's fully managed and intelligent relational database service

Autoscaling Your AI Agent Under Load

Common Looker CI errors (and how to tackle them)

Multi-agent vs. single-agent: Which should you use?

Spanner: The always-on, virtually unlimited scale database

Building an AI tutor that ACTUALLY remembers you

Agent Sessions and Tool Authentication

How to build a multi-agent app with ADK and Gemini

The Agent Factory - Episode 11: AI agents for data engineering and data science

How AI agents can remember you forever

Connecting Your AI Agent to a Cloud-Hosted LLM

My AI agent crashed and how to save its memory

Gemini vs. the clock: 5 hours to build a FinTech agent

This month in GKE: September edition

Giving your AI agent a "scratchpad" memory

Secure ADK elements with Secret Manager

Validate your Looker workflow from data to dashboard

AI Agent Dance Off: Comparing Design Approaches

The Agent Factory - Episode 10: Agent Security

Why do AI agents forget everything?

Workflow agents and communication in ADK

Build serverless applications in the Cloud Run hackathon

Common Cloud Run errors and how to fix them with Cloud Logging

Deploying A GPU-Powered LLM on Cloud Run

Build an AI app that watches videos using Gemini

Large scale data ingestion for MongoDB using Application Integration

Salesforce bulk data ingestion using Application Integration

How to enable app management for Google Cloud folder

The Agent Factory - Episode 9: Agent evaluation with ADK & Vertex AI

Protect your LookML: Continuous Integration for reliable data

Secure AI: De-identifying data with SDP

Gemini in Google Threat Intelligence

Foundations of multi-agent systems with ADK

What is Looker CI?

Programmable data quality with Dataplex and generative AI

The Looker MCP server: Your data's universal translator for AI