NVIDIA AI Infrastructure & Kubernetes Platform Engineer (DGX Systems) Job at VSG Business Solutions LLC, Remote

bnRvMzEreVBkZkl6QWNvME00RHZZV2VsZUE9PQ==
  • VSG Business Solutions LLC
  • Remote

Job Description

NVIDIA AI Infrastructure & Kubernetes Platform Engineer (DGX Systems)

Related Certifications required

Alternate titles depending on context:

  • AI Platform Architect DGX & SuperPOD
  • AI Infrastructure DevOps Engineer NVIDIA DGX Stack
  • Senior AI Systems Engineer DGX | Kubernetes | InfiniBand

Job Description:

We are seeking a highly skilled AI Infrastructure & Kubernetes Platform Engineer with a proven track record in deploying and managing NVIDIA DGX-based AI clusters, orchestrating containerized AI workloads using Kubernetes, and ensuring secure, high-throughput operations across InfiniBand-powered networks. The ideal candidate will hold a combination of Kubernetes certifications (CKA, CKAD, CKS) and NVIDIA certifications (NCA-AIIO, NCP-AIO, NCP-AII, NCP-AIN) , coupled with hands-on training in DGX, BlueField, and high-speed network operations.

This position plays a key role in supporting AI/ML infrastructure at scale, enabling efficient training and inference for complex models, and integrating NVIDIA's cutting-edge compute, storage, and fabric solutions with modern DevOps practices.

Core Responsibilities:

AI Infrastructure Operations

  • Deploy and manage NVIDIA DGX BasePODs and SuperPODs for high-performance AI workloads.
  • Oversee DGX system lifecycle operations including provisioning, monitoring, firmware upgrades, and capacity planning.
  • Operate Base Command Manager to manage GPU clusters, schedule workloads, and integrate with MLOps tools.
  • Perform DGX node health validation, NCCL interconnect testing, and NVLink topology verification following new deployments or hardware changes.

Kubernetes Platform Engineering

  • Architect secure and scalable Kubernetes clusters optimized for GPU-accelerated workloads using NVIDIA GPU Operator.
  • Leverage expertise from CKA/CKAD/CKS to develop, deploy, and secure AI applications on Kubernetes.
  • Implement CI/CD pipelines and GitOps methodologies for deploying and managing ML workflows.

High-Performance Networking & DPUs

  • Administer InfiniBand networks and BlueField DPUs using Unified Fabric Manager (UFM) .
  • Enable NVLink/NVSwitch performance across GPU nodes and tune fabric configurations for minimal latency and maximum throughput.
  • Use BlueField for offloading storage, firewalling, and telemetry, enhancing AI workload security and performance.

Security & Compliance

  • Apply best practices from the CKS certification to secure containerized AI environments.
  • Configure runtime security, secrets management, network segmentation, and auditing using DPU-enhanced Kubernetes deployments.
  • Support zero-trust architecture initiatives by enforcing workload identity, RBAC policies, and supply chain integrity across AI container images and model artifacts.

Monitoring, Telemetry & Optimization

Monitor GPU, CPU, and I/O performance using NVIDIA DCGM , Prometheus, Grafana, and Base Command APIs.

  • Tune system performance and model training pipelines for cost-efficiency and throughput.
  • Build and maintain operational runbooks, incident response playbooks, and SLA reporting dashboards covering GPU utilization, thermal thresholds, and fabric health.

Qualifications:

Certifications a plus:

  • Certified Kubernetes Administrator (CKA)
  • Certified Kubernetes Application Developer (CKAD)
  • Certified Kubernetes Security Specialist (CKS)
  • NVIDIA Certified Associate: AI Infrastructure & Operations (NCA-AIIO)
  • NVIDIA Certified Professional: AI Infrastructure (NCP-AII)
  • NVIDIA Certified Professional: AI Operations (NCP-AIO)
  • NVIDIA Certified Professional: AI Networking (NCP-AIN)

Expertise With:

  • DGX System, BasePOD, and SuperPOD Administration
  • BlueField DPU Configuration & Operations
  • InfiniBand Fabric and UFM Management
  • Base Command Manager for workload orchestration

Technical Skills:

  • Kubernetes, Helm, GPU Operator, Kubeflow
  • DevOps tools: Ansible, Terraform, GitOps, CI/CD pipelines
  • Storage: NFS, BeeGFS, Lustre
  • Networking: RoCE, InfiniBand, DPU offload, gRPC, RDMA
  • Programming/scripting: Python, YAML, Bash

Job Tags

Similar Jobs

Choice Health at Home

Hospice Business Development Liaison Job at Choice Health at Home

 ...Minimum of 2 years home health or hospice sales experience preferred Valid Drivers License Valid Auto Insurance HomeCare HomeBase experience preferred Benefits and Perks ~ Medical, Dental, Vision Insurance ~401k Youre eligible after 3 months of... 

Fieldguide

Senior Product Marketing Manager Job at Fieldguide

 ...of trust for global commerce and capital markets through automating and streamlining the work...  ...in San Francisco, CA, but built as a remote-first company that enables you to do your...  ...We are seeking a Senior Product Marketing Manager to lead product marketing for Fieldguide... 

Miller Bros. Const., Inc.

Traffic Control Foreman Job at Miller Bros. Const., Inc.

 ...specializing in earthmoving, utility installation, and structure development for public roadway, industrial, and commercial clients. Traffic Control Foreman This position will place heavy emphasis on field direction, monitoring, and assessment of Maintenance of Traffic (... 

LX Pantos Americas

Warehouse Operator Job at LX Pantos Americas

 ...Skills: Previous experience of handing long-forks and/or countertop industry highly preferred Operate a seat-down forklift, pallet jack, and other equipment utilized in the warehouse Basic MS Office skills Knowledge of inventory management and logistics elements... 

Chemical Guys

Photographer Job at Chemical Guys

 ...agency. You thrive under tight turnarounds and understand the nuances of brand consistency. Master of the Craft: You have expert-level knowledge of DSLR/Mirrorless systems, strobe lighting, and color theory. AI is your additive, but your foundation is "real" photography...