Machine Learning Engineer Job at Evolve Group, San Jose, CA

TTBFL2VQb3NKU3Q2bFdvZVpNNklwTUpWVWc9PQ==
  • Evolve Group
  • San Jose, CA

Job Description

Machine Learning Engineer

Tech start-up

San Fransisco based

We’ve partnered with one of the most ambitious and technically rigorous AI research labs in the world. Based in San Francisco, this team is building foundation models entirely from scratch.

They are now hiring ML Infrastructure Engineers to design and scale the systems that power large-scale, distributed model training. If you’ve built infrastructure that runs across hundreds of GPUs, thrive under technical complexity, and want to work side-by-side with elite AI researchers — this is the role.

Key Responsibilities:

  • Build and scale distributed training systems for large-scale model training across LLMs, vision, and robotics.
  • Set up and run large-scale training across many GPUs using tools like Kubernetes, DeepSpeed, and FSDP.
  • Troubleshoot system issues (GPU errors, network problems) and build tools to monitor and recover from failures.
  • Optimize PyTorch pipelines, sharding, and sampling strategies.
  • Collaborate closely with researchers to support novel model training at scale.

Requirements:

  • 3–15 years in ML infrastructure, systems, or research engineering roles.
  • Proven experience scaling distributed training for large models.
  • Strong with PyTorch, CUDA, NCCL, Kubernetes.
  • Familiar with setting up distributed training clusters.
  • Deep understanding of PyTorch dataloaders, data sharding, and sampling.
  • Strong communicator with a collaborative, mission-driven mindset.

This is a fully in-person role based in San Francisco , it's ideal for engineers excited to build at the edge of what's possible in AI.

Job Tags

Immediate start,

Similar Jobs

Upson Regional Medical Center

Blood Bank Manager Job at Upson Regional Medical Center

 ...Job Description Job Description Blood Bank Manager Join a winning team! Upson Regional Medical Center is a small hospital with a BIG heart! Our employees enjoy an environment where they can take care of patients with a high degree of quality and compassion... 

RICEFW Technologies Inc

MNIT/DPS FAS Electronic Document Management System (EDMS) BA 109420 REPOST Job at RICEFW Technologies Inc

 ...qualifications for this MNSITE 2.0 Event. Sample Tasks Requirements Gathering & Analysis Lead pre-discovery sessions with DPS stakeholders to review each process and form. Analyze and document current-state workflows for all identified processes and... 

Certifiedarchivists

Electronic Records Archivist Job at Certifiedarchivists

 ...Representatives is hiring an Electronic Records Archivist to provide archival services for the digital records of the House and to provide...  ...formats, including email, social media, and web archiving, and assists with the implementation of the Houses recordkeeping requirement... 

Gateway Rehab

Treatment Support Technician 2 - Center Township Job at Gateway Rehab

 ...ATTENTION! $1,000 SIGN-ON BONUS! Gateway Rehab has an outstanding opportunity for a Treatment Support Technician 2 in our Center Township, PA location. Our Technician facilitates programming and assures the integrity of the therapeutic milieu. This position receives... 

WakeMed

Telephone Triage Nurse Job at WakeMed

 ...triage within a full spectrum of acuity. Utilizing comprehensive nursing skills to assess and advise patients and their families, a...  ...developmental and health care needs as identified through the telephonic assessment of the patient and/or the caller's physical and psychological...