Lead Software Engineer-AI Platform Engineer
On-site · Jersey City, New Jersey, United States
Job Summary
Lead Software Engineer focusing on AI Platform Infrastructure. responsible for designing, developing, and deploying secure, scalable cloud infrastructure and AI/ML workloads; leading architectural evaluations with external vendors and internal teams; building CI/CD pipelines and automation for ML workloads; collaborating with AI teams to translate computational needs into infrastructure requirements; optimizing cloud resources for performance and cost; contributing to a culture of diversity, inclusion, and technical excellence; and advancing communities of practice around modern software engineering and AI platforms.
Required Qualifications
- Formal training or certification in software engineering concepts with 5+ years of applied experience
- Hands-on practical experience in delivering system design, application development, testing, and ensuring operational stability
- Proficiency in at least one programming language (Python, Go, Java, or C#)
- Proficiency in automation and continuous delivery methods
- Proficient in all aspects of the Software Development Life Cycle
- Demonstrated proficiency in software applications and technical processes within cloud/AI/ML domains
- Foundational understanding of machine learning concepts (transformers, ML training, inference)
- Experience with containerization (Docker, Kubernetes) and cloud service providers (AWS, Azure, GCP)
- Experience with Infrastructure as Code
- Deep understanding of cloud component architecture (microservices, containers, IaaS, storage, security)
- Preferred: NVIDIA GPU infrastructure software, PyTorch, TensorBoard, MLflow, Prometheus, Grafana, vLLM, Ray, Slurm, SQL/NoSQL, Linux scripting
Desired Qualifications
- Foundational understanding of machine learning concepts (transformer architecture, ML training and inference)
- Hands-on experience delivering system design, application development, testing, and operational stability
- Experience with containerization (Docker, Kubernetes) and cloud providers (AWS, Azure, GCP)
- Experience with Infrastructure as Code
- Experience with ML Ops tooling (MLflow)
- Familiarity with observability tools (Prometheus, Grafana)
- Strong programming skills in Python, Go, Java, or C#
- CI/CD and automation proficiency
- Experience with high-performance computing concepts and ML frameworks (e.g., PyTorch, TensorBoard)
- Strong knowledge of network architecture, databases (SQL/NoSQL), and data modeling
- Security-focused software engineering and scalable AI/ML infrastructure design
- Leadership and collaboration across multiple teams and vendors
- Experience with NVIDIA GPU infrastructure software (DCGM, BCM, Triton) (preferred)
- Experience with ML frameworks and tools (e.g., vLLM, Ray, Slurm) (preferred)
- Experience with distributed systems and microservices architecture (preferred)
- Experience with monitoring/observability stacks (Prometheus, Grafana) (preferred)
Apply with one swipe on Sorce. We auto-fill applications and apply on your behalf — no cover letters, no 40-minute forms.
Hiring someone like this?
Get your role in front of qualified candidates on Sorce.