LLM Inference Deployment Engineer
$180,000–$240,000 year
Remote · United States or Canada
Job Summary
LLM Inference Deployment Engineer to optimize, deploy, and scale large language models for high-performance inference on energy-efficient AI accelerators. Responsibilities include deploying and optimizing LLMs post-training from libraries like HuggingFace, using inference runtimes such as ONNX Runtime and vLLM, optimizing batching and tensor parallelism for real-time applications, and building high-performance inference pipelines with Docker and Kubernetes.
Required Qualifications
- Bachelor’s or Master’s degree in Computer Science, Electrical Engineering, or related field
- Experience in LLM inference deployment, model optimization, and runtime engineering
- Strong expertise in LLM inference frameworks (PyTorch, ONNX Runtime, vLLM, TensorRT-LLM, DeepSpeed)
- In-depth knowledge of Python for model integration and performance tuning
- Experience with containerized AI deployments (Docker, Kubernetes, Triton Inference Server, TensorFlow Serving, TorchServe)
- Experience with real-time LLM applications (chatbots, code generation, retrieval-augmented generation)
- EnchargeAI is an equal employment opportunity employer in the United States
Desired Qualifications
- Experience in LLM inference deployment
- Model optimization
- Runtime engineering
- Containerized AI deployments (Docker, Kubernetes, Triton Inference Server, TensorFlow Serving, TorchServe)
- Experience with HuggingFace libraries
- Proficiency in PyTorch and ONNX Runtime
- Familiarity with vLLM, TensorRT-LLM, DeepSpeed
- Real-time LLM applications (chatbots, code generation, retrieval-augmented generation)
- Python programming for model integration and performance tuning
Apply with one swipe on Sorce. We auto-fill applications and apply on your behalf — no cover letters, no 40-minute forms.
Hiring someone like this?
Get your role in front of qualified candidates on Sorce.