Senior Lead Software Engineer
On-site · Palo Alto, California, United States
Job Summary
Senior Lead Site Reliability Engineer to collaborate with stakeholders to define NFRs and availability targets, ensure SLOs are reflected in design and tests, and implement AI-powered autonomous operations. Spearhead observability design, develop AI Agents and MCP Servers for autonomous incident detection and auto-remediation, integrate multiple data stores, and lead automation via Java, Go, Python, and Terraform. Mentor engineers, contribute to the SRE community, and drive adoption of reliability practices across distributed, cloud-native systems.
Required Qualifications
- Formal training or certification in software engineering concepts with 10+ years of applied experience in Site Reliability Engineering, DevOps, or Software Engineering
- Advanced knowledge of site reliability culture and principles with demonstrated ability to implement SRE within an application or platform
- Advanced knowledge of observability and alerting with tools such as Grafana, Dynatrace, Prometheus, Datadog, Splunk
- Expert-level proficiency in Java, Go, Python, and Terraform for building enterprise-grade applications and automation
- Experience with containerization (Docker, Kubernetes) and CI/CD pipelines
- Strong experience designing and implementing logging pipelines and metrics/trace collection for distributed systems
- Experience with graph databases (Neo4j, TigerGraph) and vector databases (Pinecone, Weaviate, Chroma)
- Experience building production-grade RESTful APIs and event-driven architectures (Kafka, RabbitMQ, SQS)
- Experience with AI/ML frameworks and autonomous systemsintegration
- Bachelor’s or Master’s degree in Computer Science, Engineering, or related field
Apply with one swipe on Sorce. We auto-fill applications and apply on your behalf — no cover letters, no 40-minute forms.
Hiring someone like this?
Get your role in front of qualified candidates on Sorce.