We are the Azure Reliability team; a multidisciplinary engineering organization committed to making Azure the worlds safest and most reliable cloud. For Azures most critical services and products, we apply a Site Reliability Engineering (SRE) approach. Our software engineers work closely with product teams to enhance availability, reliability, observability, and operability across our planet-scale systems.
We prioritize long-term platform improvements through engineering over repetitive manual tasks. Increasingly, we leverage AI to detect anomalies, predict incidents, and automate operational workflows, amplifying our ability to scale reliability across Azure. Our teams contribute to product architecture, share knowledge and code, and focus on building reusable solutions that benefit multiple teams and services.
Were not looking for people who know it all, were looking for those eager to learn it all. If you thrive on collaboration, embrace challenges, and see mistakes as opportunities to grow, wed love to meet you.
Responsibilities
Billions of users across the world rely on our products, and to meet this demand we design and implement world-class distributed systems.
As a Software Engineer in one of our Azure SRE teams, you will be responsible for improving the reliability of key Azure products.
The Azure SRE key focus areas are:
Building reusable automation and processes that help multiple teams meet their reliability goals. Influencing product architecture and roadmaps to ensure customer-experienced reliability is a core design principle.
Contributing directly to product code to achieve reliability outcomes. Leveraging AI to proactively detect anomalies, predict incidents, and automate operational workflows - scaling reliability efforts across complex systems.
We are looking for engineers passionate about the above areas who are also interested in:
Providing technical leadership across multiple Azure teams. Mentoring others on SRE principles, practices, and tools as well as AI usage to boost software development productivity.
Designing and developing large-scale distributed software services and solutions. Delivering best-in-class engineering by ensuring services are modular, secure, reliable, testable, diagnosable, observable, and reusable.
Collaborating with internal and external partners to support team goals. Balancing pragmatism with visiondriving continuous improvements in process and codebase. Building automation to prevent or remediate service issues before they impact users.
Driving innovation in large-scale operations by applying cutting-edge AI tools and techniques to reduce operational toil and scale reliability engineering across complex systems. Gaining a working understanding of Microsoft businesses and contributing to cohesive, end-to-end user experiences.
Requirements: Required Qualifications
Bachelor's Degree in Computer Science or related technical field AND technical engineering experience with coding in languages including, but not limited to, C, C++, C#, Java, JavaScript, or Python
OR Master's Degree in Computer Science or related technical field AND technical engineering experience with coding in languages including, but not limited to, C, C++, C#, Java, JavaScript, or Python.
OR equivalent experience working with large-scale distributed systems (e.g., cloud computing providers, SaaS services, etc., ideally with millions or billions of users) or similarly complex environments.
Awareness of, and ability to reason about, modern distributed software design patterns and cloud systems architecture, including microservices,containers,load-balancing, queuing, caching.
Experience with C#/Java/C/C++/Golang.
Experience in building, shipping and operating reliable solutions.
Operated large-scale distributed systems with high availability requirements
.המשרה מיועדת לנשים ולגברים כאחד