Senior Site Reliability Engineer
Every engineering team at Client is responsible for running and operating the software that they build. The Reliability Engineers (SREs) work towards standardizing and supporting all of the rapidly growing teams throughout our organization, assessing their architecture, helping them design scalable services, and fostering excellent operational practices. It's a mission-critical role of ensuring that our systems are always healthy, monitored, automated, and designed to scale.
What makes Reliability Engineering different at Client?
It is software engineering! We work on resolving the problems with the mindset on how to ensure they don't happen again. We are looking to enhance Observability into our systems and to automate ourselves out of our jobs.
We are engineers that’s either embedded in specific development teams where we drive operational improvements or are part of the CORE Site Reliability Team where we focus on innovating in Observability & Resilience engineering space.
Define roadmap and architecture based on technology and business needs.
Build holistic visibility into SLIs, SLOs, SLAs, dependency graphs, past performance of software, network, and system to ensure that we can continue to scale without increasing operational burden or toil.
Share your knowledge by giving brown bags, tech talks, and evangelizing appropriate tech and engineering best practices.
Build infrastructure and drive projects that break things with the aim to improve the robustness of production systems
Use the core Site Reliability Engineering principles of Monitoring, emergency response, capacity planning, and production readiness reviews to run the platform.
Step back to observe patterns and develop innovative tools and automation to minimize toil. Use those learnings to drive the best operational practices.
Partner with the broader Client organization to build a culture of rigorously learning from incidents.
Unblock, support, and effectively communicate across teams to achieve results
Diagnose and develop fixes to implement quickly and efficiently for production incidents.
Design and implement Observability strategy for Client Global Technology.
Proficient in Java with 5+ years’ experience.
3 years’ experience in building cloud-based enterprise systems, ideally on AWS.
Basic understanding of DNS, Networking, Virtualization, Linux.
Experience with Docker/Containers and Serverless patterns.
Expertise in designing, debugging and running fault-tolerant large- scalable Distributed systems.
Expertise in NoSQL datastore systems to build highly scalable solutions.
Experience with messaging (pub-sub) patterns
Good understanding of async/non-blocking Restful APIs approaches and frameworks
Basic understanding of the following tools: ServiceNow, Jira, Jenkins, Splunk, SignalFx, NewRelic.
Good communication skills
Good to have skills
Experience with python or Scala
Experience with test driven development
Background with ITIL or Lean a plus
Experience with code instrumentation for adding Metrics & Traces.
Demonstrated negotiation and influencing skills.
Requires a Bachelor’s Degree in Computer Science, Engineering, IT or a related field; MBA a plus. Minimum of 2 years of relevant work experience.