Location: Bengaluru, India Onsite work required Full Time role 3+ years of Site Reliability Engineering experience is required
This is a junior position.
Job Description:
Scope - The following activities are in scope for the proposed SRE role:
Exercise best practices to ensure and improve high availability, reliability, and recoverability of platforms.
Work with proprietary tools that mitigate weakness in incident management or software delivery.
Design and build disaster recovery and business continuity automation and perform routine DR trials.
Develop capacity management practices.
Evaluate and re-architect SLI's to dynamically account for projected growth to properly represent service reliability.
Develop, maintain and configure cloud observability systems (e.g., DataDog, GCP logging, RUM, APM, etc.).
Build flexible monitoring and alerting to proactively address issues before they become incidents.
Develop a framework to evaluate system performance and implement optimizations where appropriate.
Partner with development teams to establish application production readiness through rigorous testing and release procedures.
Participate in on-call rotations for incident response and postmortem investigation.
Participates in rigorous training both within and across engineering teams.
Demonstrate a proactive approach by swiftly identifying areas within the systems and processes where resiliency improvements can be implemented.
Develop documentation and knowledge-sharing mechanisms with a resiliency-focused approach.
Observability as code
Design a tier system for reusable monitors for various environments utilizing configurations that are maintained in source control.
Design and make proposals to software development teams on how to apply monitoring to prod and non-prod environments in a financially responsible way while accounting for all compliance (GDPR, HIPAA, etc) concerns.