Apply now »

Site Reliability Engineer (SRE)

Req ID:  29260
Posted on:  7 Feb 2025
Location: 

Huechuraba, Chile

Department:  Customer Projects Deployment & Services
Job Family:  Information Technology


SITE RELIABILITY ENGINEER

Site Reliability Engineer (SRE)

AIM OF THE JOB:

As an SRE, responsible for responding to incidents and escalation. This includes on-call support and escalation support that may be required after office hours and planned during the weekend. A support duty roster shall be implemented.  On Technical Support, competent in troubleshooting and investigating technical problems, perform RCA, recommending resolutions, and implementing workarounds when a software fix is not available yet. On Solution and Observability Monitoring must be competent in developing, customizing, and implementing Monitoring of the solution. On Continuous delivery, responsible for deployment of new versions of applications. On Solution Quality Assurance, participate with Product Dev and DevOps on development testing activities (FAT) and drive solution testing during deployment (SAT). Proactively shares knowledge with team members and SRE community. Possess a curious mindset that is always learning new things or making new improvements.

Main responsibilities and activities:

  • Implement solution monitoring and observability monitoring, automate detections and responses
  • Implement SLI and SLO measurements and monitoring in our Solution Monitoring
  • Conduct Service improvement actions and review with the team using data from SLI and SLO
  • Troubleshoot incidents, post-incidents analysis, perform root cause analysis
  • Implement workarounds to avoid recurrence of incidents, improvements to monitoring detection
  • Implement Observability monitoring and perform distributed tracing analysis of applications
  • Deployment of new application releases to the preproduction and production environments
  • Participate and contribute to automation in deployment, automated testing, and monitoring detection
  • Collaborate with SQC team on testing automation deployment and DevOps on continuous delivery
  • Participate in the planning and review sessions with Development, DevOps, Platform teams
  • Expand and grow the technical knowledge, skillsets, and expertise expected of an SRE
  • Create and document any artifacts related to SRE practices, for example, good practices or patterns or customized dashboards or workarounds or troubleshooting methods, solution monitoring and observability improvements.

PROFILE:

  • College degree or technical training in Computer Science, software engineering or equivalent combination of training, and/or experience
  • At least 5 years of working experience, of which at least 3 years involved software development and 2 years related to IT operations or IT support or basic System Administration. Experience in application maintenance especially in application troubleshooting, bug detection, fixing, testing and application is a must.

TECHNICAL SKILLS:

  • Troubleshooting or debugging applications and complex systems
  • Application tracing and log analysis
  • Linux and VM
  • Hands-on experience in Shell Scripts
  • Application deployment, and deployment tools (e.g. Jenkins)
  • Competent knowledge of at least a database (understand schema, able to perform DML using SQL)
  • Programming and development at least one programming language (e.g. Python, C, Java, etc).
  • Incident resolution and root cause analysis and incident management
  • JIRA, ITSM ticketing tool and any documentation tools (e.g. Wiki), Nagios, Splunk, Dockers, OpenShift, Kubernetes, automation (e.g. Ansible)
  • English B2

Apply now »