Apply now »

IT Service Engineer - Site Reliability Engineer

Req ID:  27830
Posted on:  12 Apr 2024
Location: 

Bangkok (TH01), Thailand

Department:  Customer Projects Deployment & Services (50016328)
Job Family:  Information Technology


 

 

IT SERVICE ENGINEER - SITE RELIABILITY ENGINEER

 

 

PURPOSE OF THE JOB:

 As an SRE, responsible for responding to incidents and escalation. This includes on-call support and escalation support that may be required after office hours and planned during the weekend. A support duty roster shall be implemented. 

On technical support, competent in troubleshooting and investigating technical problems, perform RCA, recommending resolutions, and implementing workarounds when a software fix is not available yet. Experience in software development and diagnosing defects, possess sufficient understanding of design principles in technical troubleshooting capabilities. Must process a root cause mindset, active practitioner, and prevent recurrence of issues.

On Solution and Observability Monitoring, must be competent in developing, customizing, and implementing Monitoring of the solution. Continually automate detections and automate actions to improve the responsiveness to outages. Responsible to implement, maintain and improve any defined SLI and SLO of the services, using the Solution and Observability Monitoring.

On Continuous delivery, responsible for deployment of new versions of applications. Competent with automation tools for scripting and automating deployment or system tasks or reconfigurations.

On Solution Quality Assurance, participate with Product Dev and DevOps on development testing activities (FAT) and drive solution testing during deployment (SAT).

On knowledge sharing and learning, responsible to document any technical issues, resolutions, workarounds, and improvements. Proactively shares knowledge with team members and SRE community. Possess a curious mindset that is always learning new things or making new improvements

 

OBJECTIVES OF ROLE

  • Collaborate in a multi-functional team (Product development, DevOps, SQC, IT Operations, Support Coordination)
  • Implement solution availability through SLI and SLO as an availability management within the team
  • Technical support, with on-call or escalation support after office hours and in the weekend and requirement to work in the weekend on duty roster basis
  • Incident response and resolution related to solution or service outages
  • Prevent recurrence of past problems and mitigation of future, potential problems
  • Solution and Services Monitoring to ensure automated detection and response to outages
  • Contribute and participate in the Continuous Delivery of new application releases
  • Implement high degree of automation in deployment, solution monitoring, automated action, and reconfiguration.
  • Participate and ensure compliance of Solution Quality Assurance activities
  • Continually learn new technical skills and share the knowledge and expertise with the team


RESPONSIBILITIES

 

  • Implement solution monitoring and observability monitoring, automate detections and responses
  • Implement SLI and SLO measurements and monitoring in our Solution Monitoring
  • Conduct Service improvement actions and review with the team using data from SLI and SLO
  • Troubleshoot incidents, post-incidents analysis, perform root cause analysis
  • Implement workarounds to avoid recurrence of incidents, improvements to monitoring detection
  • Implement Observability monitoring and perform distributed tracing analysis of applications
  • Deployment of new application releases to the preproduction and production environments
  • Participate and contribute to automation in deployment, automated testing, and monitoring detection
  • Collaborate with SQC team on testing automation deployment
  • Collaborate with DevOps on continuous delivery
  • Participate in solution FAT process and performance tests
  • Drive the solution SAT process in collaboration with Development, DevOps, Platform teams
  • Participate in the planning and review sessions with Development, DevOps, Platform teams
  • Expand and grow the technical knowledge, skillsets, and expertise expected of an SRE
  • Create and document any artifacts related to SRE practices, for example, good practices or patterns or customized dashboards or workarounds or troubleshooting methods, solution monitoring and observability improvements.
  • Document and share any work knowledge, incident experiences, and any improvements with the team
  • Participate in mentoring and coaching new and junior employees or new members of team
  • Participate and contribute to the community of knowledge sharing of SRE practices. For example, Solution Monitoring program.

 


REQUIREMENTS: 

 

Knowledge, skills:  

  • Basic education and training: Bachelor’s degree or technical training in Computer Science or Information Technology or equivalent combination of training, and/or experience.
  • Professional experience:  At least 5 years of working experience, of which at least 3 years involved software development and 2 years related to IT operations or IT support or basic System Administration. Experience in application maintenance especially in application troubleshooting, bug detection, fixing, testing and application is a must. 
  • Language skills: Fluent in English

 

Technical skills:

  • Experience in troubleshooting or debugging applications and complex systems
  • Understand application tracing and log analysis
  • Strong knowledge of Linux and VM
  • Hands-on experience in Shell Scripts
  • Experience in application deployment, and deployment tools (e.g. Jenkins)
  • Competent knowledge of at least a database (understand schema, able to perform DML using SQL)
  • Experience in programming and development at least one programming language (e.g. Python, C, Java, etc).
  • Experience with incident resolution and root cause analysis and incident management
  • Knowledge and experience in using JIRA, a ITSM ticketing tool and any documentation tools (e.g. Wiki)
  • Knowledge and experience in Nagios and Splunk or similar tools
  • Knowledge and experience in Dockers, OpenShift, Kubernetes or similar technologies
  • Knowledge and experience in automation (e.g. Ansible) will be an advantage


Personal qualities required, abilities: 

  • Strong problem-solving skills
  • Possess a root cause mindset
  • Critical thinking and analytical mindset (organized in thought and methodology)
  • Strong aptitude and motivation in learning new technical knowledge or skills
  • Willing to share, teach and coach others
  • Good sense of humility to learn from mistakes and learn from others
  • Demonstrated ownership of responsibilities
  • Good oral and writing communication skills
  • Good interpersonal skills and good influencing skills
  • Ability to work under pressure and meet deadlines
  • Customer oriented attitude

 

 

JOIN US!

 

Our success comes from our highly skilled and talented employees

Respectful entrepreneurship and a long-term vision are key to success

Our people contribute to a more secure world

Diversity at all levels of an organization is a strength

Apply now »