Site Reliability Engineer - Singapur, Singapore - Sea

    Sea
    Sea Singapur, Singapore

    3 weeks ago

    Default job background
    $80,000 - $120,000 per year Technology / Internet
    Description

    We are seeking a highly skilled Site Reliability Engineer (SRE) with a strong background in maintaining self-hosted Kubernetes clusters, where your primary focus will be on ensuring the stability and reliability of our production environment. Ensuring a smooth running infrastructure supports the work of our AI researchers as it provides them a steady and dependable platform.

    Job Description
  • Work closely with AI researchers to understand their workflow and infrastructure needs, optimizing the cluster configurations accordingly.
  • Implement monitoring, alerting, and self-healing systems to ensure high availability and performance of the clusters.
  • Collaborate with development teams to design and implement best practices for infrastructure as code (IaC).
  • Drive automation initiatives to reduce manual toil and improve system resilience and scalability.
  • Document system design and procedures, provide guidance for researchers on our cluster advance usage.
  • Bachelor degree or higher in Computer Science, Engineering, or related fields.
  • Proven experience in managing self-hosted Kubernetes clusters in a production environment.
  • Strong understanding of containerization, orchestration, and the Kubernetes ecosystem.
  • Familiarity with AI workflows, machine learning/deep learning research background is a plus.
  • Proficiency in at least one programming language (e.g., Python, Go) and scripting skills for automation.
  • Good working attitude, problem-solving, critical thinking, and communication skills.