Driving with us to the Next!
"Integration of various energy sources, improvement in energy efficiency, and creation of a powerful platform that benefits everyone"
【Job Description】
We are in search of SRE engineer who can seamlessly integrate development artifact with cloud resources. The candidate needs to have hands-on experience on public cloud usage and work closely on container world. We are looking for highly self-motivated engineer to join to build operational environments to support from customer service to development. Daily task might include explore to the latest technology to be adopted to resolve business problems.
【Core Responsibilities】
• Work closely with engineer teams to identify and implement optimal cloud-based solutions for the company.
• Build and maintain the agile / responsive container native CI/CD pipelines (Jenkins / ArgoCD), and support multiple development teams to deliver high-quality builds with measurable performance
• Build, maintain, improve, scale and secure cloud infrastructures and resources by using IaC tools (Terraform / Pulumi) with cost consideration
• Build automation tools to improve system's observability, availability and reliability via Python and Serverless solutions (AWS Lambda, Kubernetes Jobs)
• Design, manage and monitor Kubernetes clusters for multiple production workloads
• Participate in an on-call rotation to mitigate disruption for any production systems and conduct root cause analysis reports
• Plan and test disaster recovery scenarios and business continuity plans for a highly available micro-services architecture
• Develop and implement security policies in compliance with ISO 27001/27017 standards, including access control, encryption and logging
• Build central dashboard and alert mechanisms to identify potential resource problems
• Handle production issues with intelligent means
【Essential Qualification】
• Bachelor degree in computer related program
• 3 year experience in AWS cloud management
• 3 year experience in Kubernetes management
• 3 year experience in CI/CD area (Jenkins)
• 3 year experience in network or database (PostgreSQL, Cassandra, Redis)
• 2 year experience in observability mechanism (Prometheus, Grafana, InfluxDB, OpenSearch, ELK)
• 3 year experience in Linux
• Performance tuning & error handling & root cause analysis
• Need to on-call
【Desirable Abilities】
• AWS related certification
• CKA, CKAD, CKS
17 LIVE 歡迎對以下工作內容有興趣的 網站可靠性工程師 加入我們的大家庭!
如果您具備以下工作技能及工作經驗,請不要猶豫立即手刀提出申請:
-具備 Container 以及 Kubernetes 的基礎知識。
-具備 CI/CD 流程的基礎概念,有撰寫或維護 CI/CD pipeline 的經驗者佳 (我們主要的使用到的有: CircleCI, Jenkins, ArgoCD, Helm)。
-具備建置或維護監控系統的相關經驗 (例如:Prometheus, Thanos, Grafana, Elasticsearch, Fluentd, Kibana 等)。
-了解 Linux 作業系統,並能在 Linux 環境下進行操作與問題排查。
-熟悉使用 Infrastructure as Code (IaC) 工具管理雲端資源,特別是 Terraform。
-具備基礎的 Shell Script 撰寫能力。
-提高可用性:知道如何部署HA架構以及DR架構。
-參與輪班值班,提供 24/7 支援
-具備至少一種程式語言的開發經驗 (例如:Go, Python, etc.)
加分條件:
-具備快速學習新技術與解決問題的能力。
-曾對開源軟體專案做出貢獻。
-具備後端服務開發經驗。
我們希望您具備的特質:
-反應迅速,能快速理解問題並採取行動。
-做事謹慎,習慣在進行變更或部署前,先進行小範圍的測試或驗證。
-熱愛學習,對 SRE 領域有高度熱情。
-良好的溝通能力與團隊合作精神。
If you have the following skills and experience, don’t hesitate to apply — we’d love to hear from you!
Required Skills & Experience
-Basic knowledge of Containers and Kubernetes
-Familiar with CI/CD workflows; experience in writing or maintaining CI/CD pipelines is a plus (tools we use include CircleCI, Jenkins, ArgoCD, and Helm)
-Experience in building or maintaining monitoring systems such as Prometheus, Thanos, Grafana, Elasticsearch, Fluentd, Kibana, etc.
-Solid understanding of the Linux operating system, including the ability to operate and troubleshoot in a Linux environment
-Familiar with using Infrastructure as Code (IaC) tools to manage cloud resources, especially Terraform
-Basic scripting skills with Shell scripts
-Knowledge of how to design and deploy high availability (HA) and disaster recovery (DR) architectures to improve system reliability
-Willingness to participate in an on-call rotation and provide 24/7 support
-Experience with at least one programming language (e.g., Go, Python, etc.)
Bonus Points
-Ability to quickly learn new technologies and solve complex problems
-Contributions to open-source projects
-Experience in backend service development
What We Look for in You
-Fast response: Able to quickly understand and act on issues
-Cautious and detail-oriented: Prefer to test and validate changes in a limited scope before deploying widely
-Curious and passionate: Eager to learn and enthusiastic about the SRE field
-Great communicator and team player: Strong collaboration and communication skills
We are looking for an enthusiastic and motivated Site Reliability Engineer (SRE) to join our growing team. In this role, you will have the opportunity to learn and contribute to the stability, performance, and scalability of our critical systems. We place a strong emphasis on security in all aspects of our operations. You will work closely with teams to maintain and improve our infrastructure, monitor services, and respond to incidents. This is an excellent opportunity to develop your skills in a dynamic and supportive environment.
<Responsibilities>
1.Assist in Maintaining and Optimizing Infrastructure: Support teams in the day-to-day maintenance and optimization of our infrastructure components.
2.Monitor Services and Address Issues: Monitor system health and service performance, and assist in troubleshooting and resolving issues in a timely manner.
3.Track Resource Usage and System Status: Help monitor various resource indicators and the overall status of the system, contributing to optimization efforts.
4.Support System Stability and Incident Response: Assist in maintaining system stability and participate in incident response procedures under guidance.
5.Contribute to Preventing System Failures: Work with the team to implement measures that help avoid system failures and service interruptions.
6.Collaborate with Other Teams: Work alongside other teams to continuously learn about and contribute to improving system architecture and service quality.
7.Support System Maintenance and Deployment Processes: Assist in the execution of established processes for system maintenance, deployment, and upgrades.
8.Learn and Apply SRE Best Practices: Actively learn and apply SRE principles and best practices in daily tasks.