We are looking for a Site Reliability Engineer (SRE) to make sure our cloud-based commerce platform is up and running and healthy.
As a SRE for iKala Commerce, you will be responsible for everything from our cloud infrastructure and operating systems to developing tools for code deployment and service monitoring. You will also review our code and system design and partner with developers to build our applications.
The SRE role is an integral member of our product development team. You will be a part of the team that makes crucial decisions about how to manage and scale complex, high-performance distributed systems. You will also provide your own perspective on our backend systems and constantly develop innovative ways to improve the way we manage the underlying infrastructure. Our ideal candidate should be able to develop applications on his/her own, but more eager to accelerate the whole team by building systems to improve performance and operational efficiency.
Ultimately, you should be involved in all stages of software development to define and improve our SLOs, SLAs & SLIs.
Our current tech stack include:
GCP, Terraform, Kubernetes, Helm, ArgoCD, Gitlab-CI/CD, Grafana LGTM,
【Key Responsibilities】
1. Designing & implementing infrastructure for collecting metrics, crunching data and improving service monitoring to detect problems before they're visible to our customers.
2. Building systems to automate our server lifecycle, from configuration management, CI/CD to server bootstrap and decommission.
3. Troubleshooting, performing root cause analysis, and resolving production issues from the application and network layers all the way down to the system level.
4. Participating in solution design and advising other developers when building new features so that they're scalable, maintainable, and performing well.
5. Improving the observability of our applications through monitoring, alerting, logging, tracing and profiling, and building such observability features into a common platform.
6. Practicing sustainable incident response and blameless postmortems.
7. Proactively identifying and reducing issues through design, testing, and implementation of software-based solutions.
More Info>>>https://www.ikala.ai
職責:
我們正在尋找技術專家
1. 在雲服務開通方面:在GCP等主流雲提供商上運營和管理應用系統。
2. 在發布管理中:合規、組裝、交付源代碼到容器鏡像中,並進一步部署到各種格式的基礎設施中。
負責項目:
1. 部署、管理和操作可擴展、高可用性和容錯系統。
2. 將現有的本地應用程序遷移到雲端。
3. 根據計算、數據或安全要求選擇合適的雲服務。
4. 估算雲使用成本並確定運營成本控制機制。
5. 執行測試腳本來構建軟件包,發布工程師確保新產品的配置和編碼正確,以便成功集成和運行。
6. 構建測試環境並解決與軟件性能相關的任何問題。與 RD 合作解決任何問題並記錄修復以供將來參考資料使用。
7. 構建工具以支持軟件工程流程、審查工程實踐、協助研究新技術,並與開發團隊會面討論未來需求。他們還為完成的產品提供持續支持並維護服務器。
8. 處理升級事件並在需要時提供 On Call 的支持。
要求:
1. 3 年以上雲環境配置、運營和管理經驗
2. 掌握 CI/CD 工具和方法。
3. 擁有 Kubernetes、Docker-compose 和容器化方面的經驗
4. 擁有 配置管理和 infrastructure as code 的經驗。
5. 擁有 Site Reliability Engineering 或 DevOps 方面的經驗更理想。
6. 跨部門的溝通能力。
7. 願意在高增長/擴展技術環境中工作。
8. 經驗/接觸較少的候選人將被視為 SysOps 工程師
工具:Tereform , argoCD , jenkins , shell
====
Responsibilities
1. We are looking for technical experts
In cloud service provisioning: operate and manage application systems on mainstream cloud providers such as AWS, GCP, Azure, Aliyun.
2. In release management: comply, assemble, deliver source code into container images and further deploy in infrastructure with various formats.
You are responsible for
1. Deploy, manage, and operate scalable, highly available, and fault tolerant systems.
2. Migrate an existing on-premises application to cloud.
3. Select the appropriate cloud service based on compute, data, or security requirements.
4. Estimating cloud usage costs and identifying operational cost control mechanisms.
5. Execute test scripts to build software packages, release engineers ensure that new products are configured and coded properly for successful integration and operations.
6. Build test environments and troubleshoot any issues pertaining to the software’s performance. They work with software engineers to resolve any issues and document fixes for use in future reference materials.
7. Build tools to support the software engineering process, review engineering practices, assist in researching new technologies, and meet with the development team to discuss future needs. They also provide ongoing support for completed products and maintain servers.
8. Handle incident escalation and provide on-call support if necessary.
Requirements:
1. 3+ years experience in provisioning, operating, and managing cloud environments
2. Mastery of CI/CD tools and methodologies.
3. Experience in Kubernetes, Docker orchestration and containernation
4. Experience in configuration management and infrastructure as code.
5. Experience as a site reliability or devops would be ideal.
6. Excellent communication skills.
7. Working in a high-growth/scaling technical environment.
8. Candidate with less experience/exposure will be considered as SysOps Engineer
17 LIVE 歡迎對以下工作內容有興趣的 網站可靠性工程師 加入我們的大家庭!
如果您具備以下工作技能及工作經驗,請不要猶豫立即手刀提出申請:
-具備 Container 以及 Kubernetes 的基礎知識。
-具備 CI/CD 流程的基礎概念,有撰寫或維護 CI/CD pipeline 的經驗者佳 (我們主要的使用到的有: CircleCI, Jenkins, ArgoCD, Helm)。
-具備建置或維護監控系統的相關經驗 (例如:Prometheus, Thanos, Grafana, Elasticsearch, Fluentd, Kibana 等)。
-了解 Linux 作業系統,並能在 Linux 環境下進行操作與問題排查。
-熟悉使用 Infrastructure as Code (IaC) 工具管理雲端資源,特別是 Terraform。
-具備基礎的 Shell Script 撰寫能力。
-提高可用性:知道如何部署HA架構以及DR架構。
-參與輪班值班,提供 24/7 支援
-具備至少一種程式語言的開發經驗 (例如:Go, Python, etc.)
加分條件:
-具備快速學習新技術與解決問題的能力。
-曾對開源軟體專案做出貢獻。
-具備後端服務開發經驗。
我們希望您具備的特質:
-反應迅速,能快速理解問題並採取行動。
-做事謹慎,習慣在進行變更或部署前,先進行小範圍的測試或驗證。
-熱愛學習,對 SRE 領域有高度熱情。
-良好的溝通能力與團隊合作精神。
If you have the following skills and experience, don’t hesitate to apply — we’d love to hear from you!
Required Skills & Experience
-Basic knowledge of Containers and Kubernetes
-Familiar with CI/CD workflows; experience in writing or maintaining CI/CD pipelines is a plus (tools we use include CircleCI, Jenkins, ArgoCD, and Helm)
-Experience in building or maintaining monitoring systems such as Prometheus, Thanos, Grafana, Elasticsearch, Fluentd, Kibana, etc.
-Solid understanding of the Linux operating system, including the ability to operate and troubleshoot in a Linux environment
-Familiar with using Infrastructure as Code (IaC) tools to manage cloud resources, especially Terraform
-Basic scripting skills with Shell scripts
-Knowledge of how to design and deploy high availability (HA) and disaster recovery (DR) architectures to improve system reliability
-Willingness to participate in an on-call rotation and provide 24/7 support
-Experience with at least one programming language (e.g., Go, Python, etc.)
Bonus Points
-Ability to quickly learn new technologies and solve complex problems
-Contributions to open-source projects
-Experience in backend service development
What We Look for in You
-Fast response: Able to quickly understand and act on issues
-Cautious and detail-oriented: Prefer to test and validate changes in a limited scope before deploying widely
-Curious and passionate: Eager to learn and enthusiastic about the SRE field
-Great communicator and team player: Strong collaboration and communication skills
Driving with us to the Next!
"Integration of various energy sources, improvement in energy efficiency, and creation of a powerful platform that benefits everyone"
【Job Description】
We are in search of SRE engineer who can seamlessly integrate development artifact with cloud resources. The candidate needs to have hands-on experience on public cloud usage and work closely on container world. We are looking for highly self-motivated engineer to join to build operational environments to support from customer service to development. Daily task might include explore to the latest technology to be adopted to resolve business problems.
【Core Responsibilities】
• Work closely with engineer teams to identify and implement optimal cloud-based solutions for the company.
• Build and maintain the agile / responsive container native CI/CD pipelines (Jenkins / ArgoCD), and support multiple development teams to deliver high-quality builds with measurable performance
• Build, maintain, improve, scale and secure cloud infrastructures and resources by using IaC tools (Terraform / Pulumi) with cost consideration
• Build automation tools to improve system's observability, availability and reliability via Python and Serverless solutions (AWS Lambda, Kubernetes Jobs)
• Design, manage and monitor Kubernetes clusters for multiple production workloads
• Participate in an on-call rotation to mitigate disruption for any production systems and conduct root cause analysis reports
• Plan and test disaster recovery scenarios and business continuity plans for a highly available micro-services architecture
• Develop and implement security policies in compliance with ISO 27001/27017 standards, including access control, encryption and logging
• Build central dashboard and alert mechanisms to identify potential resource problems
• Handle production issues with intelligent means
【Essential Qualification】
• Bachelor degree in computer related program
• 3 year experience in AWS cloud management
• 3 year experience in Kubernetes management
• 3 year experience in CI/CD area (Jenkins)
• 3 year experience in network or database (PostgreSQL, Cassandra, Redis)
• 2 year experience in observability mechanism (Prometheus, Grafana, InfluxDB, OpenSearch, ELK)
• 3 year experience in Linux
• Performance tuning & error handling & root cause analysis
• Need to on-call
【Desirable Abilities】
• AWS related certification
• CKA, CKAD, CKS