We are looking for a Site Reliability Engineer (SRE) to make sure our cloud-based commerce platform is up and running and healthy.
As a SRE for iKala Commerce, you will be responsible for everything from our cloud infrastructure and operating systems to developing tools for code deployment and service monitoring. You will also review our code and system design and partner with developers to build our applications.
The SRE role is an integral member of our product development team. You will be a part of the team that makes crucial decisions about how to manage and scale complex, high-performance distributed systems. You will also provide your own perspective on our backend systems and constantly develop innovative ways to improve the way we manage the underlying infrastructure. Our ideal candidate should be able to develop applications on his/her own, but more eager to accelerate the whole team by building systems to improve performance and operational efficiency.
Ultimately, you should be involved in all stages of software development to define and improve our SLOs, SLAs & SLIs.
Our current tech stack include:
GCP, Terraform, Kubernetes, Helm, ArgoCD, Gitlab-CI/CD, Grafana LGTM,
【Key Responsibilities】
1. Designing & implementing infrastructure for collecting metrics, crunching data and improving service monitoring to detect problems before they're visible to our customers.
2. Building systems to automate our server lifecycle, from configuration management, CI/CD to server bootstrap and decommission.
3. Troubleshooting, performing root cause analysis, and resolving production issues from the application and network layers all the way down to the system level.
4. Participating in solution design and advising other developers when building new features so that they're scalable, maintainable, and performing well.
5. Improving the observability of our applications through monitoring, alerting, logging, tracing and profiling, and building such observability features into a common platform.
6. Practicing sustainable incident response and blameless postmortems.
7. Proactively identifying and reducing issues through design, testing, and implementation of software-based solutions.
More Info>>>https://www.ikala.ai
Driving with us to the Next!
"Integration of various energy sources, improvement in energy efficiency, and creation of a powerful platform that benefits everyone"
【Job Description】
We are in search of SRE engineer who can seamlessly integrate development artifact with cloud resources. The candidate needs to have hands-on experience on public cloud usage and work closely on container world. We are looking for highly self-motivated engineer to join to build operational environments to support from customer service to development. Daily task might include explore to the latest technology to be adopted to resolve business problems.
【Core Responsibilities】
• Work closely with engineer teams to identify and implement optimal cloud-based solutions for the company.
• Build and maintain the agile / responsive container native CI/CD pipelines (Jenkins / ArgoCD), and support multiple development teams to deliver high-quality builds with measurable performance
• Build, maintain, improve, scale and secure cloud infrastructures and resources by using IaC tools (Terraform / Pulumi) with cost consideration
• Build automation tools to improve system's observability, availability and reliability via Python and Serverless solutions (AWS Lambda, Kubernetes Jobs)
• Design, manage and monitor Kubernetes clusters for multiple production workloads
• Participate in an on-call rotation to mitigate disruption for any production systems and conduct root cause analysis reports
• Plan and test disaster recovery scenarios and business continuity plans for a highly available micro-services architecture
• Develop and implement security policies in compliance with ISO 27001/27017 standards, including access control, encryption and logging
• Build central dashboard and alert mechanisms to identify potential resource problems
• Handle production issues with intelligent means
【Essential Qualification】
• Bachelor degree in computer related program
• 3 year experience in AWS cloud management
• 3 year experience in Kubernetes management
• 3 year experience in CI/CD area (Jenkins)
• 3 year experience in network or database (PostgreSQL, Cassandra, Redis)
• 2 year experience in observability mechanism (Prometheus, Grafana, InfluxDB, OpenSearch, ELK)
• 3 year experience in Linux
• Performance tuning & error handling & root cause analysis
• Need to on-call
【Desirable Abilities】
• AWS related certification
• CKA, CKAD, CKS
[Job Overview]
We are seeking a skilled and passionate Site Reliability Engineer with a strong technical background and excellent communication skills. This individual will lead the development, construction, and management of reliable and distributed systems that support our business operations.
In this role, you will play a vital part in supporting the "OneDegree HK" businesses, a user-friendly digital insurance platform for individuals and businesses in Hong Kong.
Know more about SRE team in OneDegree→ https://medium.com/onedegree-tech-blog
[Responsibilities]
· Implement and enhance system reliability, availability, scalability, performance, and efficiency by leveraging monitoring, alerting, and automation tools on public cloud platforms.
· Participate in capacity planning, analyze software performance, and fine-tune systems to ensure optimal operation.
· Develop and enhance GitLab CI/CD process and toolset to streamline software delivery and deployment.
· Define and monitor key metrics to assess and enhance system reliability.
· Collaborate closely with the engineering team to improve reliability and operational efficiency at every software development life cycle (SDLC) stage.
· Troubleshoot, optimize infrastructure and automate repetitive tasks to increase efficiency and effectiveness
[Requirements]
· Proficiency in programming languages such as Bash, Python, or Go.
· Advanced knowledge of monitoring solutions like Prometheus, Grafana, ELK (Elasticsearch, Logstash, Kibana).
· Strong expertise and experience in cloud technologies, specifically Azure and GCP.
· Experience in the complete software development life cycle (SDLC).
· In-depth understanding of network concepts, particularly with a focus on security.
· Hands-on experience implementing CI/CD processes, for example, using GitLab CI.
· Proficiency in automation platforms like Ansible and Terraform.
· Knowledge of orchestration tools like Kubernetes.
· Familiarity with container technologies like Docker.
· Experience with Git source code version control systems.
· Strong problem-solving skills with a systematic approach, effective communication a/bilities, and a self-driven attitude.
About
Want to build a worldwide brand from Taiwan, and to communicate our brand story to millions of users worldwide?
Want to be based in Taiwan but work in a silicon-valley-like environment, and to build world-class brand and products?
Want to participate in the global fintech and blockchain movement, and work at an English-speaking workplace?
Come change the world with us! Join this fast-growing startup founded by software veterans and funded by top VCs, Skype co-founders, and the Taiwanese government (NDF)!
We’re hiring for an experienced Senior SRE Engineer. The exact mix of other skills does not matter, so long as your tool chest includes a mix of abilities. Be willing to attack anything that comes your way, learn on the fly and get things done. Come talk to us if you want to push your skillset in a dynamic fast-paced environment.
Responsibilities
1. 負責日常 AWS 線上營運平台運維工作, 保障系統7*24小時穩定運行、系統監控、應用監控、日誌監控、元件升級、安全事件回應處理、成本控制,資源管理和分配等
2. 負責日常AWS 線上營運平台各項問題/緊急狀況處理/排查/追蹤/回報
3. 分析系統瓶頸,優化架構和優化性能
4. 監控系統 Zabbix、Nagios 和 ELK 建置、維護、告警處理及調整,並能依照特定需求完成自訂腳本掌握系統運作狀態
5. 協助應用系統、資料庫高可用部署、備份、故障排除
6. 配合後端、產品與建置伺服器等架構
7. 定期的報告、報表製作與事件紀錄
8. 協助排除內部 IT 問題以及內部資訊環境建置及維護
9. 配合公司安排 oncall
Requirements
1. 5年及以上的 Linux 系統使用和管理運維經驗
2. 有7*24運維工作經驗佳
3. 熟悉 AWS 雲端平台,如:
AWS EC2
AWS APP sync
AWS API Gateway
AWS Networking (firewalls and routing)
AWS VPC permissions and routing
AWS Lambda functions
AWS Aurora (MySQL but cloud-based)
AWS Elasticache (explicitly REDIS)
AWS Cloudfront
AWS Cloudwatch
AWS Security and protection systems
AWS EKS
AWS IAM
AWS parameter store and secret manager
Amazon Simple Notification Service
Dashboard systems such as Grafana
Scripting languages such as Bash script, Python, Golang
Container systems such as Kubernetes
4. 熟悉自動化組態管理工具: Terraform, Helm, Kustomize
5. 熟悉 Linux 環境下的系統管理、網絡管理、監控、問題追踪及故障排除
6. 熟悉 CICD pipeline such as Jenkins, github action, argo Workflow, ArgoCD
7. 熟悉 Airflow 包括建置與撰寫DAG
8. 熟悉大型網站系統架構,EKS、MongoDB、Kafka以及相關應用的部署、備份、復原、調教、優化, 包括:Web伺服器、資料庫、流量管理、負載均衡、消息隊列、高可用解決方案等
9. 具備相關資訊安全知識
10. 具有良好的溝通能力和團隊合作能力, 具有較強的抗壓能力和學習能力,能夠獨立高效地發現和解決問題
Location: Taipei
https://goo.gl/maps/vC7WxAurcZVWwCCNA
About XREX
https://www.xrex.io/
Culture
https://downloads.xrex.io/culture
-We will proceed your application first if you apply online:
https://xrex.breezy.hr/p/16372b5314e8
【職缺描述】
-Ability to use configuration management tools and revision control system (e.g., Git)
-Experience with CI/CD & Automation systems (e.g., Jenkins)
-Experience with AWS Core Services: EC2 / ELB / S3 / CloudFront/ IAM/ VPC, AWS SDK and CLI
-Build & operation container based platform with Nomad / Consul / Kubernetes.
-Experience with monitoring, alerting, and log pipeline analysis tools (Graylog2, ELK, Prometheus, etc.)
【職務需求】
-The successful candidate will be a self-driven Senior DevOps Engineer with proven experience in large-scale microservice systems hosted on AWS
-The candidate will have a deep understanding of cloud architecture, AWS technologies, and cloud security best practices
-The candidate will be following the latest industry trends and be passionate about cloud computing for large-scale systems
Roles:
1. Develop and maintain complex infra consisting of many microservices on AWS.
Minimum qualifications:
1. 3+ years of experience working in a devop role.
2. Experience in a commonly used language in devop, like python, golang, or pearl.
3. Strong in bash scripting.
4. Knowledge in implementing infra as code in Terraform.
5. Experience in developing clean and maintainable infra in a cloud environment, preferably AWS.
Preferred Qualifications:
1. Knowledge in security best practices.
2. Experience with various CI/CD technologies like GHA and docker … etc.
3. Experience scaling large databases and data intensive applications.
4. Expert in designing fault-tolerant infra.