We are seeking an experienced and passionate Senior Site Reliability Engineer (SRE) to join our team. In this pivotal role, you will be responsible for designing, building, maintaining, and optimizing our growing and highly complex infrastructure, ensuring its stability, scalability, and performance. As a senior member, you will lead technical direction, mentor junior engineers, and collaborate closely with cross-functional teams to continuously improve service quality and system resilience. <Responsibilities> 1.Lead Infrastructure Design, Implementation, and Strategy: Architect and drive the evolution of scalable, highly available, and cost-effective infrastructure solutions. 2.Spearhead Monitoring and Alerting Systems: Design, implement, and continuously optimize comprehensive monitoring, logging, and alerting systems to ensure proactive issue detection and rapid root cause analysis. 3.Conduct In-depth System Performance Analysis and Optimization: Monitor and analyze various resource metrics and overall system health, identify bottlenecks, and lead the implementation of advanced performance tuning solutions. 4.Enhance System Stability and Incident Response Capabilities: Develop and execute preventive maintenance strategies, lead major incident troubleshooting and emergency response efforts, ensuring service continuity. Design and conduct disaster recovery drills. 5.Drive Automation and Efficiency Improvements: Proactively identify and implement automation solutions for infrastructure operations, deployment, scaling, and other areas to enhance team efficiency. 6.Facilitate Cross-Team Collaboration and Architectural Improvements: Serve as an SRE technical expert, working closely with development, QA, and other teams to provide architectural recommendations from a reliability perspective, driving continuous improvements in service quality. 7.Establish Operational Standards and Processes: Design, establish, and promote standardized processes and best practices for system maintenance, deployment, upgrades, and change management. 8.Mentor and Train: Guide and mentor junior SRE engineers within the team, sharing knowledge and experience to elevate the team's overall technical proficiency. 9.Evaluate and Adopt New Technologies: Assess, test, and introduce new technologies, tools, and methodologies to enhance infrastructure efficiency, reliability, and security. 10.Oversee Capacity Planning and Cost Optimization: Perform capacity planning, forecast resource requirements, and identify opportunities for infrastructure cost optimization.
待遇面議
(經常性薪資達 4 萬元或以上)
不拘
<Must Have> -6+ years of experience in SRE, DevOps, or Systems Administration roles, with experience operating large-scale distributed, high-traffic systems. -Proficiency in Linux operating system administration, performance tuning, and troubleshooting. Deep understanding of containerization technologies (Docker, containerd) and Kubernetes (K8s) management, scheduling, networking, and storage. Experience designing, deploying, and maintaining complex K8s clusters. -Extensive experience in building, integrating, and maintaining monitoring systems (e.g., Prometheus, Grafana) and log aggregation systems (e.g., ELK Stack, Loki). -Proficiency in using and managing at least two major public cloud platforms (AWS, GCP, Azure) with practical deployment experience. -Skilled in using one or more IaC tools (e.g., Terraform) with practical experience managing production environments. -Solid programming skills in at least one scripting language (e.g., Python, Go, Bash) for developing automation tools. -In-depth understanding of network architecture (TCP/IP, HTTP/HTTPS, DNS, CDN, Load Balancing) and troubleshooting. -Experience deploying, managing, and performing basic tuning for RDBMS and NoSQL database clusters (e.g., PostgreSQL, MySQL, Redis, MongoDB). -Experience designing, building, and maintaining CI/CD pipelines using tools such as Gitlab CI, Github Actions. -Excellent problem-analysis and resolution skills, capable of independently handling complex technical challenges. -Excellent cross-team communication, coordination skills, and a strong spirit of teamwork. -Experience collaborating with development and product teams to help drive product improvements from a reliability and operability perspective. <Nice to have> -Experience with multiple public cloud providers. -Familiarity with Service Mesh technologies (e.g., Istio, Linkerd). -Familiarity with the principles and practices of Distributed Tracing systems (e.g., Jaeger, Tempo, OpenTelemetry). -Familiarity with DevOps Research and Assessment (DORA) metrics or other tools for measuring software delivery and operational performance. -Deeper experience in database administration, performance optimization, or architectural design. -Familiarity with big data processing or dataflow related systems and technologies (e.g., Google BigQuery, Dataflow, Apache Spark, Apache Kafka, etc.). -Proficiency in leveraging AI-powered tools to enhance productivity and accelerate tasks within the SRE domain (e.g., AI-assisted coding, automated analysis, documentation generation). -Experience in network security and infrastructure hardening. Familiarity with advanced configuration and performance tuning of web servers (Nginx, Apache). -Experience mentoring or leading junior engineers. Contributions to open-source projects.
Gamania提供具市場競爭力的薪資福利,以及各項貼近橘子人需求的福利措施,以延攬各領域專業及深具潛力人才。以台灣地區為例提供的有: 1.具市場競爭力的薪資 2.年終獎金2個月+營運分紅獎金 3.勞健、團體保險 4.員工旅遊假、旅遊金補助 5.生日假、產檢假、壯遊假、陪產假、配偶/伴侶陪產檢假 6.生日、三節禮金 7.婚喪喜慶補助金 8.每年免費健康檢查 9.每日普橘島餐飲補助金 10.多元社團活動 11.二十四小時開放的休閒運動中心以及專職健身教練 12.歡樂安全的幼兒學習天地-幼橘園 13.E-learning學習平台 14.全員休閒活動,年終尾牙、家庭日、節慶活動等 15.明亮、寬敞、摩登辦公環境 更多Gamania台灣區相關資訊請上www.gamania.com 查詢。