Advertisement:



Site Reliability Engineer - Okx

Hong Kong, 香港, 香港
Last update 2025-03-15
Expires 2025-04-15
ID #2652581321
Free
Site Reliability Engineer - Okx
Hong Kong, 香港, 香港,
Modified March 12, 2025

Description

What You’ll Be Doing:  Ensure stability and optimize big data platforms (Alibaba Cloud Data Works, AWS EMR, AWS Data Bricks, Spark, Flink) and data warehouses (Max Compute, Hologres, Hive, Clickhouse, Star Rocks, etc.).

Deeply understand the architecture and principles of middleware (Kafka, Spring Cloud, Nacos, Apollo, Kong Gateway, etc.), ensuring high performance and availability.

Effectively optimize existing runtime environments (KVM, Docker, K8 S, JVM, etc.) to ensure efficient resource utilization and stable service operation.

Comprehend network architecture and security, providing guidance on infrastructure stability based on network architecture and security layers, ensuring secure, stable, and efficient network communications.

Lead chaos engineering exercises, coordinating with business units to validate system robustness and recovery capabilities through simulated failure scenarios.

Participate in rapid response and troubleshooting of system failures, continuously optimize monitoring strategies to reduce system downtime and ensure service continuity and stability.

Drive infrastructure automation and intelligence to improve SRE work efficiency and quality.

Collaborate closely with development teams, providing technical support and advice on infrastructure to jointly promote continuous product improvement and innovation.

What We Look For In You:  Bachelor's degree or above in Computer Science or related field, with 8+ years of experience in large-scale internet or cloud computing platform development/SRE/operations.

In-depth understanding of big data platforms, data warehouses, middleware, runtime environments, and network technology principles and architectures, with rich practical experience and troubleshooting skills.

Proficient in Linux system management and optimization, familiar with scripting languages such as Shell/Python, able to write automation tools and scripts.

Familiar with container and cloud-native technologies like KVM, Docker, and K8 S, including their architectures and principles, with extensive experience in handling common issues and failures.

Familiar with network protocols such as TCP/UDP/QUIC, proficient in using network commands like Tcp Dump, Trace Route, Netstat, and tools like Wireshark, with rich practical experience in troubleshooting common network issues.

Rich experience with Alibaba Cloud and AWS cloud products, from architecture to usage, with extensive practice in dealing with common issues and failures.

Practitioners with experience in service governance system construction, architecture optimization, stability assurance construction, capacity management, activity support, and chaos engineering are preferred.

Strong sense of responsibility and team spirit, with excellent problem-solving and analytical skills.

Must have Chinese communication skills; proficiency in both Chinese and English communication is preferred.

Perks & Benefits  Competitive total compensation package L&D programs and Education subsidy for employees' growth and development Various team building programs and company events Wellness and meal allowances  Comprehensive healthcare schemes for employees and dependants  More that we love to tell you along the process!

Job details:

Job type: Full time
Contract type: Permanent
Salary type: Monthly
Occupation: Site reliability engineer - okx

⇐ Previous job

Next job ⇒     

 

Contact employer

    Employer's info

    Quick search:

    Location

    Type city or region

    Keyword


    Advertisement: