Site Reliability Engineer

NOV

HoustonLocation

Houston

19 days ago

Posted date

19 days ago

N/A

Minimum level

N/A

Full-timeEmployment type

Full-time

EngineeringJob category

Engineering

JOB DESCRIPTION

As a Site Reliability Engineer, you will be responsible for: Operational Excellence & Incident Management

- Maintain and monitor production systems for availability, latency, and performance.

- Lead incident response efforts, including communication, resolution, and postmortem documentation.

- Design and implement health checks, alerting systems, and automated remediation workflows.

- Drive root cause analysis and implement permanent resolutions for recurring issues.

Observability & Insights

- Set up and maintain full observability stacks (logging, metrics, tracing) using tools like Prometheus, Grafana, Datadog, OpenTelemetry, or ELK.

- Analyze telemetry and logs to identify trends, anomalies, and opportunities for improvement.

- Conduct post-incident reviews and use insights to inform future engineering investments.

Performance & Systems Optimization

- Tune and optimize distributed systems, including AKKA.NET actors, for performance and resource efficiency.

- Work with developers to evolve architecture and improve system throughput, latency, and stability.

- Optimize PostgreSQL performance, queries, and maintenance strategies.

CI/CD & Automation

- Design and maintain modern CI/CD pipelines using GitHub Actions, Azure Pipelines, or GitLab CI.

- Automate deployment, testing, and rollback processes to reduce friction and increase deployment frequency.

- Standardize infrastructure as code practices across environments.

Education and Experience

- 5+ years of experience in SRE, DevOps, or Infrastructure Engineering roles.

- Bachelor's degree in information technology, Computer Science, or a related

- Expertise in Kubernetes and container orchestration at scale.

- Strong experience with AKKA.NET or similar actor-based frameworks.

- Proficiency with scripting and automation (Bash, PowerShell, Python).

- Experience with observability tools (Phobos,Datadog, Prometheus, Grafana, OpenTelemetry, ELK).

- Hands-on experience with cloud platforms (AWS, Azure, or GCP).

- Strong PostgreSQL knowledge-performance tuning, query optimization, maintenance.

- Proven ability to lead incident management and drive postmortem processes.

- A builder's mindset with high standards for operational excellence and technical ownership.

Preferred Tools & Ecosystem Experience

- CI/CD: GitHub Actions, Azure Pipelines, GitLab CI

- Infrastructure: Kubernetes, Docker, Terraform

- Monitoring: Phobos (AKKA.NET), Datadog, Prometheus

- Source Control: GitHub, GitLab, Azure DevOps

- Programming: C#, Python, Bash, PowerShell

Related tags

JOB SUMMARY

Site Reliability Engineer

NOV

Houston

19 days ago

N/A

Full-time

Site Reliability Engineer