DevOps Engineer

Vetric · senior

0/3 labs0/3 designs

Hands-on labs

3 scenarios

Bash: EKS Fleet Health Reportereasy

Write a Bash tool that audits a Kubernetes cluster and produces a health report suitable for a first-DevOps-hire onboarding audit. No cluster access needed — the lab ships canned kubectl JSON fixtures you parse with jq.

Enter lab →

Terragrunt: Multi-Env EKS Modulemedium

Design a Terragrunt layout that stamps out a shared EKS module across dev/staging/prod without copy-pasted HCL. Uses OpenTofu + LocalStack-compatible AWS provider so `tofu plan` works offline.

Enter lab →

KEDA: Scale Scrapers on Queue Depthhard

Deploy a toy scraper workload in your k3s namespace that scales 0→N based on queue depth, mirroring Vetric's event-driven data-pipeline shape. Uses Prometheus + KEDA ScaledObject.

Enter lab →

Want more to practice?

Appends 3 new labs (1 easy · 1 medium · 1 hard) and 3 new design scenarios to this briefing — existing scores stay put.

Design interviews

3 Job-Description-grounded scenarios

medium

Design the observability platform for Vetric's first real infra buildout: ~200 microservices scraping public web data, 20+ countries of customers, one small team on-call. What do metrics, logs, and traces look like end-to-end?

Directly matches the JD's monitoring stack list and the 'first DevOps hire builds the foundation' framing.

inferred: ~200 services across EKS — JD says 'team large enough to need real infrastructure'inferred: 2–4 person on-call rotation given 'small focused team' languageinferred: <$X/month obs budget — bootstrapped profitable company is cost-sensitiveMust integrate with Prometheus, Grafana, ELK, CloudWatch (JD-named stacks)+1

Open board →

hard

Design the ingest pipeline for a new scraper tier that collects 200k public posts/sec across 20+ countries, feeding cybersecurity and public-safety customers who need ≤5min end-to-end latency.

Vetric's entire business is high-volume public-data ingest — this is the core architectural surface the hire will own.

200k events/sec sustained peak (inferred from 'massive scale' + 20 countries)≤5 min P95 end-to-end latency to customer deliveryPer-tenant isolation — a noisy cybersecurity customer cannot starve a public-safety oneMust run on AWS EKS (JD-mandated stack)+2

Open board →

hard

Vetric is bootstrapped-profitable and your AWS bill just hit a number leadership doesn't love. Redesign the EKS + scraping footprint to cut 40% of cloud spend without reducing throughput or SLA.

'Profitable from day one, fully bootstrapped' is the loudest culture signal in the JD — cost discipline is part of the job.

40% cost reduction target on current AWS spendZero regression on customer-facing SLA (≤5min ingest latency)Must stay on EKS + AWS (JD-mandated)Bootstrapped company — no room for a 6-month rewrite, changes must land in <1 quarter+2

Open board →

Troubleshooting drills

2 scenarios — run them as interactive practice

medium

An EKS node group keeps evicting scraper pods under memory pressure. How do you diagnose the root cause?

Run drill →

hard

Your Prometheus is OOMing at 80M active series. What's your first move?

Run drill →

Stack

9 mentioned · 2 inferred

AWS (EKS, EC2, IAM, VPC)KubernetesTerraform / OpenTofu / TerragruntGitHub Actions / Jenkins / ArgoCDPrometheus / Grafana / ELK / CloudWatchBash / Python / JavaScriptAmazon ECSKEDA / event-driven autoscalingGitHubKafka or similar streamingHelm / Kustomize

Likely questions

behavioralmedium
You're the first DevOps hire at Vetric. What do you ship in your first 30 days?
Practice
troubleshootingmedium
An EKS node group keeps evicting scraper pods under memory pressure. How do you diagnose the root cause?
Practice
architecturemedium
Why would you pick OpenTofu + Terragrunt over vanilla Terraform for a multi-account AWS setup?
Practice
architecturehard
Design an ArgoCD layout for 3 environments and 40 microservices with per-env secrets. What's the repo structure?
Practice
systemshard
A scraper deployment needs to scale from 0 to 500 pods based on Kafka lag. Which KEDA scaler and what are the pitfalls?
Practice
scriptingeasy
Write a Bash one-liner to find all EKS pods in CrashLoopBackOff across every namespace.
Practice
behavioralmedium
Engineers push straight to main and CI is flaky. How do you roll out branch protection without stalling delivery?
Practice
securitymedium
What IAM boundary would you enforce so a scraper pod can read one S3 prefix but nothing else?
Practice
troubleshootinghard
Your Prometheus is OOMing at 80M active series. What's your first move?
Practice
architecturehard
How would you run a stateful scraping workload on Spot instances without losing in-flight jobs?
Practice

Culture

· Bootstrapped and profitable — expect cost-consciousness and long-term thinking over hypergrowth spend.
· First DevOps hire with full technical authority — candidate must be opinionated and self-directed.
· Small sharp team, infrastructure-matters tone — engineering discipline valued over ticket-churn.
· Mission-critical customers (cybersecurity, public safety) — reliability and uptime carry real weight.
· Global customer base across 20+ countries — implies 24/7 reliability expectations and likely on-call.
· In-office Tel Aviv role (not flagged remote) — assume hybrid/on-site collaboration.

From the bank

3 for this stack

Write a Bash one-liner that tails every pod's logs in a namespace and highlights any line containing 'ERROR' or '5xx'.
Tell me about a time a deploy went wrong in production. What was the blast radius, how did you recover, and what did you change afterwards?
A customer reports that one of our internal Kubernetes-hosted services is returning 502s intermittently. Walk me through how you'd investigate, starting from zero context.

Browse all →

Original job description

DevOps Engineer
Engineering Tel Aviv, IsraelSeniorFull-time
Description
What is Vetric?

Vetric builds large-scale public data infrastructure.

We provide data pipelines that collect, structure, and deliver high-volume public web data for mission-critical companies operating in cybersecurity, public safety and digital risk protection.

Our systems power platforms that detect bad actors, uncover impersonation and fraud, identify coordinated manipulation, and help public safety organizations respond faster to real-world risks.

We don’t build dashboards, and we don’t sell surface-level insights.

We build stable, production-grade data flows that become part of our customers’ core products, with the real impact of saving lives or huge known organizations from bad actors.

Operating globally, we serve industry leaders across more than 20 countries who rely on us for scale, reliability, and depth.



Why Vetric?

Vetric is profitable from day one (fully bootstrapped - we haven’t raised external funding), and we’re building foundational technology - not chasing trends. Because this is infrastructure that matters, we operate with engineering discipline, strong ownership, and long-term thinking.

We’re at a true inflection point: the team is now large enough to require real infrastructure, yet still small enough that what you build will define how things work for the next several years.

This is infrastructure that matters and so is how we operate internally. You’ll be working with a sharp, focused team that takes engineering discipline seriously and is intentionally building an organization that matches the quality of its product.



Position Overview

We are seeking a DevOps Engineer to lead and own the entire DevOps function at Vetric. 

As our first DevOps hire, you won’t just maintain systems, you will set the vision, establish best practices, and build the foundation of our infrastructure strategy for years to come. This is a unique opportunity to step into an impactful role with full technical authority, influencing architectural decisions and guiding how our engineering teams deliver, scale, and secure our large-scale, data-intensive platform.



Responsibilities:

Define and drive Vetric’s infrastructure strategy across all environments
Architect and operate Kubernetes clusters at production scale with a focus on reliability, resilience, and data-heavy workloads
Lead the adoption of Infrastructure as Code (Terraform, OpenTofu, Terragrunt) and establish automation standards
Implement modern CI/CD pipelines (GitHub Actions, Jenkins, ArgoCD, or similar)
Champion observability, monitoring, and reliability engineering practices
Build and optimize infrastructure that powers large-scale, data-driven pipelines at massive scale
Serve as the technical authority for all DevOps matters, influencing and aligning engineering teams
Partner with engineering leadership to shape infrastructure roadmaps and technology choices
Requirements
Qualifications:

5+ years of deep, hands-on AWS experience (EKS, EC2, networking, IAM, scaling)
Proven success in senior DevOps / Cloud Engineering leadership roles
Expert knowledge of Terraform and modern IaC tools (OpenTofu, Terragrunt)
Strong Kubernetes expertise at scale (design, scaling, optimization)
Experience running high-scale, production-grade environments handling large data volumes
Excellent communication skills with the ability to influence, guide, and align teams
Solid scripting/automation skills (Bash, JavaScript, Python, or similar)
Familiarity with cloud-native monitoring & logging stacks (Prometheus, Grafana, ELK, CloudWatch, etc.)


We’d be lucky if you:

Experience with Amazon ECS
Proficiency with GitHub or similar platforms (GitLab, Bitbucket, etc.)
Exposure to event-driven architectures and autoscaling frameworks (KEDA or similar)