feat(slurm): Add native Prometheus OpenMetrics telemetry for Slurm >= 25.11#5824
feat(slurm): Add native Prometheus OpenMetrics telemetry for Slurm >= 25.11#5824AdarshK15 wants to merge 9 commits into
Conversation
Summary of ChangesHello, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed! This PR enhances observability for Slurm clusters by integrating native Prometheus OpenMetrics telemetry for versions 25.11 and above. It introduces a modular configuration generation system that allows for version-specific logic, facilitating the addition of experimental features like asynchronous replies and improved metric scraping via the Google Cloud Ops Agent. Highlights
Using Gemini Code AssistThe full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips. Invoking Gemini You can request assistance from Gemini at any point by creating a comment using either
Customization To customize the Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a Limitations & Feedback Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counterproductive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for GitHub and other Google products, sign up here. Footnotes
|
There was a problem hiding this comment.
Code Review
This pull request refactors the Slurm configuration generation logic into an object-oriented structure to support multiple Slurm versions, introducing specific generators for Slurm 24.11, 25.05, and 25.11. It also adds support for native Prometheus OpenMetrics telemetry via Slurm and Google Cloud Ops Agent, and introduces an experimental enable_async_reply setting. The review feedback highlights a potential KeyError in setup.py when initializing nested dictionaries for the Ops Agent configuration, suggesting a safer initialization approach.
| if "metrics" not in file: | ||
| file["metrics"] = {"receivers": {}, "service": {"pipelines": {}}} |
There was a problem hiding this comment.
If the metrics key already exists in the Ops Agent configuration file but does not contain receivers or service, directly accessing file["metrics"]["receivers"] or file["metrics"]["service"]["pipelines"] will raise a KeyError. We should safely initialize these nested dictionaries to prevent potential runtime crashes.
if "metrics" not in file or not isinstance(file["metrics"], dict):
file["metrics"] = {}
metrics = file["metrics"]
if "receivers" not in metrics or not isinstance(metrics["receivers"], dict):
metrics["receivers"] = {}
if "service" not in metrics or not isinstance(metrics["service"], dict):
metrics["service"] = {}
service = metrics["service"]
if "pipelines" not in service or not isinstance(service["pipelines"], dict):
service["pipelines"] = {}
Summary
This PR introduces
native OpenMetrics (Prometheus)telemetry support to the Slurm controller module (schedmd-slurm-gcp-v6-controller), exporting Slurm metrics directly to Google Cloud Monitoring.For Slurm clusters running version >= 25.11, this implementation leverages Slurm's native metrics/openmetrics plugin for Prometheus exporter. These metrics are exported directly from the
slurmctlddaemon natively, they are collected and exported from the controller instance only.List of supported endpoints:
To know more about these endpoints, visit the SchedMD Metrics Documentation.
Targeted Scraping
Because the root
/metricsendpoint returns a plain-text index instead of raw OpenMetrics payloads, the Google Cloud Ops Agent cannot scrape it directly. To handle this, we explicitly configure distinct scrape jobs to target the sub-endpoints individually.Additionally, we intentionally omit scraping the
/metrics/jobs-users-acctsendpoint. This endpoint duplicates job metrics broken down by every user and account, which introduces excessively high metric cardinality and bloat without adding significant value to cluster-level observability.Key Changes
enable_openmetricsboolean variable to the controller andslurm_filesmodules.conf_v2511.pyto conditionally injectMetricsType=metrics/openmetricsinto theslurm.confgeneration logic.setup.pyto dynamically inject the Prometheus receiver into/etc/google-cloud-ops-agent/config.yaml, configuring the 4 targeted scrape jobs.Usage Example
To enable this feature, set enable_openmetrics to true on the controller module in your blueprint. Note that this requires a Slurm version of 25.11 or higher.