Skip to content

feat(slurm): Add native Prometheus OpenMetrics telemetry for Slurm >= 25.11#5824

Draft
AdarshK15 wants to merge 9 commits into
GoogleCloudPlatform:developfrom
AdarshK15:slurm-prometheus
Draft

feat(slurm): Add native Prometheus OpenMetrics telemetry for Slurm >= 25.11#5824
AdarshK15 wants to merge 9 commits into
GoogleCloudPlatform:developfrom
AdarshK15:slurm-prometheus

Conversation

@AdarshK15

Copy link
Copy Markdown
Member

Summary

This PR introduces native OpenMetrics (Prometheus) telemetry support to the Slurm controller module (schedmd-slurm-gcp-v6-controller), exporting Slurm metrics directly to Google Cloud Monitoring.

For Slurm clusters running version >= 25.11, this implementation leverages Slurm's native metrics/openmetrics plugin for Prometheus exporter. These metrics are exported directly from the slurmctld daemon natively, they are collected and exported from the controller instance only.

List of supported endpoints:

$ curl localhost:6818/metrics
slurmctld index of metrics endpoints:
  '/metrics/jobs': get job metrics
  '/metrics/nodes': get node metrics
  '/metrics/partitions': get partition metrics
  '/metrics/jobs-users-accts': get user and account jobs metrics
  '/metrics/scheduler': get scheduler metrics

To know more about these endpoints, visit the SchedMD Metrics Documentation.

Targeted Scraping

Because the root /metrics endpoint returns a plain-text index instead of raw OpenMetrics payloads, the Google Cloud Ops Agent cannot scrape it directly. To handle this, we explicitly configure distinct scrape jobs to target the sub-endpoints individually.
Additionally, we intentionally omit scraping the /metrics/jobs-users-accts endpoint. This endpoint duplicates job metrics broken down by every user and account, which introduces excessively high metric cardinality and bloat without adding significant value to cluster-level observability.

Key Changes

  • Terraform Variables: Added the enable_openmetrics boolean variable to the controller and slurm_files modules.
  • Slurm Configuration: Updated conf_v2511.py to conditionally inject MetricsType=metrics/openmetrics into the slurm.conf generation logic.
  • Ops Agent Pipeline: Updated setup.py to dynamically inject the Prometheus receiver into /etc/google-cloud-ops-agent/config.yaml, configuring the 4 targeted scrape jobs.

Usage Example

To enable this feature, set enable_openmetrics to true on the controller module in your blueprint. Note that this requires a Slurm version of 25.11 or higher.

- id: slurm_controller
    source: community/modules/scheduler/schedmd-slurm-gcp-v6-controller
    settings:
      enable_openmetrics: true

@AdarshK15 AdarshK15 added the release-key-new-features Added to release notes under the "Key New Features" heading. label Jun 22, 2026
@gemini-code-assist

Copy link
Copy Markdown
Contributor

Summary of Changes

Hello, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This PR enhances observability for Slurm clusters by integrating native Prometheus OpenMetrics telemetry for versions 25.11 and above. It introduces a modular configuration generation system that allows for version-specific logic, facilitating the addition of experimental features like asynchronous replies and improved metric scraping via the Google Cloud Ops Agent.

Highlights

  • Telemetry Support: Added native Prometheus OpenMetrics telemetry support for Slurm clusters version 25.11 and higher.
  • Configuration Refactoring: Implemented a new SlurmConfigGenerator class hierarchy to handle version-specific configuration generation, enabling cleaner support for new features like OpenMetrics and experimental settings.
  • Infrastructure Changes: Introduced enable_openmetrics and experimental Terraform variables to control these new features within the controller and slurm_files modules.
  • Observability: Updated the Ops Agent configuration to target specific Slurm metrics endpoints, ensuring compatibility with Google Cloud Monitoring.
Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature Command Description
Code Review /gemini review Performs a code review for the current pull request in its current state.
Pull Request Summary /gemini summary Provides a summary of the current pull request in its current state.
Comment @gemini-code-assist Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help /gemini help Displays a list of available commands.

Customization

To customize the Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counterproductive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for GitHub and other Google products, sign up here.

Footnotes

  1. Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution.

@gemini-code-assist gemini-code-assist Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request refactors the Slurm configuration generation logic into an object-oriented structure to support multiple Slurm versions, introducing specific generators for Slurm 24.11, 25.05, and 25.11. It also adds support for native Prometheus OpenMetrics telemetry via Slurm and Google Cloud Ops Agent, and introduces an experimental enable_async_reply setting. The review feedback highlights a potential KeyError in setup.py when initializing nested dictionaries for the Ops Agent configuration, suggesting a safer initialization approach.

Comment on lines +699 to +700
if "metrics" not in file:
file["metrics"] = {"receivers": {}, "service": {"pipelines": {}}}

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

If the metrics key already exists in the Ops Agent configuration file but does not contain receivers or service, directly accessing file["metrics"]["receivers"] or file["metrics"]["service"]["pipelines"] will raise a KeyError. We should safely initialize these nested dictionaries to prevent potential runtime crashes.

        if "metrics" not in file or not isinstance(file["metrics"], dict):
            file["metrics"] = {}
        metrics = file["metrics"]
        if "receivers" not in metrics or not isinstance(metrics["receivers"], dict):
            metrics["receivers"] = {}
        if "service" not in metrics or not isinstance(metrics["service"], dict):
            metrics["service"] = {}
        service = metrics["service"]
        if "pipelines" not in service or not isinstance(service["pipelines"], dict):
            service["pipelines"] = {}

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

release-key-new-features Added to release notes under the "Key New Features" heading.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant