Skip to content

Conversation

@gmarciani
Copy link
Contributor

@gmarciani gmarciani commented Feb 4, 2026

Description of changes

In clustermgtd, we replace the explicit publishing of heartbeat metric with writing the heartbeat as a clustermgtd event into its dedicated events log. In this way, the metric is not created by clustermgtd, but it will be generated by the metric filter created by the CLI.

The advantage of this approach compared to the previous one is that:

  1. is does not require extra permission cloudwatch:PutMetricData
  2. it does not require an extra VPC Endpoint for CloudWatch Monitoring in air-gapped environments.

Notes

In this PR we reverted ebb742f, which is the commit where we introduced the approach of explicitly publishing the metric. Whatever deletion ion the PR diff comes from this revert.

User Experience

Below is an example of heartbeat event written to the clustermgtd events log /var/log/parallelcluster/clustermgtd.events.
The schema of the event is the standard schema for events published by clustermgtd.

{
  
  "datetime": "2026-02-04T19:24:37.549+00:00", 
  "version": 0, 
  "scheduler": "slurm", 
  "cluster-name": "clustermgtd-0204-2", 
  "node-role": "HeadNode", 
  "component": "clustermgtd", 
  "level": "INFO", 
  "instance-id": "i-00a3f1fa58cbcbd60", 
  "event-type": "clustermgtd-heartbeat", 
  "message": "Clustermgtd heartbeat", 
  "detail": {"heartbeat-timestamp": "2026-02-04T19:24:37.549+00:00"}
}

Q&A

  1. Why the heartbeat event is published only when the heartbet file is updated?
    This is the same behavior we had when we were pushing the metric explicitly. This is the correct approach because in this way the heartbeat event would signal a healthy clustermgtd loop.
  2. Why the heartbeat event contains such many fields that seems not required?
    This is the standard event schema we use for whatever clustermgtd event. We want to be stick with such default schema to keep the things simple.

Tests

By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.

…clustermgtd heartbeat."

This reverts commit ebb742f.

We must revert this commit because we will no longer put the clustermgtd heartbeat metric explicitly.
Instead, the heartbeat will be included into the clustermgtd events logs and the heartbeat metric
will be generated by a metric filter.
@gmarciani gmarciani changed the title Wip/mgiacomo/3150/clustermgtd alarm fix 0204 1 [Observability] Clustermgtd to emit heartbeat into event logs rather than explicit CW metric Feb 4, 2026
@gmarciani gmarciani marked this pull request as ready for review February 4, 2026 19:36
@gmarciani gmarciani requested review from a team as code owners February 4, 2026 19:36
timestamp_file.write(datetime.now(tz=timezone.utc).strftime(TIMESTAMP_FORMAT))
timestamp_file.write(heartbeat_time.strftime(TIMESTAMP_FORMAT))
# Publish heartbeat event to events log
self._event_publisher.publish_heartbeat_event(heartbeat_time)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Shall we wrap this with a try-catch to prevent it from failing clustermgtd logic?
Here we can deal case by case. In general, we should wrap all non-critical stuffs with try catch

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree with this need. This is done in

@log_exception(logger, "publish_heartbeat_event", catch_exception=Exception, raise_on_error=False)
def publish_heartbeat_event(self, heartbeat_timestamp: datetime):
.

Every function of the event publ;isher has the same mechanism to not fail on exceptions

@gmarciani gmarciani merged commit 9b1b66f into aws:develop Feb 5, 2026
21 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants