[Observability] Clustermgtd to emit heartbeat into event logs rather than explicit CW metric #687

gmarciani · 2026-02-04T19:13:59Z

Description of changes

In clustermgtd, we replace the explicit publishing of heartbeat metric with writing the heartbeat as a clustermgtd event into its dedicated events log. In this way, the metric is not created by clustermgtd, but it will be generated by the metric filter created by the CLI.

The advantage of this approach compared to the previous one is that:

is does not require extra permission cloudwatch:PutMetricData
it does not require an extra VPC Endpoint for CloudWatch Monitoring in air-gapped environments.

Notes

In this PR we reverted ebb742f, which is the commit where we introduced the approach of explicitly publishing the metric. Whatever deletion ion the PR diff comes from this revert.

User Experience

Below is an example of heartbeat event written to the clustermgtd events log /var/log/parallelcluster/clustermgtd.events.
The schema of the event is the standard schema for events published by clustermgtd.

{
  
  "datetime": "2026-02-04T19:24:37.549+00:00", 
  "version": 0, 
  "scheduler": "slurm", 
  "cluster-name": "clustermgtd-0204-2", 
  "node-role": "HeadNode", 
  "component": "clustermgtd", 
  "level": "INFO", 
  "instance-id": "i-00a3f1fa58cbcbd60", 
  "event-type": "clustermgtd-heartbeat", 
  "message": "Clustermgtd heartbeat", 
  "detail": {"heartbeat-timestamp": "2026-02-04T19:24:37.549+00:00"}
}

Q&A

Why the heartbeat event is published only when the heartbet file is updated?
This is the same behavior we had when we were pushing the metric explicitly. This is the correct approach because in this way the heartbeat event would signal a healthy clustermgtd loop.
Why the heartbeat event contains such many fields that seems not required?
This is the standard event schema we use for whatever clustermgtd event. We want to be stick with such default schema to keep the things simple.

Tests

Manually tested that the heartbeat event is written to the event logs (see example of emitted events in UX section above)
Unit test (updated to cover the current changes)
Manually tested that the metric is emitted and the alarm is able to capture it, by using the CLI changes in [Observability] Use metric filter to generate the clustermgtd heartbeat metric. aws-parallelcluster#7219
SUCCEEDED test_monitoring on a cluster using the CLI changes in [Observability] Use metric filter to generate the clustermgtd heartbeat metric. aws-parallelcluster#7219

By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.

…clustermgtd heartbeat." This reverts commit ebb742f. We must revert this commit because we will no longer put the clustermgtd heartbeat metric explicitly. Instead, the heartbeat will be included into the clustermgtd events logs and the heartbeat metric will be generated by a metric filter.

hanwen-cluster · 2026-02-05T14:30:47Z

src/slurm_plugin/clustermgtd.py

-            timestamp_file.write(datetime.now(tz=timezone.utc).strftime(TIMESTAMP_FORMAT))
+            timestamp_file.write(heartbeat_time.strftime(TIMESTAMP_FORMAT))
+        # Publish heartbeat event to events log
+        self._event_publisher.publish_heartbeat_event(heartbeat_time)


Shall we wrap this with a try-catch to prevent it from failing clustermgtd logic?
Here we can deal case by case. In general, we should wrap all non-critical stuffs with try catch

I agree with this need. This is done in

aws-parallelcluster-node/src/slurm_plugin/cluster_event_publisher.py

Lines 497 to 498 in 00ab174

@log_exception(logger, "publish_heartbeat_event", catch_exception=Exception, raise_on_error=False)

def publish_heartbeat_event(self, heartbeat_timestamp: datetime):

.

Every function of the event publ;isher has the same mechanism to not fail on exceptions

gmarciani added 2 commits February 4, 2026 11:48

[Observability] Clustermgtd to emit heartbeat event into its events log.

00ab174

gmarciani changed the title ~~Wip/mgiacomo/3150/clustermgtd alarm fix 0204 1~~ [Observability] Clustermgtd to emit heartbeat into event logs rather than explicit CW metric Feb 4, 2026

gmarciani marked this pull request as ready for review February 4, 2026 19:36

gmarciani requested review from a team as code owners February 4, 2026 19:36

gmarciani added enhancement 3.x labels Feb 4, 2026

gmarciani mentioned this pull request Feb 4, 2026

[Observability] Use metric filter to generate the clustermgtd heartbeat metric. aws/aws-parallelcluster#7219

Merged

hanwen-cluster reviewed Feb 5, 2026

View reviewed changes

hanwen-cluster approved these changes Feb 5, 2026

View reviewed changes

gmarciani merged commit 9b1b66f into aws:develop Feb 5, 2026
21 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Observability] Clustermgtd to emit heartbeat into event logs rather than explicit CW metric #687

[Observability] Clustermgtd to emit heartbeat into event logs rather than explicit CW metric #687

Uh oh!

gmarciani commented Feb 4, 2026 •

edited

Loading

Uh oh!

hanwen-cluster Feb 5, 2026

Uh oh!

gmarciani Feb 5, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

	@log_exception(logger, "publish_heartbeat_event", catch_exception=Exception, raise_on_error=False)
	def publish_heartbeat_event(self, heartbeat_timestamp: datetime):

[Observability] Clustermgtd to emit heartbeat into event logs rather than explicit CW metric #687

[Observability] Clustermgtd to emit heartbeat into event logs rather than explicit CW metric #687

Uh oh!

Conversation

gmarciani commented Feb 4, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description of changes

Notes

User Experience

Q&A

Tests

Uh oh!

hanwen-cluster Feb 5, 2026

Choose a reason for hiding this comment

Uh oh!

gmarciani Feb 5, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

gmarciani commented Feb 4, 2026 •

edited

Loading