Skip to content

Conversation

@gmarciani
Copy link
Contributor

@gmarciani gmarciani commented Feb 3, 2026

Description of changes

Add log lines to clustermgtd to signal that heartbeat file has been update and heartbeat metric has been published.

Tests

Verified logs are written and wre actually helpful in root causing a problem.

2026-02-03 23:13:42,879 - [slurm_plugin.clustermgtd:_terminate_orphaned_instances] - INFO - Checking for orphaned instance
2026-02-03 23:13:42,879 - [slurm_plugin.clustermgtd:manage_cluster] - INFO - Writing heartbeat file
2026-02-03 23:13:42,880 - [slurm_plugin.clustermgtd:manage_cluster] - INFO - Publishing heartbeat metric
2026-02-03 23:29:43,747 - [slurm_plugin.cloudwatch_utils:put_metric] - ERROR - Failed to publish metric ClustermgtdHeartbeat: Connect timeout on endpoint URL: "https://monitoring.us-east-1.amazonaws.com/"
2026-02-03 23:29:43,747 - [slurm_plugin.clustermgtd:_get_config] - INFO - Reading /etc/parallelcluster/slurm_plugin/parallelcluster_clustermgtd.conf
2026-02-03 23:29:43,750 - [slurm_plugin.clustermgtd:manage_cluster] - INFO - Managing cluster...

By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.

… been update and heartbeat metric has been published.
@gmarciani gmarciani marked this pull request as ready for review February 3, 2026 23:31
@gmarciani gmarciani requested review from a team as code owners February 3, 2026 23:31
@gmarciani gmarciani changed the title [Logs] Add log lines to clustermgtd to signal that heartbeat file has… [Logs] Add log lines to clustermgtd to signal that heartbeat file has been update and heartbeat metric has been published. Feb 3, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant