Skip to content

Conversation

@lionakhnazarov
Copy link
Collaborator

Detailed Coordination Window Diagnostics

Summary

This PR introduces comprehensive diagnostics and metrics tracking for tBTC coordination windows, significantly enhancing observability into the coordination process. The changes add detailed per-window and per-wallet metrics, improve network diagnostics, and expand performance monitoring capabilities.

A new comprehensive metrics tracking system for coordination windows that provides:

  • Per-Window Tracking: Each coordination window is tracked with:

    • Window identification (index, coordination block)
    • Timing information (start time, end time, duration, block ranges)
    • Coordination statistics (wallets coordinated, successful/failed counts)
    • Leader distribution across wallets
    • Action type breakdown
    • Fault statistics (by type and culprit)
  • Per-Wallet Details: For each wallet coordinated in a window:

    • Wallet public key hash
    • Leader address
    • Action type
    • Success/failure status
    • Duration
    • Error messages (if failed)
    • Detailed fault information
  • Memory Management: Tracks up to 100 recent windows (~25 hours) with automatic cleanup of older windows to prevent unbounded memory growth

- Updated  to include a new peer for the sepolia network.
- Added timeout handling in  to prevent indefinite hangs.
- Introduced new system metrics: CPU load, RAM utilization, and swap utilization, with corresponding updates to the performance metrics registration.
- Introduced a new  structure to track detailed metrics for individual coordination windows, including timing, success rates, and fault statistics.
- Enhanced the coordination layer to record the start and end of coordination windows, as well as wallet-specific coordination details.
- Added new metrics for coordination windows, including total wallets coordinated, successful, and failed, along with fault tracking.
- Introduced new metrics for redemption actions, including total executions, success, and failure counts, as well as duration tracking.
- Updated the performance metrics registration to include these new redemption metrics.
- Refactored existing code to utilize defined constants for metric names, enhancing consistency and readability.
- Improved error handling in redemption proof submissions to accurately record failure metrics.
- Updated the  and  structures to include JSON tags for improved serialization.
- Introduced a new  structure to capture detailed fault information during coordination.
- Enhanced the  method to include error messages for failed wallet actions.
- Added a new method  to retrieve a summary of coordination window metrics.
- Registered coordination windows as a diagnostic source in the client info for better monitoring.
- Added a mutex and a map to track peers that have already been pinged to avoid duplicate ping tests.
- Updated the connected and disconnected callback functions to manage the pinged peers set, ensuring each unique peer is only pinged once.
- Enhanced disconnection handling to allow re-pinging if a peer reconnects later.
Copy link
Member

@lrsaturnino lrsaturnino left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good work! Some comments pending clarification.

- Refactored metric increment calls in libp2p to utilize constants for peer connections, disconnections, and ping tests.
- Enhanced coordination window metrics by adding a mutex for safe access to previous window data across goroutines.
- Introduced a cleanup goroutine to ensure the end time of the last coordination window is recorded on shutdown.
Copy link
Member

@lrsaturnino lrsaturnino left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants