[ML] Don't try and correct for sample count when estimating statistic variances for anomaly detection#2677
Draft
[ML] Don't try and correct for sample count when estimating statistic variances for anomaly detection#2677
Conversation
We model the level of a time series which we've observed having step discontinuities via a Markov process for forecasting. Specifically, we estimate the historical step size distribution and the distribution of the steps in time and as a function of the time series value. For this second part we use an online naive Bayes model to estimate the probability that at any given point in a roll out for forecasting we will get a step. This approach generally works well unless we're in the tails of the distribution values we've observed for the time series historically when we roll out. In this case, our prediction probability are very sensitive to the tail behaviour of the distributions we fit to the time series values where we saw a step and sometimes we predict far too many steps as a result. We can detect this case: when we're in the tails of time series value distribution. This change does this and stops predicting changes in such cases, which avoids pathologies. This fixes #2466.
…metric-statistic-modelling-part-1
Contributor
|
Behaviour before the change: I use the following script to generate synthetic data: import pandas as pd
import numpy as np
def generate_variable_frequency_data():
"""Generate variable frequency throughput data with a failure scenario.
Returns:
pandas.DataFrame: A DataFrame containing the generated data with two columns:
- '@timefield': Timestamps of the data points.
- 'transaction_throughput': Throughput values at each timestamp.
"""
# Define start and end dates
start_date = pd.to_datetime("2024-04-01")
end_date = pd.to_datetime("2024-04-21") # 20 days period
# Initialize lists to store timestamps and throughput values
timestamps = []
throughput_values = []
# Initial timestamp
current_time = start_date
while current_time <= end_date:
# Append the current timestamp
timestamps.append(current_time)
# Generate a throughput value with normal variability
throughput = np.random.normal(200, 50)
throughput = max(0, throughput) # Ensure non-negative throughput
throughput_values.append(throughput)
# Generate the next timestamp using a sinusoidal frequency with noise with period of 24 hours
base_frequency = 10 # base frequency in seconds
sinusoidal_variation = 50 * np.sin(
2 * np.pi * current_time.hour / 24
) # sinusoidal variation
noise = np.random.normal(0, 5) # noise
interval = base_frequency + sinusoidal_variation + noise
# Simulate a drop in frequency after a certain date
if current_time > pd.to_datetime(
"2024-04-18"
) and current_time < pd.to_datetime("2024-04-19"):
interval *= 25 # Increase the interval by 2500%
throughput_values[-1] = 0
# Calculate the next timestamp
current_time += pd.to_timedelta(abs(interval), unit="s")
return pd.DataFrame(
{"@timefield": timestamps, "transaction_throughput": throughput_values}
)
if __name__ == "__main__":
# Generate data
data = generate_variable_frequency_data()
# Save the data to a CSV file
data.to_csv("variable_frequency_throughput_data.csv", index=False)Hence, while data frequency is time-dependent, the metric value \cc @tveasey |
…metric-statistic-modelling-part-1
valeriy42
added a commit
to elastic/elasticsearch
that referenced
this pull request
Jul 9, 2024
While working on elastic/ml-cpp#2677, I encountered a failure in the integration test DetectionRulesIt.testCondition(). It checks the number of return records. With the new change in ml-cpp the native code returns two more values that have no significant score. I added filtering those out in the integration test code so it continues working as expected.
tveasey
commented
Jul 9, 2024
Contributor
Author
tveasey
left a comment
There was a problem hiding this comment.
Thanks for finishing this Valeriy! One small suggestion, but LGTM.
…metric-statistic-modelling-part-1
tvernum
pushed a commit
to elastic/elasticsearch
that referenced
this pull request
Feb 25, 2025
While working on elastic/ml-cpp#2677, I encountered a failure in the integration test DetectionRulesIt.testCondition(). It checks the number of return records. With the new change in ml-cpp the native code returns two more values that have no significant score. I added filtering those out in the integration test code so it continues working as expected.
tvernum
pushed a commit
to elastic/elasticsearch
that referenced
this pull request
Feb 25, 2025
While working on elastic/ml-cpp#2677, I encountered a failure in the integration test DetectionRulesIt.testCondition(). It checks the number of return records. With the new change in ml-cpp the native code returns two more values that have no significant score. I added filtering those out in the integration test code so it continues working as expected.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.


Currently, we:
This adds significant complexity to sampling metrics and creates a disconnect between the data we show in visualisations and the data we use for anomaly detection. Furthermore, the independence assumption frequently does not hold. In this case our current behaviour can lead to false negatives. For data where outages are associated with a significant fall in data rate this is particularly problematic. The choice to try to correct the variance predated modelling periodic variance, which now better accounts for the most common case, that the data rate is periodic.
In this PR I have reverted to using the raw time bucket statistics for model update and anomaly detection. I rely on periodic variance estimation to deal with (common instances of) time varying data rate. This is a step towards #1386.