[ML] Don't try and correct for sample count when estimating statistic variances for anomaly detection by tveasey · Pull Request #2677 · elastic/ml-cpp

tveasey · 2024-06-03T16:29:14Z

Currently, we:

Try to ensure each metric sample contains the same number of values
Correct for the variance assuming the raw values are independent samples from some distribution when we compute time bucket anomaly scores

This adds significant complexity to sampling metrics and creates a disconnect between the data we show in visualisations and the data we use for anomaly detection. Furthermore, the independence assumption frequently does not hold. In this case our current behaviour can lead to false negatives. For data where outages are associated with a significant fall in data rate this is particularly problematic. The choice to try to correct the variance predated modelling periodic variance, which now better accounts for the most common case, that the data rate is periodic.

In this PR I have reverted to using the raw time bucket statistics for model update and anomaly detection. I rely on periodic variance estimation to deal with (common instances of) time varying data rate. This is a step towards #1386.

We model the level of a time series which we've observed having step discontinuities via a Markov process for forecasting. Specifically, we estimate the historical step size distribution and the distribution of the steps in time and as a function of the time series value. For this second part we use an online naive Bayes model to estimate the probability that at any given point in a roll out for forecasting we will get a step. This approach generally works well unless we're in the tails of the distribution values we've observed for the time series historically when we roll out. In this case, our prediction probability are very sensitive to the tail behaviour of the distributions we fit to the time series values where we saw a step and sometimes we predict far too many steps as a result. We can detect this case: when we're in the tails of time series value distribution. This change does this and stops predicting changes in such cases, which avoids pathologies. This fixes #2466.

…metric-statistic-modelling-part-1

valeriy42 · 2024-07-04T13:43:00Z

Behaviour before the change:

Behaviour after the change:

I use the following script to generate synthetic data:

import pandas as pd
import numpy as np


def generate_variable_frequency_data():
    """Generate variable frequency throughput data with a failure scenario.

    Returns:
        pandas.DataFrame: A DataFrame containing the generated data with two columns:
            - '@timefield': Timestamps of the data points.
            - 'transaction_throughput': Throughput values at each timestamp.
    """
    # Define start and end dates
    start_date = pd.to_datetime("2024-04-01")
    end_date = pd.to_datetime("2024-04-21")  # 20 days period

    # Initialize lists to store timestamps and throughput values
    timestamps = []
    throughput_values = []

    # Initial timestamp
    current_time = start_date

    while current_time <= end_date:
        # Append the current timestamp
        timestamps.append(current_time)

        # Generate a throughput value with normal variability
        throughput = np.random.normal(200, 50)
        throughput = max(0, throughput)  # Ensure non-negative throughput
        throughput_values.append(throughput)

        # Generate the next timestamp using a sinusoidal frequency with noise with period of 24 hours
        base_frequency = 10  # base frequency in seconds
        sinusoidal_variation = 50 * np.sin(
            2 * np.pi * current_time.hour / 24
        )  # sinusoidal variation
        noise = np.random.normal(0, 5)  # noise
        interval = base_frequency + sinusoidal_variation + noise

        # Simulate a drop in frequency after a certain date
        if current_time > pd.to_datetime(
            "2024-04-18"
        ) and current_time < pd.to_datetime("2024-04-19"):
            interval *= 25  # Increase the interval by 2500%
            throughput_values[-1] = 0
        # Calculate the next timestamp
        current_time += pd.to_timedelta(abs(interval), unit="s")

    return pd.DataFrame(
        {"@timefield": timestamps, "transaction_throughput": throughput_values}
    )


if __name__ == "__main__":

    # Generate data
    data = generate_variable_frequency_data()

    # Save the data to a CSV file
    data.to_csv("variable_frequency_throughput_data.csv", index=False)

Hence, while data frequency is time-dependent, the metric value throughput is actually a normally distributed value. After the change, its confidence interval is correctly estimated, while before the change, the confidence bounds were changing over time, which is wrong.

\cc @tveasey

…metric-statistic-modelling-part-1

While working on elastic/ml-cpp#2677, I encountered a failure in the integration test DetectionRulesIt.testCondition(). It checks the number of return records. With the new change in ml-cpp the native code returns two more values that have no significant score. I added filtering those out in the integration test code so it continues working as expected.

tveasey

Thanks for finishing this Valeriy! One small suggestion, but LGTM.

include/core/CSmallVector.h

…metric-statistic-modelling-part-1

While working on elastic/ml-cpp#2677, I encountered a failure in the integration test DetectionRulesIt.testCondition(). It checks the number of return records. With the new change in ml-cpp the native code returns two more values that have no significant score. I added filtering those out in the integration test code so it continues working as expected.

tveasey added 9 commits October 30, 2023 22:22

Remove fixed count metric sampling

d88a7d3

More testing

5beb535

Test fixes

e1d8f49

.

24099fd

More redundant code

829cc22

Merge branch 'main' into better-metric-statistic-modelling-part-1

8cc6b1f

Formatting

cfdb9f0

Fix merge fallout

fd6d4d7

tveasey added WIP :ml affects-results labels Jun 3, 2024

tveasey changed the title ~~Don't try and correct for sample count when estimating statistic variances for anomaly detection~~ [ML] Don't try and correct for sample count when estimating statistic variances for anomaly detection Jun 3, 2024

valeriy42 self-assigned this Jun 17, 2024

Merge branch 'main' of https://github.com/elastic/ml-cpp into better-…

1db09d3

…metric-statistic-modelling-part-1

valeriy42 added the v8.15.0 label Jun 20, 2024

valeriy42 marked this pull request as draft June 20, 2024 08:04

valeriy42 added the >enhancement label Jun 20, 2024

valeriy42 added 3 commits June 20, 2024 10:16

add changelog entry

9b6a7a4

fix unit tests

7c3f5f0

fix formatting

11802cd

valeriy42 added the ci:run-qa-tests Run a subset of the QA tests label Jun 20, 2024

valeriy42 added v8.16.0 and removed v8.15.0 labels Jul 1, 2024

valeriy42 removed the WIP label Jul 4, 2024

valeriy42 added 4 commits July 5, 2024 12:11

fix gcc compilation error

b06480a

formatting

2588fb1

fix debug compilation error

d372925

Merge branch 'main' of https://github.com/elastic/ml-cpp into better-…

9cb82d6

…metric-statistic-modelling-part-1

valeriy42 marked this pull request as ready for review July 8, 2024 09:57

fix windows compile error

f91046b

valeriy42 mentioned this pull request Jul 9, 2024

[ML] Updated filtering in DetectionRulesIt.testCondition() elastic/elasticsearch#110628

Merged

tveasey commented Jul 9, 2024

View reviewed changes

include/core/CSmallVector.h Outdated Show resolved Hide resolved

valeriy42 added 2 commits July 9, 2024 16:31

reserve memory for CSmallVector::toDelimited()

7795233

Merge branch 'main' of https://github.com/elastic/ml-cpp into better-…

76f9785

…metric-statistic-modelling-part-1

valeriy42 marked this pull request as draft September 10, 2024 08:38

tveasey removed the v8.16.0 label Oct 27, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[ML] Don't try and correct for sample count when estimating statistic variances for anomaly detection#2677

[ML] Don't try and correct for sample count when estimating statistic variances for anomaly detection#2677
tveasey wants to merge 20 commits intomainfrom
better-metric-statistic-modelling-part-1

tveasey commented Jun 3, 2024

Uh oh!

valeriy42 commented Jul 4, 2024 •

edited

Loading

Uh oh!

tveasey left a comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

tveasey commented Jun 3, 2024

Uh oh!

valeriy42 commented Jul 4, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

tveasey left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

valeriy42 commented Jul 4, 2024 •

edited

Loading