Skip to content

Improve random() to use name-based salting for order-independent reproducibility #433

@baogorek

Description

@baogorek

Problem

The current random() function in policyengine_core/commons/formulas.py uses a global execution counter (count_random_calls) to differentiate random streams:

seeds = np.abs(entity_ids * 100 + population.simulation.count_random_calls)

This creates a "ripple effect": adding, removing, or reordering variables that call random() changes the random values for ALL subsequent variables. This makes it impossible to:

  • Compare policy versions with confidence (random noise shifts underneath)
  • Isolate the effect of a specific policy change
  • Run variables in parallel without counter synchronization

Proposed Solution: Name-Based Salting

Replace the global counter with the variable name (accessible via population.simulation.tracer.stack[-1]["name"]):

base_seed = stable_hash(f"{variable_name}:{per_variable_call_count}")
seeds = entity_ids ^ base_seed

Benefits:

  • Order-independent: Adding/removing variables doesn't affect others
  • True reproducibility: Same variable + entity ID = same value, always
  • Parallelizable: No global state to synchronize

Breaking Change

This will change random values for all existing simulations using random(). Downstream packages (policyengine-us, policyengine-uk) will see different takeup modeling results.

Questions for Maintainers

  1. Is this change acceptable given the breaking nature?
  2. Should we provide a legacy_random() for transition?
  3. Any concerns about the tracer stack approach?

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions