Skip to content

Conversation

@batmnnn
Copy link

@batmnnn batmnnn commented Feb 8, 2026

Add serialize/deserialize methods to ThetaSketch with Java-compatible compact format for cross-platform interoperability.

Changes:

  • Add theta/serialization.rs with binary format constants
  • Add ThetaSketch::serialize() for compact format output
  • Add ThetaSketch::deserialize() and deserialize_with_seed()
  • Add ThetaHashTable::seed() and from_entries() helpers
  • Add comprehensive serialization tests

Addresses serialization requirement from #30

Add serialize/deserialize methods to ThetaSketch with Java-compatible
compact format for cross-platform interoperability.

Changes:
- Add theta/serialization.rs with binary format constants
- Add ThetaSketch::serialize() for compact format output
- Add ThetaSketch::deserialize() and deserialize_with_seed()
- Add ThetaHashTable::seed() and from_entries() helpers
- Add comprehensive serialization tests

Addresses serialization requirement from apache#30
Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Adds binary (Java-compatible) compact serialization/deserialization support for ThetaSketch to enable cross-platform interoperability (per Iceberg/Puffin Theta v1 blob requirements in #30).

Changes:

  • Introduces Theta compact-format constants (theta/serialization.rs) and wires the module into theta::mod.
  • Adds ThetaSketch::serialize(), deserialize(), and deserialize_with_seed() implementing the compact Theta format.
  • Adds ThetaHashTable::seed() and ThetaHashTable::from_entries() plus new serialization-focused test coverage.

Reviewed changes

Copilot reviewed 5 out of 5 changed files in this pull request and generated 5 comments.

Show a summary per file
File Description
datasketches/src/theta/sketch.rs Adds ThetaSketch serialize/deserialize logic using the compact binary format.
datasketches/src/theta/serialization.rs Defines compact-format constants/flags used by Theta serialization.
datasketches/src/theta/mod.rs Registers the new serialization module.
datasketches/src/theta/hash_table.rs Adds seed accessor and helper to rebuild a hash table from deserialized entries.
datasketches/tests/theta_serialization_test.rs Adds round-trip and format-level serialization tests.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment on lines +307 to +335
pub fn from_entries(lg_nom_size: u8, seed: u64, theta: u64, entries: Vec<u64>) -> Self {
let lg_max_size = lg_nom_size + 1;
let lg_cur_size = lg_max_size; // Use max size for deserialized tables
let num_entries = entries.len();

// Rebuild hash table from compact entries
let table_size = 1usize << lg_cur_size;
let mut table_entries = vec![0u64; table_size];

for entry in &entries {
if *entry != 0 {
if let Some(idx) = Self::find_in_entries(&table_entries, *entry, lg_cur_size) {
table_entries[idx] = *entry;
}
}
}

Self {
lg_cur_size,
lg_nom_size,
lg_max_size,
resize_factor: ResizeFactor::X8, // Default for deserialized
sampling_probability: 1.0, // Unknown, assume 1.0
theta,
hash_seed: seed,
entries: table_entries,
num_entries,
}
}
Copy link

Copilot AI Feb 9, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

from_entries sets num_entries to entries.len() even if some values are skipped (0s) or fail insertion (e.g., table full) or are duplicates. This can make the table internally inconsistent and leads to incorrect estimate()/iteration behavior. Consider validating the input (no 0s, no duplicates, len <= capacity) and computing num_entries from the actual number of inserted uniques; if insertion fails, return an error to the caller instead of silently dropping entries.

Copilot uses AI. Check for mistakes.
Comment on lines +386 to +388
let preamble_longs = cursor.read_u8().map_err(make_error("preamble_longs"))?;
let serial_version = cursor.read_u8().map_err(make_error("serial_version"))?;
let family_id = cursor.read_u8().map_err(make_error("family_id"))?;
Copy link

Copilot AI Feb 9, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

deserialize_with_seed reads preamble_longs but never validates it against the supported compact Theta formats. As written, values like preamble_longs=1 with FLAG_IS_EMPTY unset will be parsed as non-empty and can misinterpret the payload. Please validate preamble_longs (e.g., require 1 for empty, 2 for exact, 3 for estimation) and return Error::invalid_preamble_longs(...) (or equivalent) on unexpected values.

Copilot uses AI. Check for mistakes.
let family_id = cursor.read_u8().map_err(make_error("family_id"))?;
let lg_k = cursor.read_u8().map_err(make_error("lg_k"))?;
let _lg_arr = cursor.read_u8().map_err(make_error("lg_arr"))?;
let flags = cursor.read_u8().map_err(make_error("flags"))?;
Copy link

Copilot AI Feb 9, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The flags byte is read but endianness is not enforced. If FLAG_IS_BIG_ENDIAN is set, the current code will still parse all multi-byte fields as little-endian and produce incorrect results. Either reject sketches with the big-endian flag set (return an InvalidData error) or branch reads to use big-endian decoding when the flag is present.

Suggested change
let flags = cursor.read_u8().map_err(make_error("flags"))?;
let flags = cursor.read_u8().map_err(make_error("flags"))?;
// Enforce endianness: this implementation only supports little-endian sketches.
if (flags & FLAG_IS_BIG_ENDIAN) != 0 {
return Err(Error::new(
ErrorKind::InvalidData,
"big-endian ThetaSketch serialization is not supported",
));
}

Copilot uses AI. Check for mistakes.
Comment on lines +436 to +452
// Read retained count (bytes 8-11)
let num_entries = cursor.read_u32_le().map_err(make_error("num_entries"))? as usize;
let _padding = cursor.read_u32_le().map_err(make_error("padding"))?;

// Read theta if in estimation mode (preamble_longs >= 3)
let theta = if preamble_longs >= PREAMBLE_LONGS_ESTIMATION {
cursor.read_u64_le().map_err(make_error("theta"))?
} else {
MAX_THETA
};

// Read hash entries
let mut entries = Vec::with_capacity(num_entries);
for _ in 0..num_entries {
let hash = cursor.read_u64_le().map_err(make_error("hash_entry"))?;
entries.push(hash);
}
Copy link

Copilot AI Feb 9, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

num_entries is taken directly from the input and used to pre-allocate (Vec::with_capacity) and to drive reads, without any upper bound derived from lg_k/format. A malformed input can request an extremely large allocation and trigger OOM. Add sanity checks (e.g., num_entries <= 1<<lg_k and bytes.len() is at least preamble_longs*8 + num_entries*8) before allocating/looping.

Copilot uses AI. Check for mistakes.
} else {
MAX_THETA
};

Copy link

Copilot AI Feb 9, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

theta is accepted without validation. If it is 0 (or > MAX_THETA), estimate() can divide by zero (or compute nonsensical results). Validate that 0 < theta && theta <= MAX_THETA when preamble_longs indicates an estimation sketch; otherwise return InvalidData.

Suggested change
// Validate theta for estimation sketches to avoid division by zero or invalid estimates
if preamble_longs >= PREAMBLE_LONGS_ESTIMATION && (theta == 0 || theta > MAX_THETA) {
return Err(Error::new(
ErrorKind::InvalidData,
format!("theta {} is out of range (0, {}]", theta, MAX_THETA),
));
}

Copilot uses AI. Check for mistakes.
@tisonkun
Copy link
Member

This can be for the same purpose of #77.

I'll review both of these in this week.

@ZENOTME maybe you two can coordinate with the impl together.

@ZENOTME
Copy link
Contributor

ZENOTME commented Feb 10, 2026

This can be for the same purpose of #77.

I'll review both of these in this week.

@ZENOTME maybe you two can coordinate with the impl together.

Thanks @batmnnn! I think this PR overlaps with #77. Following the C++ and Java implementations, the serialization is done by the compact theta sketch, which I’ve already implemented in #77. #77 also implement compress serialize/deserialize. Maybe we can just keep #77. Welcome any review and suggestions on #77.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants