-
Notifications
You must be signed in to change notification settings - Fork 16
feat: add ThetaSketch serialization support #87
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
Add serialize/deserialize methods to ThetaSketch with Java-compatible compact format for cross-platform interoperability. Changes: - Add theta/serialization.rs with binary format constants - Add ThetaSketch::serialize() for compact format output - Add ThetaSketch::deserialize() and deserialize_with_seed() - Add ThetaHashTable::seed() and from_entries() helpers - Add comprehensive serialization tests Addresses serialization requirement from apache#30
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull request overview
Adds binary (Java-compatible) compact serialization/deserialization support for ThetaSketch to enable cross-platform interoperability (per Iceberg/Puffin Theta v1 blob requirements in #30).
Changes:
- Introduces Theta compact-format constants (
theta/serialization.rs) and wires the module intotheta::mod. - Adds
ThetaSketch::serialize(),deserialize(), anddeserialize_with_seed()implementing the compact Theta format. - Adds
ThetaHashTable::seed()andThetaHashTable::from_entries()plus new serialization-focused test coverage.
Reviewed changes
Copilot reviewed 5 out of 5 changed files in this pull request and generated 5 comments.
Show a summary per file
| File | Description |
|---|---|
| datasketches/src/theta/sketch.rs | Adds ThetaSketch serialize/deserialize logic using the compact binary format. |
| datasketches/src/theta/serialization.rs | Defines compact-format constants/flags used by Theta serialization. |
| datasketches/src/theta/mod.rs | Registers the new serialization module. |
| datasketches/src/theta/hash_table.rs | Adds seed accessor and helper to rebuild a hash table from deserialized entries. |
| datasketches/tests/theta_serialization_test.rs | Adds round-trip and format-level serialization tests. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| pub fn from_entries(lg_nom_size: u8, seed: u64, theta: u64, entries: Vec<u64>) -> Self { | ||
| let lg_max_size = lg_nom_size + 1; | ||
| let lg_cur_size = lg_max_size; // Use max size for deserialized tables | ||
| let num_entries = entries.len(); | ||
|
|
||
| // Rebuild hash table from compact entries | ||
| let table_size = 1usize << lg_cur_size; | ||
| let mut table_entries = vec![0u64; table_size]; | ||
|
|
||
| for entry in &entries { | ||
| if *entry != 0 { | ||
| if let Some(idx) = Self::find_in_entries(&table_entries, *entry, lg_cur_size) { | ||
| table_entries[idx] = *entry; | ||
| } | ||
| } | ||
| } | ||
|
|
||
| Self { | ||
| lg_cur_size, | ||
| lg_nom_size, | ||
| lg_max_size, | ||
| resize_factor: ResizeFactor::X8, // Default for deserialized | ||
| sampling_probability: 1.0, // Unknown, assume 1.0 | ||
| theta, | ||
| hash_seed: seed, | ||
| entries: table_entries, | ||
| num_entries, | ||
| } | ||
| } |
Copilot
AI
Feb 9, 2026
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
from_entries sets num_entries to entries.len() even if some values are skipped (0s) or fail insertion (e.g., table full) or are duplicates. This can make the table internally inconsistent and leads to incorrect estimate()/iteration behavior. Consider validating the input (no 0s, no duplicates, len <= capacity) and computing num_entries from the actual number of inserted uniques; if insertion fails, return an error to the caller instead of silently dropping entries.
| let preamble_longs = cursor.read_u8().map_err(make_error("preamble_longs"))?; | ||
| let serial_version = cursor.read_u8().map_err(make_error("serial_version"))?; | ||
| let family_id = cursor.read_u8().map_err(make_error("family_id"))?; |
Copilot
AI
Feb 9, 2026
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
deserialize_with_seed reads preamble_longs but never validates it against the supported compact Theta formats. As written, values like preamble_longs=1 with FLAG_IS_EMPTY unset will be parsed as non-empty and can misinterpret the payload. Please validate preamble_longs (e.g., require 1 for empty, 2 for exact, 3 for estimation) and return Error::invalid_preamble_longs(...) (or equivalent) on unexpected values.
| let family_id = cursor.read_u8().map_err(make_error("family_id"))?; | ||
| let lg_k = cursor.read_u8().map_err(make_error("lg_k"))?; | ||
| let _lg_arr = cursor.read_u8().map_err(make_error("lg_arr"))?; | ||
| let flags = cursor.read_u8().map_err(make_error("flags"))?; |
Copilot
AI
Feb 9, 2026
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The flags byte is read but endianness is not enforced. If FLAG_IS_BIG_ENDIAN is set, the current code will still parse all multi-byte fields as little-endian and produce incorrect results. Either reject sketches with the big-endian flag set (return an InvalidData error) or branch reads to use big-endian decoding when the flag is present.
| let flags = cursor.read_u8().map_err(make_error("flags"))?; | |
| let flags = cursor.read_u8().map_err(make_error("flags"))?; | |
| // Enforce endianness: this implementation only supports little-endian sketches. | |
| if (flags & FLAG_IS_BIG_ENDIAN) != 0 { | |
| return Err(Error::new( | |
| ErrorKind::InvalidData, | |
| "big-endian ThetaSketch serialization is not supported", | |
| )); | |
| } |
| // Read retained count (bytes 8-11) | ||
| let num_entries = cursor.read_u32_le().map_err(make_error("num_entries"))? as usize; | ||
| let _padding = cursor.read_u32_le().map_err(make_error("padding"))?; | ||
|
|
||
| // Read theta if in estimation mode (preamble_longs >= 3) | ||
| let theta = if preamble_longs >= PREAMBLE_LONGS_ESTIMATION { | ||
| cursor.read_u64_le().map_err(make_error("theta"))? | ||
| } else { | ||
| MAX_THETA | ||
| }; | ||
|
|
||
| // Read hash entries | ||
| let mut entries = Vec::with_capacity(num_entries); | ||
| for _ in 0..num_entries { | ||
| let hash = cursor.read_u64_le().map_err(make_error("hash_entry"))?; | ||
| entries.push(hash); | ||
| } |
Copilot
AI
Feb 9, 2026
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
num_entries is taken directly from the input and used to pre-allocate (Vec::with_capacity) and to drive reads, without any upper bound derived from lg_k/format. A malformed input can request an extremely large allocation and trigger OOM. Add sanity checks (e.g., num_entries <= 1<<lg_k and bytes.len() is at least preamble_longs*8 + num_entries*8) before allocating/looping.
| } else { | ||
| MAX_THETA | ||
| }; | ||
|
|
Copilot
AI
Feb 9, 2026
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
theta is accepted without validation. If it is 0 (or > MAX_THETA), estimate() can divide by zero (or compute nonsensical results). Validate that 0 < theta && theta <= MAX_THETA when preamble_longs indicates an estimation sketch; otherwise return InvalidData.
| // Validate theta for estimation sketches to avoid division by zero or invalid estimates | |
| if preamble_longs >= PREAMBLE_LONGS_ESTIMATION && (theta == 0 || theta > MAX_THETA) { | |
| return Err(Error::new( | |
| ErrorKind::InvalidData, | |
| format!("theta {} is out of range (0, {}]", theta, MAX_THETA), | |
| )); | |
| } |
Thanks @batmnnn! I think this PR overlaps with #77. Following the C++ and Java implementations, the serialization is done by the compact theta sketch, which I’ve already implemented in #77. #77 also implement compress serialize/deserialize. Maybe we can just keep #77. Welcome any review and suggestions on #77. |
Add serialize/deserialize methods to ThetaSketch with Java-compatible compact format for cross-platform interoperability.
Changes:
Addresses serialization requirement from #30