Skip to content

Changed disclosure checks to include NAs for cross-tabulations.#456

Open
anarchodoc wants to merge 1 commit intodatashield:masterfrom
anarchodoc:master
Open

Changed disclosure checks to include NAs for cross-tabulations.#456
anarchodoc wants to merge 1 commit intodatashield:masterfrom
anarchodoc:master

Conversation

@anarchodoc
Copy link

@anarchodoc anarchodoc commented Feb 6, 2026

This PR is to fix a disclosure bug whereby a table value below the threshold can be converted to NA and then identified. The fix entails including the NAs in the disclosure test.

In the following example, there's a variable ('sex_bin') which cannot be read as at least one of the categories has a value below the filter value (3):

> ds.table("D$sex_bin")
  Aggregated (exists("sex_bin", D)) [====================================================] 100% / 1s
  Aggregated (asFactorDS1("D$sex_bin")) [================================================] 100% / 1s
  Aggregated (tableDS(rvar.transmit = "D$sex_bin", cvar.transmit = NULL, stvar.transmit = NULL, )...

All studies failed for reasons identified below

Study1: Failed: at least one cell has a non-zero count less than nfilter.tab i.e. 3

Study2: Failed: at least one cell has a non-zero count less than nfilter.tab i.e. 3

Study3: Failed: at least one cell has a non-zero count less than nfilter.tab i.e. 3

$validity.message
[1] "All studies failed for reasons identified below"

$error.messages
$error.messages$COHORT1
[1] "Failed: at least one cell has a non-zero count less than nfilter.tab i.e. 3"

$error.messages$COHORT2
[1] "Failed: at least one cell has a non-zero count less than nfilter.tab i.e. 3"

$error.messages$COHORT3
[1] "Failed: at least one cell has a non-zero count less than nfilter.tab i.e. 3"

>

So, recode the variable (I know it has 3 categories: 1,2,9 - and here I suspect 9 might be the suspect category):

# Convert to numeric
ds.asNumeric(x.name="D$sex_bin",
             newobj = "sex_bin.n",
             datasources = working)
# Recode 9 to NA
ds.recodeValues(
   var.name = "sex_bin.n",
   values2replace = c(1,2,9),
   new.values.vector = c(0,1,NA),
   newobj = "sex_bin.n",
   datasources = working)

# Reconnect to main working data set
ds.dataFrame(x=c("D","sex_bin.n"),
             newobj = "D",
             datasources = working)

Then, we can cross-tabulate the two variables - and it works! Look at this:

> ds.table("D$sex_bin","D$sex_bin.n")
  Aggregated (exists("sex_bin", D)) [====================================================] 100% / 1s
  Aggregated (exists("sex_bin.n", D)) [==================================================] 100% / 1s
  Aggregated (asFactorDS1("D$sex_bin")) [================================================] 100% / 1s
  Aggregated (asFactorDS1("D$sex_bin.n")) [==============================================] 100% / 1s
  Aggregated (tableDS(rvar.transmit = "D$sex_bin", cvar.transmit = "D$sex_bin.n", ) [====] 100% / 1s

Data in all studies were valid

Study1: No errors reported from this study

Study2: No errors reported from this study

Study3: No errors reported from this study

$output.list
$output.list$TABLE.STUDY.COHORT1_row.props
         D$sex_bin.n
D$sex_bin   0   1  NA
       1    1   0   0
       2    0   1   0
       9    0   0   1
       NA NaN NaN NaN

$output.list$TABLE.STUDY.COHORT1_col.props
         D$sex_bin.n
D$sex_bin 0 1 NA
       1  1 0  0
       2  0 1  0
       9  0 0  1
       NA 0 0  0

$output.list$TABLE.STUDY.COHORT2_row.props
         D$sex_bin.n
D$sex_bin   0   1  NA
       1    1   0   0
       2    0   1   0
       9    0   0   1
       NA NaN NaN NaN

$output.list$TABLE.STUDY.COHORT2_col.props
         D$sex_bin.n
D$sex_bin 0 1 NA
       1  1 0  0
       2  0 1  0
       9  0 0  1
       NA 0 0  0

$output.list$TABLE.STUDY.COHORT3_row.props
         D$sex_bin.n
D$sex_bin   0   1  NA
       1    1   0   0
       2    0   1   0
       9    0   0   1
       NA NaN NaN NaN

$output.list$TABLE.STUDY.COHORT3_col.props
         D$sex_bin.n
D$sex_bin 0 1 NA
       1  1 0  0
       2  0 1  0
       9  0 0  1
       NA 0 0  0

$output.list$TABLES.COMBINED_all.sources_row.props
         D$sex_bin.n
D$sex_bin   0   1  NA
       1    1   0   0
       2    0   1   0
       9    0   0   1
       NA NaN NaN NaN

$output.list$TABLES.COMBINED_all.sources_col.props
         D$sex_bin.n
D$sex_bin 0 1 NA
       1  1 0  0
       2  0 1  0
       9  0 0  1
       NA 0 0  0

$output.list$TABLE_STUDY.COHORT1_counts
         D$sex_bin.n
D$sex_bin    0    1 NA
       1  3751    0  0
       2     0 3180  0
       9     0    0  2
       NA    0    0  0

$output.list$TABLE_STUDY.COHORT2_counts
         D$sex_bin.n
D$sex_bin    0   1 NA
       1  1107   0  0
       2     0 983  0
       9     0   0  2
       NA    0   0  0

$output.list$TABLE_STUDY.COHORT3_counts
         D$sex_bin.n
D$sex_bin    0    1 NA
       1  1750    0  0
       2     0 1638  0
       9     0    0  1
       NA    0    0  0

$output.list$TABLES.COMBINED_all.sources_counts
         D$sex_bin.n
D$sex_bin    0    1 NA
       1  6608    0  0
       2     0 5801  0
       9     0    0  5
       NA    0    0  0


$validity.message
[1] "Data in all studies were valid"

> 

Now we clearly see that there were some individuals such that n was below the filter value (3) but these don't get picked up by the filter trap as they are now in the NA column. Specifically, there are 2 in COHORT1, 2 in COHORT2 and 1 in COHORT3 - giving a total of 5 subjects with value 9 in the original variable.

@anarchodoc anarchodoc marked this pull request as ready for review February 6, 2026 23:18
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Development

Successfully merging this pull request may close these issues.

1 participant