Stat Database Sec78

download Stat Database Sec78

of 42

Transcript of Stat Database Sec78

  • 7/28/2019 Stat Database Sec78

    1/42

    Security-Control Methods for Statistical Databases:A Comparative StudyNABIL R. ADAMRutgers, The State University of New Jersey, Newark, New Jersey 07102JOHN C. WORTMANNDepartment of Zndustrial Engineering & Management Science, Eindhoven University of Technology,The Netherlands

    This paper considers the problem of providing security to statistical databases againstdisclosure of confidential information. Security-control methods suggested in theliterature are classified into four general approaches: conceptual, query restriction, dataperturbation, and output perturbation.

    Criteria for evaluating the performance of the various security-control methods areidentified. Security-control methods that are based on each of the four approaches arediscussed, together with their performance with respect to the identified evaluationcriteria. A detailed comparative analysis of the most promising methods for protectingdynamic-online statistical databases is also presented.

    To date no single security-control method prevents both exact and partial disclosures.There are, however, a few perturbation-based methods that prevent exact disclosure andenable the database administrator to exercise statistical disclosure control. Some ofthese methods, however introduce bias into query responses or suffer from the O/l query-set-size problem (i.e., partial disclosure is possible in case of null query set or a query setof size 1).We recommend directing future research efforts toward developing new methods thatprevent exact disclosure and provide statistical-disclosure control, while at the same timedo not suffer from the bias problem and the O/l query-set-size problem. Furthermore,efforts directed toward developing a bias-correction mechanism and solving the generalproblem of small query-set-size would help salvage a few of the current perturbation-based methods.Categories and Subjec t Descriptors: H.2.0 [Database Management]: General-security,integrity, and protectionGeneral Terms: Protection, SecurityAdditional Key Words and Phrases: Compromise, controls, disclosure, inference, security

    INTRODUCTION the real world that is modeled), attributes(characteristics of the entities), and rela-A database consists of a model of some tionships among the different entities.part of the real world. Such a model is made Entities with identical attributes consti-up of entities (the elements of the part of tute a particular entity type. In a hospitalSupported by a grant from Rutgers GSM Research Resources Committee.Permission to copy without fee all or part of this material is granted provided that the copies are not made ordistributed for direct commercial advantage, the ACM copyright notice and the title of the publication and itsdate appear, and notice is given that copying is by permission of the Association for Computing Machinery. Tocopy otherwise, or to republish, requires a fee and/or specific permission.0 1989 ACM 0360-0300/89/1200-0515 $01.50

    ACM Computing Surveys, Vol. 21, No. 4, December 1989

  • 7/28/2019 Stat Database Sec78

    2/42

    516 l Adam and WortmannCONTENTS

    INTRODUCTIONThe Security Problem of Statistical DatabasesOverview of Solution ApproachesTypes of Statistical Databases and Computersystems1. EVALUATION CRITERIA2. CONCEPTUAL APPROACH2.1 The Conceptual Model2.2 The Lattice Model3. QUERY RESTRICTION APPROACH3.1 Query-Set-Size Control3.2 Query-Set-Overlap Control3.3 Auditing3.4 Partitioning

    3.5 Cell Suppression4. DATA PERTURBATION4.1 The Bias Problem4.2 Probability Distribution4.3 Fixed-Data Perturbation5. OUTPUT-PERTURBATION APPROACH5.1 Random-Sample Queries5.2 Varying-Output Perturbation5.3 Rounding6. COMPARATIVE ANALYSIS OF THESECURITY-CONTROL METHODS6.1 Security Criterion for the COUNT Query6.2 Precision Criterion for the COUNT Query6.3 Security Criterion for the SUM Query6.4 Precision Criterion for the SUM Query6.5 Consistency Criterion6.6 Robustness Criterion6.7 Cost Criterion6.8 Combination of the Traub et al. Method withOther MethodsI. NEW TYPES OF THREATS8. CONCLUSIONSACKNOWLEDGMENTSREFERENCES

    database, for example, patients and treat-ments are system entities, and name, SocialSecurity number, and diagnosis type areattributes of the entity type patient.A statistical database (SDB) system is adatabase system that enables its users toretrieve only aggregate statistics (e.g., sam-ple mean and count) for a subset of theentities represented in the database. Anexample is Ghoshs [1984,1985] descriptionof an SDB that is made up of test data fora manufacturing process. Another exampleis the database maintained by the U.S.Census Bureau. These examples are spe-cial-purpose databases, since providing

    aggregate statistics is their only purpose. Inother situations, a single database mayserve multiple applications, including a sta-tistical application. A hospital database, forexample, might be used by physicians tosupport their medical work as well as bystatistical researchers of the NationalHealth Council. In that case, the statisticalresearchers are authorized to retrieve onlyaggregate statistics; the physicians, on theother hand, can retrieve anything from thedatabase.Many government agencies, businesses,and nonprofit organizations need to collect,analyze, and report data about individualsin order to support their short-term andlong-term planning activities. SDBs there-fore contain confidential information suchas income, credit ratings, type of disease,or test scores of individuals. Such data aretypically stored online and analyzed usingsophisticated database management sys-tems (DBMS) and software packages. Onthe one hand, such database systems areexpected to satisfy user requests of aggre-gate statistics related to nonconfidentialand confidential attributes. On the otherhand, the system should be secure enoughto guard against a users ability to infer anyconfidential information related to a spe-cific individual represented in the database.The problem here is the inevitable conflictbetween the individuals right to privacyand the societys need to know and pro-cess information [Palley 1986; Palley andSimonoff 19871. Dalenius [ 19741 presentsan overview of this problem as does thefollowing citation from Miller [1971, p.1361:

    Some deficiencies inevitably crop up even in theCensus Bureau. In 1963, for example, it reportedlyprovided the American Medical Association with astatistical list of one hundred and eight-eight doc-tors residing in Illinois. The list was broken downinto more than two dozen categories, and each cat-egory was further subdivided by medical specialtyand area residence; as a result, identification ofindividual doctors was possible . . . .The recent proliferation of computer-

    ized-information systems has added to thegrowing public concern about threats toindividuals privacy. It is therefore not sur-prising that the problem of securing SDBsACM Computing Surveys, Vol. 21, No. 4, December 1989

  • 7/28/2019 Stat Database Sec78

    3/42

    Security Control Method For Databases 517has become an important one in recentyears. As we move further into the infor-mation age and see expert and knowledge-based systems used in conjunction withSDBs, the security problem is expectedto become even more important. True-blood [ 19841, for example, illustrates howknowledge-based and expert systems coulduse nonconfidential information to inferconfidential information.The following example of a hospital da-tabase discusses the problem of securing anSDB. The database contains these dataabout patients:(Age, Sex, Employer, Social Security

    Number, Diagnosis Type)In the hospital environment, physiciansmay be given access to patients entire med-ical records, whereas statistical researchersmay only be allowed to obtain aggregatestatistics for subsets of the patient popu-lation. A subset of patients whose data areincluded in the computation of the responseto a query is referred to as the query set.Statistics are calculated for subsets of pa-tients having common attribute values(e.g., Age = 42 and Sex = male). Such asubset can be specified by a characteristicformula, C, which is a logical formulaover the values of the attributes using theBoolean operators AND (&), OR (+), andNOT (1). For example,

    C = (Age = 42) & (Sex = Male)& (Employer = ABC)

    is a characteristic formula that specifies thesubset of male patients, age 42, employedby the ABC company. The size of the queryset that corresponds to a characteristic for-mula C is denoted by ] C 1.Suppose there is a malevolent researcherwho wants to obtain information about thediagnosis type of a given patient, Mr. X. Amalevolent user who wants to compromisethe database is referred to as a snooper.In our example, assume that the snooperknows the age and employer of Mr. X. Hecan then issue the queryQl: COUNT (Age = 42) & (Sex = Male) &(Employer = ABC).

    If the answer is 1, the snooper has locatedMr. X and can then issue such queries asQ2: COUNT (Age = 42) & (Sex = Male) &(Employer = ABC) & (Diagnosis Type

    = Schizophrenia).If the answer to Q2 is 1, the database issaid to be positively compromised and theuser is able to infer that Mr. X has thediagnosis type schizophrenia. If the answeris 0, the database is said to be partiallycompromised, because the user was able toinfer that the diagnosis type of Mr. X is notschizophrenia. Partial compromise refers tothe situation in which some inference abouta confidential attribute of an entity can bemade, even if the exact value cannot bedetermined. It may take the form of a neg-ative compromise, that is, it is inferred thatan attribute of a certain entity does not liewithin a given range.As we have seen from the above example,there are basically three types of authorizedusers: the nonstatistical user (e.g., the phy-sician in the hospital example) who is au-thorized to issue queries and update thedatabase, the researcher who is authorizedto retrieve aggregate statistics from the da-tabase, and the snooper who is interestedin compromising the database. There isalso a database administrator (DBA) who,by definition, is the guardian of thedatabase. In the remainder of this paperwe will use the terms user and researcherinterchangeably.In this paper we assume that the follow-ing elements of access control [Turn andShapiro 19781 have been implemented andare effective in preventing unauthorizedaccess to the system:

    Authorization of people to access thecomputer facility and the database sys-tem.Identification of a person seeking accessto the computer facility and the databasesystem.Authentication of the users identity andaccess authorization.As such we will use the term snooper torefer to a person who is making improperuse of the data normally available to him

    ACM Comput ing Surveys, Vol. 21, No. 4, December 1989

  • 7/28/2019 Stat Database Sec78

    4/42

    518 . Adam and Wortmannor her as an authorized user. Usually it isassumed that a snooper has supplementary(i.e., a priori) knowledge about a target. Forexample, in queries Ql and Q2, the snooperknows a priori that the target is male, 42years old, and employed by the ABC Com-pany. Assumptions regarding the a prioriknowledge of the snooper are crucial tothe development of an effective security-control method: The more the security-control method is aware of each userssupplementary knowledge, the more effec-tive it will be in reducing the likelihood ofcompromising the database.An SDB that serves multiple applica-tions can be structured as a hierarchical,network, or relational database. An SDBwhose only purpose is to provide aggregatestatistics can be structured in a tabularform. The relational and tabular forms ofSDBs are widely discussed in the literature[Ghosh 19861.In the relational form, each entity in thereal world is represented by a tuple con-sisting of attribute values of that entity. Aset of tuples with similar attributes consti-tutes a relation. Such a relation is usuallydepicted as a two-dimensional table, wherethe rows correspond to tuples and the col-umns correspond to attributes. In statisti-cal analysis, the domain of attribute valuesis often subdivided into classes, which areused as categories. Based on these cate-gories, all entities are classified into ele-mentary cells. If two entities belong to thesame categories for all attributes, they arein the same elementary cell. This processis referred to as microaggregation.If M attributes are used for categoriza-tion, the data can be represented by anM-dimensional table consisting of elemen-tary cells. This type of table representationis called the tabular form. The cells of suchtables contain summary statistics (e.g., thenumber of entities) computed over the en-tities contained in the cell [Denning 19831.Many statistical databases (e.g., censusdata) are published in tabular form.An SDB in relational form is easily trans-lated into an equivalent one in tabularform. The reverse, however, is not true; thatis, it is not always possible to deducethe relational form from the tabular form.

    This is because there is information losswhen attribute values of entities are cate-gorized in order to obtain the tabularrepresentation.The remainder of this section discussesthe problem of securing SDBs and gives abrief overview of the solution approaches.Then Section 1 identifies a set of criteriathat can be used to evaluate the per-formance of a security-control method,and Sections 2 through 5 examine thesecurity-control methods that have beenproposed to date that are related to thesolution approaches discussed in the Intro-duction. A detailed comparative analysis ofthe most promising security-control meth-ods is given in Section 6, followed bya discussion of new types of threats thatare starting to be explored in the literaturein Section 7. Section 8 presents ourconclusions.

    The Security Problem of Statistical DatabasesThe objective of an SDB is to provide re-searchers with aggregate statistics (e.g.,mean and count) about a collection of en-tities while protecting confidentiality ofany individual entity represented in thedatabase. It is the policy of the system asset by the DBA that determines the crite-rion for defining confidential information[Denning and Schlorer 19831.Threats to data security arise from asnoopers attempt to infer some previouslyunknown, confidential data about a givenentity. Threats to security may result inexact or partial disclosure. A disclosure issaid to occur (or, equivalently, an SDB issaid to have been compromised) if throughthe answer to one or more queries a snooperis then able to infer the exact value of (exactdisclosure) or a more accurate estimate ofa confidential attribute of an individualentity. In this paper, we use the termscompromise and disclosure interchangeably.

    Overview of Solution ApproachesSeveral methods for protecting the securityof SDBs have been suggested in the litera-ture. These methods can be classified under

    ACM Computing Surveys, Vol. 21, No. 4, December 1989

  • 7/28/2019 Stat Database Sec78

    5/42

    rlDBSDB

    Security Control Method For Databases l 519

    (restricted) Queries4 lG esearcherExact responsesor denialsb

    Dataperturbation

    \ I

    PerturbatedSDB

    responses(b)

    Figure 1. (a) Query set restriction; (b) data perturbation; (c) output perturbation.

    four general approaches: conceptual, queryrestriction, data perturbation, and outputperturbation.Two models are based on the conceptuala,pproach: the conceptual model [Chin andOzsoyoglu 19811 and the lattice model[Denning 1983; Denning and Schlorer19831. The conceptual model provides aframework for investigating the securityproblem at the conceptual-data-model level.The lattice model constitutes a frameworkfor data represented in tabular form. Eachof these models presents a framework forbetter understanding and investigating thesecurity problem of SDBs. Neither presents

    a specific implementation procedure. Adiscussion of these models is given inSection 2.Security-control methods that are basedon the query-restriction approach (see Fig-ure la) provide protection through one ofthe following measures: restricting thequery set size, controlling the overlapamong successive queries by keeping anaudit trail of all answered (or could bededuced) queries for each user, makingcells of small size unavailable to users ofSDBs that are in tabular form, or partition-ing the SDB. Section 3 includes a discus-sion of these methods.

    ACM Computing Surveys, Vol. 21, No. 4, December 1989

  • 7/28/2019 Stat Database Sec78

    6/42

    520 . Adam and WortmannData perturbation introduces noise in thedata. The original SDB is typically trans-formed into a modified (perturbed) SDB,which is then made available to researchers(see Figure lb). Section 4 examines each of

    the data-perturbation-based methods.The output-perturbation approach per-turbs the answer to user queries whileleaving the data in the SDB unchanged(see Figure lc). A discussion of the output-perturbation-based methods is given inSection 5.Types of Statistical Databases andComputer Systems

    may not be suitable for dynamic SDBssince the efforts of transforming the origi-nal SDB to the perturbed one may becomeprohibitive.Centralized-DecentralizedIn a centralized SDB there is one database.In a decentralized (distributed) SDB, over-lapping subsets of the database are storedat different sites that are connected bya communication network. A distributeddatabase may be fully replicated, partiallyreplicated, or partitioned. The securityproblem of a distributed SDB is more com-plex than that of a centralized one due

    The nature of the database and the char- to the need to duplicate, at each site, theacteristics of the computer system strongly security-control overhead, as well as theaffect the complexity of the SDB security difficulty of integrating user profiles.problem and the proposed solution ap-proach. The following classification [Turnand Shapiro 19781 reflects the variety ofenvironments.Offline-OnlineIn an online SDB, there is direct real-timeinteraction of a user with the data througha terminal. In an offline SDB, the userneither is in control of data processing norknows when his or her data request is proc-essed. In this mode, protection methodsthat keep track of user profiles becomemore cumbersome. Compromise methodsthat require a large number of queries (e.g.,regression-based-compromise method, seeSection 7) also become more difficult whenworking offline.Static-DynamicA static database is one that never changesafter it has been created. Most census da-

    Dedicated-Shared Computer SystemIn a dedicated SDB, the computer systemis used exclusively to serve SDB applica-tions. In a shared system, the SDB appli-cations run on the same hardware systemwith other applications (possibly using dif-ferent databases). The shared environmentis more difficult to protect, since other ap-plications may be able to interfere with theprotected data directly through the oper-ating system, bypassing the SDB securitymechanism.The above discussion indicates that thedegree of difficulty of securing an SDBagainst a snoopers attempt to infer somepreviously unknown, confidential dataabout an individual entity depends uponwhether the SDB is online or offline, staticor dynamic, centralized or decentralized,and running on a dedicated or sharedcomputer system.1. EVALUATION CRITERIAtabases are static. Whenever a new versionof the database is created, that new version This section discusses the criteria foris considered to be another static database. evaluating the security-control methodsIn contrast, dynamic databases can change investigated in Sections 2 through 5.continuously. This feature can complicate (1) Security is the level of protectionthe security problem considerably, because provided by the control method againstfrequent releases of new versions may en- complete (or exact) and partial disclosure.able snoopers to make use of the differences In perturbation-based methods, partial dis-among the versions in ways that are diffi- closure is referred to as statistical disclo-cult to foresee. Data-perturbation methods sure. Beck [ 19801 suggests that a statistical

    ACM Computing Surveys, Vol. 21, No. 4, December 1989

  • 7/28/2019 Stat Database Sec78

    7/42

    Security Control Method For Databases l 521disclosure is said to occur if, using infor-mation from a series of queries, it is pos-sible to obtain a better estimate forconfidential information than was possibleusing only one query. The larger the num-ber of queries required to compromise anSDB partially (or statistically), the moresecure is the SDB.In this paper, we adopted the followingdefinition of compromise. Consider a con-fidential attribute Ai for an individualentity i represented in the database

    ( 1 the entity possesses a givenAi= property (e.g., disease typeAIDS)

    (0 otherwiseorAi is a numerical attribute (e.g., income)An exact compromise is said to take placeif by issuing one or more queries, a user isable to determine that Ai = 1 or its exactvalue (if it is a numerical attribute). Apartial compromise occurs if by issuing oneor more queries, a user is able to determinethat Ai = 0 or to obtain, for the-case ofnumerical attribute, an estimator A; whosevariance satisfies the following:

    Var&)

  • 7/28/2019 Stat Database Sec78

    8/42

    522 . Adam and Wortmannwhere ] SC ] is the number of suppressedcells and TC is the total number of cells.This formula may overestimate the amountof nonconfidential information eliminated.Notice that it is natural in the context ofthe lattice model, where there is no datamanipulation language and all the infor-mation about the model can be character-ized by the cells of the lattice.Bias, precision, and consistency are threecomponents of the statistical quality of theinformation revealed to users. Bias repre-sents the difference between the unper-turbed statistic and the expected valueof its perturbed estimate [Denning andSchlorer 19831. In general, the property ofbeing unbiased is one of the more desiredones in point estimation. We will, there-fore, assume that being unbiased is adesirable property of any security-controlmethod.Precision refers to the variance of theestimators obtained by users. On the onehand, we would like to provide users withas precise information as possible, that is,with an estimator with as low a variance aspossible. On the other hand, we would liketo ensure that an estimator obtained by asnooper would have as high a variance aspossible. Therefore, an effective security-control method is one that enables the DBAto adjust the precision to an appropriatevalue by setting the methods parametersaccordingly. It is also desirable to provideusers with a confidence interval of the es-timated statistic. As will be shown in thenext sections, however, this requirementgives a snooper additional information thatmay enable him or her to find an easy wayto compromise the database.Consistency represents the lack of con-tradictions and paradoxes [Denning andSchlorer 19831. Contradiction arises when,for example, different responses are ob-tained to repetitions of the same query orthe average statistic differs from the com-puted average using the sum and countstatistics. (A difference in answers to rep-etitions of the same queries that is due tochanges in the real world is not, however,considered an inconsistency.) A negativeresponse to a count query is an example ofparadox. Consistency is a desirable featureof any security-control method.

    (7) Cost is made up of three compo-nents. The first is the implementation cost,which represents the effort required by theDBA to implement the security-controlmethod and to determine the required pa-rameters of that method. The second com-ponent, processing overhead per query,measures the CPU time and storage re-quirements of the method during queryprocessing. In an online-dynamic-SDBenvironment, the processing component ofthe cost is a significant factor. The thirdcomponent is the amount of education re-quired to enable users to understand thesecurity-control method so that they canmake effective use of the SDB.2. CONCEPTUAL APPROACH2.1 The Conceptual ModelIn Chin and ijzsoyoglu [1981] and ijzsoy-oglu and Chin [1982] a framework isdescribed for dealing with the securityproblem from the development of theconceptual schema to implementation.Background on this framework can befound in Chins earlier work [1978]. In theconceptual-modeling approach, the popu-lation (a population is a collection of enti-ties that have common attributes) and itsstatistics are all that users can access.There is no data manipulation language(such as relational algebra) allowed tomerge and intersect populations.The framework, which has been analyzedin ijzsoyoglu and Chung [ 19861 and ijzsoy-oglu and Su [ 19851, deals with several issuesthat have not been considered jointly else-where in the literature. The most importantof these issues are as follows:

    (1) An SDB is more than just one filethat has data about one population.An SDB contains information aboutdifferent types of populations (e.g.,patients, treatments, medicines, doc-tors). Moreover, the security controlcomponents of the SDB manage-ment system should be made awareof the relevant subcategories of eachpopulation and the security con-straints with respect to these subcat-egories that must be enforced.

    ACM Computing Surveys, Vol. 21, No. 4, December 1989

  • 7/28/2019 Stat Database Sec78

    9/42

    Security Control Method For Databases l 523(2)

    (3)

    (4)

    The dynamics of the database (in-cluding the dynamics of its structureand security constraints) should betaken into account. In particular, dis-closure due to these dynamics shouldbe prevented.Users supplementary knowledgeshould be maintained and kept up todate. The security-control methodcould take such information intoconsideration when responding to auser query.Possible inferences that may lead todisclosure, with or without userssupplementary knowledge, aboutconfidential information should beanalyzed.

    The framework relies on the distinctionbetween the conceptual-data-modelinglevel and the internal level. The frameworkapplies mainly to the conceptual level. Theframework, however, contains a certifiedmain kernel, which deals with all access tothe confidential part of the physical data-base. Within this framework, the securityis basically guaranteed by the introductionof the concept of smallest nondecompos-able subpopulations (referred to as atomicpopulations or A-populations). These A-populations always contain either zero orat least two entities, thus preventing disclo-sure of information about one individualentity. The A-populations correspond toelementary cells in an SDB, which arerepresented in tabular form (see the dis-cussion on the lattice model below). In theconceptual-modeling framework, however,the definition of the A-populations is con-trolled by the database designer (usuallythe DBA). Preventing A-populations fromhaving only one entity due to insertionsand deletions is controlled by either delayedupdate processing or by dummy entities. AsSchlorer [1983] notes, adding dummyentities may introduce bias in reportedstatistics.The framework describes an architec-ture, called the Statistical Security Man-agement Facility, for realizing statisticalsecurity in SDB environments. The partsof this facility are discussed in Sections2.1.1 to 2.1.5.

    2.1.1 Population Definition ConstructEach subpopulation has a correspondingpopulation definition construct that keepstrack of information such as the allowedstatistical query types for each attribute ofthe population, the security constraintsenforced in updates, and the history ofchanges in the (sub)population.2.1.2 User Knowledge ConstructUser Knowledge Construct is a process thatkeeps track of the properties of each usergroup. These properties describe thegroups knowledge from earlier queries aswell as any supplementary knowledge itmay have (e.g., knowledge about confiden-tial attribute values of individual entities).It is worthwhile to note that several of themethods discussed in the next section, thatis, query-overlap control and auditing, pro-pose some kind of user knowledge constructthat is not as elaborate as the one proposedhere.2.1.3 Constraint Enforcer and CheckerConstraint enforcer and checker is the pro-cess that enforces security constraintswhen queries are issued to the system. Itprovides the user knowledge construct withinformation on successfully answered quer-ies (and, therefore, increased knowledge ofa user group). The constraint enforcer andchecker is invoked whenever the DBAneeds to study changes in the security con-straints. It is also invoked whenever userstry to perform inserts, updates, or deletionsof entities in the database. The DBA isinformed by the constraint enforcer andchecker about the security consequences ofsuch transactions.

    2.1.4 Conceptual Model ModificationConceptual model modification is a processthat supports all changes to the conceptualmodel. These changes are first checked ina test mode, where conceptual model mod-ification communicates with the constraintenforcer and checker and reports securityconsequences of the proposed changes tothe DBA.

    ACM Computing Surveys, Vol. 21, No. 4, December 1989

  • 7/28/2019 Stat Database Sec78

    10/42

  • 7/28/2019 Stat Database Sec78

    11/42

    Security Control Method For Databases lAGE

    TABLE ASE O-20 21-45 46-65 >65EMPLOYERUNEMPLOYED

    ABC-COMPANYXYZ-INC.

    M 24 2 9 49F 26 0 1 51M 0 1 9 0F 0 16 0 0M 1 20 48 0F 1 0 52 0AGE

    TABLE AS O-20 21-45 46-65 >65SEX M 25 23 66 49F 27 16 53 51

    AGETABLE AE O-20 21-45 46-65 >65EMPLOYERUNEMPLOYEDABC-COMPANYXYZ-INC.

    50 2 10 1000 17 9 02 20 100 0SEX

    TABLE SE M FEMPLOYER TABLE ALL:UNEMPLOYED 84 78

    ABC-COMPANY 10 16 13101XYZ-INC. 69 53AGE

    TABLE A O-20 21-45 46-65 >6552 39 119 100

    SEXM F

    TABLE S 163 147EMPLOYER

    UNEMPLOYED ABC-COMPANY XYZ-INC.TABLE E 162 26 122

    Figure 3. Lattice model of Figure 2 (extension).

    three-dimensional table ASE with dimen- dimensional tables:sions A, S, and E (Figure 3). A particularelementary cell in this table might be, for (1) Table AS, where ASE is aggregatedexample, the cell, where A = 42, S = M, over the dimension E; an example of a celland E = ABC company. This tabular in this case is the one in which A = 42 andform can be aggregated into three two- S = M.

    ACM Computing Surveys, Vol. 21, No. 4, December 1989

  • 7/28/2019 Stat Database Sec78

    12/42

    526 l Adam and Wortmann(2) Table AE, where ASE is aggregatedover the dimension S; an example of a cellin this case is the one in which A = 42 andE = ABC company.(3) Table SE, where ASE is aggregatedover the dimension A ; an example of a cellin this case is the one in which S = M andE = ABC company.

    The aggregation performed in obtainingthese three new tables is called microaggre-gation. The process can be repeated in or-der to obtain three one-dimensional tables:(1) Table A, where table AS is aggregatedover the dimension S or table AE is aggre-gated over the dimension E.(2) Table S, where table AS is aggregatedover the dimension A or Table SE is aggre-gated over the dimension E.(3) Table E, where table AE is aggregatedover the dimension A or Table SE is aggre-gated over the dimension S.

    The aggregation can be extended onestep further, where a zero-dimensional ta-ble is obtained. This table contains onlyone cell, providing statistics for the data-base as a whole. Note that the set of alltwo-dimensional tables may sometimes dis-close the elementary cell statistic of thethree-dimensional table.The relationship between tabular dataand SDBs can be defined more formally asfollows: Let the set of attributes of an SDBbe (A,, . . ., AMI. Suppose that for eachattribute Ai, a finite domain of allowablevalues, aii, is given (ji = 1, . . . , ] Ai ( ).Schlorer [ 19831 defines an m-set as a queryset that can be specified using m (but notfewer) attributes. An elementary m-set ischaracterized by a formula of the form

    (AI = aij,)& (AZ = azj,) & . . * 8~ (A, = a,j,,,).

    These elementary m-sets correspond to thecells in tabular data with m dimensions (anm-table).Although in an actual dynamic SDB datawill not always be structured according tosuch tables, elementary m-sets provide aninteresting concept for systematicallystudying the approaches to query restric-

    ACM Computing Surveys, Vol. 21, No. 4, December 1989

    tions. For example, it is shown in Denningand Schlorer [1983] that all m-tables fora given statistic (e.g., COUNT or SUM)constitute a lattice when m ranges from1 to M.3. QUERY RESTRICTION APPROACHFive general methods have been developedto restrict queries: query-set-size control,query-set-overlap control, auditing, cellsuppression, and partitioning. Following isa discussion of these five methods, togetherwith an evaluation of their performancewith respect to the criteria discussed inSection 1. The results are summarized inTable 1. (See also Denning [1982, Chapter61 for an excellent discussion of several ofthese methods and Denning and Schlorer[ 19831 for additional results.)3.1 Query-Set-Size ControlThe query-set-size control method permitsa statistic to be released only if the size ofthe query set ] C ] (i.e., the number of en-tities included in the response to the query)satisfies the condition [Fellegi 1972; Fried-man and Hoffman, 1980; Hoffman andMiller 1970; Schlorer 19751:

    where L is the size of the database (thenumber of entities represented in the da-tabase) and K is a parameter set by theDBA. K should satisfy the conditionOaK+

    It was shown that by using a snooping toolcalled tracker it is possible to compromisethe database even for a value of K that isclose to L/2 [Denning et al. 1979; Denningand Schlorer 1980; Jonge 1983; Schlorer1980; Schwartz et al. 19791. Notice that Kcannot exceed L/2, otherwise no statisticswould ever be released.To illustrate the basic idea of a tracker,consider the following queries where thefemale set is used as a tracker:Q3: q(C) = COUNT (Sex = Female)Q4: 9 (C) = COUNT (Sex = Female + (Age= 42 & Sex = Male & Employer =

    ABC))

  • 7/28/2019 Stat Database Sec78

    13/42

    Security Control Method For Databases l 527Q5: q(C) = COUNT (Sex = Female + (Age= 42 & Sex = Male & Employer =ABC & Diagnosis Type = Schizophre-nia) )

    Suppose that the response to Q3 and Q4are, respectively, A and B, where K % A IL-KandKsBsL-K.IfB=A+l,then the target is uniquely identified by theclause Age = 42 & Sex = Male and Em-ployer = ABC. In this case, the databaseis positively compromised if the responseto Q5 is B and negatively compromised ifthe response to Q5 is A. It is usually easyto find a tracker for a characteristic formulaC [Schlorer 19801.A summary of the performance of thequery-set-size-control method with respectto the evaluation criteria is given in Ta-ble 1. In general, there seems to be a con-sensus in the literature that subverting thequery-set-size-control method is straightforward and cheap [Denning 1982; Traubet al. 19841.

    3.2 Query-Set-Overlap ControlNotice that Q3 and Q4 have a large numberof entities in common. Dobkin et al. [1979]noticed that many compromises use querysets that have a large number of overlap-ping entities. They studied the possibilityof restricting the number of overlappingentities among successive queries of a givenuser. If K denotes the minimum query setsize and r denotes the maximum number ofoverlapping entities allowed between pairsof queries, then according to Dobkin et al.,the number of queries needed for a compro-mise has a lower bound of 1 + (K - 1)/r.Unfortunately, however, for typical valuesof K/r, the lower bound of the number ofqueries needed to compromise the databaseis not a practical hindrance.In practice, query-set-overlap-controlmethod suffers from drawbacks such as[Dobkin et al. 19791: (1) this control mech-anism is ineffective for preventing the co-operation of several users to compromisethe database, (2) statistics for both a setand its subset (e.g., all patients and allpatients undergoing a given treatment)cannot be released, thus limiting the use-fulness of the database, and (3) for eachuser, a user profile has to be kept up to

    date. The performance of the query-set-overlap method with respect to the evalu-ation criteria is summarized in Table 1. Inregard to the cost criterion, the followingcomments are in order. The initial imple-mentation effort consists of developingsoftware that maintains user profiles andcompares a new query set with all previoussets. We feel that such an effort is moder-ate. The processing overhead per querymay, however, be very high due to the com-parison algorithm. Every new query issuedby a given user has to be compared with hisor her previously issued ones. Each com-parison takes O(L) processing time, whereL is the size of the SDB.3.3 AuditingAuditing of an SDB involves keeping up-to-date logs of all queries made by each user(not the data involved) and constantlychecking for possible compromise when-ever a new query is issued [Hoffman 1977;Schlorer 19761. Auditing has advantagessuch as allowing the SDB to provide userswith unperturbed response, provided thatthe response will not result in a compromise[Chin et al. 19841. One of the major draw-backs of auditing, however, is its excessiveCPU time and storage requirements tostore and process the accumulated logs.Chin and Ozsoyoglu [1982] developed aCPU time and storage-efficient method(called audit expert) that controls disclo-sure of a confidential attribute when usingthe SUM query. Consider a response, d, toa SUM query. Such a response provides theuser with information in the form of alinear equation:

    4: aixi = d,i=l

    where L is the number of entities repre-sented in the SDB, ai is one if the ith entitybelongs to the query set and is zero other-wise, and Xi represents the value of a con-fidential numerical attribute for entity i.The users knowledge obtained by que-rying the SDB may, therefore, be describedin the form of a set of linear equationsobtained from linear combinations of equa-tions in the set of answered queries. Theaudit expert maintains a binary matrixACM Computing Surveys, Vol. 21, No. 4, December 1989

  • 7/28/2019 Stat Database Sec78

    14/42

    528 . Adam and WortmannTable 1. A Summary of the Performance

    Security Suitable for Suitabil itySecurityControlMethod

    Exact Partial Numerical Suitable for to OnlineDisclosure Disclosure or Categ. One or More DynamicPossible? Possible? Robustness Attribute Attributes SDBQuery Set Size Yes Yes

    ControlQuery Set Over- Yes (unless Yes

    lap Control number ofqueriesseverelyrestricted)

    Auditing No Yes

    LowLow

    Low

    Partitioning No

    Cell Suppression No

    Yes, but more Controlled byprotection size of A-results populationfrom largermin. size ofA-popula-tion

    Yes Low

    BothBoth

    More than oneMore than one

    ModerateModerate

    Both

    Both

    One, otherwiseprocessingoverhead isvery high

    More than one

    Low

    Yes

    Both More than one No

    whose columns represent specific linearcombinations of database entities andwhose rows represent the user queries thathave already been answered. These rowsare chosen in such a way that they describeexactly and efficiently the knowledge spaceof each user. When a new query is issued,the matrix is updated. A row with all zerosexcept for an ith column indicates thatexact disclosure of the confidential attri-bute of the corresponding entity is possible.Thus, the answer to the new query shouldbe denied. It was shown that it takes theaudit expert no more than O(L2) time toprocess a new query [Chin and Ozsoyoglu19821. Hence, the method is suited for smallSDBs.The auditing method has been furtherinvestigated in Chin et al. [1984] for aspecial type of SUM query, the range SUMquery, which is defined as

    iil wi = d,where ai is one if LB 5 i 5 UB and is zerootherwise, and xi, as before, represents theACM Computing Surveys, Vol. 21, No. 4, December 1989

    value of a confidential-numerical attributefor entity i. Chin et al. [1984] show thatwhen using the proper data structure, thecomplexity of checking if a new range SUMquery could be answered can be reduced toO(L) time and space as long as the numberof queries is less than L or to O(t log L)time and O(L2) space for the t th new rangeSUM query with t 1 L.Table 1 includes a summary of the per-formance of the auditing method withrespect to the evaluation criteria. We noticethat the audit expert that has been devel-oped only for SUM queries does not provideprotection against partial disclosure. Al-though the DBA can provide the audit ex-pert with information regarding additionaluser knowledge, the robustness of themethod is considered very low since statis-tics on subpopulations with only a few en-tities will be made available to users. Withrespect to the cost criterion, the initial im-plementation effort is high because com-plex algorithms have to be implemented.For example, Chin and Ozsoyoglu [1982]show that maximizing the amount of non-confidential information that is to be

  • 7/28/2019 Stat Database Sec78

    15/42

    of the Query Restriction Based MethodsSecurity Control Method For Databases l 529

    Richness of Information costsAmount of

    Nonconf. Info.Eliminated Bias

    Initial ProcessingImplemenation Overhead User

    Precision Consistency Efforts Per Query EducationHighVery high

    NA NA NANA NA NA

    Moderate NA NA NA

    Moderate (veryhigh forsparse SDBs)

    Yes, if dummy NA NAentities areadded

    LowModerate

    LowVery high for

    large SDBs

    High Very high forlarge SDBs

    Moderate forstatic SDB;very high fordynamic SDB

    Very low forstatic SDB

    Very lowVery low

    Low

    Low

    Moderate NA NA NA High None None

    provided to users is an NP-complete prob-lem (i.e. there exists no polynomial-timealgorithm for solving this problem).Two observations are worth noting.First, the security level provided for theSUM query by the audit expert may inactuality be better than assumed. This isdue to the fact that for snoopers to be ableto perform a linear system attack on a SUMquery they must have enough supplemen-tary knowledge about the database entitiesto enable them to identify, through thecharacteristic formulas of the successivequeries, controlled groups of entities [Den-ning 19831. To illustrate, consider the fol-lowing four queries:

    x1 + x2 = dl,x3 + x4 = dz,

    xl + x3 + x5 = d3,1~2 + x4 + x5 = d4.

    Given the query responses dl, d2, d3, andd4, snoopers can infer x5 as follows:d3 + d4 - d, - dzx5 = 2

    As noted in Denning [1983], however, inorder for snoopers to perform such an at-tack, they must have enough supplemen-tary knowledge about the database entitiesSO that they know exactly the coefficientsof x1 through x5 in the query responses. Inmost practical applications, it is rare thata user would have such supplementaryknowledge.The second observation is related to themethod proposed in McLeish [1983]. Thismethod can be classified as a variant ofauditing. It is based on the model used inKam and Ullman [1977], which views anSDB as a function f from strings of k bitsto the positive and negative integers withthe keys being the domain off. A query isalways of length lz bits; for example, for k= 5 a possible query could be l**O*, with sOs and ls (in this case s = 2) and the *standing for do not care. The result of aquery Q that is of length k and has s Osand ls is given by

    c f(i).key i matches QIn the hospital database, for example,the key could consist of 17 bitsACM Computing Surveys, Vol. 21, No. 4, December 1989

  • 7/28/2019 Stat Database Sec78

    16/42

    530 l Adam and Wortmannxxxywwwwwwwzzzzzz as follows:xxxY

    is a code for the clinic,is a code for the patients sex(0 = Male, 1 = Female),wwwwwww is a code for the patients age,zzzzzz is a code for the type of disease.Thus, the query ***0101010111111 wouldrepresent the sum of all male patients,independent of which clinic they are in, ofage 42 who have a disease type 111111.According to McLeish [1983], theamount of information gained by issuing aquery Q is given by:

    log ( L ) if ]C] z2Lmin(]C], L- ICI) or ]C] 501ogL if ICI = L0 otherwisewhere L and ] C ] are the database size andthe query-set size, respectively.McLeish [1983] argues that minimizingthis information function corresponds toincreasing the chance of compromising thedatabase. Given a query of length k bitsissued to an SDB of 2k entities, it is shownthat the expected value of the informationgained by issuing such query is/, \ l/k--l

    k - (k - 1) ;0and is minimized when

    p= ;0l/k-l

    for k > 1 (1)where p is the probability of an * occurringin any given bit position.Based on the above results, the security-control method suggested in McLeish[1983] can be summarized as follows:

    Keep audit trails of the sequence of que-ries in the following ways:l Observe the actual value of p for thatsequence of queries and determine if it isstatistically significantly close to the mini-mum value given by (1). If so, there is ahigh likelihood that the user is attemptingto compromise the database.l Evaluate the information function foreach query in the sequence and study sta-tistically the deviation of this value from

    the minimum expected value given by (1).Based on these deviations, determine thelikelihood that the user is attempting tocompromise the database.Since the study is preliminary in nature,no implementation details such as the com-putational time and storage requirementshave been addressed.In general, the applicability of the audit-ing method to real world situations is ques-tionable since it is not feasible to accountfor disclosure by collusion that involvesseveral users. Furthermore, unless we areconcerned with a centralized and dedicatedSDB with few users and one confidential

    attribute, the CPU time and storage re-quirements would render the method im-practical. Despite these shortcomings, webelieve it is too early to eliminate auditingcompletely from consideration.3.4 PartitioningThe basic idea of partitioning is to clusterindividual entities of the population in anumber of mutually exclusive subsets,called atomic populations [Chin and Ozsoy-oglu 1979, 1981; Schlorer 1983; Yu andChin 19771. The statistical properties ofthese atomic populations constitute the rawmaterials available to the database users.These authors pay special attention to thedisclosure risk due to the dynamics of theSDB: If a snooper has additional knowledgeof entity insertions, updates, and deletions,many new avenues of attack emerge undernearly all query-restricting methods. Par-titioning could be an attractive techniqueto overcome this problem.As long as atomic populations do notcontain precisely one individual entity, ahigh level of security and precision can beattained. Partitioning is illustrated in Fig-ure 4. As can be seen, the cells with size 1have been eliminated by combining themwith neighboring cells of different sex. Interms of the lattice model (Figure 2) weare, in this way, preserving the two-dimen-sional table AE (Age-Employer). The ta-bles AS and SE shown in Figure 2 are nolonger complete ly available. The choice topreserve the two-dimensional table AE(and consequently the one-dimensional ta-bles E and A) is arbitrary. It is not always

    ACM Computing Surveys, Vol. 21, No. 4, December 1989

  • 7/28/2019 Stat Database Sec78

    17/42

    Security Control Method For Databases l 531AGE

    TABLE ASE O-20 21-45 46-65 >65EMPLOYERUNEMPLOYED

    ABC-COMPANYXYZ-INC.

    M 24 2 49F 26 0 10 51M 0 17 9 0F 0 0 0MF 2 20 48 00 52 0AGE

    TABLE AS O-20 21-45 46-65 >65SEX M X X X 49F X X X 51

    AGETABLE AE O-20 21-45 46-65 >65EMPLOYERUNEMPLOYEDABC-COMPANYXYZ-INC.

    50 2 10 1000 17 9 02 20 100 0SEX

    TABLE SEEMPLOYERUNEMPLOYED

    ABC-COMPANYXYZ-INC.

    M FX X TABLE ALL:X XX X m

    AGEO-20 21-45 46-65 >65

    TABLE A 52 39 119 100SEX

    TABLE S

    TABLE E

    M FX X

    EMPLOYERUNEMPLOYED ABC XYZ

    162 26 122Figure 4. Protection by partitioning on the model of Figure 3.

    possible to preserve at least one (m - l)- be combined with another cell, thus re-dimensional table if cells of m-dimensional stricting some entry in table AE.tables are combined. For example, if the Schlorer [1983] has investigated a largelady employed by XYZ, Inc., who is younger number of practical databases and foundthan 21 was not included in the database, that a considerable number of atomic pop-the cell that contains the man employed by ulations with only one entity will emerge.XYZ, Inc., who is younger than 21 would Clustering such populations with larger

    ACM Computing Surveys, Vol. 21, No. 4, December 1989

  • 7/28/2019 Stat Database Sec78

    18/42

    532 . Adam and Wortmannones leads to serious information loss[ Schlorer 19831.In order to cope with the problem of A-populations. of size 1, it was proposed inChin and Ozsoyoglu [1979, 19811, to adddummy entities to the database. Includingdummy entities, however, introduces biasinto statistics such as the Average [ Schlorer19831. n the same study, Chin and ozsoy-oglu [1981] also proposed to postpone theprocessing of insert and delete transactionsuntil there are two or more such transac-tions per atomic populations. Such a modeof operation is not well suited to a dynamic-SDB environment in which an update tothe SDB should immediately be reflectedin the information provided to users.Extensive study of these problems is stillrequired before wide-scale applicalion ofpartitioning is feasible [Schlorer 19831.The performance of partitioning with re-spect to the evaluation criteria is summa-rized in Table 1. We add the followingcomments. In regard to partial disclosure,the original papers [Chin and Ozsoyoglu1981; Ozsoyoglu and Ozsoyoglu 1981;Schlorer 19833 usually assume that theminimum size of the nonempty atomic pop-ulations equals 2. Atomic populations ofsmall size make partial disclosure morelikely. Small atomic-population sizes couldlead to robustness problems too. Thegeneral idea of partitioning, however, isalso applicable to atomic populations ofsome minimum size of 4, 5, . . . , and so on,in which case partial disclosure can becontrolled to some extent.The amount of nonconfidential infor-mation eliminated is moderate. For sparsedatabases, it could become considerable(information loss > 50%). See Ozsoyogluand Chung [ 19861.As far as cost of partitioning is con-cerned, there are several possibilities. If weare dealing with a static SDB that is alreadyin tabular form, software modules have tobe developed that (1) detect sensitive cellsand create A-populations and (2) prepro-cess queries in order to detect violation ofthe partitioning rules. This seems to be amoderate initial effort. The processingoverhead per query is very low. The situa-tion becomes slightly more complicated for

    a static SDB in relational form. If feasible,a proper way to implement partititioning isto transform the SDB to a correspondingone of tabular form. It could be argued thatsuch a transformation introduces some lossof information due to the fact that therelational model allows queries such asCOUNT (Sex = Female + (Sex

    = Male & Age= 42) & Employer= ABC & Diagnosis= Schizophrenia)

    The additional richness of queriesrelational model, however, always

    Tmein thestemsfrom the fact that A-populations aredivided, which according to partitioningshould be disallowed.For dynamic SDBs in relational form, A-populations have to be defined as separateentity types. Here, several notions from theconceptual-modeling approach described inSection 1 will be required. Therefore, theinitial investment in software modules willbe considerable. The nrocessing overhead

    per query could still be minor if each newversion of the dynamic SDB is brought intotabular form. Otherwise, depending uponthe way in which the query language andthe SDB are implemented, the processingoverhead per query could become very high.Finally, the cost of user education in-volves the publication of the set of A-pop-ulations to the user community; otherwise,users are unaware of queries that should bedenied.3.5 Cell SuppressionCell suppression [Cox 1980; Sande 19831 sone of the techniques typically used bycensus bureaus for data published in tabu-lar form. Cell suppression has been inves-tigated for static SDBs. The basic idea isto suppress from the released table(s) allcells that might cause confidential infor-mation to be disclosed. Other cells of non-confidential information that might lead toa disclosure of some confidential informa-tion also have to be suppressed (this iscalled complementary suppression). The

    ACM Computing Surveys, Vo l. 21, No. 4, December 1989

  • 7/28/2019 Stat Database Sec78

    19/42

    Security Control Method For Databases l 533AGE

    TABLE ASEEMPLOYERUNEMPLOYED

    O-20 21-45 46-65 >65

    M X X X 49F X X X 51ABC-COMPANY M 0 X X 0F 0 X X 0XYZ-INC. M X X 48 0F X X 52 0

    AGETABLE AS O-20 21-45 46-65 >65SEX M 25 23 66 49F 27 16 53 51

    AGETABLE AEEMPLOYERUNEMPLOYEDABC-COMPANYXYZ-INC.

    O-20 21-45 46-65 >65

    50 2 10 1000 17 9 02 20 100 0

    SEXTABLE SE M FEMPLOYERUNEMPLOYED

    ABC-COMPANYXYZ-INC.

    84 78 TABLE ALL:10 1669 53

    AGEO-20 21-45 46-65 >65

    TABLE A 52 39 119 100SEX

    M FTABLE S 163 147

    EMPLOYERUNEMPLOYED ABC XYZ

    TABLE E 162 26 122Figure 5. Protection by cell suppression in the model of Figure 3.

    basic idea of cell suppression is illustrated cell suppression becomes impractical if anin Figure 5 for the SDB shown in Figure 3. arbitrary complex syntax for queries isA thorough study on cell suppression for allowed. (With such a syntax, suppressionthe static SDB environment is presented of complete tables from the lattice modelin Denning et al. [ 19821. They show that might be necessary.) If, however, the syntax

    ACM Computing Surveys, Vol. 21, No. 4, December 1989

  • 7/28/2019 Stat Database Sec78

    20/42

    534 l Adam and Wortmannis restricted such that the query set is anelementary m-set (see Section 2),

    (AI = aij,)& (AZ = azj,) & * * * & (A, = a,j_),

    cell suppression would remain of practicalvalue. Fortunately, the summary tablesused by the census bureaus usually take theform of such elementary m-sets.The determination of complementarysuppressed cells has been studied by Cox[1980]. He shows that the determination ofa minimum set of complementary suppres-sion involves a great deal of computationalcomplexity. An insight into heuristics andsoftware required to solve these problemsin practice (the Canadian census) is in-cluded in Sande [1983]. It is interestingto note that Cox and Sande use an elabo-rate sensitivity criterion, called the k%-dominance rule. According to this criterion,a cell is sensitive if the attribute values oftwo or three entities in the cell contributemore than k% of the corresponding SUMstatistic.Recently, ijzsoyoglu and Chung [1986]have published a study that is of interestfor both the partitioning method and thecell-suppression method. It is concernedwith preventing users from obtaining theinformation that a cell is of size 1. Theproblem is attacked by merging a cell ofsize 1 with a cell o f size > 1. For thisproblem, an efficient heuristic is presented.The information loss may still become high(52%), however, thus limiting its applica-tion to real world SDBs.

    In Table 1 we summarize the perfor-mance of the cell suppression methoddescribed in Cox [1980] with respect to theevaluation criteria.

    4. DATA PERTURBATIONThe methods based on the data-perturba-tion approach fall into two main categories,which we will call the probability-distribution category and the fixed-data-perturbation category. The probability-distribution category considers the SDB tobe a sample from a given population thathas a given probability distribution. In this

    case, the security-control method replacesthe original SDB by another sample fromthe same distribution or by the distributionitself. In the fixed-data-perturbation cate-gory, the values of the attributes in thedatabase, which are to be used for comput-ing statistics, are perturbed once and forall. The fixed-data-perturbation methodsdiscussed in the literature have been devel-oped exclusively for either numerical dataor categorical data. Accordingly, thesemethods will be discussed separately.As shown in Figure lb, the data-pertur-bation approach usually requires that adedicated transformed database is createdfor statistical research. If the original da-tabase is also used for other purposes, theoriginal database and the transformed SDBare both maintained by the system.4.1 The Bias ProblemBefore discussing the data-perturbation-based methods, it is important to note thatthis approach has a large risk of introducingbias to quantities such as the conditionalmeans and frequencies. This point isstressed in Matloff [ 19861. FollowingMatloffs notation and line of thought, thesource of this bias can be explained asfollows.Let X denote the original value of anattribute to be perturbed, and let X denotethis perturbed value X = X + (Y. Nowconsider the set of entities that has a per-turbed value w. Matloff shows that theexpected value of X under the conditionthatX=w;thatis,E(X]X=w)isnotnecessarily equal to w. More specifically, ifX is a numerical-positive variable with astrictly decreasing density function (e.g.,the exponential density function) to whicha perturbation that is symmetrical around0 has been added, Matloff [1986] showsthat

    E(X]X = w) < w.In this case perturbation introduced bias inthe response to the user query.

    This result has important consequencesfor fixed-data perturbation. Each query inwhich the selection of the query set is basedon perturbed values runs the risk of beingACM Computing Surveys, Vol. 21, No. 4, December 1989

  • 7/28/2019 Stat Database Sec78

    21/42

    Security Control Method For Databases l 535biased. Matloff also shows that this biascan be considerable. For example, if X andY are correlated attributes with a bivariateGaussian distribution whose expected valueis 0 and X is perturbed by an independentnoise variable LYwith mean value zero andvariance Var(a), then the following biasoccurs: Let

    m(w) = E(Y ] X= W)= (E(XY)/Var(X))w

    m(w) = E(Y (X = zu)= (E(X Y)/Var(X))w.

    Since LY s independent of Y and both Yand (Y have a mean of zero, we have

    orm(wJ=(VEb,,),(wJ

    iVar(X)

    = Var(X) + Var(cu) ) m(w)

    =(l +iar(ol))m(w)then

    m(w) 1-=m(w) 1 + Var(cY)/Var(X)Therefore, if Var(cu) = Var(X), which isnot unreasonable for perturbing noise, abias of 50% occurs. We assume that such abias is unacceptable. This bias problem willbe further addressed in the context of themethods discussed below.

    4.2 Probability DistributionWithin the probability-distribution cate-gory, two methods can be identified. The

    basic idea of the first is to transform theoriginal SDB with another sample thatcomes from the same (assumed) probabilitydistribution. This method is described inReiss [ 1980, 19841 for multicategoricalattributes and is called data swappingor multidimensional transformation[Schlorer 19811. The method has been fol-lowed for categorical or numerical attri-butes by Liew et al. [1985]. The secondmethod, described in Lefons et al. [1983],calls for replacing the original SDB byits (assumed) probability distribution. Adiscussion of each of these methods ispresented.4.2.1 Data SwappingReiss [1984] suggested a method that dealswith multicategorical attributes calledapproximate data swapping. The method,which extends earlier work by Reiss [1980]and Schlorer [1981], is described forBoolean attributes (O-l). According toReiss [ 19841, however, it is straightforwardto extend the method to the case in whichattributes take any value within the setK4 1, 2, * * . , r - 1) for any arbitrary valuer. In this method, the original database isreplaced with a randomly generated data-base having approximately the same t-orderstatistics as the original database. (A t-order statistic is some statistical quantitythat can be computed from the values ofexactly t attributes [Schlorer 19831, for ex-ample, the number of patients whose Sex= Male and Disease = AIDS is two-orderfrequency count.) Due to the computationalrequirement of the method, it is only fea-sible to consider it for static SDBs whereoffline mode of usage is used (see Figurela). Even if the computational requirementwere reduced, the following issues haveto be resolved before its application to anonline-usage mode would be practical.

    (1) Every time a new entity is added ora current entity is deleted, the relationshipbetween this entity and the rest of thedatabase has to be taken into considerationwhen computing a new perturbation. Therequired algorithm is not straightforward.

    (2) There is a need for a one-to-one map-ping between the original database and theACM Computing Surveys, Vol. 21, No. 4, December 1989

  • 7/28/2019 Stat Database Sec78

    22/42

    536 . Adam and WortmannTable 2. A Summary of the Performance

    Security Suitable for SuitabilitySecurity Exact Partial Numerical Suitable for to OnlineControl Disclosure Disclosure or Categ. One or More DynamicMethod Possible? Possible? Robustness Attribute Attributes SDBData Swapping

    Probab. Distribu-tion by Liewet al.

    Analytical Method

    Fixed Data Per-turb. by Traubet al.

    Fixed Data Per-turb. by Warner

    No Yes (difficult) HighNo Yes (easy es- For exact

    pecially for discl.-large SDBs) Moderate;

    for partialdiscl.-Very Low

    Yes in ex- Yes (especially Moderatetreme for largecases SDBs)

    No Can be bal- Moderateancedagainstsecurity

    No Can be bal- Moderateancedagainstsecurity

    Categorical More thanone

    Both OneNot suitableNot suitable

    for realtime

    Numerical More thanone

    Moderate

    Numerical One Moderate

    Categorical One Moderate

    perturbed database. Although several alter-native methods discussed in Reiss [1984]look promising, further investigation isrequired.

    (3) The precision resulting from thismethod may be considered unacceptablesince, as shown in Reiss [1984, p. 331, themethod may in some cases have an error ofup to 50%. In a static-SDB environmentsuch extreme cases may be analyzed andcorrected by the DBA before the releaseof the perturbed SDB. This is infeasible,however, in the case of the dynamic SDB.(4) The small query-set (size 0 or 1)problem needs to be resolved; otherwise thedatabase is vulnerable to such compromiseas the regression-based one [Palley 1986;Palley and Simonoff 19871 (see Section 7).A summary of the performance of thismethod is presented in Table 2. In general,data swapping has not been developedenough to be seriously considered for staticor dynamic SDBs.

    4.2.2 The Probability-Distribution Method byLiew et al.

    Liew et al. [1985] describe a method forprotecting a single confidential attributeACM Computing Surveys, Vol. 21, No. 4, December 1989

    in an SDB. The method is applicable toboth categorical and numerical attributes.The generalization to multiple-dependent-confidential attributes is not, however,straightforward. The method consists ofthree steps:(1) Identify the underlying density func-tion of the attribute values and estimatethe parameters of this function.(2) Generate a sample series of data fromthe estimated density function of the con-fidential attribute. The new sample shouldbe the same size as that of the database.(3) Substitute the generated data of theconfidential attribute for the original datain the same rank order. That is, the small-est value of the new sample should replacethe smallest value in the original data, andso on.

    As a result of the third step, this methodcould equally as well be classified under thefixed-data-perturbation category. For easeof discussion, we describe it in this section.This method is equivalent to a noise-addition method, hence it introduces sam-pling bias in query responses [Matloff19861. The bias results from sampling froma population that is not the true-target

  • 7/28/2019 Stat Database Sec78

    23/42

    Security Control Method For Databasesof the Data Perturbation Methods

    Richness of Information costsAmount of Initial Processing

    Nonconf. Info. Implementation OverheadEliminated Bias Precision Consistency Efforts Per Query

    None Yes (could be Could be High Very high Noneserious) very low

    None Yes (could be Inversely NA Moderate Noneserious for relatedsmall to sizeSDBs) of the

    SDB

    l 537

    UserEducationModerateVery low

    None Yes (serious) Moderate NA Very high None Very low

    None

    None

    Yes (serious) Can be Moderate Low Very low Very lowbalanced except foragainst extremesecurity

    No Can be Moderate Low Very low Very lowbalancedagainstsecurity

    population [Tendick and Matloff 19871.For an SDB of small size, the noise intro-duced by this method is larger; thus bettersecurity is achieved but biased-query re-sponses are provided to users. As the sizeof the database increases, the bias becomessmaller but less security of confidentialattributes is achieved.The evaluation of the performance of thismethod with respect to the evaluation cri-teria is summarized in Table 2. Partial dis-closure is easily possible since the noiseadded to the confidential attribute becomesrapidly small, even for databases of mod-erate size. Additional knowledge aboutother entities in the database may enhancethe likelihood of partial disclosure.4.2.3 The Analytical MethodLefons et al. [1983] describe a method forprotecting multinumerical-confidential at-tributes. The method consists of estimatingthe joint probability function of several nu-merical attributes. The key contribution ofthis work lies in the approximation of thedata distribution by orthogonal polyno-mials. The coefficients used in the compu-tation of this approximation are calledcanonical coefficients. These coefficients

    are well suited for usage in an online envi-ronment because they can be adopted easilyin case of insertions and deletions of thedatabase entities.Although the method looks promising,its security aspect needs further investiga-tion. In particular, if the new probability-distribution function is a very precise de-scription of the original data, then there ishardly any protection against partial diszclosure. On the other hand, if deviatiohsbetween the distribution function and theoriginal data are possible, then issues suchas how to avoid bias and how could theDBA exercise control on the trade-offbetween precision and security need to beaddressed.The evaluation of the analytic approachwith respect to the criteria of Section 1 ispresented in Table 2. Exact disclosure ispossible in extreme cases. For example, ifthe distribution shows that 1% of the pop-ulation satisfies certain criteria and it isknown that the size of the original databaseamounts to 100, exact disclosure occurs.4.3 Fixed-Data PerturbationThis section discusses the fixed-data-perturbation method for numerical attri-

    ACM Computing Surveys, Vol. 21, No. 4, December 1989

  • 7/28/2019 Stat Database Sec78

    24/42

    538 l Adam and Wortmannbutes and the fixed-data-perturbationmethods for categorical attributes.4.3.1 Fixed-Data Perturbation for NumericalAttributesTraub at al. [ 19841 developed a method thatapplies to numerical attributes. Suppose,for example, that the true value of a givenattribute (e.g., salary) of an entity k is Yk.The response to the sum query, under thismethod, will be

    T= i Xk,k=l

    wherexk = Yk + ek,

    ek is a random-perturbation variable withE(ek) = 0 and var(ek) = uz,

    and (ek1 are independent for different ks.Under this method, the perturbation ekof an entity k is fixed. Thus, it is notpossible for snoopers to improve theirestimates of a given statistic by repeatingqueries.The additive-perturbation method de-scribed above suffers in terms of scale. Forexample, perturbing a salary of $150,000 by3000 would be considered a compromisewhile at the same time perturbing a salaryof $15,000 by 3000 would preverve the con-fidentiality of the data. Several alternativesto this basic method have been suggestedin Traub et al. [1984]. One alternative is toapply multiplicative rather than additiveperturbation, thus overcoming the scaleproblem.Note that this method is akin to theprobability-distribution method [Liew etal. 19851. The bias problem, which wasdiscussed earlier in this section, appliesalso to the fixed-data-perturbation method[Traub et al. 19841. The method could,however, be saved for practical usage onlyif a bias-compensation mechanism is alsodeveloped and implemented. In this re-spect, this method has an advantage overthe probability-distribution method [Liewet al. 19851, because the way in which noiseis added to the data is much clearer andtherefore better suited for statistical analy-ACM Computing Surveys, Vol. 21, No. 4, December 1989

    sis. We expect that the bias problem forsingle attribute (or independent attributes)is solvable. Recently, Tendick and Matloff[ 19871 have suggested a bias-correctionmechanism for both the multivariate nor-mal and nonparametric cases. The appli-cation of these bias-correction mechanismsto the fixed-data-perturbation method[Traub et al. 19841 needs further study.A summary of the evaluation of the fixed-data-perturbation method [Traub et al.19841 with respect to the criteria of Section1 is given in Table 2. Unlike other data-perturbation methods, this method is likelyto be appropriate for usage in an onlineSDB. The original and perturbed valuescan be maintained, with users accessingonly the perturbed database. Updates, dele-tions, and insertions affecting the originalattributes can immediately be reflected intothe perturbed values.

    4.3.2 Fixed-Data Perturbation for CategoricalAttributes: Basic MethodWarner [ 19651 developed the randomizedresponse method for the purpose of apply-ing it to data collection through a survey.The method deals with a single confidentialattribute that can take only the value 0 or1 (e.g., drug addiction). In order to describethis method as applied to the COUNTquery in an SDB, we extend our hospitaldatabase example to include a confidential-categorical attribute (O-l): drug addiction.Consider the following query:Q6: COUNT (Age = 42 & Sex = Male &

    Employer = ABC & Drug Addiction .=Yes)Letn = the query-set size (true answer to theabove query).no = the number of entities that satisfy thecharacteristic formula, excluding the clausethat pertains to the confidential attribute;that is, COUNT (Age = 42 & Sex = Male& Employer = ABC). This will be referredto as the nonconfidential-response set.Yi = the value of the confidential attributeafter applying data perturbation to entityi. Specifically, if Xi is the original value of

  • 7/28/2019 Stat Database Sec78

    25/42

    Security Control Method For Databases l 539the confidential attribute, then

    yi = { ;; - Xi ]with probability pwith probability 1 -p

    where the fixed parameter p is set by theDBA. Typical values for p are 0.6-0.8. Theroles of p and 1 - p can be interchangedsuch that p might take values 0.2-0.4.n, = the number of Yis with a value of 1in the confidential-response set (i.e., thequery-set size).N = the response to the user. It is theunbiased-maximum-likelihood estimator ofn (thus, no bias problem) and is given by

    N=nob - 1) + 1

    2p - 1where

    Note that a correct determination of nois crucial for this method. This, in general,is not a trivial problem. For the query Q6,it is easy to determine no because the queryconsists of a nonconfidential-Booleanexpression connected by an AND operatorwith one simple confidential clause. Thus,we are able to distinguish, in a straightfor-ward way, between the nonconfidential-response set (with cardinality no) and itssubset for which Y = 1 (with cardinalityno). Consider, however, the following query:Q7: COUNT (Sex = Male & Drug Addic-tion = Yes + Employer = ABC & Drug

    Addiction = No)In this case, defining no and nl is not atrivial task. We outline a general procedurefor dealing with this problem that aims attransforming the characteristic formula Cinto a formula C of the form(Y = 1) & A, + (Y = 0) & A, + AB,

    where Ai n Aj = 0, for i 5 j and Ai isnonconfidential for i = 1, 2, 3.Step 1. Transform C into its most ex-panded form. Each term of the new formulacontains a clause (Y = l), a clause (Y = 0),or no reference to Y at all.

    Step 2. Collect all terms with Y = 1, allterms with Y = 0, and the remaining terms.This yields a formula of the form (Y = 1)&B,+(Y=O)&B,+B,.Step 3. Transform this formula into (Y =1) & B, & (lB2) & (lB3) + (Y = 0) &(~7~) & Bz & (lB3) + B, + BS, whichgives the required result (the formula C ).Step 4. During the query processing, thealgorithm should use separate countersnf and n, l) for the first clause, nff andnj for the second clause, and nc3) for thethird clause.Step 5. After processing, the response Ncan be estimated as the sum of the threevariables N(l), N@), and Nc3), where

    N(1) = nt(p - 1) + nf2p - 1

    N(2) = nf(p - 1) + n:2p-1

    N3 = n(3)This procedure shows that this fixed-data-perturbation method is a feasible so-lution for an SDB application, although itsgeneralization is not trivial.Table 2 includes a summary of the per-formance of the fixed-data-perturbationmethod by Warner 119651 with respect tothe evaluation criteria. Notice that in thiscase, similar to the fixed-data-perturbationmethod described in Traub et al. [1984], itis possible to balance precision and secu-rity. A more detailed discussion of this issueis given in Section 6.

    4.3.3 Fixed-Data Perturbation for CategoricalAttributes: Extensions

    The fixed-data perturbation for categoricaldata is an extension of Warners [1965]method (as applied to data collectionthrough interviews) to the case in whichthe population can be divided into t cate-gories as given in Abul-Ela et al. [1967].The confidential information is obtainedby a Yes or No answer from eachinterviewee to precisely one randomlydrawn question out of a set of t - 1questions. Each question takes theform, Do you belong to group i? (fori = 1,2,. . . , t - 1). As in Warners method,

    ACM Computing Surveys, Vol. 21, No. 4, December 1989

  • 7/28/2019 Stat Database Sec78

    26/42

    540 l Adam and Wortmannthe interviewer is not aware of the partic-ular question to which the interviewee isresponding.Applying this method to an SDB envi-ronment results in a considerable infor-mation loss since the original t-possiblevalues of the confidential attribute arereduced to two possible values (yes or no).More specifically, assuming equal probabil-ity of each of the t values, the availableinformation is reduced from log2 (t ) bits to1 bit. This information loss is considered aserious hindrance in applying the methodto real-world SDBs.It was shown that the above two methodsof randomized response can be formulatedas special cases of a general class of linearregression models [Warner 19711. It wasfurther suggested that this general modelmight be applied to perturb categorical datain a statistical database [Traub et al. 19841.The two randomized-response methods dis-cussed above are the only methods that weare aware of as being specific models thatfall under the general class of models de-scribed by Warner. (We consider the meth-ods described in Greenberg et al. [1969a,1969b] to be minor variations of these twomethods.) Therefore, for Warners generalclass of models to be applicable to SDBs inonline mode (see Figure lb), new methodshave to be developed. Such methods shouldhave better performance with respect totheir computational requirements andinformation loss.The randomized-response methods dis-cussed above do not address the case ofmore than one categorical attribute. Theapplication of these methods to SDBs with

    multicategorical attributes requires furtherstudy.The discussion presented in this sectionindicates that data-perturbation-basedmethods suffer from a bias problem, and/or not being suitable for dynamic SDBs,and/or being limited to one confidentialattribute. A study of how to remedy thesedrawbacks is needed.5. OUTPUT-PERTURBATION APPROACHBefore discussing the output-perturbation-based methods, we would like to note thatthe bias problem pointed out in the pre-

    vious section is less severe here. This is dueto the fact that the query set selected underthe output-perturbation approach is basedon the original values, not the perturbedvalues. Thus, the value of Y in Section 4.1is generated after selecting the appropriatevalues of X.5.1 Random-Sample QueriesDenning [1980] proposed a method that iscomparable to ordinary random samplingwhere a sample is drawn from the query setitself. Given a characteristic formula C, aset of entities that satisfies C is determined.For each entity i in the query set, the sys-tem applies a Boolean formula f (C, i) todetermine whether this entity is to be in-cluded in the sampled query set. The func-tion f is designed in such a way that thereis a probability P that the entity is includedin the sampled query set. The probabilityP can be set by the DBA. The requiredstatistics are computed based on the sam-pled query set. The statistics computedfrom the sampled query set have to bedivided by P in order to provide a corre-sponding unbiased estimator. For example,if the response to a count query based onthe sampled query set is n*, an estimatorfor the true count is n*/P. The samplingmechanism in this method is designed insuch a way that lexicographically the samequeries produce the same sample from thequery set. Logically equivalent but lexico-graphically different queries, however, donot necessarily produce identical sampledquery sets. As a result, independent esti-mates of a given statistic can be obtainedby issuing logically equivalent but not iden-tical queries. The method thus suffers fromthe resulting inconsistency.The probability of an entity being in-cluded in the query set P is either fixed(independent of n) or variable (varieswith n).When P is fixed there should be a query-set-size restriction if P is large. Otherwise,there is a high probability of including allthe entities of a small query set, thus com-promising the database.When P is variable, it should approach 0for a query-set size approaching 1 in orderto avoid compromising the database. This,

    ACM Computing Surveys, Vol. 21, No. 4, December 1989

  • 7/28/2019 Stat Database Sec78

    27/42

    Security Control Method For Databases 541however, results in an estimator whosevariance, Var(n), approaches infinity.Therefore, in discussing this method welimit our attention only to the case of fixedP with a minimum query-set-size restric-tion. The minimum query-set-size restric-tion should be applied to n* instead of n,otherwise it is easily shown that compro-mise can be achieved by a tracker of sizeK - 1, where the query-set size is less thanor equal to K. We shall ignore this imple-mentation issue in the remainder of thepaper.A summary of the performance of thismethod with respect to the evaluation cri-teria is given in Table 3, and additionaldiscussion is included in Section 6.Finally we note that a variant of therandom-sample queries [Denning 19801 hasbeen proposed in Leiss [1982]. Accordingto the method by Leiss [ 19821, the modifiedquery set is extended by a small samplefrom those entities in the SDB that are notincluded in the genuine query set. Unlikethe random-sample queries [Denning19801, it is difficult to avoid the bias result-ing from the method in Leiss [ 19821. There-fore, no further discussion of Leisssmethod is included in this paper.5.2 Varying-Output PerturbationBeck [1980] suggested a method for SUM,COUNT, and PERCENTILE queries. Themethod introduces a varying perturbationto the data that are used to compute theresponse to a (eventually repeated) versionof a given query. We will use the COUNTquery and follow Becks notation to illus-trate the basic idea of the method. Letn = the query-set size determined by thecharacteristic formula C.

    N = the perturbed value of n; it is theresponse to users and is given by

    N = C -Co.,k=lwhere

    -G,k = i zgtk,i=l

    where .Z$ are independent random vari-ables withE(Z$) = i and Var(.Zti) = $.

    ThusE(&,k) = 1 and E(&,k) = j:

    andE(N) = n and Var(n) = ja?.

    The variable & is the sum of j randomvariables 2: that change their values atvarying rates. Each new draw of Z3 causesat least a new draw of 2:. After (al/cl)drawings of Zf, however, a snooper mighthave an accurate estimator of n (althoughthis estimator may be biased). Therefore,after (al/cl)* drawings of Zf, a new drawof-w is made, and so on. Let d = (a1/c1)2,then it takes d3 queries for snoopers to findtheir target with sufficient accuracy. Dueto the varying rates of change of the valuesof Zf, subsequent draw of Z3 are highlycorrelated. The purpose of introducing thiscorrelation is to make the number of quer-ies needed for a disclosure grow exponen-tially with j.The varying-output-perturbation method[Beck 19801 for SUM and PERCENTILEqueries proceeds along the same line ofthought. Here we will discuss only the SUMquery.Suppose we have a query SUM(Y) overa query set R, with n entities in R. Supposefurther that the mean value of Y for theque_ry set is yr and for the entire databaseis Y. The response to the query will be

    T=iXk,kwhere xk = Yk + & ( Yk - F,.) + 22.In this equation 2, and Z2 are indepen-dent random variables withE(Z,) = 0 and Var(Zi) = 2ja

    E(Z,) = 0and

    Var(&) = j$ (yr - P)2.ACM Computing Surveys, Vol. 21, No. 4, December 1989

  • 7/28/2019 Stat Database Sec78

    28/42

    542 l Adam and WortmannTable 3. A Summary of the Performance

    Security Suitable for SuitabilitySecurity Exact Partial Numerical Suitable for to OnlineControl Disclosure Disclosure or Categ. One or More DynamicMethod Possible? Possible? Robustness Attribute Attributes SDB

    Random Sample No Yes (can be Moderate Both More than one HighQueries by balancedDenning againstprecision)

    Varying Output No Yes (can be Moderate Numerical One or several HighPerturb. by balanced independentBeck against onesprecision)

    It can be shown thatE(T) = jl Y,a,

    Var(T) = 2jna + 2S,2 + 2ja2(Fr - 9)2,where

    s2 = CL (Fr - n2r n

    is the sample variance over the set R.Given Becks criterion of compromise,the value ,yk of some individual k by anestimator Yk through repeated queries isVar(Yk) < c2(Yk - P)

    Beck showed that dj queries (with d =(a/c)) are required for partial compromise.Table 3 shows the evaluation of thismethod with respect to the criteria dis-cussed in Section 1. Additional discussionis included in Section 6.5.3 RoundingOutput perturbation may take some formof rounding, where the answer to the queryis rounded up or down to the nearest mul-tiple of a certain base b. Systematic round-ing [Achugbue and Chin 19791, randomrounding [Fellegi and Phillips 1974; Haq1975, Haq 19771, and controlled rounding[Dalenius 19811 are three types of roundingthat have been investigated. The basic ideaof each of the rounding methods is sum-marized below.

    Consider a database in a tabular form andsuppose that, due to security reasons, it wasdecided to round the true statistic nij of celli, j in the table. As shown above, systematicand random rounding calls for adding aquantity +d, or -dij to nij. According tocontrolled rounding, the same quantity +d,(or -dij ) is added to three other cells. Thesecells are chosen in such a way that thereleased row sum, ri, and column sum, rj ,equal the true row sum, ni, and the truecolumn sum, n , respectively.Random rounding suffers from two majordrawbacks [Achugbue and Chin 19791:

    Let r = ] C ] (mod b) and 1x1 = the value (1) It is possible (with small probability,of X rounded downward. however) for a randomly rounded row of an

    5.3.1 Systematic RoundingThe response to the user

    ifr=O,if r < L(b + 1)/2J

    ICl+b-r ifrrL(b+1)/2J5.3.2 Random RoundingThe response to the user

    c ICI ifr=O,= ICI-ri

    l-rwith probability b

    ] C ] + b - r with probability f5.3.3 Controlled Rounding

    ACM Computing Surveys, Vol. 21, No. 4, December 1989

  • 7/28/2019 Stat Database Sec78

    29/42

    of the Querv Restriction Based MethodsSecurity Control Method For Databases 9 543

    Richness of Information costsAmount of Initial Processing

    Nonconf. Info. Implementation Overhead UserEliminated Bias Precision Consistency Efforts Per Query Education

    Low None Can be balanced Low Low Low Lowagainst secu-rity

    Low None Can be balancedagainst secu-rity

    Low Moderate Moderate Very high

    SDB in a tabular form to determine theexact original values of the row cells.(2) It is possible to determine the truevalue, nij (or greatly narrow its range) byaveraging the responses to the same query.

    Systematic rounding introduces nonzerobias. Furthermore, systematic rounding canbe circumvented by derounding or using atracker [Denning and Schlorer 19831.Generally, rounding is not consideredan effective security-control method. Butcombining rounding with other security-control methods seems to be a promisingavenue. For example, in Ozsoyoglu and Su[1985], rounding was used as an additionalsecurity-control method.Before concluding this section, it shouldbe pointed out that the output-perturbationapproach does not add noise to the dataand thus does not suffer from a serious biasproblem. It suffers, however, from thepossibility of having a null query set,thus providing valuable information to asnooper. Compromised methods such asregression-based ones (see Section 7) couldeasily take advantage of the null-query-setsituation.6. COMPARATIVE ANALYSIS OF THE

    SECURITY-CONTROL METHODSThe discussion presented in the previoussections shows that the random-sample-queries method [Denning 19801, the vary-ing-output-perturbation method [Beck19801, the fixed-data-perturbation method[Traub et al. 19841, and the fixed-data-

    perturbation for categorical data method[Warner 19651 are clearly among the mostpromising security-control methods for on-line, dynamic SDBs (the most difficult typeof SDBs). This section presents a compar-ison of these methods based on the evalu-ation criteria discussed in Section 1 asapplied to both the COUNT and SUMqueries.6.1 Security Criterion for the COUNT QueryPartial disclosure occurs if a snooper at-tempts (through issuing m queries) toobtain an estimator ri for the true countvalue, n, whose perturbed value is N andthe variance of that estimator satisfies thefollowing:

    Var(fi) < c?,where c1 is a parameter that is set by theDBA. We want to determine the value ofm that will result in

    Var(fi) < CSunder each of the methods.Let ri be an estimate of n obtained aftermR repeated queries when using the ran-dom-sample queries [Denning 19801. Wethen have

    Var(N) since answers toVar(fi) = 7 repeated queries. are independentor

    Var(N) Var(N)-=-mR = Var(;L) d ,ACM Computing Surveys, Vol. 21, No. 4, December 1989

  • 7/28/2019 Stat Database Sec78

    30/42

    544 lbut

    Hence,

    Adam and WortmannOF

    n(1 - P)Var(N) = p .

    n(1 - P)mR =When the varying-output-perturbationmethod [Beck 19801 is used, the answers torepeated queries are not independent andthe number of repeated queries, mv, thatwill result in

    Var(fi) = CTis given by atmv= 7.0 l

    Under equal precision to users it can beeasily shown that mR 5 mv, since for equalprecision we haven(1 - P)

    P = jaf.Dividing both sides by CSwe get

    n(1 - P) a4PC: =.i 20or

    as compared to2 ialmv= T .0l

    Under certain circumstances (e.g., smalln) the method [Denning 19801 provides aprecision that might not be attainable byBecks [1980] method. In such cases, mR

  • 7/28/2019 Stat Database Sec78

    31/42

    Factors

    Table 4. The Precision Criterion for the COUNT QueryVarying Output

    Random Sample Queries PerturbationFixed Data Perturbation

    by Warner1. Precision: V(N) n(1 - P)V(N) = p V(N) = ja:

    2. Required i nput parameters and theirrecommended values .5 I p 5 .90 4 5 j 5 30 and .5 5

    c1 c 1.0 .6 5 P I .8Minimum query set size could and a: = c:dbe as low as 9 and d = 3.0

    3. Range of PrecisionMin value of V(N): .lln (when p = .9 and n > 9) 3.0 (when j = 4 and c1 = .5) whenl=landp=.6%3Max value of V(N): n(whenp=.5an dn>9) 90.0 (when j = 30 and c1 = 1.0)

    4. Need for a min