Zhang n Jaim07

7/28/2019 Zhang n Jaim07

1/18

This article was published in an Elsevier journal. The attached copy

is furnished to the author for non-commercial research and

education use, including for instruction at the authors institution,

sharing with colleagues and providing to institution administration.

Other uses, including reproduction and distribution, or selling or

licensing copies, or posting to personal, institutional or third party

websites are prohibited.

In most cases authors are permitted to post their version of the

article (e.g. in Word or Tex form) to their personal website or

institutional repository. Authors requiring further information

regarding Elseviers archiving and manuscript policies are

encouraged to visit:

http://www.elsevier.com/copyright
http://www.elsevier.com/copyrighthttp://www.elsevier.com/copyright


2/18

Author's personal copy

Latent tree models and diagnosis in traditional

Chinese medicine

Nevin L. Zhang a,*, Shihong Yuan b, Tao Chen a, Yi Wang a

a Department of Computer Science & Engineering, The Hong Kong University of Science & Technology,Clear Water Bay Road, Kowloon, Hong Kong, Chinab Department of Chinese Medicine Diagnostics, Beijing University of Traditional Chinese Medicine,

11 Beisanhuan Donglu, Beijing 100029, China

Received 13 January 2007; received in revised form 17 October 2007; accepted 25 October 2007

Artificial Intelligence in Medicine (2008) 42, 229245

http://www.intl.elsevierhealth.com/journals/aiim

KEYWORDS

Machine learning;Latent structuremodels;Multidimensional

clustering;Traditional Chinesemedicine;Syndromedifferentiation

Summary

Objective: TCM (traditional Chinese medicine) is an important avenue for diseaseprevention and treatment for the Chinese people and is gaining popularity amongothers. However, many remain skeptical and even critical of TCM because of a number

of its shortcomings. One key shortcoming is the lack of objective diagnosis standards.We endeavor to alleviate this shortcoming using machine learning techniques.Method: TCM diagnosis consists of two steps, patient information gathering andsyndrome differentiation. We focus on the latter. When viewed as a black box,syndrome differentiation is simply a classifier that classifies patients into differentclasses based on their symptoms. A fundamental question is: do those classes exist inreality? To seek an answer to the question from the machine learning perspective, onewould naturally use cluster analysis. Previous clustering methods are unable to copewith the complexity of TCM. We have therefore developed a new clustering method inthe form of latent tree models. We have conducted a case study where we firstcollected a data set about a TCM domain called KIDNEY DEFICIENCY and then used latenttree models to analyze the data set.Results: Our analysis hasfound natural clusters in the data set that correspond well toTCM syndrome types. This is an important discovery because (1) it provides statistical

validation to TCM syndrome types and (2) it suggests the possibility of establishingobjective and quantitative diagnosis standards for syndrome differentiation. In thispaper, we provide a summary of research work on latent tree models and report theaforementioned case study.# 2007 Elsevier B.V. All rights reserved.

* Corresponding author. Tel.: +852 2358 7015; fax: +852 2358 1477.

E-mail address: [email protected] (N.L. Zhang).

0933-3657/$ see front matter # 2007 Elsevier B.V. All rights reserved.

doi:10.1016/j.artmed.2007.10.004


3/18


1. Introduction

We present a work that applies machine-learning techniques to solve a problem in TCM(traditional Chinese medicine). To motivate the

work, we will first briefly explain the differencesbetween TCM and western medicine, discuss theincreasing importance of TCM in health care, andpoint out one key shortcoming of TCM (Section1.1). We will then motivate our approach to over-come the shortcoming (Section 1.2), and describethe content and organization of this paper (Sec-tion 1.3).

1.1. TCM and western medicine

TCM and western medicine represent two differentparadigms. Western medicine approaches the

human body from an anatomic and biochemicalstandpoint. It emphasizes specific disease entitiesand focuses on pathophysiological mechanisms.TCM approaches the human body from an energeticand functional standpoint. It believes that diseaseis a result of disharmony within the body andbetween the body and the environment. TCM diag-nosis and treatment involve identifying the factorsthat are out of balance and attempting to bringthem back into harmony. Instead of micro levellaboratory tests, TCM diagnosis is primarily basedon overall observation of patient through inspec-

tion, auscultation and olfaction, interrogation andpalpation.

Western medicine is often complicated withundesirable side effects and does not have effectivetreatments for illness such as irritable bowel syn-drome and menopause [1]. An increasing number ofpatients in developed countries are turning to tradi-tional medicine for alternative treatment. Accord-ing to a report by World Health Organization [2], theglobal market for herbal medicines currently standsat over US$60 billion annually and is growing stea-dily. As a matter of fact, 80% of Africans use tradi-tional medicine; 90% of Canadians, 49% of Frenchpeople, 48% of Australians, 42% of Americans, and31% of Belgians have received traditional medicinetreatment at least once in their life times; and 77%of German clinics recommend acupuncture for painrelief.

TCM is a systematic traditional medicine systemand it has been clinically observed to have dra-matic performance in treating many chronic andsystematic diseases. However, several shortcom-ings are hindering its wide acceptance. One keyshortcoming is the lack of objective diagnosis stan-dards. In practice, this results in variability of

diagnosis conclusions among TCM practitioners

[1], which in turn raises doubts about TCM in theminds of many.

To alleviate the situation, researchers in Chinahave been seeking for laboratory tests that can beused as gold standards for TCM diagnosis [3,4].

Despite extensive research work for more than halfa century, little has been achieved [5]. In this paper,we propose and investigate a different approach.

1.2. Syndrome differentiation, clustering,and latent structures

TCM diagnosis consists of two steps. In the first step,a doctor collects patient information throughinspection, auscultation and olfaction, interroga-tion and palpation. In the second step, he reachesdiagnosis conclusions by analyzing patient informa-tion based on TCM theories and his experiences. The

second step is known as syndrome differentiation,while the first step can be called patient informa-tion gathering.

Subjectivity is an issue in both the patient infor-mation gathering step and the syndrome differen-tiation step. We focus on the latter. Our long-termgoal is to establish objective and quantitative stan-dards for syndrome differentiation using machinelearning techniques. How could this goal be possiblyachieved?

When viewed as a black box, syndrome differen-tiation is simply a classifier that classifies patients

into different classes based on their symptoms. Dothose classes exist in reality? Can we characterizethem mathematically? To seek answers to thosequestions from the machine learning perspective,one would naturally use cluster analysis. The idea isto: (1) collect patient data systematically; (2) per-form cluster analysis to identify natural clusters ofpatients in these data; (3) compare the naturalclusters with the classes mentioned in TCM. If someof the natural clusters match the TCM classes, thenwe would have provided statistical validation forTCM classes. Moreover, we can use the naturalclusters as a basis for syndrome differentiation.The TCM classes are described in natural languageand are often vague and prone to subjectivity. Thenatural clusters, however, are described in the lan-guage of mathematics and are objective (in a rela-tive sense). Therefore, they can serve as thefoundation for objective and quantitative syndromedifferentiation standards.

Which clustering method should we use? Toanswer this question, we need to take a closer lookat syndrome differentiation. There are several syn-drome differentiation systems, each focusing on adifferent perspective of the human body and with its

own theory. The theories describe relationships

230 N.L. Zhang et al.


4/18


between syndrome factors and symptoms, as illu-strated by this excerpt:

KIDNEY YANG1 is the basis of all YANG in the body. When

KIDNEY YANG is in deficiency, it cannot warm the bodyand the patient feels cold, resulting in intolerance

to cold, cold limbs, and cold lumbus and back [6].

The syndrome factor mentioned here is KIDNEY YANGDEFICIENCY. It is not directly observed. Rather, it issimilar in nature to concepts such as intelligenceand is indirectly measured through its manifesta-tions. Hence, we say that it is a latent variable. Incontrast, symptom variables such as cold limbs aredirectly observed and we call them manifest vari-ables. TCM theories involve a large number of latentand manifest variables. Abstractly speaking, theydescribe relationships among latent variables, andbetween latent variables and manifest variables.

They can be viewed as latent structure modelsspecified in natural language. To study syndromedifferentiation, therefore, we need statistical mod-els that involve multiple interrelated latent vari-ables, i.e. latent structure modelsdescribed in thelanguage of mathematics.

How are latent structure models related to clus-tering? A latent structure model represents a multi-dimensional classification of objects. Imagine alatent structure model about people that containstwo latent variables intelligence and personal-ity. Suppose the first latent variable has three

possible values high, medium and low, whilethe second latent variable has two possible valuesoutgoing and withdrawn. Then, the model repre-sents one classification of people along the dimen-sion of intelligence into three classes, and at thesame time another classification along the dimen-sion of personality into two classes. By consideringthe two dimensions at the same time, one gets across classification where one can speak of classessuch as outgoing people with high intelligence andwithdrawn people with medium intelligence.

Now suppose we start from a data set and byanalyzing the data set we obtain a latent structuremodel. That would mean that we simultaneouslycluster data in multiple ways. Each latent variable inthe resultant model represents one dimension alongwhich to cluster the data set, and different latentvariables represent different ways to cluster thedata set. So, to analyze a data set using latentstructure models is to perform multidimensionalclustering on the data set.

Is the aforementioned latent structure approachto the study of syndrome differentiation feasible? Toanswer this question, we have studied a special class

of latent structure models called latent tree mod-els, and we have used latent tree models to inves-tigate a subdomain of TCM diagnosis called KIDNEYDEFICIENCY. This case study shows that (1) there existnatural clusters in data that correspond to TCM

syndrome types, (2) we can find those classes usinglatent tree models. Those indicate that the latentstructure approach is indeed feasible.

In summary, we propose a novel approach to thestudy of syndrome differentiation. The long-termgoal is to establish objective and quantitative stan-dards for syndrome differentiation. This paper pre-sents the tools that have been developed forachieving the goal and demonstrates the feasibilityof the approach through a case study.

1.3. Content and organization

In the first half of the paper, we provide a summaryof research work on latent tree models. We will firstintroduce latent tree models as a generalization oflatent class models [7,8](Section 2). Then we willdiscuss two representational issues, namely modelequivalence (Section 3) and identifiability (Section4). In Section 5, we will survey algorithms for learn-ing latent tree models. In the second half of thepaper, we report the aforementioned case study. Wewill start by describing the data set and the dataanalysis process (Section 6). In Sections 7 and 8, weinterpret the latent variables and clusters in the

resultant model. The reader will see that the latentvariables correspond to syndrome factors and thelatent clusters can be interpreted as syndrometypes. The paper will end in Section 9 with a sum-mary and remarks about future directions.

2. Latent tree models

The concept latent variable is defined with respectto some given data set. A variable is observed withrespect to a data set if its value can be found in atleast one of the records. A variable is latent withrespect to a data set if its value is missing from allthe records. For simplicity, we will often talk aboutlatent variables without explicitly referring to datasets.

A latent class model [7,8] is a Bayesian networkthat consists of a latent variable X and a number ofobserved variables Y1; Y2; . . . ; Yn. All the variablesare categorical and the relationships among themare as shown in Fig. 1. In applications, the latentvariable stands for some latent factor and theobserved variables stand for manifestations of thelatent factor. The observed variables are also called

manifest variables.

Latent tree models and diagnosis in TCM 231

1 Words in small capital letters are reserved for TCM terms.


5/18


To learn a latent class model is to: (1) determinethe cardinality, i.e. the number of states for thelatent variable X, and (2) estimate the model para-meters PX and PYijX. A state of the latent vari-able X corresponds to a class of individuals in apopulation. It is called a latent class. Thus, todetermine the cardinality of X is to determine thenumber of latent classes, and to estimate PYijX isto reveal the statistical characteristics of the latentclasses. So, latent class analysis is a type of model-based cluster analysis.

Underlying a latent class model is the assump-tion that the manifest variables are mutually inde-pendent given the latent variable. A seriousproblem with the use of latent class models, knownas local dependence, is that this assumption isoften violated. If one does not deal with localdependence explicitly, one implicitly attributesit to the latent variable. This can lead to spuriouslatent classes and poor model fit. It can alsodegenerate the accuracy of classification becauselocally dependent manifest variables contain over-

lapping information [9].The local dependence problem is acknowledged

in the latent class analysis literature [10]. Methodsfor detecting and modeling local dependence havebeen proposed. To detect local dependence, onetypically compares observed and expected cross-classification frequencies for pairs of manifest vari-ables. To model local dependence, one can joinmanifest variables, introduce multiple latent vari-ables, or reformulate latent class models as log-linear models and then impose constraints on them.

Previous work for dealing with local dependenceis not sufficient for a number of reasons. First, thereare no criteria for making the trade-off betweenincreasing the cardinalities of existing latent vari-ables and introducing additional latent variables. In[10,11], cardinalities of all latent variables are fixedat 2 while the number of latent variables is allowedto change. In most other work, the standard one-latent-variable structure is assumed and fixed,while the cardinality of the latent variable isallowed to change. Second, the search for the bestmodel is carried out manually. Typically only a fewsimple models are considered [11,12]. Finally, whenthere are multiple pairs of locally dependent man-

ifest variables, it is not clear which pair should be

tackled first, or if all pairs should be handled simul-taneously.

Latent tree models are previously known as HLC(hierarchical latent class) models [13,14]. They area generalization of latent class models. A latent tree

model is a Bayesian network where

the network structure is a rooted tree; the internal nodes represent latent variables and

the leaf nodes represent manifest variables; and all the variables are categorical.

Fig. 2 shows an example latent tree model. In thispaper, we do not distinguish between variables andnodes. So we sometimes speak also of manifestnodes and latent nodes.

The class of latent tree models is clearly muchlarger than the class of latent class models. In the

meantime, latent tree models remain computation-ally attractive because their structures arerestricted to trees. Latent tree models can alleviatethe local dependence problem that latent classmodels face. We will later present search-basedalgorithms for learning latent tree models. Whenthere is no local dependence, the algorithms returnlatent class models. When local dependence is pre-sent, they return latent tree models with localdependence appropriately modeled.

In Section 1.2, we have explained why we needlatent tree models in TCM research. There are two

other reasons why latent tree models are interestingin general. First, latent tree models represent com-plex dependencies among observed variables andyet are computationally simple to work with. It wasfor the reason that Pearl [15] first identified latenttree models as a potentially useful class of Bayesiannetworks. Second, the endeavor of learning latenttree models can reveal interesting latent structures.Researchers have already been inferring latentstructures from observed data. One example isthe reconstruction of phylogenetic trees [16], whichcan be viewed as special latent tree models. Latentstructure discovery is also interesting for TCM. Afterall, TCM theories themselves are latent structuremodels described in natural language.


Figure 1 The structure of a latent class model.

Figure 2 An example latent tree model. The Xis are

latent variables and the Yjs are manifest variables.


6/18


3. Model equivalence

We write a latent tree model as a pair M G; u,where Gstands for the model structure plus cardin-alities of variables, and u stands for the vector of

probability parameters. We will sometimes refer toGalso as a latent tree model. Two latent tree modelsG; u and G0; u0 are marginally equivalent if theyshare the same manifest variables Y1; Y2; . . . ; Yn and

PY1; . . . ; YnjG; u PY1; . . . ; YnjG0; u0: (1)

A latent tree model Gincludesanother G0 if for anyparameter value vector u0 of G0, there exists para-meter value vector u ofGsuch that G; u and G0; u0are marginally equivalent. If G includes G0, then Gcan represent any distributions over the manifestvariables that G0 can. IfGincludes G0 and vice versa,

we say that G and G

0

are marginally equivalent.Marginally equivalent models are equivalent if theyhave the same number of independent parameters.It is impossible to distinguish between equivalentmodels based on data if penalized likelihood score[17] is used for model selection.

Let X1 be the root of a latent tree model G.Suppose X2 is a child of X1 and it is also a latentnode. Define another latent tree model G0 by rever-sing the arrowX1 !X2. VariableX2 becomes the rootin the new model. The operation is called rootwalking; the root has walked from X1 to X2.

It has been proved that root walking leads to

equivalent models [14]. Therefore, the root andedgeorientations of a latent tree model cannot be deter-mined from data. What we learn from data areunrooted latent tree models, which are latent treemodels with all directions on the edges dropped.Fig. 3 shows the unrooted latent tree model thatcorresponds to the latent tree model in Fig. 2.

An unrooted latent tree model represents a classof latent tree models. Members of the class areobtained by rooting the model at various latentnodes. Semantically it is a Markov random field onan undirected tree. The concepts of marginal

equivalence and equivalence can be defined for

unrooted latent tree models in the same way asfor rooted models. From now on when we speak oflatent tree models, we always mean unrooted latenttree models unless explicitly stated otherwise.

4. Identifiability and regularity

Let Y1, Y2, . . ., Yn be the manifest variables in alatent tree model G. If there exist two differentparameter value vectors u and u0 such that

PY1; . . . ; YnjG; u PY1; . . . ; YnjG; u0; (2)

then model G is said to be unidentifiable. In such acase, it is impossible to distinguish between the twoparameter value vectors u and u0 based on data.

Identifiability is closely related to three otherconcepts, namely effective dimensions, standard

dimensions, and parsimonious models. Let k bethe product of the cardinalities of the manifestvariables. For a given parameter value vectoru; PY1; . . . ; YnjG; u is a point in the R

k1 space. Ifwe let urun through all its possible values, we wouldget a set of points in the Rk1 space. This set is amanifold [18]. We call it the manifold image ofmodel G. The effective dimension of G is definedto be the dimension of its manifold image. Thenumber of independent parameters for G is calledits standard dimension. The standard dimension of alatent tree model is always greater than or equal to

its effective dimension.A latent tree model G is parsimonious if there

does not exist another model that is marginallyequivalent to G and that has fewer independentparameters than G. A latent tree model is stronglyparsimoniousif its standard dimension is the same asits effective dimension. Clearly, if a model is notparsimonious, then it is not strongly parsimonious. Ifa model is not strongly parsimonious, then it isunidentifiable.

Are strongly parsimonious models identifiable?The answer is negative. Let Gbe an arbitrary latenttree model, strongly parsimonious or not, and let ube a parameter value vector for G. Further let Xbe alatent variable in Gand s1 and s2 be two states ofX.Let u0 be obtained from uby swapping the parametervalues pertaining to s1 and those pertaining to s2.Then the two pairs G; u and G; u0 satisfy Eq. (2).Hence, G is unidentifiable. This means that onecannot identify the identities of the states of latentvariables based on data. We call this the uniden-tifiability of latent state. It is an intrinsic propertyof all models with latent variables. It should benoted that, in applications, one can usually deter-mine the meanings of the latent states from domain

knowledge.


Figure 3 The unrooted latent tree model that corre-

sponds to the latent tree model in Fig. 2.


7/18


If a model is not parsimonious, then it containsredundancies. The concept of regularity was intro-duced to identify some of the redundances [14]. LetjXj stand for the cardinality of a variable X. For alatent variable Zin a latent tree model, enumerate

its neighbors as X1;X2;. . .

;Xk. A latent tree model isregular if for any latent variable Z,

jZj

Qki1jXij

max ki1jXij; (3)

and when Z has only two neighbors, one of theneighbors must be a latent node and the inequality(3) holds strictly.

If a latent tree model Gis not regular, then thereexists a regular model G0 that is marginally equiva-lent to G and has fewer independent parameters[14]. The model G0 can be obtained from Gthrough

the following regularization process:(1) For each latent variable Z in G,

(a) If it violates inequality (3), reduce the car-dinality of Z to

Qki1 jXij=max

ki1jXij.

(b) If it has only two neighbors with one being alatent node and it violates the strict versionof inequality (3), remove Z from G andconnect the two neighbors of Z.

(2) Repeat Step 1 until no further changes.

Regularization can remove redundancies inlatent tree models. Can it remove all the redun-

dancies? In other words, are regular models parsi-monious? This is an open question. What we doknow is that some regular models are not stronglyparsimonious. Examples of this can be found inTable 1, which is borrowed from [19]. The tableshows the effective and standard dimensions ofseveral latent class models. Latent class modelsare identified using the cardinalities of their vari-ables. For example, 2:3,3 refers to a latent classmodel with two manifest variables, where thecardinality of the latent variable is 2 while thoseof the two manifest variables are both 3. Allthe models in the table are regular. However,their effective dimensions are smaller than their

standard dimensions. Hence, they are not stronglyparsimonious.

In addition to helping us remove redundancies,the concept of regularity also provides a finitesearch space for the algorithms to be described in

the next section. Let, Ybe a set of variables. Thereare infinitely many possible latent tree models withY as manifest variables. However, only a finitenumber of them are regular [14].

5. Learning latent tree models

Assume that there is a collection D of i.i.d. samplesthat were generated by an unknown regular latenttree model. Each sample contains values for some orall the manifest variables. By learning latent treemodel we mean the effort to reconstruct, from D,

the regular unrooted latent tree model that corre-sponds to the generative model.

Three search-based algorithms have been pro-posed for this task, namely DHC (double hill-climb-ing) [13,14], SHC (single hill-climbing) [20], andHSHC (heuristic single hill-climbing) [20]. All thosealgorithms aim at finding the model with the highestBIC score [21]. The BIC score of a model Ggiven dataset D is given below:

BICGjD log PDjG; u dG

2log N; (4)

where u

is the maximum likelihood estimate ofmodel parameters, dG the standard dimensionof G, and N is the sample size.

The BIC score is a large sample approximation ofthe marginal likelihood PDjG derived in a settingwhere all variables observed. Geiger et al. [18] havere-done the derivation for latent variable modelsand arrived at another scoring function called theBICe score. The BICe score is the same as the BICscore except that the standard dimension dG isreplaced by the effective dimension of the model.Theoretically, BICe is advantageous over BIC. WhyBIC is still used to guide the search algorithms? Thereare three reasons. First, effective model dimensionsare difficult to compute, despite recent decomposi-tion results [22]. Second, there is no substantialempirical evidence showing that BICe is advanta-geous over BIC in practice. Third, our experienceswith aboutone dozen data sets indicate that one canfind good models with BIC.

5.1. Double hill-climbing

The DHC algorithm searches in the space of regularlatent tree models. It starts with the simplest latent

tree model, i.e. the model with one binary latent


Table 1 Standard and effective dimensions of severallatent class models

Model Effectivedimension

Standarddimension

2:2,2 3 53:2,2,2 7 114:2,2,2 7 152:3,3 7 93:4,5 17 234:3,3,3 25 27


8/18


variable. At each step of search, it first generates anumber of candidate model structures by modifyingthe structure of the current model using threesearch operators, namely node introduction, nodedeletion, and node relocation. It then optimizes

the cardinalities of the latent variables in each ofthe candidate model structures, resulting in candi-date models. Finally, it evaluates the candidatemodels and picks the best one to seed the next stepof search. Search terminates when the best candi-date model is no better than the current model. Tooptimize the cardinalities of the latent variables in acandidate model structure, the algorithm employsanother hill-climbing routine. This is why it is calledDHC (double hill-climbing).

Node introduction is conceptually the mostimportant among the three search operators. Tomotivate the operator, consider the latent tree

model G1 shown in Fig. 4. It assumes that themanifest variables are mutually independent giventhe latent variable. What if this assumption is nottrue? To be more specific, what if Y1 and Y2 arecorrelated given X? A natural thing to do in this caseis to introduce a new latent node X1 and make it aparent for Y1 and Y2, as shown in G2.

The NI (node introduction) operator is the resultof applying the idea to unrooted latent tree modelstructures. Let X be a latent node in such a struc-ture. Suppose X has more than two neighbors. Forany two neighbors Z1 and Z2 ofX,wecan introduce a

new latent node Z to mediate X and its neighbors Z1and Z2. Afterwards X is no longer connected to Z1and Z2. Instead X is connected to Z and Z is con-nected to Z1 and Z2. Consider the model structure G

01

in Fig. 4. Introducing a new latent node X1 tomediate X and its neighbors Y1 and Y2 results inthe model structure G02. For the sake of computa-tional efficiency, node introduction is not allowed toinvolve three or more neighbors of a latent node.

This implies that we cannot reach G0

3 from G0

1 usingthe NI operator only.ND (node deletion) is the opposite of node intro-

duction. Let Xbe a latent node and let Z1; Z2; . . . ; Zkbe the neighbors ofX. If one of these neighbors, sayZ1, is also a latent node, then we can delete X withrespect to Z1. This means to remove X from themodel and connect Z1 to each of Z2; Z3; . . . ; Zk. InFig. 4, deleting X1 from G

02 or G

03 with respect to X

results in G01.The third search operator is called node reloca-

tion. LetXbe a latent node and Zbe a neighbor ofX.Suppose X has another neighbor Z0 that is also a

latent node. Then we can relocate Z from X to Z0.This means to disconnect Zfrom Xand reconnect itto Z0. In Fig. 4, relocating Y3 from X to X1 in G

02

results in G03. For the sake of computational effi-ciency, it is not allowed to relocate Z to a latentnode that is not a neighbor of X. In Fig. 3, forexample, we cannot relocate Y2 from X2 to X3.

Given a latent tree model structure, how do wedetermine the cardinalities of the latent variables?The answer is to search in the space of all the regularlatent tree models with the given model structure.The search starts with the model where the cardin-

alities of all the latent variables are set at 2. At eachstep, we obtain a number of candidate models byincreasing the cardinality of a single latent variableby one. Irregular candidate models are discarded.


Figure 4 Illustration of structural search operators.


9/18


Each of the candidate models is then evaluated andthe best one is picked to seed the next search step.To evaluate a model, one needs to estimate itsparameters. The EM algorithm [23,24] is used forthis task.

Zhang [14] tested the DHC algorithm on bothsynthetic data and real-world data. The syntheticdata sets were generated from a latent tree modelwhose structure is shown in Fig. 2. The sample sizeranges from 5000 to 100,000. The models obtainedby DHC are of high quality because: (1) their struc-tures are either identical to that of the generativemodel or only one step away from it, and (2) theirlogarithmic scores on a testing data set are veryclose to that of the generative model. In anothersetting where strong dependence between variableswere enforced, DHC correctly recovered the struc-ture of the generative model for all sample sizes.

The real-world data sets were taken from the latentclass analysis literature. In all cases but one, DHCfound the models that are considered to be the bestin the literature. The exception happened with adata set for which latent tree models are apparentlyinappropriate.

The DHC algorithm has one serious drawback,namely its extremely high computational complex-ity. Although the synthetic data sets involve onlyseven manifest variables, DHC took around 100 h ona 1 GHz Pentium III machine to analyze each ofthem.2 As such, DHC is a concept testing algorithm.

5.2. Single hill-climbing

A latent tree model G consists of a network struc-ture, the cardinalities of the manifest variables, andthe cardinalities of the latent variables. If the car-dinalities of the latent variables are unknown, thenG is called a pre-LT (pre-latent-tree) model. Theoutputs of the three search operators used by DHCare pre-LT models and the cardinalities of the latentvariables are optimized using a separate hill-climb-ing routine. We have been vague about this pointuntil now for readability.

The SHC algorithm is the same as the DHC algo-rithm in that they both search in the space of regularlatent tree models. However, there are two majordifferences. First, the search operators in SHC pro-duces latent tree models rather than pre-LT models.This implies attention must be paid to the cardin-alities of the latent variables when designing theoperators. Second, SHC does not have a separateroutine to optimize the cardinalities of latent

variables. They are optimized at the same time asthe model structure.

Five search operators are used in SHC, namelynode introduction, node deletion, node relocation,state introduction, and state deletion. The NI (node

introduction) operator of SHC is the same as the NIoperator of DHC, except that it specifies a cardin-ality for the newly introduced latent node. Whenintroducing a new latent node Zto mediate a latentnode Xand two of its neighbors, we set jZj jXj. LetGand G0 the model before and after the operation.We have the nice property that G0 includes G.

The ND (node deletion) and NR (node relocation)operators of SHC are the same as the ND and NRoperators of DHC, except that cardinalities of latentvariables are no longer ignored. The SI (state intro-duction) operator increases the cardinality of alatent variable by one, while the SD (state deletion)

operator decreases the cardinality of a latent nodeby one. Applying the operators to a regular latenttree model might result in irregular models. For thisreason, they are always followed immediately bythe regularization process described in Section 4.The operators can sometimes make a latent node aleaf node. When this happens, the node is simplydeleted.

The SHC algorithm begins with the simplest latenttree model, i.e. the model with one binary latentnode. It works in two phases. In Phase I, SHC expandsmodels by applying the NI and SI operators. The aim

is to improve the likelihood term of the BIC score.The NR operator is also used in Phase I. In Phase II,SHC simplifies models by applying the ND and SDoperators. The aim is to reduce the penalty term ofthe BIC score, while keeping the likelihood termmore or less the same. If model quality is improvedin Phase II, SHC goes back to Phase I and the processrepeats itself.

In Phase I, SHC needs to choose between NI and SIoperations. There is an issue of operation granu-larity here. Suppose there are 100 manifest vari-ables. At the first step of search, the current modelcontains only one binary latent variable and it hasonly two states. Denote the latent variable by X0.Applying the SI operator to X0 would introduce 101additional model parameters. Introducing a newlatent node to mediate X0 and two of its neighbors,on the other hand, would increase the number ofmodel parameters by only 2. The latter operation isclearly of much finer-grain than the former. Theformer is like a bulldozer if the latter is compared toa shovel.

To deal with operation granularity, SHC adopts theso-called cost-effectiveness principle when choosingamong candidate models in Phase I. Let G be the

current model and G0 be a candidate model. Define


2 The running times for different data sets are more or less thesame because the data sets contain the same number of distinct

samples although their sample sizes are different.


10/18


the unit score improvement ofG0 overGgiven data Dto be

UG0; GjD BICG0jD BICGjD

dG0 dG(5)

It is the increase in model score per unit increase inmodel complexity. The cost-effectiveness principlestates that, among all candidate models, choose theone that has the highest unit score improvementover the current model.

Phase I of SHC involves not only the NI and SIoperators, but also the NR operator. A technicaldifficulty might arise when computing the quantityUG0; GjD for a candidate model generated by NR.The NR operator does not necessarily increase thenumber of model parameters. The denominatordG0 dG could be zero or negative. One wayto overcome the difficulty is to replaced the

denominator with maxf1; dG0 dGg.

5.3. Heuristic single hill-climbing

SHC is more efficient than DHC, but it still does notscale up well. Here is the reason. At each searchstep, SHC generates a list of candidate models,evaluates each of them, and selects the best one.To evaluate a model, it first runs the EM algorithm toobtain the maximum likelihood estimation of theparameters and then computes the BIC score. EM isknown to be computationally expensive. SHC runs

EM on each candidate model and hence has highcomputational complexity.

One way to reduce the complexity of SHC is toapply the technique of structural EM [25]. The ideais to complete the data set D based on the currentmodel and evaluate the candidate models using thecompleted data set Dc. Parameter optimizationbased on Dc does not require EM and hence EM isavoided during model evaluation and selection. TheHSHC (Heuristic SHC) algorithm is the result ofincorporating this strategy into SHC.

There is an important issue that one needs toaddress when incorporating structural EM into SHC.Structural EM cannot be directly applied to thecandidate models generated by the NI, SI, SD opera-tors. The latent nodes in those models are differentfrom those in the current model. NI introduces alatent node that is not in the current model, while SIand SD alter the cardinality of one latent variable inthe current model. Dc is complete with respect tothe variables in the current model, but it is incom-plete with respect to the variables in the candidatemodels generated by the NI, SI and SD operators.

To solve the problem, HSHC divides the candidatemodels into groups, with one group for each opera-

tor. Models in a group are compared with each other

using heuristics based on the completed data set Dc,and the best one is selected. Thereafter a secondmodel selection process is invoked to choose theoverall best model among the best models of thegroups. This second process is the same as the model

selection process in SHC, except that there is onlyone candidate model for each operator. Here themodel parameters are optimized using EM.

The heuristics for ranking candidate models aredifferent for different operators. Here we explainthe heuristic for NI. Let Gbe the current model andG0 be a candidate model obtained from G by intro-ducing a new latent node Zto mediate a latent nodeXand two of its neighbors Z1 and Z2. As explained inSection 5.1, the new node would be necessary if Z1and Z2 are not independent of each other given X.This hypothesis can be tested based on the com-pleted data Dc. The variables X, Z1 and Z2 are all

observed in Dc. Hence, one can compute the G-squared statistic for testing the hypothesis that Z1and Z2 are independent given X. The larger thestatistic, the further away Z1 and Z2 are from beingindependent given X, and hence the more necessaryit is to introduce a new node. Therefore, one canrank all the candidate models generated by NI usingthe G-squared statistics and pick the model with thelargest value.

There is one more issue to clarify before thedescription of the HSHC algorithm is complete.For a given operator, instead of choosing the best

model, one can choose the best K, for some integerK, models based on heuristic. This top-K schemereduces the chance of local maxima. In general, thelarger the K, the lower the probability of encoun-tering local maxima. On the other hand, larger Kalso implies running EM on more models and hencelonger computation time.

For the sake of computational efficiency, HSHCreplaces EM with the so-called local EM in the top-Kscheme. Let G0 be a candidate model obtained fromthe current model G by applying one of the searchoperators. The parameters for G have been opti-mized and G0 differs from Gonly in one or two nodes.Local EM optimizes only the parameters pertainingto those one or two nodes in G0 while freezing thevalues of all the other parameters. Obviously, modelparameters obtained by local EM deviate from thoseobtained by EM. To avoid accumulation of devia-tions, HSHC runs EM once at the end of each searchstep on the best model.

5.4. Empirical results with SHC and HSHC

How much more efficient is SHC than DHC? To answerthis question, SHC was tested on one of the synthetic

data sets mentioned in Section 5.1[20]. SHC turned



11/18


out to be 22 times faster than DHC, and it obtainedthe same model as DHC. No more experiments wererun to compare DHC and SHC because DHC is extre-mely slow.

How much more efficient is HSHC than SHC? Can

HSHC find high quality models? To answer thosequestions, experiments were conducted on five syn-thetic data sets with 6, 9, 12, 15 and 18 manifestvariables, respectively and 10,000 records [20]. Forthe top-K scheme in HSHC, three values were usedfor K, namely 1, 2 and 3. The experiments were runon a 2.26 GHz Pentium 4 machine and the time limitwas set at 100 h. The results indicate that HSHC ismuch more efficient than SHC and scales up muchbetter. In particular, SHC was not able to finishanalyzing the two data sets with 15 and 18 manifestvariables. On the data set with 12 manifest vari-ables, it was 10 times slower than HSHC when Kwas

set at 3. When K 1, the models found by HSHC forthe data sets with 15 and 18 manifest variables areof poor quality. When K 2 or 3, however, all themodels reconstructed by HSHC match the genera-tive models extremely well in terms of the jointdistribution of the manifest variables. The struc-tures of these models are either identical or verysimilar to the structures of the generative models.

How does HSHC perform on real-world data? Canit discover interesting latent structures and learngood probabilistic models? To answer those ques-tions, the algorithm was tested on the CoIL Chal-

lenge 2000 data set [26]. This data set consists of5,822 customer records of an insurance company.Each record contains socio-demographic informa-tion and information about insurance product own-erships. There are 86 attributes. The task is to learna model and use it to predict who would buy mobilehome insurance policies.

Before the data set was fed to HSHC, it was firstpreprocessed and the resulting data set contains 42manifest variables. Four different values were usedfor K in the top-K scheme, namely 1, 5, 10 and 20.The best model was found in the case ofK 10. The

analysis took 121 h. The structure of the best modelis shown in Fig. 5.The latent structure discovered by HSHC is inter-

esting. This can be appreciated from several angles.First of all, the data set contains two variables foreach type of insurance. For example, the two vari-ables for bicycle insurance are contribution tobicycle insurance policies (Y62) and number ofbicycle insurance policies (Y83). HSHC introduceda latent variable for each of such pairs. The latentvariable introduced for Y62 and Y83 isX11. Obviously,X11 can be interpreted as attitude toward bicyclerisks. Similarly, X10 can be interpreted as attitude

toward motorcycle risks, X9 as attitude towardmoped risks, and so on.

Next consider the manifest variables below thelatent variable X8. Besides social security, theyare all related to private vehicles. HSHC concludedthat peoples decisions to buy insurance for privatevehicles are influenced by one common latent vari-able. This is clearly reasonable and X8 can be inter-preted as attitude toward private vehicle risks.The variables about heavy vehicles such as car,mobile home, and boat are placed under X12. Thisis also reasonable and X12 can be interpreted as

attitude toward heavy private vehicle risks.Similarly, we see that X1 corresponds to attitude

toward firm risks and X15 means attitude towardagriculture risks. The two latent variablesX3 andX6capture attitudes toward risks in daily life; X21summarizes socio-demographic information con-tained in the manifest variables Y04, Y05 and Y43;


Figure 5 Latent tree model found by HSHC for the CoIL Challenge 2000 data. The number next to a latent variable is the

cardinality of that variable.


12/18


and X0 can be interpreted as general attitudetoward risks.

It is particularly interesting to note that,although delivery vans (Y48 and Y69) and tractors(Y37 and Y52) are vehicles, HSHC did not conclude

that the decisions to buy insurance for delivery vansand tractors are influenced by attitude towardvehicle risks (X8). Rather, it correctly concludedthat the decision to buy insurance for delivery vansis influenced by attitude toward firm risks and thedecision to buy insurance for tractors is influencedby attitude toward agriculture risks.

Is the latent tree model found by HSHC a goodprobabilistic model for the domain? The predictiontask of CoIL Challenge 2000 provides one criterionfor answering this question. In CoIL Challenge 2000,there is a test set of 4000 records that contains 238mobile home policy owners. The prediction task

requires the participants to identify a subset of800 records that contains as many mobile homepolicy owners as possible. The random selectionby the contest organizers resulted in 42 policy own-ers, while the best entry identified 121 policy own-ers. The selection based on the model found byHSHC includes 110 policy owners. This is good per-formance considering that HSHC aims at optimizingBIC score rather than classification error.

For the purpose of building a probabilistic modelfor the domain, one can choose to learn a Bayesiannetwork without latent variables. How would such a

network compare with the latent tree model foundby HSHC? To answer this question, Zhang and Kockaalso analyzed the data set using the GES algorithm[27] for learning Bayesian networks. The structureof the resulting model is shown in Fig. 6. We see thatthe network structure is less interpretable than thestructure of the latent tree model found by HSHC.

When the model is used to guide the CoIL predictiontask, only 83 policy owners were identified. This issignificantly better than the random selection, butsignificantly worse than the selection based on thelatent tree model.

6. Analysis of kidney deficiency data

We propose a new machine learning approach to thestudy of TCM syndrome differentiation. It is calledthe latent structure approach. The idea is to firstcollect patient data systematically and then per-form multidimensional cluster analysis using latentstructure models. The goal is to find natural clustersin data that correspond well to TCM syndrome typesand hence can be used to establish objective andquantitative standards for syndrome differentia-

tion. To investigate the feasibility of the latentstructure approach, we have used latent tree mod-els to study a subdomain of TCM diagnosis, namelyKIDNEY DEFICIENCY. We report this case study in the restof the paper. In this section, we describe thedomain, the data set, and the data analysis process.

Diagnosis and treatment in TCM are based onseveral theories. One of the theories is calledZANG-FU theory. It explains the physiological func-tions, pathological changes, and mutual relation-ships of the TCM vital organs, which include HEART,LUNG, KIDNEY, and so on. Although minor similarities

exist, the vital organs of the body in TCM aredifferent from the anatomical organs of westernmedicine. They are classified according to theirfunctions and functional entities. The main physio-logical functions of KIDNEY, for instance, are: (1)storing YIN-ESSENCE; (2) controlling water metabolism;(3) receiving QI; (4) controlling human reproduction,growth and development; (5) producing marrow,controlling the bones, manufacturing blood andinfluencing hair luster; (6) opening into the ear;(7) controlling the urinary bladder. Most KIDNEY DISEASESare due to the lack of essence, a condition known asKIDNEY DEFICIENCY. KIDNEY DEFICIENCY includes several sub-types, namely KIDNEY YANG DEFICIENCY, KIDNEY YIN DEFICIENCY,KIDNEY ESSENCE INSUFFICIENCY, and so on [6]. Commonsymptoms of KIDNEY DEFICIENCY include lumbago, soreand weak lumbus and knees, tinnitus, deafness, lossof hair, impotence, irregular menstruation, edema,abnormal defecation and urination, and so on.

The first step in the latent structure approach isdata collection, and the first step in data collectionis to determine its coverage in terms of symptomvariables. In consultation with the China nationalstandards on clinic terminology of TCM syndromes[28] and some textbooks on TCM diagnosis [6,29], we

have selected 67 symptoms variables for our study.


Figure 6 Bayesian network model found by GES for the

CoIL Challenge 2000 data set.


13/18


Those variables cover all aspects of the aforemen-tioned physiological functions of KIDNEY. Each of thevariables has four possible values, namely no,light, medium, and severe.

TCM diagnosis consists of two steps, patient infor-mation gathering and syndrome differentiation.Subjectivity is an issue in both steps. The focus ofour research is on the second step. To minimize theinfluence of subjectivity in the first step, weadopted, during data collection, the operationalstandards for determining the severity levels ofKIDNEY DEFICIENCY symptoms by Yan et al. [30].

A total of 2600 data cases were collected. Thedata set was collected from communities in severalregions in China and all the subjects were at orabove the age of 60 years. The sample space wasset this way because KIDNEY DEFICIENCY is more commonamong senior people. The entire data collectionprocess was managed personally by the secondco-author and a variety of measures were takento ensure data quality.

We attempted to analyze the whole KIDNEY datausing the HSHC algorithm. Despite being the mostefficient algorithm for learning latent tree models3,HSHC was not able to handle all 67 symptom vari-ables. We were forced to reduce the number ofvariables and only 35 variables were included inthe analysis. This limits the applicability of theresults in practice. However, it does not preventus from drawing conclusions about the feasibility ofthe latent structure approach. The approach ispossible if (1) there exist natural clusters in datathat correspond to TCM syndrome types, and (2) itcan discover such clusters from data. We candemonstrate those two points using a data set of

35 variables as well as using a data set of 67 vari-ables.

The analysis was conducted on a 2.4 GHz Pentium4 machine and it took 98.5 h to finish. The BIC scoreof the resultant model is 73,947. HSHC is a hill-climbing algorithm. Like most hill-climbing algo-rithms, it might get stuck at local maxima. To detectwhether HSHC was trapped at a local maximum and,if this was the case, to help it escape from the localmaximum, we modified the initial result based ondomain knowledge and continued the search withHSHC from the modified models. This resulted in

another model, denoted by M

, with BIC score73,860. Further efforts to improve M did notresult in better models. We hence regard M asthe best model for the data set. The structure ofM

is given in Fig. 7.

7. Interpretation of latent variables

Model M consists of 14 latent variables, X0 to X13.This means that our analysis has identified 14 latentfactors from the KIDNEY data set. Each of the latentvariables has a specific number of states. For exam-ple, the variable X1 has five states. This means thatwe have grouped the data set into five clustersaccording to the latent factor. The variable X13has four states. This means that we have groupedthe data set in another way into four clusters. Thus,we have simultaneously clustered the data set inmultiple ways.

Do the clusters correspond to TCM syndrometypes? This question will be answered in the nextsection. As a preparatory step, we examine themeanings of the latent variables. In this process,we need to compare the structure of model M with

the TCM theory on KIDNEY, which is itself a latent


Figure 7 The structure of the best model M found for the KIDNEY data set. The symptom variables are at the leaf nodesand the latent variables are at the internal nodes. The numbers in parentheses are the numbers of states of the latentvariables. The abbreviation HSFCV stands for hot sensation in the five centers with vexation, where the five centers referto the centers of two palms, the centers of two feet, and the heart.

3 The situation is likely to change soon due to some on-going

work.


14/18


structure model described in natural language. Werely on textbooks [6,29] for description of the TCMtheory.

7.1. Variable interpretation

We start from the lower left corner ofM. Here themodel states that there is a latent variable X1 that:(1) directly influences the three symptoms intoler-ance to cold, cold limbs, and cold lumbus andback and (2) through another latent variable X2indirectly influences loose stool and indigestedgrain in stool. Do those match the TCM theory onKIDNEY? What are the meanings of the latent variablesX1 and X2?

According to TCM theory, KIDNEY YANG is the basis ofall YANG in the body. When KIDNEY YANG is in deficiency, itcannot warm the body and the patient feels cold,

resulting in manifestations such as cold lumbus andknees, intolerance to cold, and cold limbs. Defi-ciency of KIDNEY YANG also leads to spleen disorders,resulting in symptoms such as loose stool and indi-gested grain in stool.

Both of the above two paragraphs speak of twolatent factors and some symptoms. The relation-ships between the latent factors and the symptomsare almost identical in the two cases. Therefore,there is a good match between the lower left cornerofM and the relevant part of the TCM KIDNEY theory.The latent variables X1 can be interpreted as KIDNEY

YANG deficiency, while X2 can be interpreted as SPLEENDISORDERS DUE TO KYD, where KYD stands for KIDNEY YANGDEFICIENCY.

To the right ofX1, model M states that there is a

latent variable X3 that influences the symptomsedema on legs and edema on face. On the otherhand, TCM theory maintains that when KIDNEY YANG isin deficiency, water is out of control. It flows to theskin and brings about edema. We see a perfectmatch here. The latent variable X3 can be inter-preted EDEMA DUE TO KYD.

To the right ofX3, model M states that there is a

latent variable X4 that: (1) directly influences thesymptoms urine leakage after urination, frequenturination, and frequent nocturnal urination, and(2) through another latent variable X5 indirectlyinfluences urinary incontinence (day) and urinaryincontinence (night). On the other hand, TCM the-ory maintains that when KIDNEY fails to control theurinary bladder, one would observe clinical mani-festations such as frequent urination, urine leakageafter urination, frequent nocturnal urination, and insevere cases urinary incontinence. Once again,there is a fairly good match between this part ofM and the relevant part of the TCM KIDNEY theory.

The latent variable X4 can be interpreted as KIDNEY

FAILING TO CONTROL UB, where UB stands the urinarybladder.

According to TCM theory, the clinical manifesta-tions of KIDNEY ESSENCE INSUFFICIENCY include prematurebaldness, tinnitus, deafness, poor memory, trance,

declination of intelligence, fatigue and weakness.Those match the symptom variables under X8 fairlywell and hence X8 can be interpreted as KIDNEY ESSENCEINSUFFICIENCY. The clinical manifestations of KIDNEY YINDEFICIENCY include dry throat, tidal fever or hecticfever, fidget, hot sensation in the five centers,and yellow urine. Those match the symptom vari-ables under X10 fairly well and hence X10 can beinterpreted as KIDNEY YIN DEFICIENCY. Finally, TCM theorymaintains that there are several subtypes of KIDNEYDEFICIENCY, namely KIDNEY YANG DEFICIENCY, KIDNEY ESSENCEINSUFFICIENCY, KIDNEY YIN DEFICIENCY, and so on. They sharecommon symptoms such as lumbago, sore and weak

lumbus and knees, mental and physical fatigue.Moreover, KIDNEY DEFICIENCY can be caused by prolongedillness. Those and the topology of M suggest thatX0 should be interpreted as KIDNEY DEFICIENCY.

Table 2 summarizes the meanings of the latentvariables.

7.2. Remarks

All symptom variables in our case study are thosethat a TCM doctor would consider when makingdiagnostic decisions about KIDNEY DEFICIENCY. There is

hence no surprise that one of the latent variables inM can be interpreted as KIDNEY DEFICIENCY. However, itis very interesting that some of the latent variablescorrespond to syndrome factors such as KIDNEY YANG/YINDEFICIENCY, as each of them is associated with only asubset of the symptom variables. Take KIDNEY YANGDEFICIENCY as an example. TCM claims that it can causesymptoms such as intolerance to cold, coldlimbs, and cold lumbus and back. On the otherhand, the result of our analysis states that, judgingfrom data, there should be a latent factor thatdirectly influences those symptom variables. Inthis sense, our analysis has validated the TCM claim.This is important because TCM theories are some-times discarded as being unscientific and evengroundless. Our work has shown that there are


Table 2 Interpretations of some of latent variables inM

X0: KD (KIDNEY DEFICIENCY)X1: KYD (KIDNEY YANG DEFICIENCY)X3: EKDY (EDEMA DUE TO KYD)X4: KFCUB (KIDNEY FAILING TO CONTROL)X8: KEI (KIDNEY ESSENCE INSUFFICIENCY)X10: KYD (KIDNEY YIN DEFICIENCY)


15/18


scientific (more precisely, statistically valid) con-tents in TCM theories.

It should be noted that the scope of our case studyis KIDNEY DEFICIENCY, which is determined by the selec-tion of symptom variables. Hence, the results should

be helpful in determining whether a patient issuffering from KIDNEY DEFICIENCY and if so which sub-type. Table 2 shows that model M is indeed usefulin this aspect. What should we do if we want todetermine whether a disease is with KIDNEY, or LIVER, orother vital organs? The answer is to build anothermodel with symptom variables that can help thishigher level differentiation task.

It should also be pointed out that there are someaspects of M that do not match TCM theory well.As one example, consider the symptoms tinnitusand poor memory. According to TCM theory, theycan be caused both by KIDNEY YIN DEFICIENCY and KIDNEYESSENCE INSUFFICIENCY. In M, however, they are directlyconnected to only X8, but not to X10. This mismatchis due to the restriction that in latent tree models amanifest variable can be connected to only onelatent variable. Another mismatch is in the scope:M involves fewer symptom variables than the TCMtheory on KIDNEY. This is due to two reasons. First, wewere not able to collect data about human repro-duction because our subjects did not feel comfor-table talking about it. Second, some symptomvariables were excluded from data analysis forthe sake of computational efficiency. The conse-

quence is twofold. First, many symptoms, especiallythose about human reproduction, are absent fromthe discussions above. They are mentioned in TCMtheory, but do not appear in the model M. Second,two other common subtypes of KIDNEY DEFICIENCY,

namely UNCONSOLIDATION OF KIDNEY QI and FAILURE OF KIDNEYTO RECEIVE QI are absent from M.

Finally, we would like to make a remark aboutedge orientations in latent tree models. Asexplained in Section 3, edge orientations cannot

be determined during data analysis. However, theysometimes can be determined during model inter-pretation. As an example, consider the edgesbetween the latent variable X1 and the symptomvariables Y2, Y3, and Y4. X1 is interpreted as KIDNEYYANG DEFICIENCY, which according to TCM theory causesintolerance to cold, cold limbs, and cold lumbusand back. Therefore, the edges can be oriented andthey point from X1 to the symptom variables. As amatter of fact, all edges in M can be oriented andthey all point downward. There is only one excep-tion, namely the edge between X0 and Y17.

8. Interpretation of latent clusters

Each of the latent variables in model M defines anumber of clusters. In this section, we examine themeaning of some of the clusters and check to seewhether they correspond to TCM syndrome types.We will focus primarily on X1 and X4.

The latent variable X1 was interpreted as KYD(KIDNEY YANG DEFICIENCY) and it has five states. Thismeans that the data has been grouped into fiveclusters according to KIDNEY YANG DEFICIENCY. Fig. 8 shows

the probability distributions of the five symptomvariables Y0 to Y4 in each of the clusters. There isone bar diagram for each cluster. In each diagram,there are five bars, each corresponding to one of thefive symptom variables. The bars consist of up to


Figure 8 Characteristics of the five clusters identified by the states of the latent variable X1: the diagrams show theprobability distribution of the symptom variables Y0, Y1, Y2, Y3, and Y4 in each of the clusters. The table shows the mosttypical members of the clusters, together with the probability distribution of X1 for those members. In the table,

symptom severity levels are indicated using integers: 0no; 1light; 2medium; 3severe.


16/18


four segments, each corresponding to one state ofthe symptom variables. The clusters were orderedand named after an initial visual inspection of thosebar diagrams.

The probability bar diagrams provide an overall

characterization of the clusters, but they do notprovide any information about individual membersof the clusters. To compensate for this lack ofspecificity, we also provide information aboutthe most typical members of the clusters. Themost typical member of the cluster X1 s0, forinstance, is the configuration of the states of thevariables Y0, Y1, Y2, Y3, and Y4 that maximizes theprobability PX1 s0jY0; Y1; Y2; Y3; Y4. This config-uration might not be unique. Nonetheless, we willuse the term the most typical member for sim-plicity.

We can now digest the meanings of the clusters

defined by X1. Five symptoms are involved here,namely LS (loose stool), (IGS) (indigested grain instool), IC (intolerance to cold), CL (cold limbs), andCLB (cold lumbus and back). In the cluster X1 s0,the five symptoms almost never occur. The mosttypical member of X1 s0 has none of those symp-toms. Hence, the clusters means no KYD.

Next, consider the clustersX1 s1 andX1 s2. Inboth clusters, the five symptoms have substantialprobability of occurring. The overall probability ofthe symptoms occurring is higher in X1 s1, whilethe probabilities of the symptoms occurring at the

medium or severe levels are higher in X1 s2. Forthe most typical member of X1 s1, three of thesymptoms, namely LS, IC and CLB, occur only at thelight level, while for the most typical member ofX1 s2, only one of the symptoms, namely IC,occurs at the severe level. Therefore, both clusterscan be interpreted as light KYD. To differentiatetheir characteristics, we label X1 s1 as light KYD(1) and label X1 s2 as light KYD (2).

In the clusters X1 s3 and X1 s4, the overallprobability of the five symptoms occurring is high,while the probabilities of their occurring at themedium and severe levels in X1 s4 are much

greater than those in X1 s3. For the most typicalmember ofX1 s3, three of the symptoms occur atthe light level and one occurs at the medium level.For the most typical member ofX1 s4, all the fivesymptoms occur at the severe level. Hence, X1 s3

can be interpreted as mediumKYD and X1 s4 canbe interpreted as severe KYD.There is one interesting observation that we

would like to make about the class probability dis-tributions. In the last four of the clusters, theprobabilities of the symptoms LS (Y0) and IGS (Y1)occurring are much lower than those of the otherthree symptoms. This is consistent with TCM theory.According to the latter, KYD would not affect othervital organs when at benign stage. As the severityincreases, however, it might affect other vitalorgans. In particular, it might lead to spleen dis-orders, resulting in the symptoms LS and IGS. Hence,

the symptoms IC, CL and CLB are more common thanLS and IGS. IC, CL and CLB can occur as soon as KYD ispresent, while LS and IGS occur only when KYDbecomes relatively more severe. Our analysis hasvalidated this piece of TCM theory and has provideda quantification for it.

We next consider the four clusters given by thelatent variable X4. The probability distributions ofthe symptom variables Y7, Y8, Y9, Y11, and Y11 ineach of the clusters are given in Fig. 9. The typicalmembers of the clusters are also shown. By usingarguments similar to those used in the case of X1,

one can conclude that the clusters X4 s0, X4 s1,X4 s2 and X4 s3 can respectively interpreted asnoKFCUB (KIDNEY FAILING TO CONTROL UB), lightKFCUB(1),lightKFCUB(2), and severe KFCUB.

As mentioned in the previous section, TCM theorymaintains that when kidney fails to control theurinary bladder, one would observe clinical mani-festations such as FU (frequent urination), ULU(urine leakage after urination), FNU (frequent noc-turnal urination), and in severe cases UID (urinaryincontinence (day)) and UIN (urinary incontinence(night)). Therefore, UID and UIN occur less fre-quently than the other three symptoms. This is


Figure 9 Characteristics of the four clusters identified by the states of the latent variable X4: the diagrams show theprobability distributions of the symptom variables Y7, Y8, Y9, Y10, and Y11 in each of the clusters. The table shows themost typical members of the clusters. In the table, symptom severity levels are indicated using integers: 0no; 1light;

2medium; 3severe.


17/18


consistent with the bar diagrams in Fig. 9. We seethat the probabilities of UID (Y9) and UIN (Y10)occurring are always significantly lower than thoseof the other three symptoms. Once again, our ana-lysis has validated a piece of TCM theory and hasprovided a quantification for it.

We have also examined the clusters identified byX3, X8, and X10. The interpretations are given in

Table 3.

9. Conclusions and future directions

TCM diagnosis consists of two steps, patient infor-mation gathering and syndrome differentiation. Wepropose a novel approach, namely the latent struc-ture approach, to the study of syndrome differen-tiation. The long-term objective is to establishobjective and quantitative standards for syndromedifferentiation. The idea is to systematically collect

patient data, perform multidimensional cluster ana-lysis, and use the resultant clusters as the basis forsyndrome differentiation.

To demonstrate the feasibility of the newapproach, we have studied a special class of latentstructure models called latent tree models. Thestudy on latent tree models has spanned over sev-eral years. In this paper, we have provided a detailedsurvey of the work so far. It is intended to serve as agood entry point for those who want to use orfurther develop the methodology.

We have collected a data set about KIDNEY DEFICIENCYand have analyzed the data set using latent treemodels. The resulting model matches relevant TCMtheory well: the latent variables correspond to syn-drome factors, and the latent clusters correspond tosyndrome types. We have recently analyzed severalother data sets from joint projects with Beijing Uni-versity of Chinese Medicine and China Academy ofChinese Medicine. Equally good results wereobtained.4 Consequently, we have shown that thelatent structure approach is indeed feasible.

For future work, there is obviously a need todevelop more efficient algorithms for learning

latent tree models. It would be also interesting tostudy latent structures beyond trees. In the TCMfront, the immediate next step is to investigate howthe results produced by the latent structureapproach can actually be used to improve TCMdiagnosis and treatment. One area is to start withthe integration of TCM and west medicine. The ideahere is to divide all patients suffering from a wes-

tern medicine, e.g. irritable bowel syndrome, intoseveral groups from the TCM perspective and applydifferent treatments to different groups. Doctors inChina have been practicing this for decades. Thelatent structure approach can be applied to estab-lish objective and quantitative standards for thegrouping. TCM doctors and western medicine doc-tors can then come together, study the groups andform treatments accordingly.

Acknowledgement

We thank Yang Weiyi and Wang Miqu for valuablediscussions during the inception of this work in 2000.We also thank Yan Shilin, Wang Tianfang and manyother TCM experts for helping with data collection in2000 and model interpretation in 2004. TomasKocka, Finn Jensen, and Thomas Nielsen collabo-rated with us on developing algorithms for learninglatent tree models and related models. We aregrateful to Lai Shilong, Liu Baoyan, Wang Qingguo,Wang Jie, Wang Tianfang, and the anonymousreviewers for feedbacks on earlier versions of the

paper. Research on this work was supported by HongKong Grants Council Grants #622105 and #622307,and The National Basic Research Program (aka the973 Program) under project no. 2003CB517106.

References

[1] SungJJY, Leung WK, Ching JYL, Lao L,Zhang G, Wu JCY, etal.Agreements among traditional Chinese medicine practi-tioners in the diagnosis and treatment of irritable bowelsyndrome. Alimen Pharmacol Therapeut 2004;20:120510.

[2] World Health Organization, WHO traditional medicine strat-egy. http://www.who.int/medicines/publications/traditio-

nalpolicy/en/index.html; 20022005 [accessed: 14.10.07].


Table 3 Interpretations of some of latent clusters in M

s0 s1 s2 s3 s4

X1 No KYD Light KYD (1) Light KYD (2) Medium KYD Severe KYDX3 No EKYD Light EKYD (1) Light EKYD (2) Severe EKYD X4 No KFCUB Light KFCUB (1) Light KFCUB (2) Severe KFCUB

X8 No KEI Light KEI (1) Light KEI (2) Severe KEI X10 No KID Light KID Medium KID (1) Medium KID (2) Severe KID

Refer to Table 2 for the meanings of the abbreviations.

4 The results will be reported in forthcoming papers.


18/18


[3] Wang HX, Xu YL. The current state and future of basictheoretical research on traditional Chinese medicine. Beij-ing: Military Medical Sciences Press; 1999.

[4] Feng Y, Wu Z, Zhou X, Zhou Z, Fan W. Knowledge discovery intraditional Chinese medicine: state of the art and perspec-tives. Aritif Intell Med 2006;38:21936.

[5] Liang MX, Liu J, Hong ZP, Xu YY. Perplexity of TCM syndromeresearch and countermeasures. Beijing: Peoples HealthPress; 1998.

[6] Yang W, Meng F, Jiang Y. Diagnostics of traditional Chinesemedicine. Beijing: Academy Press; 1998.

[7] Lazarsfeld PF, Henry NW. Latent structure analysis. Boston:Houghton Mifflin; 1968.

[8] Bartholomew DJ, Knott M. Latent variable modelsand factoranalysis, Kendalls Library of Statistics 7, 2nd ed., London:Arnold; 1999.

[9] Vermunt JK, Magidson J. Latent class cluster analysis. In:Hagenaars JA, McCutcheon AL, editors. Advances in latentclass analysis. Cambridge University Press; 2002. p. 89106.

[10] J. Uebersax, A practical guide to local dependence inlatent class models, http://ourworld.compuserve.com/

homepages/jsuebersax/condep.htm [accessed 14.10.07].[11] Hagenaars JA. Latent structure models with direct effectsbetween indicators: local dependence models. Sociol MethRes 1988;16:379405.

[12] Goodman LA. Exploratory latent structure analysis usingboth identifiable and unidentifiable models. Biometrika1974;61:21531.

[13] Zhang NL. Hierarchical latent class models for clusteranalysis. In: Dechter R, Kearns M, Sutton AAAI R, editors.Proceedings of 18th National Conference on Artificial Intel-ligence. 2002. p. 2307.

[14] Zhang NL. Hierarchical latent class models for cluster ana-lysis. J Machine Learn Res 2004;5(6):697723.

[15] Pearl J. Probabilistic reasoning in intelligent systems: Net-works of plausible inference. Palo Alto: Morgan Kaufmann;

1988.[16] Durbin R, Eddy S, Krogh A, Mitchison G. Biological sequenceanalysis: Probabilistic models of proteins and nucleic acids.Cambridge University Press; 1998.

[17] Green P. Penalized likelihood. In: Encyclopedia of StatisticalSciences, Update Volume 3. John Wiley & Sons; 1999 . p.57886.

[18] GeigerD, HeckermanD, Meek C. Asymptotic model selectionfor directed networks with hidden variables. In: Horvitz E,Jensen FV, editors. Proceedings of the 12th Annual Confer-ence on Uncertainty in Artificial Intelligence. 1996. p.15868.

[19] Kocka T, Zhang NL. Dimension correction for hierarchicallatent class models. In: Breese J, Koller D, editors.Proceedings of the 18th Conference on Uncertainty inArtificial Intelligence (UAI-02). Morgan Kaufmann; 2002 .p. 26774.

[20] Zhang NL, Kocka T. Efficient learning of hierarchical latentclass models. In: Khoshgoftaar TM, editor. Proceedings ofthe 16th IEEE International Conference on Tools with Arti-ficial Intelligence (ICTAI-2004). 2004. p. 58593.

[21] Schwarz G. Estimating the dimension of a model. Ann Stat1978;6(2):4614.

[22] Zhang NL, Kocka T. Effective dimensions of hierarchicallatent class models. J Artif Intell Res 2004;21:117.

[23] Dempster AP, Laird NM, Rubin DB. Maximum likelihood fromincomplete data via the EM algorithm. J Roy Stat Soc B1997;39:138.

[24] Lauritzen SL. The EM algorithm for graphical associationmodels with missing data. Comput Stat Data Anal 1995;19:191201.

[25] Friedman N. Learning belief networks in the presence ofmissing values and hidden variables. In: Fisher DH, editor.Proceedings of the 14th International Conference onMachine Learning (ICML 1997). 1997. p. 12533.

[26] van er Putten P, van Someren M, editors. CoIL Challenge2000: The insurance company case. Amsterdam: SentientMachine Research; 2002.

[27] Chickering DM. Learning equivalence classes of Bayesian-network structures. J Machine Learn Res 2002;2:44598.

[28] China State Bureau of Technical Supervision, National stan-dards on clinic terminology of traditional Chinese Medical

diagnosis and treatmentSyndromes, GB/T 16751.21997.Beijing: China Standards Press; 1997.[29] Zhu WF. Diagnostics of traditional Chinese medicine. Shang-

hai Science Press; 1995.[30] Yan SL, Zhang LW, Wang MH, Yuan SH. Operational standards

for determining the severity levels of kidney deficiencysymptoms. J Chengdu Univ Chin Med 2001;24(1):569.


Zhang n Jaim07

Documents

Transcript of Zhang n Jaim07