Graphical Models: An Extension to Random Graphs, Trees, and Other Objects

Graphical Models: An Extension to Random Graphs, T rees, and Other Objects Neil Hallonquist Johns Hopkins Uni versity neil.hallonquist@yahoo.com Contents 1 Introduction 2 1.1 Random Graphs . . . . . . . . . . . . . . . . . . . . . . 4 1.1.1 Literature . . . . . . . . . . . . . . . . . . . . . . 4 1.1.2 Other Literature . . . . . . . . . . . . . . . . . . . 5 1.1.3 Issues . . . . . . . . . . . . . . . . . . . . . . . . 6 1.1.4 Structure . . . . . . . . . . . . . . . . . . . . . . 8 1.2 Random T rees . . . . . . . . . . . . . . . . . . . . . . . 9 1.2.1 Literature . . . . . . . . . . . . . . . . . . . . . . 9 1.3 Outline . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10 2 Random Graphs 12 2.1 Graphs . . . . . . . . . . . . . . . . . . . . . . . . . . . 13 2.2 Marginal Random Graphs . . . . . . . . . . . . . . . . . 14 2.3 Independence . . . . . . . . . . . . . . . . . . . . . . . 16 2.4 Ba y esian Networks . . . . . . . . . . . . . . . . . . . . 17 2.4.1 Structure Graphs . . . . . . . . . . . . . . . . . . 17 2.4.2 Atomic V ariables . . . . . . . . . . . . . . . . . . 19 2.5 Gibbs Distribution . . . . . . . . . . . . . . . . . . . . . 19 2.6 Markov Random Fields . . . . . . . . . . . . . . . . . . 22 2.6.1 Cliques . . . . . . . . . . . . . . . . . . . . . . . 23 2.6.2 Markovity . . . . . . . . . . . . . . . . . . . . . . 25 ii iii 2.7 P ar tially Directed Models . . . . . . . . . . . . . . . . . 26 2.8 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . 28 2.8.1 Redundant Representations . . . . . . . . . . . 28 2.8.2 Graph V ariations . . . . . . . . . . . . . . . . . . 31 2.8.3 Projections . . . . . . . . . . . . . . . . . . . . . 32 3 Random T rees 33 3.1 Branching Models . . . . . . . . . . . . . . . . . . . . . 34 3.1.1 T rees . . . . . . . . . . . . . . . . . . . . . . . . 34 3.1.2 Basic Models . . . . . . . . . . . . . . . . . . . . 35 3.1.3 Substructures . . . . . . . . . . . . . . . . . . . . 37 3.1.4 Attributed T rees . . . . . . . . . . . . . . . . . . . 40 3.2 Merging Models . . . . . . . . . . . . . . . . . . . . . . 43 3.3 General Models . . . . . . . . . . . . . . . . . . . . . . . 45 4 General Random Objects 48 4.1 Projection F amilies . . . . . . . . . . . . . . . . . . . . . 49 4.2 Substructures . . . . . . . . . . . . . . . . . . . . . . . . 53 4.3 Compositional Systems . . . . . . . . . . . . . . . . . . 55 5 Examples 57 5.1 Compact Distributions . . . . . . . . . . . . . . . . . . . 58 5.2 Additional Structure . . . . . . . . . . . . . . . . . . . . 58 5.3 Graph Isomorphisms . . . . . . . . . . . . . . . . . . . . 59 5.4 Attributed Graph Isomorphisms . . . . . . . . . . . . . . 60 5.5 Master Interaction Function . . . . . . . . . . . . . . . . 61 5.6 Examples . . . . . . . . . . . . . . . . . . . . . . . . . . 62 5.6.1 Example 1: Grid Graphs . . . . . . . . . . . . . . 62 5.6.2 Example 2: ‘Molecule’ Graphs . . . . . . . . . . 65 5.6.3 Example 3: Mouse Visual Cor tex . . . . . . . . . 68 5.6.4 Example 4: Chemistr y Data . . . . . . . . . . . . 73 5.6.5 Example 5: V er tices with Color and Location . . 77 5.7 Inf erence and Learning . . . . . . . . . . . . . . . . . . 79 5.7.1 Sampling . . . . . . . . . . . . . . . . . . . . . . 79 5.7.2 Computation . . . . . . . . . . . . . . . . . . . . 81 5.7.3 Lear ning . . . . . . . . . . . . . . . . . . . . . . 82 iv 6 Summary and Discussion 84 6.1 Extended Discussion . . . . . . . . . . . . . . . . . . . . 85 6.1.1 F r ame work Merit . . . . . . . . . . . . . . . . . . 86 6.1.2 Random V er tices . . . . . . . . . . . . . . . . . . 87 6.1.3 Consistent Distributions . . . . . . . . . . . . . . 89 6.1.4 Degeneracy . . . . . . . . . . . . . . . . . . . . . 91 Ackno wledgements 95 Appendices 96 A Statistical In v ariances 97 A.1 Inv ariant T ransf ormations . . . . . . . . . . . . . . . . . 98 A.2 Inv ariant Functions . . . . . . . . . . . . . . . . . . . . . 98 A.2.1 Inv ariant Probability Mass Functions . . . . . . . 99 A.3 Conditional Inv ariance . . . . . . . . . . . . . . . . . . . 100 A.3.1 General In v ariance . . . . . . . . . . . . . . . . . 100 A.4 An Alter native F ormulation . . . . . . . . . . . . . . . . 101 A.5 Moment Inv ariance . . . . . . . . . . . . . . . . . . . . . 102 A.6 Conditional Moment Inv ariance . . . . . . . . . . . . . . 105 References 107 Abstract In this work, we consider an e xtension of graphical models to random graphs, trees, and other objects. T o do this, many fundamental concepts for multiv ari- ate random v ariables (e.g., marginal variables, Gibbs distribution, Marko v properties) must be extended to other mathematical objects; it turns out that this e xtension is possible, as we will discuss, if we have a consistent, complete system of projections on a given object. Each projection deﬁnes a marginal random variable, allo wing one to specify independence assumptions between them. Furthermore, these independencies can be speciﬁed in terms of a small subset of these marginal v ariables (which we call the atomic variables), al- lo wing the compact representation of independencies by a directed graph. Projections also deﬁne factors, functions on the projected object space, and hence a projection family deﬁnes a set of possible factorizations for a distri- bution; these can be compactly represented by an undirected graph. The in variances used in graphical models are essential for learning distri- butions, not just on multiv ariate random v ariables, but also on other objects. When they are applied to random graphs and random trees, the result is a gen- eral class of models that is applicable to a broad range of problems, including those in which the graphs and trees have complicated edge structures. These models need not be conditioned on a ﬁxed number of vertices, as is often the case in the literature for random graphs, and can be used for problems in which attributes are associated with vertices and edges. For graphs, appli- cations include the modeling of molecules, neural networks, and relational real-world scenes; for trees, applications include the modeling of infectious diseases and their spread, cell fusion, the structure of language, and the struc- ture of objects in visual scenes. Many classic models can be seen to be par- ticular instances of this frame work. 1 Intr oduction In problems in v olving the statistical modeling of a collection of random v ari- ables (i.e., a multiv ariate random v ariable), the use of in v ariance assumpt ions is often critical for practical learning and inference. A graphical model is a frame work for such problems based on conditional independence, a funda- mental in v ariance for these v ariables; this framew ork has found wide-spread use because independence occurs naturally in man y problems, and is often speciﬁable by practitioners. Furthermore, independence assumptions can be made at varying degrees (for many in v ariances, this is not the case), thus creating a range of model complexities, and allowing practitioners to adjust models to a gi ven problem. In this work, we consider an extension of graphical models from multi- v ariate random variables to other random objects such as random graphs and trees. T o do this, core concepts from graphical models must be abstracted, forming a more general formulation; in this formulation, graphical models can be applied to any object that has, loosely speaking, a structure allow- ing a hierarchical family of projections on it. Each projection in this family deﬁnes a marginal random variable, allowing one to specify independence assumptions between them, and further , allowing a graph to represent these independencies (where vertices correspond to atomic v ariables). This projec- 2 3 tion family also deﬁnes, for distributions, a family of factors, allo wing one to specify general factorizations, and further , also represent them compactly by a graph. A projection family must satisfy certain basic properties in order for the corresponding v ariables to be consistent with each other . In the ﬁrst part of this work, we examine models for random graphs, the problem that originally motiv ated this in vestigation. Applying graphical mod- els to them results in a general framework, applicable to problems in which graphs have complicated edge structures. These models need not be condi- tioned on a ﬁxed number of vertices, as is often the case in the literature, and can be used for problems in which graphs ha v e attrib utes associated with their vertices and edges. The focus of this work is on problems in which the number of vertices can vary . Some examples of graphs that these models are applicable to are sho wn in Figures 1.1 and 1.2. This work makes no contri- bution to the traditional setting of random graphs in which the vertex set is ﬁxed; the formulation presented here is unnecessary in that setting. After in v estigating graphical models for graphs, we consider their appli- cation to trees, a special type of graph used in many real-world problems. As with graphs, this results in models applicable to a broad range of prob- lems, including those in which trees ha ve complex structures and attributes. In the approach taken in most of the literature, probabilities are placed on trees based on ho w a tree is incrementally constructed (e.g., from a branching pro- cess or grammar). Using graphical models, this approach may be extended, allo wing distrib utions to be deﬁned based on ho w trees are deconstructed into parts. The beneﬁt of this graphical model approach is that one can make well-deﬁned distributions that ha ve complex dependencies; in contrast, it is often intractable to deﬁne distributions over , for example, context-sensiti v e grammars. In the last part of this work, we deﬁne some consistency and completeness conditions for projection f amilies. These conditions on projections ensure the consistency of their corresponding random variables (i.e., the y form a fa mily of marginal variables), which in turn, allows graphical models to be directly deﬁned in terms of projection families. In this formulation, graphical models may be loosely thought of as a modeling framework based on independence assumptions between the parts of an object, gi ven the object is compositional. An object is compositional if: (a) it is composed of parts, which in turn, are 4 Introduction themselves composed of parts, etc.; and (b) a part can be a member of multi- ple larger parts. Objects such as v ectors, graphs, and trees, are compositional; in more applied settings, objects such as words and sentences, people, and real-world scenes, are compositional as well. Graphical models are naturally suited to the modeling of these objects. 1.1 Random Graphs A graph is a mathematical object that is able to encode relational information, and can be used to represent many entities in the world such as molecules, neural networks, and real-world scenes. An (undirected) graph is composed of a ﬁnite set of objects called vertices, and for each pair of vertices, speci- ﬁes a binary value. If this binary value is positiv e, there is said to be an edge between that pair of vertices. In most applications, graphs hav e attributes as- sociated with their vertices and edges; we will refer to attributed graphs sim- ply as graphs in this work. (W e make more formal deﬁnitions in Section 2.) A random graph is a random variable that maps into a set of graphs. In this section, we give a brief overvie w of random graph models in the literature, and discuss some of their shortcomings, moti v ating our work. 1.1.1 Literature The most commonly studied random graph model is the Erd ˝ os-Rényi model ([Erd ˝ os and Rényi, 1959], [Gilbert, 1959]). This is a model for conditional distributions in which, for a giv en set of vertices, a distribution is placed ov er the possible edges. It makes the inv ariance assumption that, for any two vertices, the probability of an edge between them is independent of the other edges in the graph, and further , this probability is the same for all edges. This classic model, due to its simplicity , is conduci v e to mathematical analysis; its asymptotic behavior (i.e, its behavior as the number of vertices becomes large) has been researched extensi v ely ([Bollobás, 1998], [Janson et al., 2011]). There are many ways in which the Erd ˝ os-Rényi model can be extended. One such extension is the stochastic blockmodel [Holland et al., 1983]. This model is for conditional distrib utions ov er the edges, gi ven vertices, where each verte x has a label (e.g., a color) associated with it. Similar to the Erd ˝ os- 1.1. Random Graphs 5 Rényi model, for any two vertices, the probability of an edge between them is independent of the other edges in the graph; unlike the the Erd ˝ os-Rényi model, this probability depends on the labels of those two v ertices. An e xtension of the stochastic blockmodel is the mixed membership stochastic blockmodel [Airoldi et al., 2009]. In this model, instead of asso- ciating each vertex with a ﬁxed label, each verte x is associated with a proba- bility vector over the possible labels. Giv en a set of vertices (and their label probability vectors), a set of edges can be sampled as follo ws: for each pair of vertices, ﬁrst sample their respective labels, then sample from a Bernoulli distribution that depends on these labels. Another extension of the stochastic blockmodel is the latent space model [Hoff et al., 2002], where instead of associating vertices with labels from a ﬁnite set, they are instead associated with positions in a Euclidean space; giv en the position of two vertices, the probability of an edge between them only depends on their distance. A general class of random graph models, of which the abov e models fall within, is the exponential family ([Holland and Leinhardt, 1981], [Robins, 2011], [Snijders et al., 2006]). A well-kno wn example is the Frank and Strauss model [Frank and Strauss, 1986], also a model for conditional dis- tributions, specifying the probability of having some set of edges, giv en ver - tices. Since the randomness is only ov er the edges, a graphical model can be applied in which there is a random variable for each pair of vertices, speci- fying the presence or absence of an edge. These random variables are condi- tionally independent, in this model, if they do not share a common v ertex. 1.1.2 Other Literature In this section, we re vie w models from outside the mainstream random graph community that were designed for graphs that vary in size and have compli- cated attributes. One of the ﬁrst such models was developed by Ulf Grenander under the name pattern theory ([Grenander and Miller, 2007], [Grenander, 1997], [Grenander, 2012]). This work w as moti v ated by the desire to formal- ize the concept of a pattern within a mathematical frame w ork. A lar ge collec- tion of natural and man-made patterns is sho wn in [Grenander, 1996]. Exam- ples range from textures to leaf shapes to human language. In each of these examples, every particular instance of the given pattern can be represented by a graph. These instances hav e natural v ariations, and so the mathematical 6 Introduction frame work for describing these patterns is probabilistic, i.e. a random graph model. The model dev eloped was based on applying Marko v random ﬁelds to graphs. Learning and inference are often dif ﬁcult in this model, limiting its practical use. Later , random graph models were de veloped within the ﬁeld of rela- tional statistical learning . In particular, techniques such as Probabilistic Re- lational Models [Getoor et al., 2001], Relational Marko v Networks [T askar et al., 2002], and Probabilistic Entity-Relationship Models [Heckerman et al., 2007], were speciﬁcally designed for modeling entities that are representable as graphs. These models specify conditional distrib utions, applying graphical models in which: (1) for each vertex, there is a random variable representing its attrib utes; and (2) for each pair of vertices, there is a random v ariable rep- resenting their edge attributes. (This is an approach similar to the one taken in the Frank and Strauss model). 1.1.3 Issues Suppose we want to learn a distribution ov er some graph space. This distri- bution cannot be directly modeled with graphical models because these were designed for multi v ariate random variables (with a ﬁxed number of compo- nents). T o a v oid this issue, most random graph models in the literature trans- form the problem into one in which graphical models can be applied. This is done by only modeling a selected set of conditional distributions, for ex- ample, the set of distributions in which each is conditioned on some number of vertices. Aside from the fact that many applications simply require full distributions, problems with this approach include: (1) there are complicated consistency issues; a distribution may not exist that could produce a gi ven set of conditional distributions; and (2) this partial modeling, loosely speak- ing, cannot capture important structures in distributions (e.g., there may be in v ariances within a full distribution that are dif ﬁcult to encode within con- ditional distrib utions). T o correct these issues, graphical models (for multi- v ariate random v ariables) cannot be used for this problem; we need statistical models speciﬁcally designed for general graph spaces. Suppose we have a graph space G in which graphs may dif fer in their order (i.e., graphs in this space may v ary in their number of vertices); in this work, we want to de v elop distributions o v er this type of space. 1.1. Random Graphs 7 Figure 1.1: Examples of molecule graphs in the MUT AG dataset [Sher- v ashidze et al., 2011]. 8 Introduction Figure 1.2: A mouse visual cortex [Bock et al., 2011]. In addition, we want models that are applicable to problems in which: (a) graphs hav e complex edge structures; and (b) graphs have attributes associ- ated to their v ertices and edges. T o handle these problems, e xpressi ve models are necessary (i.e., models containing a large set of distributions). T o make learning feasible in these models, it becomes imperati ve to specify structure in them as well. 1.1.4 Structure T o specify structure in random graph distributions, we look to the standard methods used in multiv ariate random variables for insight. Suppose we hav e a random v ariable X taking v alues in X = { 0 , 1 } n . In general, its distrib ution has 2 n − 1 parameters that need to be speciﬁed. If the v alue of n is not small, learning this number of parameters, in most real-world problems, is infeasi- ble; hence, the need to control complexity . This has led to the wide-spread use of graphical models ([Lauritzen, 1996], [Jordan, 2004], [W ainwright and Jordan, 2008]), a framew ork that uses factorization to simplify distributions. In this frame work, joint distributions are speciﬁed by simpler functions, and more speciﬁcally , the probability of any X ∈ X is uniquely determined by a 1.2. Random T rees 9 product of functions ov er subsets of its components. No w , suppose we have a random graph G taking values in some ﬁnite graph space G . In general, its distrib ution has |G | − 1 parameters that need to be speciﬁed, and again, clearly there is a need to control complexity . Similar to the abov e graphical models, we can simplify distributions through the use of factorization: the probability of any graph G ∈ G can be uniquely deter- mined by a product of simpler functions ov er its subgraphs. Thus, we can create a general framew ork for random graphs analogous to that of graphical models. Indeed, just as graphs can be used to represent the factorization in graphical models, graphs can also be used to represent the factorizations in random graphs. These ideas are explored in Section 2. 1.2 Random T rees A tree is a special type of graph used in many real-world problems. Like graphs, random tree models range from simplistic ones, amenable to asymp- totic analysis, to complex ones, more suited to problem domains with smaller , ﬁnite trees. W e no w brieﬂy re vie w models in the literature. 1.2.1 Literature A classic random tree model is the Galton-W atson model ([W atson and Gal- ton, 1875], [Le Gall et al., 2005], [Drmota, 2009], [Haas et al., 2012]), where trees are incrementally constructed: beginning with the root verte x, the num- ber of its children is sampled according to some distrib ution; for each of these vertices, the number of its children is sampled according to the same distri- bution, and so on. The literature on these models is vast, most focusing on the probability of extinction and the behavior in the limit. These models are often used, for example, in the study of extinction [Haccou et al., 2005], and the spread of infectious diseases [Britton, 2010]. An extension of the Galton-W atson model is the multi-type Galton W at- son model ([Seneta, 1970], [Mode, 1971]), in which each vertex no w has a label from some ﬁnite set. As before, trees are incrementally constructed; for a given verte x, the number of its children and their labels is now sampled according to some conditional distribution (conditioned on the label of the parent). 10 Introduction In problems in which vertices hav e relatively complex labels, often a grammar is used to specify which trees are valid (i.e., used to deﬁne the tree space). These grammars produce trees by production rules, which may be thought of as functions that take a tree and specify a larger one; beginning with the empty tree, trees are incrementally b uilt by the iterativ e application of these rules. In a conte xt-fr ee grammar [Chomsky, 2002], the production rules are functions that depend only on one (leaf) verte x in a giv en tree, and specify its children and their attrib utes. Distributions can be deﬁned ov er trees in this grammar by associating a probability with each production rule. A context-sensitive grammar is an extension of a context-free grammar in which production rules are functions that depend both on a giv en leaf ver - tex and certain vertices neighboring this leaf as well. It is well-known that the approach of associating probabilities to production rules does not extend to context-sensiti v e grammars (i.e., does not produce well-deﬁned distribu- tions in this case); to make distributions for these grammars, very high-order models are required. There are many applications of random trees with attributes: in linguis- tics, they are used to describe the structure of sentences and words in natural language [Chomsky, 2002]; in computer vision, they are used to describe the structure of objects in scenes ([Jin and Geman, 2006], [Zhu and Mum- ford, 2007]); and in genetics, they are used in the study of structure in RNA molecules ([Cai et al., 2003], [Dyrka and Nebel, 2009], [Dyrka et al., 2013]). In this work, we consider a graphical model approach for random trees; by decomposing a random tree into its marginal random (tree) variables, it becomes tractable to make well-deﬁned tree distributions that are, loosely speaking, context-sensiti v e. Since trees are graphs, one could model them by applying the same framew ork that we dev elop for general graphs. Howe ver , it is beneﬁcial to instead use models that are tuned to the deﬁning properties of trees. 1.3 Outline In Section 2, we e xamine the common compositional structure within mul- ti v ariate random variables and random graphs, allowing graphical models to be applied to each. The main ideas for extending graphical models to other 1.3. Outline 11 objects are outlined in this section. In Section 3, we explore the modeling of random trees with graphical models. In Section 4, we provide a formulation for general random objects, and in Section 5, we illustrate the application of these models with some examples, focusing on random graphs. Finally , we conclude with a discussion in Section 6. 2 Random Graphs In this section, we present a general class of models for random graphs which can be used for creating complex distributions (e.g., distributions that place signiﬁcant mass on graphs with complicated edge structures). W e begin by deﬁning a canonical projection family based on projections that take graphs to their subgraphs. These projections deﬁne a consistent family of marginal random (graph) variables, allowing us to specify conditional independence assumptions between them, and in turn, apply Bayesian networks (over the marginal v ariables that are atomic). Next, we deﬁne, using these same graph projections, a Gibbs form for graph distrib utions, allo wing us to specify gen- eral f actorizations, and in turn, apply Markov random ﬁelds. Finally , we con- sider partially directed models (also known as chain graph models), a gener- alization of Markov random ﬁelds and Bayesian networks; these models are important for random graphs because, as we will discuss, they av oid certain drawbacks that these former models hav e for this problem, while maintaining their adv antages. 12 2.1. Graphs 13 2.1 Graphs Suppose we ha v e a vertex space Λ V and a edge space Λ E , and for simplicity , assume the vertex space is ﬁnite. W e deﬁne a graph to be a couple of the form G = ( V , E ) , where V is a set of vertices and E is a function assigning an edge v alue to e very pair of v ertices: V ⊆ Λ V E : V × V → Λ E . Hence, every verte x is unique, i.e. no two vertices can share the same v alue in Λ V . W e assume the edge space Λ E contains a distinguished element that represents the ‘absence’ of an edge (e.g. the v alue 0 ). If a graph has no ver - tices, i.e., | V | = 0 , we will denote it by ∅ and refer to it as the empty graph. For simplicity , we assume there are no self loops. That is, there are no edges between a verte x and itself (i.e., E ( v , v ) = 0 for all v ∈ V ). Example 2.1. Suppose we have a vertex space in which each verte x has a color and a location; let Λ V = C × L , where: C = { ‘red’ , ‘blue’ } L = { 1 , . . . , p } × { 1 , . . . , p } . Here, L represents a location space, a tw o dimensional grid of size p . Let the edge space be Λ E = { 0 , 1 , 2 } , where the value 0 represents the absence of an edge. See Figure 2.1 for an example graph. Figure 2.1: An example graph that uses the verte x and edge space described in Example 2.1. The edge values 1 and 2 are represented by lines and dotted lines, respecti vely . 14 Random Graphs In most real-world applications, graphs have attributes associated with their vertices and edges; in this case, attributes can be incorporated into the verte x space Λ V and edge space Λ E , or alternativ ely , graphs can be deﬁned to be attributed. For the presentation of the random graph models, we will proceed in the simpler setting though, deferring attributed graphs and other v ariations to Section 2.8.2. W e consider more e xamples in Section 5. 2.2 Marginal Random Graphs Suppose we ha ve a graph space G that we want to deﬁne a distribution over . T o do this, some basic probabilistic concepts need to be dev eloped; in this section, we deﬁne, for a random graph, a family of marginal random (graph) v ariables. These marginal variables are deﬁned using projections on the graph space, and hence, we require that random graphs take values in graph spaces that are projectable. Let’ s begin by deﬁning an induced subgraph. For a graph G = ( V , E ) , let the subgraph induced by a subset V 0 ⊆ V of its vertices be the graph G 0 = ( V 0 , E 0 ) , where E 0 = E | V 0 × V 0 is the restriction of E ; we let G 0 = G ( V 0 ) denote the subgraph of G induced by V 0 ⊆ V . For a gi ven graph, its subgraphs may be thought of as its components or parts, and are fundamental to its statistical modeling. A graph space G is projectable if, for e very graph existing in this space, its subgraphs also exist in it. For a graph G , let S ( G ) denote the set of all its subgraphs: S ( G ) = { G ( V 0 ) | V 0 ⊆ V ( G ) } , where V ( G ) is the set of v ertices of graph G . This set contains, for example, subgraphs corresponding to individual vertices in G (i.e. the subgraphs G ( V 0 ) where V 0 ⊆ V ( G ) and | V 0 | = 1 ), and the subgraphs corresponding to pairs of vertices in G (i.e. the subgraphs G ( V 0 ) where V 0 ⊆ V ( G ) and | V 0 | = 2 ). No w , we may deﬁne a projectable graph space: Deﬁnition 2.1 (Projectable Space) . A graph space G is pr ojectable if: G ∈ G = ⇒ G 0 ∈ G for all G 0 ∈ S ( G ) . Henceforth, we assume that ev ery graph space G is projectable. Now , we may deﬁne graph projections: 2.2. Marginal Random Graphs 15 Deﬁnition 2.2 (Canonical Graph Projection) . Let V ⊆ Λ V be a set of vertices. Deﬁne the projection π V : G → G V , where G V = { G ∈ G | V ( G ) ⊆ V } , as follows: π V ( G ) = G ( V ∩ V 0 ) , where V 0 = V ( G ) is the set of v ertices of the graph G ∈ G . The projection π V maps graphs to their induced subgraphs based on the intersection of their vertices with the vertices V . That is, for a graph G , if there are no vertices in this intersection (i.e., V ∩ V 0 = ∅ , where V 0 = V ( G ) ), then G gets projected to the empty graph; if there is an intersection (i.e. V ∩ V 0 6 = ∅ , where V 0 = V ( G ) ), then G gets projected to its subgraph induced by the vertices in this intersection. This projection has the property that the image of a projectable graph space is also a projectable space. That is, if the domain G is projectable, then for each projection π V , the codomain G V ⊆ G is also projectable. This property is useful because it allo ws us to deﬁne a consistent set of marginal random variables. Suppose we ha ve a distribution P over a countable graph space G ; then the distribution for a marginal random v ariable G V taking v al- ues in G V is deﬁned as: P marg V ( G ) = X G 0 ∈G π V ( G 0 )= G P ( G 0 ) , G ∈ G V . It can be veriﬁed that this deﬁnes a v alid probability distrib ution, i.e., that X G ∈G V P marg V ( G ) = 1 , and further , that this set of distributions (i.e., the set { P marg V , V ⊆ Λ V } ) is consistent, i.e., for all V 0 , V 1 ⊆ Λ V such that V 0 ⊆ V 1 , we hav e that P marg V 0 and P marg V 1 are consistent: P marg V 0 ( G ) = P marg V 1 ( { G 0 ∈ G V 1 | G 0 ( V 0 ) = G } ) , for all G ∈ G V 0 . 16 Random Graphs 2.3 Independence The marginal random variables for random graphs deﬁned in the pre vious section allow us to use the standard deﬁnitions of independence and condi- tional independence for random v ariables. For con venience, we repeat the deﬁnition of independence here, using the notation for random graphs. Sup- pose we ha ve a vertex space Λ V and an edge space Λ E , and let G be a graph space with respect to them. Deﬁne independence as follo ws: Deﬁnition 2.3 (Independence) . Let V 1 , V 2 ⊂ Λ V . For a distribution P ov er G , we say that the marginal random variables G V 1 and G V 2 are indepen- dent if P ( G V 1 , G V 2 ) = P ( π V 1 ( G ) = G V 1 , π V 2 ( G ) = G V 2 ) = P ( π V 1 ( G ) = G V 1 ) · P ( π V 2 ( G ) = G V 2 ) = P marg V 1 ( G V 1 = G V 1 ) · P marg V 2 ( G V 2 = G V 2 ) = P marg V 1 ( G V 1 ) · P marg V 2 ( G V 2 ) , for all G V 1 ∈ G V 1 and G V 2 ∈ G V 2 . Similarly , conditional independence for random graphs can deﬁned using the standard deﬁnitions as well, which we do not repeat here. These deﬁni- tions suggest methods for specifying structure in distributions: Example 2.2 (‘Naive’ Random Graphs) . T o deﬁne a distribution P over a graph space G , a naiv e approach might be to assume, loosely speaking, that all marginal random variables are independent. Unlike for multi v ariate ran- dom v ariables though, due to the constraints imposed by the dependence of edges on vertices, conditional independence assumptions are also necessary . Let V = V (1) ∪ V (2) , where each V ( i ) = { V ⊆ Λ V : | V | = i } , be the set of all singleton vertices as well as all pairs of vertices. W e will make in v ari- ance assumptions with respect to the mar ginal random v ariables G V , V ∈ V : assume independence between these variables if they do not have vertices in common, and further , assume conditionally independence between edge v ariables, gi ven the v ertex v ariables. More precisely , suppose: 1. For e very V 1 , V 2 ∈ V such that V 1 ∩ V 2 = ∅ , the random variable G V 1 is independent of the random v ariable G V 2 . 2.4. Bay esian Networks 17 2. For e very V 1 , V 2 ∈ V such that V 1 ∩ V 2 6 = ∅ , the random v ariable G V 1 is conditionally independent of the random variable G V 2 , gi ven the v ariable G V 1 ∩ V 2 . Loosely speaking, this latter assumption makes edges incident on a common verte x conditionally independent, giv en that verte x. With these assumptions, the model has the form: P ( G ) = P ( G V , V ∈ V ) = P ( G V , V ∈ V (1) ) · P ( G V , V ∈ V (2) | G V , V ∈ V (1) ) = Y V ∈V (1) P marg V ( G V ) Y V ∈V (2) P marg V ( G V | G { v } , v ∈ V ) = Y v ∈ Λ V P marg { v } ( G { v } ) Y v ,v 0 ∈ Λ V v 6 = v 0 P marg { v ,v 0 } ( G { v ,v 0 } | G { v } , G { v 0 } ) . (2.1) Finally , we mention that this model may be further simpliﬁed by assuming the distribution is in variant to isomorphisms (see Section 5.3). 2.4 Bay esian Netw orks In graphical models, graphs are used to represent the structure within distri- butions; we will refer to these as structure graphs to avoid confusion. For a graph with a binary edge function, two vertices v , u are said to have a di- r ected edge from v to u (denoted by v → u ) if E ( v , u ) = 1 and E ( u, v ) = 0 , and are said to ha ve an undir ected edge between them (denoted v − u ) if E ( v , u ) = E ( u, v ) = 1 . The vertex v is a par ent of verte x u if v → u , and vertices v and u are neighbors if v − u . The set of parents of v is de- noted by pa ( v ) and the set of neighbors by ne ( v ) . In this section, we consider Bayesian networks, a modeling framew ork based on conditional indepen- dence assumptions, speciﬁed in structure graphs with directed edges [Pearl and Shafer, 1995]. 2.4.1 Structure Graphs Let’ s begin by considering Bayesian networks for multiv ariate random vari- ables; suppose we hav e a random v ariable X taking v alues in X = { 0 , 1 } n , and a structure graph with vertices { 1 , . . . , n } and a binary edge function of 18 Random Graphs the form N : { 1 , . . . , n } 2 → { 0 , 1 } . Further, assume this structure graph has directed edges and is acyclic; a distribution P ov er X is said to factor according to this structure graph if it can be written in the form: P ( X ) = n Y i =1 P ( X i | X pa ( i ) ) , where X A ≡ π A ( X ) is the projection of X onto its components in the set A ⊆ { 1 , . . . , n } . No w consider Bayesian networks for random graphs; suppose we have a random graph G taking values in a graph space G , and a structure graph with vertices V = V (1) ∪ V (2) , where each V ( i ) = { V ⊆ Λ V : | V | = i } , and a binary edge function of the form N : V 2 → { 0 , 1 } . Further , assume the structure graph is directed and acyclic; a distribution P ov er G factorizes according to this structure graph if it can be written in the form: P ( G ) = P ( G V , V ∈ V ) = Y V ∈V P ( G V | G pa ( V ) ) , where, for A ⊆ V , we hav e G A ≡ { G V , V ∈ A } , and where, recall G V ≡ π V ( G ) is the projection of G onto the vertices V . Example 2.3 (‘Naive’ Random Graphs (cont.)) . W e re visit example 2.2, no w specifying the structure in terms of a structure graph. Deﬁne the neigh- borhood function N : V 2 → { 0 , 1 } as follows: N ( V , V 0 ) = ( 1 , if V ⊂ V 0 0 , otherwise . In other words, this neighborhood function speciﬁes a directed edge from each vertex v ariable G { v } to ev ery edge variable of the form G { v ,v 0 } . Dis- tributions that cohere with this structure graph can be written in the form of equation (2.1). This model has the minimal complexity in the sense that the neighborhood function cannot hav e fewer non-zero v alues while still deﬁn- ing a valid structure graph (i.e., the structure graph speciﬁes independence assumptions that are consistent in the sense that there exists a well-deﬁned distribution that satisﬁes them). Hence, the reason for referring to this as the naive model. 2.5. Gibbs Distr ibution 19 2.4.2 Atomic V ariables In the previous section, the main difference between the graphical model for multi v ariate random variables and for random graphs was in the marginal v ariables used in the structure graph in each case (i.e., the variables in which vertices in the structure graph correspond). In this section, we consider in more detail the subset of v ariables used, for a giv en random object, by graph- ical models in their structure graphs. Suppose we hav e a random graph G taking values in a projectable graph space G . The canonical set of projections on this graph space deﬁnes a set of marginal random variables, and a projection in this set such that, loosely speaking, no other projection further projects downw ard, deﬁnes an atomic v ariable. Informally , a projection π is atomic (with respect to a ﬁnite projec- tion f amily) if: (a) there does not exist a projection in this f amily that projects to a subset of its image; or (b) if there are projections in this family that project to a subset of its image, then this set loses information (i.e., π is not a func- tion of these projections). W e defer more formal deﬁnitions to Section 4.1. The second condition ensures that any object projected by the set of atomic projections can be reconstructed. W e will call a mar ginal v ariable atomic if it corresponds to an atomic projection. For random graphs, the atomic projections have the form π { v } or π { v ,v 0 } (i.e., loosely speaking, the projections to some verte x or edge), and the non- atomic projections hav e the form π V where | V | > 2 (i.e., the projections to larger vertex sets). Hence, for a random graph G , the atomic v ariables are { π V ( G ) , V ∈ V } , where V = V (1) ∪ V (2) ; these v ariables can be used as a representation of the random graph, and graphical models specify structure in terms of them (i.e., the vertices in structure graphs correspond to these v ariables). 2.5 Gibbs Distribution In this section, we deﬁne a Gibbs form for random graphs based on a canon- ical factorization; this factorization is determined by the canonical projec- tions, the projection family taking graphs to their subgraphs. For a graph G , 20 Random Graphs let S k ( G ) denote the set of all subgraphs of G of order k : S k ( G ) = { G 0 ∈ S ( G ) : | V ( G 0 ) | = k } , where, recall S ( G ) is the set of all of induced subgraphs of G , and where V ( G ) denotes the v ertices of graph G . Hence, the set S 1 ( G ) contains graphs having a single verte x, the set S 2 ( G ) contains graphs having two vertices, and so on. For this section, let the verte x space Λ V be countable, and for any graph, assume its verte x set V ⊂ Λ V is ﬁnite. W e can deﬁne a Gibbs distribution for a countable graph space G as follows: Deﬁnition 2.4 (Gibbs Distribution) . A probability mass function (pmf) P ov er a countable graph space G is a Gibbs distribution if it can be written in the form P ( G ) = exp   ψ 0 + X G 0 ∈ S 1 ( G ) ψ 1 ( G 0 ) + X G 0 ∈ S 2 ( G ) ψ 2 ( G 0 ) + . . .   (2.2) where ψ k : G ( k ) → R ∪ {−∞} is called the potential of order k , and G ( k ) denotes the space of graphs of order k , i.e.: G ( k ) = { ( V , E ) ∈ G : | V | = k } . A graph space need not be countable (depending on Λ E ), but for ease of exposition, we assumed so here. W e giv e some examples in which classic models are expressed in this form. Example 2.4 (The Erd ˝ os-Rényi model) . [Erd ˝ os and Rén yi, 1959] [Gilbert, 1959]: Let G be a standard graph space (i.e. the vertex space Λ V = N is the set of natural numbers and the edge space Λ E = { 0 , 1 } .) The Erd ˝ os- Rényi model is a conditional distribution specifying the probability of edges E giv en a ﬁnite set of vertices V . It mak es the in variance assumption that, for any two vertices, the probability of an edge between them is independent of the other edges in the graph: P ( E | V ) = exp   X G 0 ∈ S 2 (( V ,E )) ψ 2 ( G 0 )   2.5. Gibbs Distr ibution 21 where ψ 2 ( G 0 ) =    log( p ) , if G 0 has an edge log(1 − p ) , otherwise and p ∈ [0 , 1] . Example 2.5 (The stochastic blockmodel) . [Holland et al., 1983]: Let G be a graph space where the vertex space is Λ V = { 1 , . . . , l } × N , where the ﬁrst component corresponds to some label, and the edge space is Λ E = { 0 , 1 } . The stochastic blockmodel is also a conditional distribu- tion specifying the probability of edges E giv en a ﬁnite set of vertices V . It makes the in v ariance assumption that, for any tw o v ertices, the probability of an edge between them depends on only the label of those two v ertices: P ( E | V ) = exp   X G 0 ∈ S 2 (( V ,E )) ψ 2 ( G 0 )   where ψ 2 ( G 0 ) =    log p ( a 1 , a 2 ) , if G 0 has an edge log(1 − p ( a 1 , a 2 )) , otherwise where a 1 , a 2 are the labels of the two vertices in G 0 ∈ G (2) and p : { 1 , . . . , l } 2 → [0 , 1] is a symmetric function. W e no w deﬁne a positivity condition for distributions; this will allo w us to make a statement about the uni v ersality of the Gibbs representation. Deﬁnition 2.5 (P ositivity Condition) . Let P be a real function ov er a pro- jectable graph space G . The function P is said to satisfy the positivity condi- tion if, for all G ∈ G , we have: P ( G ) > 0 = ⇒ P ( G 0 ) > 0 for all G 0 ∈ S ( G ) . Theorem 2.1. If P is a positiv e distribution o ver a projectable graph space G , then P can be written in Gibbs form. 22 Random Graphs Pr oof . For a gi ven graph G ∈ G , deﬁne φ G ( W ) = log P ( G W ) P ( ∅ ) where W ⊆ V ( G ) and where G W = π W ( G ) . Using the Mobius formula, we can write φ G ( W ) = X W 0 ⊆ W ψ G ( W 0 ) ψ G ( W ) = X W 0 ⊆ W ( − 1) | W |−| W 0 | φ G ( W 0 ) , where the positi vity condition is required for the v alidity of the second equa- tion. Note that ψ G ( W ) only depends on G W (not on the rest of G ), so it can be renamed ψ ( G W ) ; letting W = V ( G ) , we ha ve: P ( G ) = P ( ∅ ) exp   X W ⊆ V ( G ) ψ ( G W )   . This theorem shows that distributions can be expressed in such a way that the probability of a graph is a function of only its induced subgraphs; that is, statistical models need not include (more formally , set to zero) the v alue of potentials that inv olv e vertices that are absent from a given input graph. Henceforth, we return to assuming verte x spaces are ﬁnite (since, in our formulation, graphical models are limited to ﬁnite projection families (see Section 4)). 2.6 Marko v Random Fields In the previous section, we deﬁned a Gibbs distribution for random graphs, a univ ersal representation (Theorem 2.1) based on a general factorization. In this section, we consider Marko v random ﬁelds, a graphical model that speciﬁes structure in distrib utions based on these factorizations ([Kindermann et al., 1980], [Geman and Graf ﬁgne, 1986], [Clif ford, 1990]). Consider Markov random ﬁelds for multi v ariate random variables: sup- pose we hav e a random v ariable X taking v alues in X = X 1 × · · · × X n , 2.6. Mar ko v Random Fields 23 where each X i is ﬁnite. T o deﬁne a distribution ov er X , we will assume it equals some product of simpler functions (i.e., functions that have smaller domains than X ). T o deﬁne these simpler functions, we use projections of the form π C : X → X C , where C ⊆ { 1 , . . . , n } and X C = Q i ∈ C X i , and take elements in X to their components. Using these projections, we can de- ﬁne factors of the form f C : X C → R + , and a distribution P factorizes over C ⊆ P ( { 1 , . . . , n } ) if it can be written as: P ( X ) = 1 Z Y C ∈C f C ( π C ( X )) = 1 Z Y C ∈C f C ( X C ) , where X C ≡ π C ( X ) , and where P ( { 1 , . . . , n } ) denote the power set of { 1 , . . . , n } . Structure can be speciﬁed in this model by the choice of fac- tors. For a giv en model, complexity can be reduced through the remov al of factors (i.e., remo ving elements from the set C ). No w suppose we hav e a random graph G taking v alues in G . As was done in the multiv ariate case, we deﬁne the f actorization of distrib utions o v er this graph space using a projection family; a distribution can be deﬁned as a product of factors of the form f V : G V → R + , where, recall G V ≡ π V ( G ) is a smaller graph space. A distribution P factorizes over V ⊆ P (Λ V ) if it can be written as: P ( G ) = 1 Z Y V ∈V f V ( π V ( G )) = 1 Z Y V ∈V V ⊆ V ( G ) f V ( G V ) , where G V ≡ π V ( G ) , and where we are assuming f V ( G ) = 1 if V ( G ) 6⊆ V . As above, structure can be speciﬁed in this model through the choice of factors. 2.6.1 Cliques W e no w consider the representation of the factorizations in the previous sec- tion in terms of an undirected structure graph; suppose we ha ve a neighbor- hood function N : V 2 → { 0 , 1 } that is symmetric, where V = V (1) ∪ V (2) . 24 Random Graphs In order for a neighborhood function to be v alid (i.e., specify independence assumptions that are consistent in the sense that there exists a well-deﬁned distribution that satisﬁes them), it must specify a direct dependency between any V , V 0 ∈ V such that one is a subset of the other . That is, for all V , V 0 ∈ V , we require that N ( V , V 0 ) = 1 if V ⊂ V 0 or V 0 ⊂ V . A neighborhood function speciﬁes the set of factors within a model based on its cliques, where cliques are deﬁned as follo ws: Deﬁnition 2.6 (Cliques) . For a neighborhood function N , a collection of verte x sets ˜ V ⊆ V is a clique if: 1. N ( V 0 , V 1 ) = 1 for all V 0 , V 1 ∈ ˜ V ; or 2. | ˜ V | = 1 . Hence, by the second condition, we have that each vertex { v } and each pair of vertices { v , v 0 } are cliques. Let V N contain the verte x sets that corre- spond to cliques: V N =    [ V ∈ ˜ V V : ˜ V ⊆ V , ˜ V is a clique    . This set represents the set of factors to be used in a distrib ution (i.e., for each V ∈ V N , we will assume there is a factor ov er this set of vertices). Hence, a Gibbs distribution with respect to a neighborhood function can be deﬁned as follo ws: Deﬁnition 2.7 (Gibbs Distr ibution) . Let P be a pmf ov er G . The distribu- tion P is a Gibbs distribution with respect to the neighborhood function N if it can be written in the form: P ( G ) = 1 Z Y V ∈ V N V ⊆ V ( G ) φ V ( G V ) , (2.3) where φ V : G V → [0 , ∞ ) . No w that we ha ve deﬁned a Gibbs distrib ution with respect to a neigh- borhood function, let’ s consider its connections to Markov properties and Marko v distributions. 2.6. Mar ko v Random Fields 25 2.6.2 Marko vity A distrib ution is Markov if, loosely speaking, conditional probabilities only depend on local parts of the random object. Let’ s consider Markovity for multi v ariate random variables. Suppose we hav e a random variable X tak- ing v alues in X = { 0 , 1 } n , and a (symmetric) neighborhood function N : { 1 , . . . , n } 2 → { 0 , 1 } . A distrib ution P ov er X is Markov with respect to the neighborhood function N if, for all X ∈ X and for all i , we have that: P ( X i | X j , j 6 = i ) = P ( X i | X j , j ∈ J i ) (2.4) where J i = { j | N ( i, j ) = 1 } , and where each X i denotes the i th component of X . No w consider random graphs; let Λ V and Λ E be a verte x and edge space, respecti vely , and let G be a graph space with respect to them. Further , let V = V (1) ∪ V (2) , and suppose we hav e a (symmetric) neighborhood function N : V 2 → { 0 , 1 } . Then, a distribution P is Markov with respect to the neighborhood function N if, for all G ∈ G and all V ∈ V , we hav e that: P ( G V | G V 0 , V 0 ∈ V \ u ( V )) = P ( G V = G V | G V 0 = G V 0 , V 0 ∈ V \ u ( V )) = P ( G V = G V | G V 0 = G V 0 , V 0 ∈ J V ) = P ( G V | G V 0 , V 0 ∈ J V ) , (2.5) where J V = { V 0 ∈ V \ u ( V ) | N ( V , V 0 ) = 1 } , and where u ( V ) = { V 0 ∈ V | V ⊆ V 0 } . Thus, we deﬁne Marko vity as follo ws: Deﬁnition 2.8 (Marko v Distr ibution) . Let P be a pmf ov er G . The distri- bution P is a Markov distribution with respect to neighborhood function N if, for all G ∈ G and V ∈ V , equation (2.5) holds. W e hav e that if a distrib ution is Gibbs with respect to some neighborhood function, then it is Marko v with respect to it as well: Proposition 2.1. Let P be a distribution ov er G and let N be a neighbor- hood function. Then: P is Gibbs w .r .t. N = ⇒ P is Markov w .r .t. N . The rev erse implication in the above proposition is not true (i.e., the Hammersley-Clif ford theorem ([Grimmett, 1973], [Besag, 1974]) does not 26 Random Graphs hold). A neighborhood function can specify more structure for a Markov dis- tribution than for a Gibbs distribution; hence, one cannot specify (general) independence assumptions and then assume a Gibbs form. The reason is be- cause the atomic variables hav e redundancy in them; a verte x v ariable G { v } is a function of an edge variable of the form G { v ,v 0 } . For a discussion on this issue, see Section 2.8. T o a void this dra wback, but maintain the advantages of fered by undirected models (in particular , the ability to express the proba- bility of a graph in terms of only its subgraphs), we no w consider partially directed models. 2.7 P artially Directed Models In this section, we brieﬂy re vie w chain graph models [Lauritzen and Richard- son, 2002], which we will use in the modeling of random graphs. These mod- els in volv e structure graphs that can hav e both directed and undirected edges, a generalization of Bayesian models and Markov random ﬁelds. The reason chain graph models are beneﬁcial for random graphs is because they allow one to specify , loosely speaking, a Gibbs distribution ov er v ertices, as well as a Gibbs distrib ution over edges, while a voiding the functional dependencies that are problematic. For these structure graphs, we will assume that all edges between vertex v ariables and edge variables are directed, and all other edges undirected. In these models, structure graphs are required to be ac yclic, where c ycles are now deﬁned as follows: a partially directed cycle is a sequence of n ≥ 3 distinct vertices v 1 , . . . , v n in a graph, and a verte x v n +1 = v 1 , such that: 1. for all 1 ≤ i ≤ n , either v i − v i +1 or v i ← v i +1 , and 2. there exists a 1 ≤ j ≤ n such that v j ← v j +1 . A c hain graph is a graph in which there are no partially directed c ycles. For a gi ven chain graph, let the chain components K be the partition of its vertices such that any two v ertices v and u are in the same partition set if there exists a path between them that contains only undirected edges. In other words, K is the partition that corresponds to the connected components of the graph after the directed edges hav e been remo ved. 2.7. Partially Directed Models 27 A distribution P over graph space G factorizes according to a chain graph H if it can be written in the form: P ( G ) = Y K ∈K P ( G K | G pa ( K ) ) , and further , we ha v e that: P ( G K | G pa ( K ) ) = 1 Z ( G pa ( K ) ) Y C ∈C ( K ) φ C ( G C ) , where C ( K ) is the set of cliques in the moralization of the graph H K ∪ pa ( K ) , i.e., the undirected graph that results from adding edges between any uncon- nected vertices in pa ( K ) and con v erting all directed edges into undirected edges, where pa ( K ) = [ v ∈ K pa ( v ) \ K. The factor Z normalizes the distribution: Z ( G pa ( K ) ) = X G ∈G K Y C ∈C ( K ) φ C ( G C ) . Example 2.6. Suppose we have a vertex space Λ V = { 1 , 2 , 3 } , an edge space Λ E = { 0 , 1 } , and a graph space with respect to them. Then, the atomic v ariables for this graph space correspond to the set V = V (1) ∪ V (2) , where, in this case: V (1) = {{ 1 } , { 2 } , { 3 }} V (2) = {{ 1 , 2 } , { 1 , 3 } , { 2 , 3 }} , and so, the vertices in structure graphs correspond to them. Suppose the struc- ture graph is a chain graph and is as sho wn in Figure 2.2. Then the chain components K is the partition K = {V (1) , V (2) } . The distribution takes the follo wing form: P ( G ) = Y K ∈K P ( G K | G pa ( K ) ) = P ( G V (1) ) P ( G V (2) | G V (1) ) 28 Random Graphs Figure 2.2: An example chain graph for e xample 2.6. The thick edges repre- sent that the vertices connected by them are fully connected. where, recall G V ( i ) = { G V , V ∈ V ( i ) } . Further , we hav e that each compo- nent can be expressed in Gibbs form: P ( G V (1) ) ∝ Y V ⊆ V ( G ) σ ( V ) P ( G V (2) | G V (1) ) ∝ Y V ⊆ V ( G ) φ V ( G V ) , where σ : P ( V ) → [0 , ∞ ) and φ V : G V → [0 , ∞ ) . 2.8 Discussion W e now take a step back and examine some of the design choices made in this section. Graphical models, from a high-le vel, may be thought of as a frame work for modeling random objects based on the use of independence assumptions between the parts of the object. It is important that these inde- pendence assumptions be made, or can be made, between the smallest parts, those that cannot be decomposed into smaller ones. The reason, as we will discuss in this section, is that this makes the space of (possible) independence assumptions as large as possible, and hence allows the most structure to be speciﬁed within a graphical model. 2.8.1 Redundant Representations The representation of a random object based on its atomic marginal variables can hav e redundancy in it; for example, a vertex v ariable G { v } is a func- tion of an edge variable of the form G { v ,v 0 } . This redundancy may appear 2.8. Discussion 29 troublesome since, for example, it means the Hammersley-Clif ford theorem cannot be used, preventing us from specifying independencies and then as- suming a Gibbs form for distributions. W e could remove the redundant vari- ables (i.e., v ariables that are functions of other variables), and represent the random graph G by only the random variables { G V , V ∈ V (2) } , a subset of the atomic v ariables. Ho we ver , this approach is problematic since it also diminishes our ability to specify structure. Representations with redundancy hav e the adv antage, compared to representations without redundancy , of pro- viding a lar ger space of possible independence assumptions. W e illustrate the concept with some examples: Example 2.7 (Context-Sensitiv e Independence) . Suppose we hav e a multi v ariate random v ariable X taking values in X = X 1 × . . . × X n , and suppose we specify (within a Bayesian network) that the distribution ov er X has the follo wing conditional independence: P ( X 1 | X i , i 6 = 1) = P ( X 1 | X 2 , X 3 ) . No w suppose that we want to specify the additional in variance that P ( X 1 | X 2 , X 3 = X ) = P ( X 1 | X 3 = X ) , where X ∈ X 3 ; this type of in v ariance is sometimes referred to as context- sensiti ve independence ([Shimony, 1991], [Boutilier et al., 1996], [Chicker - ing et al., 1997]). A simple way to incorporate it within a Bayesian network is by the addition of a redundant random v ariable of the form Y = f ( X 2 , X 3 ) , taking values in a partition of the space of values of the input variables. In particular , deﬁne the function f : X 2 × X 3 → P ( X 2 × X 3 ) to be: f ( X 2 , X 3 ) =    B , if X 3 = X { ( X 2 , X 3 ) } , otherwise , where B = { ( X 2 , X 3 ) | X 3 = X } . Then, by including the variable Y in the network and letting it be the sole parent of X 1 , we hav e that P ( X 1 | Y , X i , i 6 = 1) = P ( X 1 | Y ) . Hence, using the redundant variable, we were able to specify this addi- tional in v ariance and reduce the number of parameters in the model. This 30 Random Graphs method for specifying context-sensiti v e in variance dif fers from the one taken in [Boutilier et al., 1996], where the focus is on the representation of depen- dencies within conditional probability tables, using for example, tree struc- tures. Example 2.8 (General Independence) . In the previous example, we en- coded a context-sensiti ve independence within a Bayesian network, where the context was based on ev ents of the form { X A = X } , where X ∈ X A and A ⊆ { 1 , . . . , n } . This was done by deﬁning a partition over the space of v alues of the parents of a random v ariable (corresponding to these ev ents), and then introducing a new v ariable that takes values in this partition. This same approach works for more general context-sensiti ve independence, i.e., contexts based on e v ents of the form { X A ∈ D } , where D ⊂ X A . Hence, we can specify in v ariances of the form P ( X A , X B | f 3 ( X C )) = P ( X A | f 3 ( X C )) · P ( X B | f 3 ( X C )) (2.6) within a Bayesian network. Finally , notice that redundant variables also allow us to include in v ariances of the form: P ( f 1 ( X A ) , f 2 ( X B ) | f 3 ( X C )) = P ( f 1 ( X A ) | f 3 ( X C )) · P ( f 2 ( X B ) | f 3 ( X C )) , (2.7) the most general form that independence assumptions can take. The in v ari- ance in equation (2.6) implies that in equation (2.7), but not the other way around. For an additional discussion about statistical in v ariances, and in par- ticular , ho w dif ferent in variances relate to each other , see Appendix A. These examples illustrate that representing a random object with atomic v ariables, ev en if there is redundancy in them, allows more in v ariances to be speciﬁed by a graphical model than would be possible without all of them. Although having a lar ger space of independence assumptions is not always beneﬁcial - a practitioner cannot specify in v ariances between v ariables so lo w-le vel that they are uninterpretable - the speciﬁcation of in v ariances in- volving vertices is natural when modeling random graphs, and so v ertex v ari- ables should generally be included in any graphical model for this problem. 2.8. Discussion 31 2.8.2 Graph V ariations In this section, we brieﬂy describe other mathematical objects - v ariations on the deﬁnition of a graph - that may be useful for some problems; the graphical model frame w ork discussed in this section can accommodate these objects in a straightforward way . In the deﬁnition of graphs presented in Section 2.1, vertices were a sub- set of some vertex space Λ V , and hence each vertex has a unique value in this space. In some applications, graphs hav e attributes associated with their vertices, in which case, the vertices need only be unique on some compo- nent, for example a location component, and may otherwise hav e common attribute values. These graphs are referred to as attributed in the literature ([Pfeif fer III et al., 2014], [Jain and W ysotzki, 2004]). Suppose we hav e a ﬁ- nite verte x space Λ V , an edge space Λ E , and an attrib ute space X . W e deﬁne an attributed graph to be of the form G = ( V , X , E ) , where V is a set of vertices, X is a function assigning an attribute v alue to each verte x, and E is a function assigning an edge v alue to e very pair of v ertices: V ⊆ Λ V X : V → X E : V × V → Λ E . Hence, ev ery vertex in a graph has a unique value in Λ V , and the vertices may be thought of as indices for the variables X v ≡ X ( v ) . For example, if we let Λ V = { 1 , . . . , n } , then a graph may be thought of as some collection of v ariables of the form { X i , i ∈ V } , where V ⊆ { 1 , . . . , n } , as well as edges E between them. The attribute space X could be, for example, a ﬁnite set of labels or a Euclidean space (for specifying positions). Graphs may be further generalized to allo w higher-order edges, referred to as hyper gr aphs [Berge and Minieka, 1973]. Suppose we hav e a ﬁnite verte x space Λ V and an edge space Λ E . Then, we can deﬁne a generalized graph to be of the form G = ( V , E 1 , . . . , E k ) , where V is a set of vertices, and each 32 Random Graphs E i is a function assigning an edge v alue to e very group of i vertices: V ⊆ Λ V E 1 : V → Λ E E 2 : V 2 → Λ E . . . E k : V k → Λ E . Graphs with higher-order edges may be useful in problems in which interac- tions can be between multiple objects, and these interactions are not a func- tion of the pairwise interactions ([Zhou et al., 2006], [T ian et al., 2009]). 2.8.3 Pr ojections It is worth noting that if an attributed graph space is constrained to only graphs that: (a) contain the same set of vertices; and (b) hav e no edges, then the canonical graph projections (Deﬁnition 2.2), in essence, reduce to the component projections used with multi v ariate random variables. In this sense, the graph projections may be thought of as an e xtension of the compo- nent projections to graph spaces. 3 Random T rees In this section, we consider the statistical modeling of trees; since trees are a type of graph, the random graph models described in Section 2 could be used. Howe v er , it is beneﬁcial to instead use models that are tuned to the deﬁning structure of trees. If the vertices in trees are assumed to take a certain form, then the edges in trees are deterministic, given the vertices in it; as a result, the tree space and its modeling are simpliﬁed. In particular , with these assumptions about the verte x space, the atomic v ariables correspond to individual vertices (in contrast to the atomic variables in random graphs). Hence, in basic models, the vertices in structure graphs correspond to the vertices in trees, and in more complex models (e.g., with context-sensiti v e dependencies), the vertices in structure graphs correspond to the vertices in the verte x space. W e begin by considering Bayesian networks in which: (a) the direction- ality of edges (in the structure graph) are from root to leafs, which we refer to as branching models; and (b) models in which the directionality is the op- posite, from leafs to root, which we refer to as merging models. The former is well-suited for problems in which there is a physical process such that, as time progresses, objects di vide into more objects; most models in the litera- ture are of this form. The latter model, in contrast, is well-suited for problems 33 34 Random T rees in which there is some initial set of objects, and as time progresses, these objects merge with each other . In these types of causal problems, it is generally accepted that the direc- tionality of edges in Bayesian networks should, if possible, correspond to the causality . In some applications, howe ver , trees are not formed by an obvious causal mechanism, and one need not limit themself to either a branching or merging model. F or e xample, consider trees that describe the structure of ob- jects in scenes, where vertices correspond to objects (e.g., cars, trucks, tires, doors, etc.), and edges encode when an object is a subpart of another object ([Jin and Geman, 2006], [Zhu and Mumford, 2007]). These trees are repre- sentations of scenes, not formed by a clear time-dependent process. Hence, although distributions on these trees can be expressed using branching or merging models, they may not be expressible by them in a compact form, which is essential. In the last part of this section, we consider more general models that may be useful for these problems. 3.1 Branching Models In this section, we consider directed and partially-directed models for ran- dom trees in which the directed edges are from root to leaf. W e ﬁrst consider trees without attributes, then proceed to trees with them. T o demonstrate the v alue of the graphical model approach to random trees, we contrast it with approaches based on grammars. 3.1.1 T rees A tree is a graph that is connected and ac yclic. A rooted tree is a tree that has a partial ordering (ov er its vertices) deﬁned by distance from some designated verte x referred to as the r oot of the tree. Due to the structure of trees, if the vertices in them are giv en appropriate labels, then the edges are deterministic. For simplicity , let’ s consider binary trees; let the verte x space Λ = Λ V be Λ = N [ n =0 [ { v root } × { 0 , 1 } n ] . where N is some natural number . Thus, a verte x v ∈ Λ has the form v = ( v root , v 1 , . . . , v n ) , where each v i ∈ { 0 , 1 } and v root is some arbitrary element 3.1. Branching Models 35 Figure 3.1: An e xample tree using the verte x labeling in the branching model (Section 3.1). that denotes the root verte x (see Figure 3.1). Let π 1: k be the projection of a verte x to its ﬁrst k components: π 1: k ( v ) =    ( v root , v 1 , . . . , v k − 1 ) , if k ≤ | v | v , otherwise . Let a tree T ⊆ Λ be a set of vertices such that, for each vertex in T , its ancestors are also in it: v ∈ T = ⇒ π 1: k ( v ) ∈ T for all k ≤ | v | . If T = ∅ , we will refer to it as the empty tree. Giv en a tree T , deﬁne the parent, children, and siblings of a verte x v ∈ T as: pa ( v ) = π 1: | v |− 1 ( v ) ch ( v ) = { v 0 ∈ T | pa ( v 0 ) = v } sib ( v ) = { v 0 ∈ T | pa ( v 0 ) = pa ( v ) } . 3.1.2 Basic Models In this section, we consider random tree models ov er a ﬁnite tree space T based on marginal random variables that take values in T , i.e., are also ran- dom trees. In the ne xt section, we e xpand the set of mar ginal v ariables to also 36 Random T rees include tree parts that do not take values in T , b ut rather in substructures of this space. Let T be a set of trees that is projectable, i.e.: T ∈ T = ⇒ T 0 ∈ T for all T 0 ∈ S ( T ) , where S ( T ) denotes the set of all subtrees of T . W e can then deﬁne tree projections: Deﬁnition 3.1 (T ree Projections) . Let V ∈ T be a set of vertices. Deﬁne the projection π V : T → T V , where T V = { T ∈ T | T ⊆ V } , as follo ws: π V ( T ) = T ∩ V . Let T V ≡ π V ( T ) denote the projection of a tree T onto the vertices V . This projection is similar to the one used for (general) graphs, the main dif ference being that the set of vertices V being projected onto cannot be an arbitrary subset of the verte x space, but must correspond to a tree (i.e. V ∈ T ). The reason is so that the projection of a tree is always a tree (in a projectable tree space T ). W e consider projections onto substructures of the tree space in the next section. Suppose we hav e a distribution P over the tree space T . F or each V ∈ T , we can deﬁne a marginal random (tree) v ariable taking v alues in T V : P marg V ( T ) = X T 0 ∈T π V ( T 0 )= T P ( T 0 ) , T ∈ T V . For the projection family { π V , V ∈ T } , the atomic projections corre- spond to verte x sets V that are trees with only one leaf, where a leaf is a verte x with no children (i.e., a vertex v such that ch ( v ) = ∅ ); we will refer to a tree with a single leaf as a path-tr ee . The reason the set of atomic pro- jections corresponds to the set of path-trees is because: (1) any tree can be represented by a set of path-trees; and (2) no path-tree can be represented by a set of smaller path-trees. Let V atom denote the set of path-trees: V atom = { V ∈ T | V is a path-tree } . T o deﬁne structure in distributions over the tree space T , we can apply a graphical model. W e use a Bayesian network here; let H = ( V atom , N ) be 3.1. Branching Models 37 a structure graph, where N : V atom × V atom → { 0 , 1 } is an edge function that is asymmetric (where the asymmetry is used in specifying edges that are directed 1 ). T o deﬁne valid distributions, the structure graph must be acyclic and must specify a dependency between any two path-trees in which one is a function of the other; thus, we will assume that there is: (1) a directed edge from e v ery path-tree to its immediate successors (i.e., the path-trees that contain it and have one additional vertex); and (2) there is no directed edge from a path-tree to any path-tree that is a subtree of it. That is: 1. N ( V 0 , V 1 ) = 1 for all V 0 ⊂ V 1 , | V 0 | = | V 1 | − 1 . 2. N ( V 1 , V 0 ) = 0 for all V 0 ⊂ V 1 . These requirements on the edge function N ensure it is consistent with the chain rule: P ( T ) = P ( T V , V ∈ V atom ) = P ( T { v root } ) · ∞ Y i =2 P ( T V , V ∈ V ( i ) atom | T V , V ∈ V (1) atom ∪ . . . ∪ V ( i − 1) atom ) = P ( T { v root } ) · ∞ Y i =2 P ( T V , V ∈ V ( i ) atom | T V , V ∈ V ( i − 1) atom ) (3.1) where V ( i ) atom = { V ∈ V atom : | V | = i } denotes the set containing path-trees of cardinality i . 3.1.3 Substructures Similar to multi v ariate random variables, the use of projections onto sub- structures (Section 4.2) is important when modeling random trees. These ad- ditional projections allo w one to form additional marginal random variables, which in turn, allow statistical models to specify more structure in distribu- tions. Let a shifted tree be a pair ( T , v 0 ) , where v 0 ∈ Λ is a vertex and T ⊆ Λ is a set of vertices such that: 1. | v 0 | ≤ | v | for all v ∈ T ; 1 W e consider there to be a directed edge from V 0 to V 1 if N ( V 0 , V 1 ) = 1 and N ( V 1 , V 0 ) = 0 . 38 Random T rees 2. v ∈ T = ⇒ π 1: k ( v ) ∈ T for all | v 0 | ≤ k ≤ | v | . In other words, a shifted tree may be thought of as a tree in which v 0 serves as the root vertex. For the tree space T , let T ( v 0 ) denote the set of shifted trees with root verte x v 0 , i.e.: T ( v 0 ) = { T ∩ Λ( v 0 ) | T ∈ T } , where Λ V ( v 0 ) denote the set of vertices that are descendants of v 0 , i.e.,: Λ( v 0 ) = { v ∈ Λ | π 1: | v 0 | ( v ) = v 0 } . For a vertex v 0 6 = v root , the space T ( v 0 ) 6⊆ T is not a subset of the tree space, but rather a substructure of the space T . For a given v ertex v 0 , we deﬁne the projection taking trees in T to trees in T ( v 0 ) as follo ws: Deﬁnition 3.2 (Substructure Projections) . Let v 0 ∈ Λ be a verte x and let V = Λ( v 0 ) be the set of all its descendants. Deﬁne the projection π V : T → T ( v 0 ) as: π V ( T ) = T ∩ V . Let T V ≡ π V ( T ) denote the projection of a tree T onto the vertices V . For a random tree T taking values in a tree space T , the substruc- ture projections deﬁne mar ginal random (shifted tree) variables of the form T V = π V ( T ) , where V = Λ( v ) and v ∈ Λ . Each substructure is itself equipped with tree projections. Hence, allowing for both projections to sub- structures and then projections to trees within this substructure, the set of all projections on T is the projection family { π V , V ∈ T ( v ) , v ∈ Λ } , and the atomic projections are just the projections onto indi vidual vertices, i.e., those in the set { π { v } , v ∈ Λ } . T o deﬁne structure in distributions over the tree space T , we can apply a graphical model. Let H = (Λ , N ) be a structure graph, where N : Λ × Λ → { 0 , 1 } is an edge function. As before, let the structure graph be acyclic, and require it specify a dependency (either directly or indirectly) between an y two vertices in which one is an ancestor of the other . Similar to general graphs, it will often be useful in the statistical modeling of trees to incorporate in v ariance assumptions about (shifted) trees that are 3.1. Branching Models 39 isomorphic to each other . Recall, two graphs are said to be isomorphic if they share the same edge structure (see Section 5.3). Similarly , two rooted trees are said to be isomorphic if they share the same edge structure, as well as the same partial ordering structure: Deﬁnition 3.3 (T ree Isomor phism) . A tree ( T , v 0 ) is isomorphic to a tree ( T 0 , v 0 0 ) if there exists a bijection f : T → T 0 such that: 1. f ( v 0 ) = v 0 0 . 2. v ∈ ch ( u ) = ⇒ f ( v ) ∈ ch ( f ( u )) . T wo trees that are isomorphic are denoted by ( T , v 0 ) ' ( T 0 , v 0 0 ) . (The set of children ch ( u ) for verte x u is with respect to the tree T , and the set of children ch ( f ( u )) for v erte x f ( u ) is with respect to the tree T 0 .) Example 3.1 (Galton-W atson Model) . The Galton-W atson model is a classic random tree model that makes two inv ariance assumptions. The ﬁrst is that a vertex is conditionally independent of all other vertices except its parent and siblings: N ( v 0 , v 1 ) =    1 , if v 0 ∈ sib ( v 1 ) or v 0 = pa ( v 1 ) 0 , otherwise , where the children and sibling functions are with respect to the total tree T = Λ . The second in v ariance assumption is that, conditioned on their roots, shifted trees that are isomorphic to each other ha ve the same probability . That is, for all shifted trees ( T , v 0 ) and ( T 0 , v 0 0 ) such that ( T , v 0 ) ' ( T 0 , v 0 0 ) , we assume that P ( T | T { v 0 } ) = P ( T 0 | T 0 { v 0 0 } ) . (W e hav e simpliﬁed the notation from P marg V to P ; the distrib ution should be clear from the conte xt.) Thus, we hav e: P ( T ) = P ( T { v } , v ∈ Λ) = P ( T { v root } ) · ∞ Y i =2 P ( T { v } , v ∈ Λ ( i ) | T { v } , v ∈ Λ ( i − 1) ) = P ( T { v root } ) · ∞ Y i =2   Y v ∈ Λ ( i − 1) P ( T ch ( v ) | T { v } )   = P ( T { v root } ) · Y v ∈ T µ ( | T ch ( v ) | ) , 40 Random T rees where Λ ( i ) = { v ∈ Λ : | v | = i } denotes the set containing vertices of cardinality i , and where µ is a distribution over the number of children (e.g., ov er { 0 , 1 , 2 } in binary trees). The second to last equality follows from the independence assumptions for this model, and the last equality follows from the isomorphism in v ariance assumption. 3.1.4 Attrib uted T rees In many real-world problems, the vertices in trees have attributes associated with them. In most of the literature on attributed trees, grammars are used to deﬁne the tree space (i.e., the set of trees the grammar can produce). These grammars produce trees by production rules; beginning with the empty tree, larger trees are incrementally b uilt by the iterati v e application of these rules. For context-free grammars, distributions can be deﬁned ov er trees by as- sociating a probability with each production rule. Howe ver , it is well-known that this approach (associating probabilities to production rules) does not gen- eralize to the case of context-sensiti v e grammars (i.e., does not produce well- deﬁned distrib utions for this grammar). The reason is because, in context- sensiti ve grammars, the order in which production rules are applied now mat- ters (in determining what trees can be produced), and hence this grammar must have an ordering policy that speciﬁes the next production rule to ap- ply , gi ven the current tree; this policy is a function that generally depends on many of the vertices in the current tree. Hence, to deﬁne a distribution ov er this tree space, the conditional probability of the ne xt tree in a sequence, gi ven the previous one, would not (in general) be conditionally independent of vertices ev en far removed in tree distance from the vertices being used by the production rule itself. In other words, to make well-deﬁned distributions for a context-sensiti v e grammar , v ery high-order models are required. In this section, rather than trying to deﬁne distrib utions in terms of gram- mars, we use a graphical model approach; by using the marginal random vari- ables in a random tree, it becomes tractable to specify dependencies and make well-deﬁned tree distributions that are, loosely speaking, context-sensiti ve. Let an attributed tree be a pair T = ( V , X ) , where V ∈ T is a tree and X : V → X is a function taking each verte x to some attribute v alue in an at- tribute space X . For an attribute space X , let T X denote the space of attrib uted 3.1. Branching Models 41 trees: T X = { ( V , X ) | V ∈ T , X : V → X } . Since X need not be ﬁnite, the space T X may not be ﬁnite either (we hav e only assumed the verte x space Λ is ﬁnite, implying a ﬁnite number of projections). The deﬁnition of a projection on a tree can be extended to a projection on an attributed tree in a straightforward manner: for a tree T = ( V , X ) , let the projection π V 0 ( T ) = ( V ∩ V 0 , X  V ∩ V 0 ) be the intersection of the tree’ s v ertices with V 0 ⊆ Λ and the restriction of the attribute function X to these vertices. W e let T V 0 ≡ π V 0 ( T ) . Let an attributed shifted trees be a triple T = ( V , X , v 0 ) , where v 0 is the designated root and V ∈ T ( v 0 ) , the space of shifted trees with respect to v 0 . The deﬁnition of isomorphisms for attributed trees is the same as before, except with the additional requirement that the attribute values also match: the trees T = ( V , X , v 0 ) and T 0 = ( V 0 , X 0 , v 0 0 ) are isomorphic to each other if there exists a bijection f : V → V 0 such that: 1. ( V , v 0 ) ' ( V 0 , v 0 0 ) with respect to f . 2. X ( v ) = X 0 ( f ( v )) for all v ∈ T . T wo trees that are isomorphic are denoted by T ' T 0 . Example 3.2 (Probabilistic Context-F ree Grammar) . A probabilistic context-free grammar is a random tree model that may be thought of as an extension of the Galton-W atson model to attributed trees. The attribute space is assumed to hav e the form X = X leaf ∪ X non-leaf , where X leaf ∩ X non-leaf = ∅ , and the tree space is assumed to be restricted to trees such that leaf vertices only take attribute v alues in X leaf and non-leaf vertices only take attribute v alues in X non-leaf . The model makes two in variance assumptions. The ﬁrst is that it assumes a vertex is conditionally independent of all other vertices except its parent and siblings; in terms of its structure graph, the independence assumptions are: N ( v 0 , v 1 ) =    1 , if v 0 ∈ sib ( v 1 ) or v 0 = pa ( v 1 ) 0 , otherwise . 42 Random T rees The second inv ariance assumption is that, conditioned on their roots, shifted trees that are isomorphic to each other hav e the same probability . That is, for all shifted trees T = ( V , X , v 0 ) and T 0 = ( V 0 , X 0 , v 0 0 ) such that T ' T 0 , we assume that P ( T | T { v 0 } ) = P ( T 0 | T 0 { v 0 0 } ) . Thus, we hav e: P ( T ) = P ( T { v } , v ∈ Λ) = P ( T { v root } ) · ∞ Y i =2 P ( T { v } , v ∈ Λ ( i ) | T { v } , v ∈ Λ ( i − 1) ) = P ( T { v root } ) · ∞ Y i =2   Y v ∈ Λ ( i − 1) P ( T ch ( v ) | T { v } )   = P ( T { v root } ) · Y v ∈ V ( T ) µ ( ˜ T ch ( v ) | ˜ T { v } ) , where µ is a distribution o v er the space T (1) X = { T ∈ T X : | v | ≤ 2 for all v ∈ V ( T ) } , the set of trees of length less than or equal to one, and ˜ T denotes a tree such that ˜ T ' T and ˜ T ∈ T X , i.e., a non-shifted version of the shifted tree T . The second to last equality follo ws from the independence assumptions for this model, and the last equality follows from the isomorphism in v ariance assumption. The distrib ution µ is usually assumed to have zero probability ov er a portion of its input trees (or , equi v alently , the tree space is assumed to be constrained). Example 3.3 (Conte xt-Sensitiv e Random T ree) . W e will refer to a ran- dom tree model as context-sensiti ve if, compared to the probabilistic conte xt- free grammar in the previous e xample, it has the follo wing additional depen- dencies. As before, a vertex depends on its siblings and parent, but now also depends on certain vertices that are adjacent to its parent as well. For each verte x v , deﬁne its adjacent vertices as the set adj ( v ) = { v 0 ∈ Λ : | v 0 | = | v | and d ( v , v 0 ) ≤ 1 } , where d denotes some function between vertices of the same lev el in a tree. For example, if one visualizes a tree by depicting it as an image on the plane, as in Figure 3.1, vertices on each lev el will hav e an ordering based on which 3.2. Merging Models 43 vertices come before others from left to right. In linguistics, this ordering coincides with the order words occur in sentences, loosely speaking. More formally , we can assume there is an order relation ≤ on each set Λ ( i ) = { v ∈ Λ : | v | = i } , and then deﬁne d based on this ordering. Then, in terms of its structure graph, the independence assumptions are: N ( v 0 , v 1 ) =    1 , if v 0 ∈ sib ( v 1 ) or v 0 = pa ( v 1 ) or v 0 ∈ adj ( pa ( v 1 )) 0 , otherwise . This structure graph could also ha ve directed edges, not just between adjacent le vels of the tree, but across multiple lev els of the tree. Similar to probabilis- tic context-free grammars, this random tree model also makes isomorphism assumptions, except with respect to subsets of v ertices that may not be trees. 3.2 Merging Models In the pre vious section, we used a v erte x space in which the label of each v er- tex encoded its entire ancestry; hence, if we kno w a vertex is in a tree, then we also know its ancestors as well, and this limits one to branching mod- els. In this section, we consider a verte x space in which the label of each verte x instead encodes its descendants, allowing mer ging models for random trees: be ginning with some set of initial objects, trees can be formed by i tera- ti vely merging them. Examples include the modeling of cell fusion (i.e., cells that combine) and the modeling of mergers between industrial corporations (which, in the end, form monopolies). W e present a simpliﬁed version of the verte x space here; it can be extended to more sophisticated forms. As be- fore, due to the structure of trees, if the vertices in them are giv en appropriate labels, then the edges are deterministic. Suppose we have some set of vertices Λ leaf such that, for e v ery tree, its leafs are in this set; beginning with some set of vertices V leaf ⊆ Λ leaf , trees will be constructed by merging them. Letting N = | Λ leaf | , deﬁne the vertex space Λ to be: Λ = P ( { 1 , . . . , N } ) \ {∅} Thus, a verte x v ∈ Λ has the form v ⊆ { 1 , . . . , N } . As before, we assume binary trees for simplicity; a tree T ⊆ Λ is a set of vertices such that: 44 Random T rees 1. There exists a verte x v ∈ T such that | v | > | v 0 | for all v 0 ∈ T , v 0 6 = v . This verte x corresponds to the root of the tree. 2. For each vertex v ∈ T , its cardinality is | v | = 2 n , for some n ∈ { 0 , 1 , 2 , . . . } . The value n for a vertex corresponds to its level, which we denote by le vel ( v ) . 3. For each verte x v ∈ T such that | v | > 1 , there exists a binary partition { v 0 , v 00 } of this verte x (i.e., v = v 0 ∪ v 00 and v 0 ∩ v 00 = ∅ ), such that v 0 , v 00 ∈ T and | v 0 | = | v 00 | . An example tree is shown in Figure 3.2. If T = ∅ , we will refer to it as the empty tree. In this tree deﬁnition, a v ertex v ∈ T is a leaf if and only if it has cardinality of one (i.e., | v | = 1 ). Hence, the label of each individual vertex deﬁnes if it is a leaf or not (unlike in the pre vious section). For a tree T , let leaf ( T ) denote the set of its v ertices that are leafs: leaf ( T ) = { v ∈ T : | v | = 1 } . This distinction, in turn, means that for a subset T 0 ⊆ T to be a tree (i.e., a subtree of T ), its leafs must be a subset of the leafs of T (i.e., leaf ( T 0 ) ⊆ leaf ( T ) ). This requirement is in contrast to the previous section, where trees and their subtrees had to hav e the root v ertex be in common. Let T be a set of trees that is projectable, i.e.: T ∈ T = ⇒ T 0 ∈ T for all T 0 ∈ S ( T ) , where S ( T ) denotes the set of all subtrees of T . As before, we can then deﬁne tree projections: Deﬁnition 3.4 (T ree Projections) . Let V ∈ T be a set of vertices. Deﬁne the projection π V : T → T V , where T V = { T ∈ T | T ⊆ V } , as follo ws: π V ( T ) = T ∩ V . Let T V ≡ π V ( T ) denote the projection of a tree T onto the vertices V . In the case of the projection family { π V , V ∈ T } , the atomic projec- tions are not a subset, but rather coincide with the entire projection family . Ho we ver , assuming projections to substructures as well, as was done in the branching models, we then arri ve at the same set of atomic projections, the set of indi vidual vertices { π { v } , v ∈ Λ } . 3.3. General Models 45 T o deﬁne structure in distributions over the tree space T , we can apply a graphical model. W e use a Bayesian network here; let H = (Λ , N ) be a structure graph, where N : Λ 2 → { 0 , 1 } is an edge function that is asym- metric (where the asymmetry is used in specifying edges that are directed). T o deﬁne v alid distrib utions, the structure graph must be ac yclic; for mer ging models, we assume that edges are in the direction from leafs to root. W e must specify a dependency between any tw o vertices in which one is a function of the other; thus, we assume: 1. N ( v 0 , v 1 ) = 1 for all v 0 ⊂ v 1 and le vel ( v 0 ) = le vel ( v 1 ) − 1 . 2. N ( v 1 , v 0 ) = 0 for all v 0 ⊂ v 1 . These requirements on the edge function N ensure it is consistent with the chain rule: P ( T ) = P ( T V , V ∈ Λ) = P ( leaf ( T )) · ∞ Y i =1 P ( T { v } , v ∈ Λ ( i ) | T { v } , v ∈ Λ (1) ∪ . . . ∪ Λ ( i − 1) ) = P ( leaf ( T )) · ∞ Y i =1 P ( T { v } , v ∈ Λ ( i ) | T { v } , v ∈ Λ ( i − 1) ) (3.2) where Λ ( i ) = { v ∈ Λ : le v el ( v ) = i } denotes the set of vertices that are on le vel i . If one assumes that a vertex can only merge with one other vertex on a gi ven layer , then complex dependencies are introduced in which a vertex de- pends on more than just its children; this situation is similar to that of context- sensiti ve grammars in branching models, except in the rev erse direction. In this case, complex models can result. 3.3 General Models In the previous sections, we used specialized vertex spaces for deﬁning trees; using vertices with labels that specify its set of possible children or possible parents (and assuming, in any valid tree, these sets are non-ov erlapping), then trees ha ve deterministic edges, gi v en the vertices. Ho we v er , we could instead deﬁne trees in terms of an arbitrary verte x space and then deﬁne the tree 46 Random T rees Figure 3.2: An example tree using the vertex labeling in the merging model (Section 3.2). space by restricting the corresponding graph space to only trees. This has the adv antage of allo wing one to employ any type of graphical model for random graphs (Section 2). In this more general formulation, distributions need not be deﬁned in terms of ho w trees are incrementally constructed by a top-do wn or bottom-up process, b ut rather how they deconstruct (e.g., into subtrees). This allo ws, for some problems, a more natural method for deﬁning distributions since it may allo w a more compact representation of dependencies. W e will assume the verte x space has some minimal structure, allo wing us to deﬁne trees based on basic conditions on the vertices and edges. Suppose we hav e a v ertex space Λ of the form Λ = N [ n =0 Λ ( n ) , where each space Λ ( n ) corresponds to the set of v ertices that can occur on the n th level of the tree (i.e., the distance from a vertex in this set to the root is assumed to be n in any tree). Further , we assume Λ ( n ) ∩ Λ ( m ) = ∅ for every n 6 = m . For example, for modeling real-world scenes, often one assumes some ﬁxed hierarchy of objects (e.g., cars occur on the k th le vel and car tires occur on the ( k + 1) th le v el). Finally , suppose the edge space Λ E = { 0 , 1 } is binary . Let a tree be a graph T = ( V , E ) with respect to this vertex space and edge space (i.e., where V ⊂ Λ is a set of vertices and E : V 2 → { 0 , 1 } a 3.3. General Models 47 binary edge function) such that the follo wing conditions are satisﬁed: letting V ( n ) ≡ V ∩ Λ ( n ) , we hav e that: 1. There is only a single root verte x: if V 6 = ∅ , then | V (0) | = 1 . 2. Every (non-root) verte x has one and only one parent: for n = 1 , 2 , . . . , for all v ∈ V ( n ) , we hav e: X v 0 ∈ V ( n − 1) E ( v 0 , v ) = 1 . 3. There are only edges between adjacent layers: for all n, m such that | n − m | 6 = 1 , we have that E ( v , v 0 ) = 0 for all v ∈ V ( m ) and v 0 ∈ V ( m ) . Let T be the space of all such trees. A distribution o ver this space can be de- ﬁned using a random graph model; in particular , we may apply an undirected or partially directed model. As mentioned, this additional ﬂexibility may be useful for the modeling of some problems in which there is no obvious causal mechanism. 4 General Random Objects In this section, we consider a general formulation of graphical models on a sample space Ω based on a family of random variables with basic consis- tency and completeness properties. In the literature, the deﬁnition of consis- tency for random v ariables is stated in terms of distributions ([Chung, 2001], [Lamperti, 2012]). In this work ho we v er , we ﬁnd it con venient to deﬁne con- sistency in terms of the functions themselves (rather than the distributions induced by them). This more elemental deﬁnition will be useful in model- ing over more general spaces, where to make independence assumptions on distributions, a consistent projection family must ﬁrst be speciﬁed. The pro- jections from this family then deﬁne random v ariables that are consistent (re- ferred to as marginal variables). W e begin by considering the case in which projections are from a given sample space to subsets of it; the random graph model discussed in Section 2 uses projections of this form. Then, we con- sider more general projections, where for example, the random tree model discussed in Section 3, and the traditional formulation of graphical models for multiv ariate random variables are instances. For simplicity , we limit the formulation here to ﬁnite projection families. 48 4.1. Projection F amilies 49 4.1 Projection F amilies Suppose we hav e a random object taking values in some space Ω , and sup- pose we have a family Π of projections where each projection has the form π : Ω → Ω 0 ⊆ Ω . Recall, a function π is a projection if π ◦ π = π , i.e., projecting an object more than once does not change its v alue. In order to produce random variables that are consistent with each other , the projections must be consistent with each other: Deﬁnition 4.1 (Consistency) . The projections π 1 : Ω → Ω 1 and π 2 : Ω → Ω 2 are consistent if: Ω 1 ⊆ Ω 2 = ⇒ π 1 ◦ π 2 = π 1 Ω 2 ⊆ Ω 1 = ⇒ π 2 ◦ π 1 = π 2 . In other words, two projections are consistent if: (a) one’ s image is not a subset of the other’ s; or (b) projecting an object onto the smaller space is the same as ﬁrst projecting the object onto the larger space, and then project- ing onto the smaller space. W e say that a projection family is consistent if e very pair of projections in it are consistent. A consistent family of projec- tions deﬁnes a consistent family of random v ariables (referred to as mar ginal v ariables). Although this deﬁnition of consistent projections corresponds to the def- inition of consistent random v ariables, it will be useful when formulating graphical models to assume a stronger form: Deﬁnition 4.2 (Strong Consistency) . The projections π 1 : Ω → Ω 1 and π 2 : Ω → Ω 2 are str ongly consistent if Ω 1 ∩ Ω 2 6 = ∅ implies that there exists a projection π 3 : Ω → Ω 1 ∩ Ω 2 consistent with π 1 and π 2 . If this projection exists, then it is unique. As before, we say that a projection family is strongly consistent if ev- ery pair of projections in it are strongly consistent. The canonical projection family for random graphs (Section 2.2) is strongly consistent, and the canon- ical projection family for random vectors (i.e., the coordinate projections) is strongly consistent as well (see ne xt section). W e illustrate the importance of strong consistency in modeling with an e xample: 50 General Random Objects Example 4.1 (Consistency and Conditional Independence) . Let Ω be a sample space with a distribution ov er it. In order to model this distribution using independence assumptions, we need to specify some random variables. Suppose we hav e two projections π 0 : Ω → Ω 0 and π 1 : Ω → Ω 1 such that Ω 0 ∩ Ω 1 6 = ∅ , Ω 0 6⊆ Ω 1 , and Ω 1 6⊆ Ω 0 . Since neither projection’ s image is a subset of the other’ s, these two projections are consistent. Howe v er , the question arises about the nature of their agreement when objects are projected to the intersection Ω 0 ∩ Ω 1 ; there are two scenarios to consider . First, suppose π 0 and π 1 are strongly consistent; then there exists a unique projection π 2 : Ω → Ω 0 ∩ Ω 1 consistent with both π 0 and π 1 (and so the set { π 0 , π 1 , π 2 } is consistent). A standard assumption is that the random vari- ables π 0 and π 1 are conditionally independent giv en π 2 , which we denote by ( π 0 ⊥ π 1 | π 2 ) . No w , suppose π 0 and π 1 are not strongly consistent; then there does not exist a projection π 2 : Ω → Ω 0 ∩ Ω 1 consistent with both π 0 and π 1 . In order to specify a conditional independence assumption analogous to the one abov e, deﬁne π 0 2 : Ω → Ω 0 ∩ Ω 1 as the projection consistent with π 0 and deﬁne π 00 2 : Ω → Ω 0 ∩ Ω 1 as the projection consistent with π 1 . No w , if we want to specify conditional independence between π 0 and π 1 , we must condition on both π 2 0 and π 2 00 , i.e.: ( π 0 ⊥ π 1 | π 0 2 , π 00 2 ) . This illustrates that, to specify conditional independence between random v ariables that are not strongly consistent, the structure graph in a graphical model both needs to be larger (i.e., incorporate more variables) and needs more edges, than if they were. This example moti v ates formulating graphical models in terms of strongly consistent projections. It will be con v enient to index projection f am- ilies so as to indicate which projection’ s images are subsets of other’ s. This can be done as follows. For a ﬁnite consistent projection family Π , there exists some ﬁnite set B and A ⊆ P ( B ) , such that we can write: Π = { π A : Ω → Ω A , A ∈ A} , and where: 4.1. Projection F amilies 51 1. Ω B = Ω . 2. A 0 ⊆ A ⇐ ⇒ Ω A 0 ⊆ Ω A . 3. A 0 ∩ A = ∅ ⇐ ⇒ Ω A 0 ∩ Ω A = ∅ . Further , we will assume that B is a minimal set for indexing Π in this way (in the sense that there does not exist B 0 such that | B 0 | < | B | and the abov e holds). Thus, the indices in an index set A show when the images of projec- tions intersect or are subsets. Henceforth, we assume projection families are index ed in this way . W e no w deﬁne a completeness condition for a projection family , which will also be useful in modeling: Deﬁnition 4.3 (Completeness) . A projection family Π is complete if its index set A is closed under intersections: A, A 0 ∈ A = ⇒ A ∩ A 0 ∈ A In other words, for any two projections π A : Ω → Ω A and π A 0 : Ω → Ω A 0 in Π , a projection of the form π A ∩ A 0 : Ω → Ω A ∩ A 0 also exists in it. Notice that if a projection family Π is consistent and complete, then it is also strongly consistent. Con versely , if a projection family is strongly consistent, then it can be made complete by augmenting it with additional projections. For model- ing purposes, the v alue of completing a projection f amily in this sense is that it provides a larger space of possible independence assumptions. Since the traditional formulation of graphical models is in terms of a consistent, com- plete system of projections, we will deﬁne the extended formulation like wise. W e no w deﬁne the notion of atomic projections. Loosely speaking, for a projection family Π , a projection in it is atomic if: (1) there does not exist a projection in this family that projects to a subset of its image; or (2) if there are projections in this family that project to a subset of its image, then this set loses information. The second condition ensures that any object projected by a set of atomic projections can be reconstructed. T o deﬁne this more formally , we introduce some notation. For a projection family Π indexed by A , let Π B ⊆ Π denote the subset of projections index ed by B ⊆ A , i.e.,: Π B = { π A , A ∈ B } . 52 General Random Objects W e say that a set of projections Π B is in vertible ov er a set Ω 0 ⊆ Ω if there exists a function Π − 1 B such that Π − 1 B (Π B ( w )) = w , ∀ w ∈ Ω 0 . W e deﬁne atomic projections as follo ws: Deﬁnition 4.4 (Atomic Projections) . For a ﬁnite projection family Π in- dex ed by A , a projection π A in this family is atomic if: 1. B ( A ) = ∅ ; or 2. If B ( A ) 6 = ∅ , then the projection set Π B ( A ) is not in v ertible ov er A . where B ( A ) = { A 0 ∈ A | A 0 ⊂ A } . In other words, for a projection family , the atomic projections are those with either the smallest images, or if there are projections with smaller im- ages, the y cannot be reconstructed from them. W e will call a random v ariable atomic if it corresponds to an atomic projection. Finally , to be used in mod- eling, we need to assume that a projection family has enough cov erage ov er a space Ω so that it can be used for representing objects in it: Deﬁnition 4.5 (Atomic Representation) . For a ﬁnite projection family Π ov er Ω , a set of atomic projections Π atom ⊆ Π is an atomic repr esentation of the space Ω if it is in v ertible ov er Ω . If a ﬁnite projection family Π o ver Ω contains the identity projection I Ω , then there exists an atomic representation of Ω within Π . If a ﬁnite projec- tion family Π ov er Ω contains the identity projection I Ω and is consistent and complete, then it has a unique atomic representation. For deﬁning a graphical model over Ω with respect to a projection family Π , we will let its structure graph correspond to an atomic representation of Ω within Π (i.e., the ver- tices in the structure graph will correspond to the projections in the atomic representation); if the atomic representation is unique, then so will be the ver - tex set in the structure graph. W e can now express, for graphical models, the requirements on projection families. Suppose we ha ve a consistent, complete system of projections Π ov er an object space Ω . Further, assume Π is ﬁnite, non-empty , and contains the 4.2. Substr uctures 53 identity projection I Ω . W ith only these assumptions on the projection family , we can model distributions ov er Ω using independence and factorization, the in v ariances used in graphical models. The projections, since they are consis- tent, deﬁne a set of marginal random variables, as well as a unique set of atomic random v ariables, and so we can encode independence assumptions in a compact form using them. A projection f amily on the space Ω also giv es rise to a Gibbs representation for distributions o v er it: P ( w ) = 1 Z exp " X A ∈A ψ A ( w A ) # , (4.1) where ψ A : Ω A → R ∪ {−∞} , and where w A ≡ π A ( w ) , facilitating factor - ization and the use of undirected models. If there are functional dependencies in the atomic projections 1 , then the objects in Ω have structural constraints, and the structure graph must respect them. If there are no functional depen- dencies, then the Hammersle y-Clif ford theorem may be directly applied; oth- erwise, a partially directed network may be necessary . The formulation of graphical models gi v en here encompasses the random graph models from Section 2. Howe ver , notice that in graphical models for multi v ariate random variables, the projections are not to subsets of the sample space, but rather to substructures. W e now turn to this topic, and extend the formulation to include these more general projections. 4.2 Substructures In the previous section, we discussed projection families in which each pro- jection’ s image was a subset of its domain (i.e., for each projection π : Ω → Ω 0 , we hav e Ω 0 ⊆ Ω ). In this section, we now consider functions on Ω in which their images may not be a subset (i.e., Ω 0 6⊆ Ω ); for con v enience, we will view these functions as mapping into substructures, and refer to them as projections. This e xtension is important because it allo ws projection fami- lies to be larger , which in turn, allows additional structure to be incorporated within models. An e xample of a projection from a space Ω to a substructure is the the projection of a v ector to one of its components. For simplicity , the de- velopment here is not gi ven in terms of substructures; the ideas can be stated 1 There are functional dependencies in the atomic representation whenever there exists a projection π A in it such that the set B ( A ) = { A 0 ∈ A | A 0 ⊂ A } 6 = ∅ is non-empty . 54 General Random Objects in terms of the projections, without making more explicit assumptions about the structure of the space Ω . Suppose we want to deﬁne a distribution ov er a space Ω using graphical models, and further , suppose we have a set of projections of the form π : Ω → Ω 0 ; since the images of these projections are not necessarily subsets of Ω , the composition of these projections is no longer well-deﬁned (i.e., the image of one projection is not necessarily a subset of the domain of another projection). In order to deﬁne a notion of consistency for this projection family , there must exist projections between these spaces. F or a projection family Π = { π A : Ω → Ω A , A ∈ A} , where A is an index set, suppose that the follo wing conditions hold: 1. (Completeness) For all A, A 0 ∈ A such that A ∩ A 0 6 = ∅ , we have that A ∩ A 0 ∈ A . 2. (Consistency) For all A, A 0 ∈ A such that A 0 ⊆ A , there exists a projection of the form π A → A 0 : Ω A → Ω A 0 , and this projection is deﬁned by π A → A 0 ◦ π A = π A 0 . If these conditions hold, then we say the projection family Π is consistent and complete (this is a natural extension of the deﬁnitions in the previous section). Incorporating these projections into the above projection family , we hav e the follo wing set of projections: ˜ Π = { π A → A 0 : Ω A → Ω A 0 } , where A, A 0 ∈ A , and A 0 ⊆ A . Example 4.2 (V ectors) . A simple example of projections to substructures is the familiar coordinate projections used in modeling multiv ariate random v ariables. Let Ω be a product space of the form Ω = X = X 1 × · · · × X n , and let the coordinate projection π A : X → X A be π A ( X ) = X A , 4.3. Compositional Systems 55 where X A = Q i ∈ A X i and A ⊆ { 1 , . . . , n } . Since X A 6⊆ X for A 6 = { 1 , . . . , n } , these projections are to substructures of the original space. Let Π be the projection family: Π = { π A , A ∈ A} where A = P ( { 1 , . . . , n } ) . Then the projection family Π is complete (since the index set A = P ( { 1 , . . . , n } ) is closed under intersection), and is also consistent (since for any A, A 0 ∈ A such that A 0 ⊆ A , there exists a pro- jection between substructures of the form π A → A 0 : X A → X A 0 such that π A → A 0 ◦ π A = π A 0 ). Another example is the projection to substructures used in modeling ran- dom trees (Section 3.1.3). F or a sample space Ω with a distrib ution P over it, if we ha v e a ﬁnite, consistent, complete system of substructure projections Π on Ω , then we can deﬁne a marginal random variable for each index A ∈ A in this f amily , and similarly , we may also deﬁne a Gibbs form (equation 4.1). Hence, we hav e arriv ed at a general framew ork, based on general projections. 4.3 Compositional Systems A projection family on an object may be viewed as deﬁning a compositional system. Compositionality refers to the phenomena in which objects are com- posed of parts, which in turn, are themselves composed of parts, etc., and that the same part can occur in multiple larger parts. For a giv en set of objects Ω , a projection family on it deﬁnes the decomposition of objects into a hierar - chy of parts, and this may be viewed as a top-down approach to deﬁning a compositional system. This approach for deﬁning these systems differs from that taken in [Geman et al., 2002]; in that w ork, gi ven a set of primiti v e parts T ⊆ Ω , a set of composition rules are used to deﬁne the allo wable group- ings of parts into larger parts, and may be vie wed as a bottom-up approach to deﬁning compositional systems. The alternativ e perspective offered here on these systems is very different than that taken in the literature; our intention is only to provide a context to how graphical models, as formulated above, ﬁts among other general frame works, and this is only one possible interpretation of the relationship. T o illustrate modeling a compositional system, consider character recog- nition, a classic problem in the ﬁeld of computer vision. The goal is to design 56 General Random Objects a computer vision system that takes images of handwritten characters and de- termines the character being displayed. Let the object space Ω be the space of possible binary (i.e., black and white) images with a label (i.e., we as- sume ev ery image has a label attached to it from some label set). Follo wing along the lines of the example giv en in [Geman et al., 2002], the most prim- iti ve parts may be images with only a single black point, ha ving the label ‘point’; the next simplest parts might be images with only two points within close proximity , having the label ‘linelet’; these objects can be combined to form objects with the label ‘line’, which in turn can be combined to form ob- jects with the label ’L-junction’, and so on, until ﬁnally objects with the label ‘character’ are formed. If instead of deﬁning a part t ∈ Ω as a single element, we let it be a random variable t that takes v alues in a subset of this space, then we may associate them to projections (taking images and their labels to a subset of the images and their labels). These projections are consistent with each other , and so deﬁne a consistent set of mar ginal distributions over these parts. In turn, these marginal variables allo w a graphical model approach to be applied to the problem, allowing the ef ﬁcient estimation of distrib utions o ver the object space Ω . W e note that in the approach to compositional systems based on projec- tion families, the composition rules (describing how to combine parts into larger parts) are provided, in a sense, by the (marginal) probabilities of parts having some value, conditioned on the value of its constituent parts (the value of its projections). When this probability is nonzero, one may interpret that a composition rule is dictating that these constituent parts are combinable. 5 Examples In this section, we consider the practical application of the models described in the pre vious sections. Since the random graph models are more general than the random tree ones, and because they differ more from the models in the literature, we focus our attention on them here. W e will use factorization to specify structure in distributions (which, for graphs, dif fers from specifying independence assumptions, see Section 2.6); the reason is, for the examples considered here, this in v ariance is more straightforward to specify and oper- ate on. W e also discuss in v ariances on distributions based on graph isomor- phisms, an assumption used in many random graph models. The use of these in v ariances on unattributed graphs, howe v er , causes models to be susceptible to degeneracy problems. T o av oid this issue, it is important for models that employ these in v ariances to assign latent variables to the vertices, or equiv- alently , to use attributed graphs. W e will assume models that take a simple exponential form based on the use of template graphs. W e illustrate the ideas with se veral e xamples. 57 58 Examples 5.1 Compact Distributions Although a distribution over a ﬁnite graph space G can always be speciﬁed by directly assigning a probability to each graph in it, in practise we need to make assumptions about the distribution. In Section 2.6, we discussed Gibbs form and the speciﬁcation of structure based on factorization, where a Gibbs distribution has the form: P ( G ) = exp   ψ 0 + X G 0 ∈ S 1 ( G ) ψ 1 ( G 0 ) + X G 0 ∈ S 2 ( G ) ψ 2 ( G 0 ) + . . .   . In the examples considered here, we ﬁnd it natural to allo w slightly more structure than can be obtained only through the speciﬁcation of factors; we also want to be able to assign individual graphs to ha v e a factor v alue of zero. In other words, we are interested in deﬁning structure through the speciﬁca- tion of a small subset G basis ⊂ G such that by assigning a potential value to each graph in G basis , the probability of ev ery graph in G can be determined. Hence, giv en a basis, we assume the potential of any graph G / ∈ G basis is zero, and deﬁne the probability of a graph as P ( G ) = 1 Z exp   X G 0 ∈C ( G ) ψ ( G 0 )   , (5.1) where ψ : G basis → R ∪ {−∞} and C ( G ) ≡ S ( G ) ∩ G basis . 5.2 Additional Structure The model giv en in equation 5.1 can be further simpliﬁed by assuming the function ψ has some structure. This can be done in many ways; the simplest is to assign the same function value to graphs that are similar in some sense. For example, we might w ant graphs that are isomorphic to each other to ha ve equal v alues (i.e., setting ψ ( G 1 ) = ψ ( G 2 ) for all G 1 , G 2 ∈ G basis that are isomorphic). More generally , we can specify structure in ψ by assuming an additi ve relationship of the form: ψ ( G ) = K X k =1 λ k I { G ∈D k } , 5.3. Graph Isomor phisms 59 where each D k ⊂ G basis is a subset of the basis and each λ k a real number . Then the model in equation 5.1 simpliﬁes to: P ( G ) = 1 Z exp " K X k =1 λ k U k ( G ) # , (5.2) where U k ( G ) = # { G 0 ∈ S ( G ) : G 0 ∈ D k } is the number of subgraphs of type k in the graph G . W e will ﬁnd it con v enient to reformulate each set D k as a binary function: deﬁne a function R k : G → { 0 , 1 } such that R k ( G ) = 1 ⇐ ⇒ G ∈ D k . Then, equi v alently , we hav e that U k ( G ) = # { G 0 ∈ S ( G ) : R k ( G 0 ) = 1 } . W e refer to the binary functions R k as compatibility maps . W e now consider methods for specifying these maps. 5.3 Graph Isomorphisms An important way to compare two graphs is based on how their parts com- pare. In this section, we consider isomorphisms, a comparison method based on second-order subgraphs; two graphs are said to be isomorphic if they share the same edge structure: Deﬁnition 5.1 (Graph Isomorphism) . A graph G = ( V , E ) is isomorphic to a graph G 0 = ( V 0 , E 0 ) if there exists a bijection f : V → V 0 such that: E ( v , v 0 ) = E 0 ( f ( v ) , f ( v 0 )) for all v , v 0 ∈ V . T wo graphs that are isomorphic is denoted by G ' G 0 . A distribution P over a graph space G is said to be inv ariant to isomor- phisms if any tw o graphs that are isomorphic hav e the same probability , i.e.: G ' G 0 = ⇒ P ( G ) = P ( G 0 ) . where G, G 0 ∈ G . W e no w consider some isomorphism variations that will be useful for attributed graphs. 60 Examples 5.4 Attributed Graph Isomorphisms When modeling unattributed graphs, often it is important to associate at- tributes to vertices in these graphs. The attributes, in this case, may be thought of as latent v ariables, which can simplify the order of models. Suppose we hav e a ﬁnite vert ex space Λ V , an edge space Λ E , and an attribute space X . Recall from Section 2.8.2, an attributed graph has the form G = ( V , X , E ) , where: V ⊆ Λ V X : V → X E : V × V → Λ E . The simplest isomorphism for attrib uted graphs is based on the edge structure and attributes on indi vidual v ertices: Deﬁnition 5.2 (First-order Isomor phism) . A graph G = ( V , X, E ) is ﬁrst-or der isomorphic to a graph G 0 = ( V 0 , X 0 , E 0 ) if there exists a bijec- tion f : V → V 0 such that: 1. X ( v ) = X 0 ( f ( v )) for all v ∈ V . 2. E ( v , v 0 ) = E 0 ( f ( v ) , f ( v 0 )) for all v , v 0 ∈ V T wo attrib uted graphs that are ﬁrst-order isomorphic is denoted by G ' 1 G 0 . This deﬁnition is a natural extension of Deﬁnition 5.1 to attributed graphs. As an example, suppose the attribute space X = { c 1 , . . . , c k } is some ﬁnite set of labels or colors; then for graphs to be isomorphic by this deﬁnition, the coloring of vertices must be respected in addition to the edge structure. The next simplest isomorphism for attributed graphs is based on the attributes on pairs of vertices. Suppose we have a distance function d over the attribute space X . Deﬁnition 5.3 (Second-order Isomor phism) . A graph G = ( V , X, E ) is second-or der isomorphic to a graph G 0 = ( V 0 , X 0 , E 0 ) if there exists a bijection f : V → V 0 such that: 1. d ( X ( v ) , X ( v 0 )) = d ( X 0 ( f ( v )) , X 0 ( f ( v 0 ))) for all v , v 0 ∈ V . 5.5. Master Interaction Function 61 2. E ( v , v 0 ) = E 0 ( f ( v ) , f ( v 0 )) for all v , v 0 ∈ V T wo attributed graphs that are second-order isomorphic is denoted by G ' 2 G 0 . This second-order isomorphism is used in many latent position models ([Hof f et al., 2002]), where X = R d is a Euclidean space; for models us- ing this isomorphism in variance, the probability of a graph depends on the distances between vertices in it, not on their particular locations. These def- initions can be extended to higher-orders in a straightforward manner . T o summarize, we presented some isomorphisms that can be used in specifying when graphs are similar to each other . W e will make use of them to specify compatibility maps in the examples presented later in this section. 5.5 Master Interaction Function In deﬁning distributions o v er a graph spaces, often it will be useful to reduce the size of the graph space, removing graphs that hav e zero probability . One way to do this, assuming that the edge space Λ E has a partial ordering ≤ , is to deﬁne a function that restricts the edge conﬁgurations allo wed in graphs: Deﬁnition 5.4 (Master Interaction Functions) . 1. A master interactions function over vertices is a function of the form F V : Λ 2 V → Λ E . A graph G = ( V , X, E ) is said to r espect a master interactions function F V if, for all v , v 0 ∈ V , we hav e E ( v , v 0 ) ≤ F V ( v , v 0 ) . 2. A master interactions function over attributes is a function of the form F X : X 2 → Λ E . A graph G = ( V , X, E ) is said to r espect a master interactions function F X if, for all v , v 0 ∈ V , we have E ( v , v 0 ) ≤ F X ( X ( v ) , X ( v 0 )) . W e use master interactions functions to restrict graph spaces to only those graphs that respect them. That is, for a graph space G and some master inter- action functions F V and F X , we can restrict the graphs to the set: G 0 = ( ( V , X, E ) ∈ G : E ( v , v 0 ) ≤ F V ( v , v 0 ) for all v , v 0 ∈ V E ( v , v 0 ) ≤ F X ( X ( v ) , X ( v 0 )) for all v , v 0 ∈ V ) . 62 Examples Example 5.1. Suppose the vertex space Λ V = { 1 , . . . , p } and the edge space Λ E = { 0 , 1 } . W e can deﬁne a master interactions function F V that ensures there is no edge between vertices that are farther apart than t ∈ R + as follo ws: F V ( v , v 0 ) = ( 0 , if | v − v 0 | > t 1 , otherwise Example 5.2. Suppose the attrib ute space is X = { c 1 , . . . , c k } , where each c i represents a color and edge space is Λ E = { 0 , 1 } . W e can deﬁne a master interactions function F X that ensures vertices with the same color attrib ute cannot hav e an edge: F X ( c i , c j ) = ( 0 , if c i = c j 1 , otherwise 5.6 Examples In this section, we illustrate the above ideas with some examples. In each example, the model takes the form of equation 5.2, and uses some set of templates { T 1 , . . . , T K } . For each template T k , the compatibility map R k : G → { 0 , 1 } is based on if a graph G ∈ G is isomorphic to it, i.e.: R k ( G ) = ( 1 , if G ' T k 0 , otherwise . (5.3) In all the examples except the ﬁrst one, we assume the isomorphism used is the ﬁrst-order isomorphism. The sampling and learning algorithms are dis- cussed in Section 5.7. 5.6.1 Example 1: Grid Graphs W e consider unattrib uted grid-like graphs such as the one sho wn in Figure 5.1. Let the verte x space be Λ V = { 1 , . . . , p } 2 be a grid of size p , and let Λ E = { 0 , 1 } , specifying the absence of an edge or the presence of an edge, respecti vely . W e can specify the master interactions function F V to tak e pairs of vertices that cannot hav e an edge to the v alue 0 , and pairs that can hav e an 5.6. Examples 63 edge to the v alue 1. Deﬁne F V as follo ws: F V ( v , v 0 ) = ( 1 , if | v 1 − v 0 1 | ≤ 1 and | v 2 − v 0 2 | ≤ 1 0 , otherwise , where v = ( v 1 , v 2 ) ∈ Λ V and v 0 = ( v 0 1 , v 0 2 ) ∈ Λ V . Hence, this master interactions function F V ensures the graph space G only contains grid-like graphs. Figure 5.1: An example of a grid-like graph. A possible set of templates is sho wn in T able 5.1. Each template T k in these tables speciﬁes a compatibility map based on graphs that are isomorphic to T k . Here, we made the following design choices. First, we ha v e limited the order of the template graphs to fourth order and lower (i.e. graphs such that | V ( G ) | ≤ 4 ). Secondly , to make computation feasible, we apply a ‘locality’ principle in which only connected graphs are used as templates. Since un- connected graphs constitute the vast majority of the subgraphs in S ( G ) for any giv en graph G ∈ G , the restriction to only these is necessary for com- putational reasons. For example, consider the second-order subgraphs in the graph in Figure 5.1; there are | S 2 ( G ) | =  17 2  = 136 subgraphs of this order, but only 18 of them are connected. If we consider higher-order subgraphs, this gap widens. Gi ven these templates, the number of subgraphs that correspond to a gi ven pattern can be calculated for any graph G ∈ G , and hence its prob- ability can be calculated. For example, for the graph G in Figure 5.1, the 64 Examples probability is expressed as follo ws: P ( G ) = 1 Z exp " K X k =1 λ k U k ( G ) # = 1 Z exp [17 λ 1 + 18 λ 2 + 26 λ 3 + 4 λ 4 + 45 λ 5 + 20 λ 6 ] . λ 1 λ 2 λ 3 λ 4 λ 5 λ 6 T able 5.1: The set of connected graphs that are used as templates; the com- patibility maps are based on graphs that are isomorphic to these templates. 5.6. Examples 65 5.6.2 Example 2: ‘Molecule’ Graphs W e consider an example in which the graph space G is composed of graphs that loosely resemble molecules in appearance. An example is sho wn in Fig- ure 5.2. Figure 5.2: An example of a ‘molecule’ graph. This is an artiﬁcial graph, made only for illustration. In this example, we will use attributed graphs of the form G = ( V , X , E ) . Let Λ V = { 1 , . . . , p } be the verte x space, X = { c 1 , c 2 , c 3 , c 4 } the attribute space, where each c i represents a color , and Λ E = { 0 , 1 } the edge space. W e can specify the master interactions function F X : X 2 → Λ E to specify that vertices with the same color cannot hav e an edge between them (e.g., set F X ( c i , c j ) = 0 if c i = c j ). Similarly , we might want to specify that vertices with certain dif ferent colors can hav e an edge between them (e.g. set F X ( c i , c j ) = 1 for some c i 6 = c j ). A possible set of templates and their corresponding parameters is shown in T able 5.2. For each template graph T k , we deﬁne a compatibility map R k : G → { 0 , 1 } based on graphs that are second-order isomorphic to it (equation 5.3). Giv en these templates, the number of subgraphs that corre- spond to a giv en pattern can be calculated for any graph G ∈ G , and hence its (unnormalized) probability can be calculated. For the graph G in Figure 5.2, 66 Examples the probability can be expressed as follo ws: P ( G ) = 1 Z exp " K X k =1 λ k U k ( G ) # = 1 Z exp [20 λ 1 + 4 λ 2 + 9 λ 3 + 5 λ 4 + 4 λ 5 + 20 λ 6 + 4 λ 7 + 9 λ 8 + 6 λ 9 + 4 λ 10 ] . Notice that in this example, the attrib utes (i.e., the colors associated with v er- tices) allow distributions in which, loosely speaking, typical samples hav e complex structure ev en despite the fact that the basis does not contain high- order graphs. For example, the edge structure in these graphs are very un- likely to hav e been generated by independent coin ﬂips as in an Erd ˝ os-Rényi model. If the vertices did not have these attributes and we wanted to deﬁne a distrib ution that has equi v alent probabilities as in this e xample, (e.g. assign the same probability to the unattributed version 1 of the graph in Figure 5.2), it would be necessary for any basis to contain graphs of much higher orders than those in the basis used in this e xample. Hence, we see that attrib utes are important latent variables even if one only wants to deﬁne distributions ov er unattributed graph spaces. Thus, ideas contained in latent position models ([Hof f et al., 2002]) and latent stochastic blockmodels ([Airoldi et al., 2009], [Latouche et al., 2011]) can be incorporated within the frame work here. 1 That is, for an attributed graph G = ( V , X, E ) , removing the attribute function X : V 2 → X from it to form the unattributed graph G = ( V , E ) . 5.6. Examples 67 λ 1 λ 2 λ 3 λ 4 λ 5 λ 6 λ 7 λ 8 λ 9 λ 10 T able 5.2: The set of graphs that are used as templates. These are used to specify the compatibility maps based on graphs that are isomorphic to them. 68 Examples 5.6.3 Example 3: Mouse Visual Cortex A graph G 0 that corresponds to the visual cortex of a mouse and is shown in Figure 5.3. Figure 5.3: A mouse visual corte x, which we label as G 0 [Bock et al., 2011]. In order to model mouse visual cortexes, we consider a hierarchical model; we begin by modeling parts (i.e., interesting subgraphs) that com- pose these visual cortexes. From the graph G 0 , we extract subgraphs of G 0 that exist in the follo wing graph space: G =      G ∈ S ( G 0 ) : | V | = 25 G is connected diameter ( G ) ≤ 5      , where S ( G 0 ) denotes the set of all subgraphs of G 0 . Some examples of graph in G are shown in Figure 5.4. 5.6. Examples 69 Figure 5.4: Some examples of subgraphs of G 0 used for learning our sub- graph model. Notice that the graphs in G do not ha v e attrib utes, i.e. the v erte x and edge spaces hav e the form: Λ V = { 1 , . . . , 25 } Λ E = { 0 , 1 } , For modeling purposes, we introduce attributes for the vertices, which serve as latent v ariables; without them, very high-order templates would be neces- sary . W e assign attrib utes to the vertices based on their edge counts, which can be used to deﬁne different types of vertices. Let d G ( v ) denote the num- ber of edges incident on verte x v in the graph G . W e will assign each verte x in a graph an attribute value in X = { c 1 , c 2 , c 3 } based on its edge count as follo ws. For a graph G = ( V , E ) , deﬁne the attribute function X : V → X 70 Examples by: X ( v ) =      c 1 , if 0 ≤ d G ( v ) ≤ 1 c 2 , if 2 ≤ d G ( v ) ≤ 4 c 3 , if 5 ≤ d G ( v ) , and augment the graph G with it to form the attrib uted graph G 0 = ( V , X, E ) . For an example of this augmentation, see Figure 5.5. W e note a couple of things: (1) more complicated attributes can be used for the vertices, for ex- ample using a generalized notion of edge counts; (2) a clustering algorithm can be used here, for example clustering similar subgraphs. = ⇒ Figure 5.5: An example of a graph being assigned color attributes. The templates used for our model are shown in T ables 5.3, 5.4, and 5.5. W e learn model parameters using the algorithm in Section 5.7.3, and a small training set of 31 graphs (examples sho wn in Figure 5.4). Some samples from the model are shown in Figure 5.6. As mentioned, these are only samples of subgraphs. With only one full graph G 0 , we cannot de velop a hierarchical model, so instead, we combine these subgraphs just using a few simple rules (e.g., combining them by randomly placing edges between them). An exam- ple is sho wn in Figure 5.7. 5.6. Examples 71 λ 1 λ 2 λ 3 T able 5.3: The set of 1st-order graphs that are used as templates. λ 4 λ 5 λ 6 λ 7 λ 8 λ 9 T able 5.4: The set of 2nd-order graphs that are used as templates. λ 10 λ 11 T able 5.5: The set of 3rd-order graphs that are used as templates. 72 Examples Figure 5.6: Model samples. Figure 5.7: Mouse visual cortex sample from our model. 5.6. Examples 73 5.6.4 Example 4: Chemistry Data In the chemoinformatics dataset MUT AG [Sherv ashidze et al., 2011], there are 188 mutagenic aromatic and heteroaromatic nitro compounds. Examples are sho wn in Figure 5.8. W e will form a simple hierarchical model using deterministic subgraphs. Deﬁne a set of subgraphs as shown in T able 5.6. These subgraphs will corre- spond to vertices in the second level of the model. There will be an edge be- tween two vertices in the second le vel if there is an intersection between two subgraphs. F or e xample, let G be a molecule graph and suppose G 1 , G 2 ⊂ G are two subgraphs of G and are in T able 5.6 below . If G 1 , G 2 hav e a com- mon subgraph (i.e. hav e at least one common v ertex in G ), then in the second le vel of the model, there will be an edge between the corresponding v ertices. This edge will hav e an attribute specifying the degree of intersection, i.e. ho w many common vertices the two subgraphs share. See Figure 5.9 for some e x- amples. Thus, all the randomness in the problem is at this second le v el in the hierarchy and a model can be applied ov er it. 74 Examples Figure 5.8: Examples of molecule graphs in the MUT A G dataset. 5.6. Examples 75 ⇐ ⇒ ⇐ ⇒ ⇐ ⇒ ⇐ ⇒ ⇐ ⇒ ⇐ ⇒ ⇐ ⇒ ⇐ ⇒ ⇐ ⇒ T able 5.6: Parts in the model. The blue vertices on the right-hand side rep- resent the corresponding graph on the left-hand side. If one of the graphs on the left-hand side is a subgraph in a larger graph, then we may simplify the description of that larger graph through the use of these parts. 76 Examples = ⇒ = ⇒ = ⇒ Figure 5.9: Examples of molecule graphs (from the MUT A G dataset) de- picted by higher -le vel parts (subgraphs) rather than their lowest le vel parts. It should be easier to learn a distribution o ver this higher -le vel description of these graphs. 5.6. Examples 77 5.6.5 Example 5: V er tices with Color and Location W e consider an example of attributed graphs in which the verte x space Λ V = { 1 , . . . , p } 2 is a two dimensional grid of size p , the attribute space X = { c 1 , c 2 , c 3 , c 4 } is a set of colors, and a binary edge space Λ E = { 0 , 1 } . W e will deﬁne a master interactions function F V : Λ 2 V → Λ E that assigns the v alue 0 to e very pair of v ertices that cannot ha v e an edge, and assigns a v alue 1 to e very pair of v ertices that can ha ve an edge. Deﬁne F V as follo ws: U ( v , v 0 ) =    0 , if d ( v , v 0 ) > t 1 , otherwise where d ( v , v 0 ) is some distance function that assigns a distance between ver- tices based on their location attributes. In other words, this master interactions function can be used to ensure there is no edge between vertices that are far- ther apart than t ∈ R . Let N be the maximum order of graphs in the graph space. Let the graph space be: G =          ( V , E ) : V ⊆ Λ V , | V | ≤ N X : V → X E : V × V → Λ E E ( v , v 0 ) ≤ F V ( v , v 0 ) for all v , v 0 ∈ V          . The templates and the parameters used are shown in Figures 5.7 and 5.8. For a template graph T k ∈ G , we deﬁne the compatibility map R k : G → { 0 , 1 } based on graphs that are second-order isomorphic to it (equation 5.3). Some samples are shown in Figures 5.10. These were generated using the sampling algorithm in Section 5.7. template T k parameter v alue λ k λ 1 = 0 . 5 λ 2 = 0 . 4 λ 3 = 0 . 5 λ 4 = 0 . 4 T able 5.7: The set of 1st-order graphs that are used as templates. Parameters were hand-tuned here. 78 Examples template T k parameter v alue λ k template T k parameter v alue λ k λ 5 = 0 . 5 λ 15 = − 5 λ 6 = 1 . 5 λ 16 = − 5 λ 7 = 0 . 4 λ 17 = 0 . 75 λ 8 = 1 . 5 λ 18 = − 5 λ 9 = −∞ λ 19 = −∞ λ 10 = −∞ λ 20 = −∞ λ 11 = −∞ λ 21 = −∞ λ 12 = −∞ λ 13 = −∞ λ 14 = −∞ T able 5.8: The set of 2nd and 3rd order graphs that are used as templates. Parameters were hand-tuned here. 5.7. Inference and Lear ning 79 Figure 5.10: Samples from the model. 5.7 Inference and Learning In this section, we discuss inference for random graphs; for a gi v en probabil- ity distribution, inference refers to the calculation (or estimation) of proba- bilities in that distribution, or more generally , of functions of probabilities in that distrib ution. Inference can be performed by sampling from distributions; a standard Metropolis-Hastings algorithm is presented here for this purpose. Next, we present a learning algorithm for random graph models; for a gi ven model, learning refers to the selection of a particular distribution in it. A stochastic learning algorithm is presented here. 5.7.1 Sampling Suppose we hav e a verte x space Λ V , an edge space Λ E , and an attribute space X , and let G be a ﬁnite graph space with respect to them. Further , 80 Examples suppose we hav e a distribution P ov er G that we want to sample from. W e will use Markov Chain Monte Carlo (MCMC), and in particular , the Metropolis-Hastings algorithm: Algorithm 1 Metropolis-Hastings Gi ven a transition kernel q ( G 0 | G ) and starting from an initial state G 1 , repeat the follo wing steps from t = 1 to T : 1. Generate a candidate G 0 ∼ q ( G 0 | G t ) . 2. Generate U ∼ U (0 , 1) and set G t +1 =    G 0 , if U ≤ α ( G t , G 0 ) G t , otherwise where α ( G, G 0 ) is the acceptance pr obability , gi v en by: α ( G, G 0 ) = min  P ( G 0 ) q ( G | G 0 ) P ( G ) q ( G 0 | G ) , 1  . This algorithm will generate a sequence G 1 , G 2 , . . . of dependent ran- dom graphs, and for large t , the graph G t will be approximately distributed according to P (assuming an appropriate transition kernel). Giv en a graph G = ( V , X, E ) , the transition k ernel will generate a proposal graph G 0 based on simple moves. There are many possibilities, but at least for simple prob- lems, we found the follo wing mov es suf ﬁce: 1. Adding a verte x v ∈ Λ V \ V to the verte x set and adding an attribute v alue X ( v ) to the attributes (and not adding any edges, i.e., E ( v , · ) = 0 .) 2. Deleting a verte x v ∈ V from the verte x set and restricting it from X (from among vertices with no incident edges). 3. Changing the v alue of an edge in E (i.e., changing the v alue of E ( v , v 0 ) ∈ Λ E for some v , v 0 ∈ V ). 5.7. Inference and Lear ning 81 The probability of each type of mov e can be uniform, although sometimes non-uniform probabilities may be preferable (e.g., assigning a greater prob- ability to edge mov es). Each move in this set has an in verse move allowing a chain to return to the previous state; hence these mov es satisfy the weak symmetry property . 5.7.2 Computation W e no w consider computational efﬁciencies for this sampling algorithm. Sup- pose we hav e a distrib ution P in the form of equation 5.2, i.e.: P ( G ) = 1 Z exp " K X k =1 λ k U k ( G ) # , and deﬁne H as the exponent in this distrib ution: H ( G ) = K X k =1 λ k U k ( G ) . In this section, we consider the calculation of dif ferences of the form 4 H ≡ H ( G new ) − H ( G old ) , where G new and G old are graphs. In the naiv e approach, each exponent is computed separately , and the difference taken. Howe v er , the difference 4 H can be calculated efﬁciently by ignoring subgraphs that are shared between the two graphs. Deﬁne J 0 ⊆ S ( G new ) as the set of subgraphs of G new that are not subgraphs of G old ; similarly , deﬁne J 1 ⊆ S ( G old ) as the set of subgraphs of G old that are not subgraphs of G new . That is: J 0 ≡ { G 0 ∈ S ( G new ) | G 0 / ∈ S ( G old ) } ∩ G basis J 1 ≡ { G 0 ∈ S ( G old ) | G 0 / ∈ S ( G new ) } ∩ G basis . Hence, we hav e that 4 H ≡ H ( G new ) − H ( G old ) = K X k =1 λ k [ U k ( G new ) − U k ( G old )] = K X k =1 λ k h ˜ U k ( G new ) − ˜ U k ( G old ) i 82 Examples where ˜ U k ( G new ) = # { G 0 ∈ J 0 : R k ( G 0 ) = 1 } ˜ U k ( G old ) = # { G 0 ∈ J 1 : R k ( G 0 ) = 1 } . Let’ s consider some examples. New Node : Let G old be any graph and suppose we formed G new by adding a verte x u to it (and possibly edges). In this case, we have that G old ⊂ G new (i.e. is an induced subgraph) and hence: J 0 = { G 0 ∈ S ( G new ) | u ∈ V ( G 0 ) } ∩ G basis J 1 = ∅ . Deleted Node : Let G old be a graph (with at least one verte x) and suppose we formed G new by deleting a vertex u ∈ V ( G old ) . In this case, we have that G new ⊂ G old and hence: J 0 = ∅ J 1 = { G 0 ∈ S ( G old ) | u ∈ V ( G 0 ) } ∩ G basis . New Edge : Let G old be a graph (with at least two vertices) and suppose we formed G new by changing the value of an edge E ( v , v 0 ) for some v , v 0 ∈ V ( G old ) . In this case, we hav e that: J 0 = { G 0 ∈ S ( G new ) | v , v 0 ∈ V ( G 0 ) } ∩ G basis J 1 = { G 0 ∈ S ( G old ) | v , v 0 ∈ V ( G 0 ) } ∩ G basis . 5.7.3 Learning T o estimate the parameters λ = { λ 1 , . . . , λ K } in the model in equation 5.2, we use the maximum likelihood estimate (MLE). Suppose we have a set of graphs { G 1 , G 2 , . . . , G N } sampled according to P ( G ; λ ∗ ) , where λ ∗ ∈ Λ and where Λ ⊂ R K is a compact set. The MLE is the solution to the following optimization problem: ˆ λ = arg max λ ∈ Λ N Y i =1 P ( G i ; λ ) . 5.7. Inference and Lear ning 83 It can be sho wn that the MLE is also a solution to the equations: E λ [ U k ( G )] = ˆ U k , k = 1 , . . . , K , where ˆ U are the empirical statistics of the features: ˆ U k = 1 N N X i =1 U k ( G i ) , k = 1 , . . . , K . W e run a stochastic approximation algorithm ([Y ounes, 1988], [Salakhut- dinov, 2009]) for estimating the solution (see Algorithm 2 belo w). This stochastic algorithm uses multiple Markov chains in parallel; we ﬁnd this beneﬁcial in practice since individual chains can sometimes become stuck in certain regions for long periods of time. Algorithm 2 Stochastic Approximation Procedure Input: Empirical statistics ˆ U ; Initial parameters λ 1 ; Initial set of M particles { G 1 , 1 , . . . , G 1 ,M } ; f or t = 1 : T (number of iterations) do f or m = 1 : M (number of parallel Markov chains) do Sample G t +1 ,m gi ven G t,m using the transition operator T λ t ( G t +1 ,m , G t,m ) end f or end f or Update: λ t +1 = λ t + α t h ˆ U − 1 M P M m =1 U ( G t +1 ,m ) i Decrease α t 6 Summary and Discussion In this work, we considered the statistical modeling of real-world problems that inv olv e objects that are naturally represented by graphs of varying orders. In general, a distribution can be speciﬁed o ver a ﬁnite space of objects (e.g., a ﬁnite graph space), by directly assigning a probability to each object in it. Of course, for all but the smallest spaces, this is impractical, and it is essential to use in v ariance assumptions. T o address this, we considered independence and factorization, the in v ariances used in graphical models, and observed that they can be deﬁned in terms of projections, allowing their use in the modeling of random graphs. These in v ariances, for a given family of projections, can be described using only a small subset of projections, those that are atomic (with respect to this f amily), allo wing their compact representation by a (structure) graph. W e found factorization to be an easier in v ariance to use for graphs, at least in some problems, and illustrated their modeling with some examples. One critique of this work may be that the formulation of graphical models discussed here is not actually an extension, and that only the original formu- lation of graphical models was presented, applied to a particular multiv ariate random variable. That is, since a multi v ariate random variable is a general object, almost any other object can be represented by it (e.g., for a random graph, since each of its marginal random graphs is a random variable, is rep- 84 6.1. Extended Discussion 85 resentable by a multi v ariate random v ariable). W e mention that the re v erse is also true: a random graph is also a general object, and any multi v ariate random v ariable can also be represented by it (e.g., for a multi v ariate random v ariable, each of its individual marginal variables can be represented by an at- tributed verte x, and then graphs assumed to hav e no edges). This equiv alence is examined in [Grenander and Miller, 2007], chapter 6, where it was prov en more formally (for the random graph model considered in that work). In this work, we are not concerned with how a random v ariable is interpreted; for the purposes of statistical modeling, it is irrele v ant if a v ariable is considered a graph or vector , it is only the structure within the object that matters. Thus, we view a graphical model as a framework for modeling any object with an appropriate structure, where this structure is deﬁned by a projection family . The value of this more abstract vie wpoint is that it focuses attention on the essence of these models, the relationship between object structure and in vari- ance, and hence clariﬁes the modeling of complicated objects such as graphs and trees. Lastly , we note that although we limited our attention here, for the most part, to the in v ariances used in graphical models, that there may exist others that result in similar modeling frameworks. A model is a set of distributions, and a modeling frame work a set of models, usually deﬁned in terms of an in v ariance. T o design a (useful) modeling framework, an inv ariance must be identiﬁed such that: (a) it is applicable to many problems; and (b) it can be made at varying degrees, creating a range of model complexities, and allo w- ing practitioners to adjust models to a giv en problem. It would be interesting to in vestigate other in v ariances, besides independence and factorization, that hav e hierarchical structures. One could imagine, if such in variances could be identiﬁed, the dev elopment of frame works similar to graphical models. Per- haps, given an appropriate hierarchical structure, these in v ariances would be also representable by graphs, in which case, the graphical model framework itself would be e xpanded. 6.1 Extended Discussion W e address questions we have receiv ed and elaborate on some possible points of confusion in this section. 86 Summar y and Discussion 6.1.1 Framew ork Merit Q : Since some random objects can be con verted to random vectors, the extended formulation of graphical models is equi v alent, in some instances, to the original. What is the v alue of using the more abstract formulation? A : This question, in essence, is asking about ho w we should deﬁne graphical models and why is this abstract formulation even necessary . There are a few ways in which the more abstract formulation might be of v alue. If we take graphical models to be, at its core, a modeling framew ork based on in v ari- ances between random v ariables in some family , then the question of ho w to deﬁne these models reduces to the question of what properties this family should satisfy . Gi ven a distribution ov er a countable space Ω , an y function on this space deﬁnes a random v ariable, and any family of functions deﬁnes a family of random v ariables. In this w ork, for a family of functions to produce an appropriate family of random variables for graphical models, we consid- ered the fundamental properties to be that it is ﬁnite, consistent, and complete (Section 4). These latter properties are also pertinent to spaces that hav e inﬁ- nite projections, rather than just a ﬁnite set. This is the case, for example, in stochastic processes where Ω is an uncountable space of functions, and has an inﬁnite set of consistent projections on it, where each restricts functions in this space to a ﬁnite set of points in their domain (i.e., the projections that deﬁne the ﬁnite-dimensional distributions.) Since the concept of consistent projections occurs throughout probability theory , there appears to be concep- tual value in aligning the deﬁnition of graphical models with it, rather than conﬁning them to coordinate projections on product spaces. From a more applied viewpoint, the value of a general framew ork is partly related to the degree that it compresses kno wledge, which can for example, illuminate similarities and dif ferences in somewhat disconnected models. In the formulation of graphical models giv en here, many classic random graph models are covered; more importantly though, it provides insight into mod- eling random graphs in which the graphs v ary in order , an important problem that has received less research attention (we defer discussion of this topic to the next section). This framew ork for random graphs also includes ran- dom trees as a particular instance, a desirable property since a tree is a type of graph, and it cov ers some classic random tree models (e.g., probabilistic 6.1. Extended Discussion 87 context-free grammars). The frame work is based on fundamental in v ariances for structured objects, where the necessary structure has been deﬁned. Finally , due to their importance, many inference algorithms are speciﬁcally designed with these in v ariances in mind. On a more technical note, ev en if a random object can be mapped to a random vector , this does not imply that graphical models for the random ob- ject are equiv alent to graphical models for the corresponding random vector . A basic property of graphical models is that they specify sufﬁcient condi- tions for the in v ariances they use to be consistent (i.e., consistent in the sense that there exists a distribution that satisﬁes them). For example, in Bayesian networks for random vectors, the structure graph encodes a set of consistent independence assumptions if the structure graph is acyclic. For other random objects, howe ver , this statement does not necessarily hold. This can be seen, for example, in random graphs; since a graph has structural constraints im- posed by the dependence of edges on vertices, for structure graphs to be valid, the inv ariances they specify cannot violate these structural constraints (Sec- tion 2.4). Similarly , in random trees, there are structural constraints imposed by the dependence of a v ertex on its ancestors. Since structural constraints of objects in a space Ω can be described by a projection family on it, the more general formulation of graphical models can ensure, since the y are deﬁned in terms of these projection families, the consistency of their inv ariances on more general random objects. 6.1.2 Random V er tices Q : In the random graph models discussed in this work, the vertex set can be random. Why is this of interest? A : There are two problem paradigms in which random graphs are applicable, the traditional one in which graphs have a ﬁxed order and randomness only on the edges, and the less established one in which graphs can have variable order . W e will compare these two paradigms in more detail in the next section. For the moment though, we note that this work will not be useful to those interested in the former problem. Rather, the focus here is on the latter , and those interested in that problem are the intended audience. Applications of random graphs with random vertices are studied, for ex- 88 Summar y and Discussion ample, within the ﬁeld of statistical relational learning [Getoor, 2007]. One application is the statistical modeling of a set Ω of real-world scenes. Suppose scenes are composed of objects with attributes (e.g., child wearing a hat, blue sedan, etc.), and relationships between objects (e.g., holding hands, driving, etc.), and further , scenes vary in the objects in them (e.g., a scene may be empty or may have numerous objects). These scenes can be represented by graphs, where vertices represent objects, attributes of vertices represent at- tributes of objects (including their type), and edges represent relationships between objects. Thus, modeling these scenes corresponds to modeling ran- dom graphs in which graphs vary in their order . In the literature, many ap- proaches to this problem are based on modeling a selected set of conditional distributions, where each is conditioned on the objects in the scene. F or exam- ple, in probabilistic relational models [Getoor et al., 2001], these conditional distributions are speciﬁed using templates and assuming repeated structure. In this work, we considered a formulation of graphical models that would allo w us to model full distributions o v er this type of space. It is worth mentioning that this formulation of graphical models may be rele v ant to applied research in pr obabilistic logic , a ﬁeld that couples prob- ability and logic (an ov ervie w is giv en in [Russell, 2015]). Notice that if we hav e a probability distribution ov er a countable space Ω , then sentences of a logical language about this space (where sentences correspond to binary functions of the form f : Ω → { 0 , 1 } ), can be assigned probabilities. (That is, a sentence assigned probability equal to the probability of the subset of Ω in which it is true.) If, on the other hand, the distribution ov er a space Ω is unknown and we want to learn it, logical expressions can be used to ex- press in variances (i.e., constraints, structure) in the distrib ution. F or example, in v ariances can be deﬁned on distributions by constraining the distributions to those that assign certain probabilities to certain sentences. More generally , in v ariances can be deﬁned on these distributions in terms of logical expres- sions about the distribution itself ([Fagin et al., 1990], [Halpern, 1990]), re- ferred to as pr obability expr essions (these correspond to functionals of the form f ( P ) 7→ { 0 , 1 } , where P is a distribution over Ω ). In other words, distributions are constrained to only those in which some set of probability expressions are true. Thus, probabilistic logic can be vie wed as a modeling frame work based on general in v ariances, as e xpressed by logical e xpressions 6.1. Extended Discussion 89 about the distrib ution. (In contrast, graphical models is a frame work based on the less expressi v e in v ariances of independence and f actorization.) This lev el of expressi v eness in in v ariances, ho we ver , can result in the speciﬁcation of a set of in v ariances that is inconsistent in the sense that there does not exist a well-deﬁned distribution that satisﬁes it. This problem has led researchers to consider for going some of the expressi v e power in these logics to ensure consistency , and since graphical models pro vide consistency guarantees for their in v ariances (see pre vious section), to research extensions of graphical models to more general spaces. Of particular interest are exten- sions to spaces Ω containing structured objects not of a ﬁxed size (such as, for example, the real-world scenes described above), and an example is the template-based graphical models mentioned abov e. In [Milch et al., 2007], an extension of Bayesian networks is proposed for modeling full distributions over such spaces, referred to as Blog, where probabilities are placed on objects in Ω based on how they are incrementally constructed from some generative process. In this work, an extension was also proposed, and it is instructi ve to examine the differences between these two general frame works. Firstly , the formulation here is based on ﬁnite fami- lies of functions on the space Ω , whereas theirs extends to inﬁnite families. (It would be interesting to e xpand the formulation here in this reg ard.) Secondly , in Blog, the function families are not required to be consistent (i.e., the for- mulation is not in terms of families of marginal random variables, but rather general random v ariables). A family of random variables that is inconsistent (among the subset that is atomic) generally produces an inefﬁcient represen- tation of in v ariances in a distribution. This is not necessarily a problem, but since the traditional formulation of graphical models is in terms of mar ginal v ariables, we deﬁned our extension in terms of them as well. Finally , the formulation here yields other forms of factorization (i.e., undirected models where probabilities are placed on objects in Ω based on ho w they deconstruct (factorize) into a set of parts). 6.1.3 Consistent Distrib utions Q : In [Shalizi and Rinaldo, 2013], it was observed that for random graphs, if a set of distrib utions from the same exponential family all have the same parameter v alues, then these distributions are inconsistent, except under 90 Summar y and Discussion very special circumstances. How does the proposed frame work address this consistency problem? A : T o answer this question about the consistency of distributions, it is con- structi ve to consider it from two perspecti ves. As mentioned abo ve, there are two paradigms for random graphs, the traditional one in which graphs ha ve a ﬁxed order and randomness only on the edges, and the less established one in which graphs have randomness on both vertices and edges. Let’ s consider consistency in each. For simplicity , suppose we hav e a verte x space Λ V and edge space Λ E that are both ﬁnite. In the random vertices setting, a random graph has a distribution P ov er a graph space G = ( G = ( V , E ) | V ⊆ Λ V E : V × V → Λ E ) , containing graphs of varying order . In this case, we can form conditional distributions of the form P V ( E ) ≡ P ( E | V ) by conditioning on the vertices V ⊆ Λ V in a graph. A set of conditional distributions { P V , V ⊆ Λ V } is consistent if there exist a full distribution P that produces it. Since the random graph models in this work deﬁne full distributions over G , any conditional distributions induced from them will be consistent, and hence this consistency problem is not an issue. (Recall, one of our moti v ations for modeling full distributions w as to a void these dif ﬁcult consistenc y problems.) In the traditional setting, a random graph has a distribution P over a graph space G Λ = { G ∈ G | V ( G ) = Λ V } , containing only graphs with n = | Λ V | vertices, where n is large and possibly inﬁnite (for simplicity , we assume ﬁniteness here). For modeling purposes, we can deﬁne projections to substructures of this graph space as follows. For a set of vertices V ⊆ Λ V , let the substructure projection π V : G Λ → G V , where G V = { G ∈ G | V ( G ) = V } 6⊆ G Λ , be deﬁned as π V ( G ) = G ( V ) (i.e., the subgraph of G induced by the vertices V ). If we have a distribution P over G Λ , then the projections π V , V ⊆ Λ V deﬁne a set of marginal distributions P marg V , V ⊆ Λ V . On the other hand, if we do not hav e the distribution P and we want to determine it through the speciﬁcation of a set of distributions 6.1. Extended Discussion 91 P marg V , V ⊆ Λ V (and using an extension theorem in the inﬁnite case), we must take care to ensure we specify a set that is consistent. In [Shalizi and Rinaldo, 2013], it was sho wn that if a set of distrib utions from the same exponential family all use the same parameter values for each distrib ution P marg V in the set, then it is not consist ent except under special circumstances. This result shows more about the dif ﬁculty of directly specifying a set of consistent distributions (especially when these distributions are assumed in v ariant to isomorphisms, see next section) than an y limitations of exponential models. It is worth mentioning that this set of projections to substructures (i.e., the set { π V , V ⊆ Λ V } ) is consistent and complete, and so graphical models, as formulated in Section 4, is applicable to them when this projection family is ﬁnite. Ho we ver , the formulati on of random graphs presented here based on v ariable-order graph spaces and subset projections (rather than ﬁxed-order graph spaces and substructure projections) may provide a more interesting v antage point since it includes random trees as an instance of it. 6.1.4 Degeneracy Q : In the literature, many of the proposed exponential random graph models suf fer from a ‘degeneracy’ problem. Since the random graph models in this work are e xponential models, why are they not de generate? A : It is important to stress upfront that degenerac y has nothing to do with exponential models, b ut rather is an issue most acutely af fecting unattrib uted random graph models that use the isomorphism in v ariance (i.e., the assump- tion that all isomorphic graphs hav e the same probability). The formulation of graphical models for random graphs (Section 2) does not use this in v ariance, and so is not degenerate. Further , if the graphs are attributed, then e ven if models use isomorphism in v ariances (Section 5.4), the probability of a graph will still depend on its vertex set beyond its cardinality; latent v ariables are no w associated with the vertices, providing a means for models to differen- tiate vertices from each other . In the literature, many models use attributes (latent variables) and are not considered degenerate (although, most of these are not high order models). In Section 5, we gave ﬁv e random graph exam- ples; all but the ﬁrst example, which was a toy problem for illustration, used attributed graphs (as well as constrained graph spaces). W e now let the dis- 92 Summar y and Discussion cussion proceed to the setting in which the degenerac y problem appears, at least in an acute form. T o begin our discussion, it’ s useful to illustrate the degenerac y problem with an example. Suppose we have a verte x space Λ V = N and edge space Λ E = { 0 , 1 } , and let G be the graph space containing all graphs with respect to them. For a ﬁnite set of vertices V ⊂ Λ V , let P V denote the following conditional distribution: P V ( E ) ≡ P ( E | V ) , where E can be any function of the form E : V × V → Λ E . Assume that these conditional distributions are in variant to isomorphisms: G 0 ' G 1 = ⇒ P V 0 ( E 0 ) = P V 1 ( E 1 ) , where G 0 = ( V 0 , E 0 ) and G 1 = ( V 1 , E 1 ) . For all vertex sets of a gi v en cardi- nality n ∈ N , since these distrib utions only depend on the edge structure, and not on the particular v ertices, we can assume the v ertex set is V = { 1 , . . . , n } and index the conditional distributions by P n ≡ P V . Now consider the ex- ponential model studied in [Handcock et al., 2003]; suppose the model has two suf ﬁcient statistics, the number of edges and the number of ‘2-stars’ in a graph: P n ( E ) = 1 Z ( n, λ ) exp [ λ 1 f 1 ( E ) + λ 2 f 2 ( E )] f 1 ( E ) = X i (1 −  ) P ∗ n } be the subset of  -modes for P n . The follo wing deﬁnition is based on the ob- serv ation that degenerate exponential-f amily distributions tend to concentrate almost all of their mass on the distributions modes: Deﬁnition 6.1 ([Strauss, 1986], [Schweinberger, 2011]) . A set of dis- tributions P n is degenerate if, for an y 0 <  < 1 , we ha ve P n ( M ,n ) − → 1 as n − → ∞ . W e now make some comments. First, notice that by this deﬁnition, ev- ery distribution in the Erd ˝ os-Rényi model is degenerate. (This follows from well-kno wn results about typical sets for sequences of independent and iden- tically distrib uted random variables [Cover and Thomas, 2012], chapter 3.) This suggests that the characterization of degenerac y in this deﬁnition dif fers from our intuition: there appears to be a fundamental dif ference between the Erd ˝ os-Rényi model and the 2-star model described abo v e, and this should be captured in any deﬁnition. Second, this deﬁnition only concerns de generacy of a distribution, whereas it appears that degeneracy should be a property of a model (i.e., a set of distributions). For example, in the empirical studies performed in [Hand- cock et al., 2003], it was observed that nearly all distributions in the model 94 Summar y and Discussion placed the majority of their mass on either the empty graph or the complete graph. Hence, it is not the fact that the distributions in this model are ‘de- generate’ in the sense of the abov e deﬁnition that is interesting, but rather that (1) most distributions in this model have the exact same modes, i.e., the empty or complete graph; and (2) the transition between these two disparate distributions can occur suddenly in the parameter space. This suggest that a more suitable deﬁnition would concern phase-transitions, i.e., loosely , the existence of singularities in the parameter space. This might be deﬁned as the existence of a point in the parameter space such that arbitrary small balls around it contain points corresponding to dramatically dif ferent distributions, taking limits appropriately . A phase-transition deﬁnition appears to capture our intuiti v e idea of what constitutes degenerate and non-degenerate models. For example, the Erd ˝ os- Rényi model would not be classiﬁed as degenerate by this deﬁnition. Con- sider another example; suppose we want to extend the second-order Erd ˝ os- Rényi model to a third-order one. There are four relev ant statistics to con- sider: (a) the number (of third-order subgraphs) with three edges; (b) the num- ber with two edges; (c) the number with one edge; and (d) the number with no edges. Adding any three of these four statistics to the Erd ˝ os-Rényi model results in a set of sufﬁcient statistics for the extended model. The impor- tant thing to notice here is that these sufﬁcient statistics counterbalance each other: an increase in the parameter corresponding to the number of triangles can be offset, loosely speaking, by an increase in the parameter correspond- ing to the number of empty (third-order) graphs. This balance in the set of suf ﬁcient statistics suggests that distributions in this model change smoothly in the parameter space (when parameters are non-zero), and thus this model does not contain singularities, giv en an appropriate deﬁnition. In turn, this suggests that learning (e.g., maximum likelihood estimation) is feasible and that this model may be useful in practice. Lastly , let’ s consider model comple xity; in the example in Section 5.6.2, it was observed that adding attributes to unattributed graphs, loosely speaking, allo ws them to be modeled using lo wer order models. Since, for any model using latent variables, there exists a more complex model without latent vari- ables that is equiv alent, this suggests that models using unattributed graphs may require, to produce equi v alent distributions, e xtremely high orders. Ac kno wledgements I am indebted to Professors Laurent Y ounes and Donald Geman, for look- ing at very preliminary and vague versions of this work, and their helpful discussions and insights. 95 Appendices A Statistical In variances Graphical models are a tool that use conditional independencies to produce distributions with compact representations, promoting learning and inference. More generally , ho we v er , any statistical in v ariance - not just conditional inde- pendence - may be used to compress representations and ease learning. One example is in random graphs, where a common assumption is that the proba- bility of a graph only depends on its edge structure, and not on the particular labeling of the vertices; distributions are assumed in v ariant to graph isomor- phisms (see Section 5.3). Another example is in the modeling of the spatial conﬁguration of objects in street-scene imagery , where a simple assumption is that there’ s a symmetry to the horizontal location of objects, e.g. the prob- ability of a person on the left half of an image is equal to the probability of a person on the right half [Geman et al., 2015]. Naturally , it is beneﬁcial for practitioners to incorporate as many valid inv ariances as possible into their models. In this appendix, the goal is to simply pro vide a formulation of inv ariance that encompasses these ideas. W e begin by considering some basic deﬁnitions of inv ariance in v olving transforms and then proceed to more complicated def- initions of inv ariance in v olving functions and distributions. In Section A.3, we show that independence and conditional independence are particular in- 97 98 Statistical Inv ariances stances of this general formulation. In Section A.5, we consider in v ariances that are functions of the probability space, which we refer to as moment in- v ariances. A.1 In v ariant T ransformations For a giv en probability space, a statistical in v ariance is some property of ob- jects in it such that, loosely speaking, objects that share the same property hav e the same probability . A property of a space X is said to be in variant un- der a transformation if the transformation preserv es that property . In general, a property may be considered as an equiv alence class in which x 1 , x 2 ∈ X share the property if and only if x 1 and x 2 are in the same equi v alence class. In other words, deﬁne a property on X as an equi v alence class, or like wise, as an equi v alence relation: Deﬁnition A.1 (Equivalence Relation) . An equivalence relation on X is a symmetric, reﬂexi v e, and transitiv e binary relation ∼ on X . For an y x 1 , x 2 ∈ X , we say x 1 and x 2 hav e the same property if x 1 ∼ x 2 . No w , we can deﬁne an in v ariant transform: Deﬁnition A.2 (Inv ariant T r ansf orm) . A transform T : X → X is in v ariant to an equi v alence relation ∼ if for all x 1 , x 2 ∈ X , we have x 1 ∼ x 2 = ⇒ T ( x 1 ) ∼ T ( x 2 ) . A.2 In v ariant Functions W e deﬁned in v ariances for functions of the form T : X → X . No w , let’ s extend the notion of in v ariance for a general function f : X → Ψ in which the domain and codomain are not necessarily the same space. T o deﬁne an in v ariance on a general function, we’ll need two equi v alence relations, one on the domain X and one on the codomain Ψ : denote these two equiv alence relations as ∼ X and ∼ Ψ , respecti v ely . No w , let’ s deﬁne an in v ariant function: Deﬁnition A.3 (Inv ariant Function) . A function f : X → Ψ is in v ariant to equi v alence relations ∼ X and ∼ Ψ if for all x 1 , x 2 ∈ X , we have x 1 ∼ X x 2 = ⇒ f ( x 1 ) ∼ Ψ f ( x 2 ) . W e’ll ﬁnd this deﬁnition useful in deﬁning in variant distrib utions. A.2. Inv ariant Functions 99 A.2.1 In v ariant Pr obability Mass Functions In this section, we consider the in v ariance of probability mass functions (pmf); for ease of exposition, we skip the details about more general dis- tributions. Let ( X , Σ X , P ) be a probability space in which X is countably inﬁnite and Σ X is the power set of X . As above, to deﬁne the in v ariance of a function P : X → [ 0 , 1] , we will need to specify equiv alence relations on both X and [0 , 1] ⊂ R . Suppose we have some equi v alence relation ∼ X on X , and for the equi v alence relation on R , let ∼ R be the equality relation (i.e. for any r 1 , r 2 ∈ R , let r 1 ∼ R r 2 if and only if r 1 = r 2 ). For simplicity , we will always use the equality relation as the relation on the real numbers R . Thus, we hav e: Deﬁnition A.4 (Inv ariant pmf ’ s) . A pmf P is in variant to ∼ X if for all x 1 , x 2 ∈ X , we have: x 1 ∼ X x 2 = ⇒ P ( x 1 ) = P ( x 2 ) . Let’ s consider examples. Example A.1 (Symmetr y) . Let X = Z be the integers, and deﬁne the follo wing equi v alence relation: for e very x 1 , x 2 ∈ X : x 1 ∼ X x 2 ⇐ ⇒ | x 1 | = | x 2 | . Thus, a pmf P that is in v ariant to this equiv alence relation will ha ve a sym- metry in which P ( x ) = P ( − x ) for all x ∈ X . If one is trying to estimate P from data, this symmetry can be utilized to improv e the estimate, as is the case for any in variance. Example A.2 (Graph Isomor phisms) . Let X = G be a graph space, and deﬁne the follo wing equi v alence relation: for e very G 1 , G 2 ∈ G : G 1 ∼ X G 2 ⇐ ⇒ G 1 is isomorphic to G 2 . Thus, a pmf P that is in v ariant to this equiv alence relation will hav e an in- v ariance in which all graphs that are isomorphic to each other have the same probability . Many random graph models use this in v ariance (e.g., the Erd ˝ os- Rényi model). 100 Statistical Inv ariances A.3 Conditional In v ariance In this section, we consider conditional and marginal in v ariances. Then we deﬁne independence and conditional independence, two special cases. Let (Ω , Σ Ω , P ) be a probability space and let X be a countably inﬁnite space. Further , let X : Ω → X be a measurable function (i.e. a discrete random v ariable) with a pmf P X : X → [0 , 1] , where P X ( x ) = P ( { w ∈ Ω : X ( w ) = x } ) , x ∈ X . Thus, we may form the probability space ( X , Σ X , P X ) , where Σ X is the po wer set of X . Furthermore, for any A ∈ Σ Ω such that P ( A ) > 0 , we may form the conditional probability: P X ( x | A ) = P ( { w ∈ Ω : X ( w ) = x } | { w ∈ A } ) = P ( { w ∈ A : X ( w ) = x } ) P ( { w ∈ A } ) where x ∈ X . Finally , suppose we hav e an equi v alence relation ∼ Σ Ω on Σ Ω . Then, we may deﬁne conditional in v ariance as follo ws. Deﬁnition A.5 (Conditional Inv ariance) . A distribution P X is condition- ally in variant to ∼ Σ Ω if, for all A 1 , A 2 ∈ Σ Ω , we hav e: A ∼ Σ Ω A 2 = ⇒ P X ( ·| A 1 ) = P X ( ·| A 2 ) . No w , we consider distributions that hav e both regular inv ariances and conditional in v ariances. A.3.1 General In v ariance Thus far , we have described a regular in v ariance (Deﬁnition A.4) and a con- ditional in variance (Deﬁnition A.5); no w , suppose we want to combine these. As abov e, suppose we ha ve the tw o probability spaces: (Ω , Σ Ω , P ) ( X , Σ X , P ) where P was the distribution induced by P . Now , we will deﬁne an in v ariance in terms of an equiv alence relation ∼ Σ on the product space Σ = Σ Ω × Σ X . The general deﬁnition of an in v ariance is as follo ws. A.4. An Alternative Formulation 101 Deﬁnition A.6 (General In v ariance) . A distribution P is in variant to ∼ Σ if for all ( A 1 , B 1 ) , ( A 2 , B 2 ) ∈ Σ Ω × Σ X we hav e: ( A 1 , B 1 ) ∼ Σ ( A 2 , B 2 ) = ⇒ P ( B 1 | A 1 ) = P ( B 2 | A 2 ) . Example A.3 (Independence) . Suppose we hav e an equiv alence relation ∼ Σ on the product space Σ = Σ Ω × Σ X . Let A 1 , A 2 ∈ Σ Ω and B 1 , B 2 ∈ Σ X , and suppose that A 2 = Ω and B 1 = B 2 . Then, the event B 1 is said to be independent of the ev ent A 1 if ( A 1 , B 1 ) ∼ Σ ( A 2 , B 2 ) . That is, if B 1 is independent of A 1 , then: P ( B 1 | A 1 ) = P ( B 1 ) . Example A.4 (Conditional Independence) . Suppose we hav e an equiv a- lence relation ∼ Σ on the product space Σ = Σ Ω × Σ X . Let A 1 , A 2 ∈ Σ Ω and B 1 , B 2 ∈ Σ X , and suppose that A 2 ⊂ A 1 and B 1 = B 2 . Then, the ev ent B 1 is said to be conditionally independent of the e vent A 1 \ A 2 , given the ev ent A 2 if ( A 1 , B 1 ) ∼ Σ ( A 2 , B 2 ) . That is, if B 1 is independent of A 1 \ A 2 gi ven A 2 , then: P ( B 1 | A 1 ) = P ( B 1 | A 2 ) . A.4 An Alternative Formulation From the perspective of estimation, a pmf P that is in v ariant to an equiv a- lence relation ∼ X on X allows the production of additional data from our limited samples. That is, suppose we ha ve a sample x ∈ X ; then, there exists a replication procedure on this example x such that the equi v alence class is preserved in each replicated version of the data. That is, we could deﬁne a transformation that takes any x ∈ X to the set X 0 = { x 0 ∈ X : x 0 ∼ x } . In this section, we present an alternati ve representation of an equiv alence re- lation ∼ X ; this representation will not be useful in and of itself, but does moti v ate more elaborate in v ariances for modeling purposes, which we con- sider in the next section. 102 Statistical Inv ariances As abo ve, let X be a countably inﬁnite space, and now deﬁne T to be the space of bijecti ve transformations of the form T : X → X . Notice that an equi v alence relation ∼ X on X corresponds to a subset T 0 ⊂ T where: T 0 = { T ∈ T | x ∼ X T ( x ) for all x ∈ X } . In other words, T 0 contains all bijecti ve transformations that are consistent with the equi v alence relation. Thus, the equiv alence relation ∼ X induces a binary partition of T . Furthermore, this binary partition may be represented by an equiv alence relation ∼ T on T . That is, deﬁne ∼ T as follows: for all T 1 , T 2 ∈ T , let T 1 ∼ T T 2 ⇐ ⇒ T 1 , T 2 ∈ T 0 or T 1 = T 2 . Using this equiv alence relation on transforms, we can make an alternativ e deﬁnition of an in v ariant pmf. Deﬁnition A.7 (Inv ariant pmf ’ s) . Suppose we hav e a space T composed of bijecti ve transformations of the form T : X → X and suppose that we hav e an equiv alence relation ∼ T on T . A pmf P : X → [0 , 1] is in variant to ∼ T if for all T 1 , T 2 ∈ T , we hav e: T 1 ∼ T T 2 = ⇒ P ( T 1 ( x )) = P ( T 2 ( x )) for all x ∈ X . Although this alternati ve deﬁnition of an in variance is not particularly in- teresting in its o wn right 1 , it motiv ates the concept of in v ariances of moments, which we no w consider . A.5 Moment In v ariance In this section, we consider an important type of inv ariance of a distribution, that of moment in v ariance. As above, suppose we ha ve a countably inﬁnite space X and a pmf P X ov er it. Now , suppose we have: (1) a space F com- posed of functions of the form f : X → R ; and (2) an equiv alence class ∼ F on F . Deﬁne moment inv ariance as follo ws: 1 Suppose an equiv alence relation ∼ T is constructed as described in this section. Then the partition of T that corresponds to the equivalence relation ∼ T can only have one set of size greater than one (the set in which the identity transform T ( x ) = x belongs). Hence, this deﬁnition is not particularly useful. A.5. Moment In v ariance 103 Deﬁnition A.8 (Moment In v ariance) . Suppose we ha v e a set of real func- tions F and an equi v alence class ∼ F on it. A pmf P is moment in variant to ∼ F if for all f 1 , f 2 ∈ F , we hav e: f 1 ∼ F f 2 = ⇒ E P ( f 1 ( X )) = E P ( f 2 ( X )) . The functions in F will be referred to as featur es . If we compare this deﬁnition of in v ariance to the one in the previous section, we see that it follo ws naturally . Let’ s now consider some examples. Example A.5 (Symmetr ic Expectations) . Let X = Z be the integers. Deﬁne f 1 ( x ) = | x | · I { x< 0 } f 2 ( x ) = | x | · I { x ≥ 0 } , and let F = { f 1 , f 2 } and let f 1 ∼ f 2 . If a pmf P is moment in v ariant to ∼ , then we hav e that E P ( | X | · I { X < 0 } ) = E P ( | X | · I { X ≥ 0 } ) . Notice that this is a less stringent in v ariance compared to the previous ex- ample where P was required to be symmetric around the axis; here we only require P to be symmetric around the axis in the sense that the expected value ov er the negati v e numbers has the same magnitude as the expected v alue ov er the positi ve numbers. Example A.6 (Marginal Expectations) . Suppose X = Z n , and let F = { f 1 , . . . , f n } be a set of real function over X such that each f i ( x ) = x i is the projection of x ∈ X onto its i th factor . Further , suppose we deﬁne the tri vial equiv alence class on F in which all functions are in the same equiv- alence class (i.e. f i ∼ f j for all i, j ). Then, if P is moment in v ariant to this equi v alence class, we hav e E P ( X 1 ) = . . . = E P ( X n ) . That is, the expected v alues of all mar ginal random v ariables are the same. 104 Statistical Inv ariances Example A.7 (Marginal Distr ibutions) . Suppose we want a distribution P ov er the space X = Z n to hav e the in v ariance property that all mar ginal distributions are equal to each other , i.e., that P ( X 1 ) = . . . = P ( X n ) . T o specify this in variance, let F = { f i,j | i = 1 , . . . , n and j ∈ Z } be the set of real functions on X in which each f i,j ( x ) = I { x i = j } is an indica- tor function. Further, deﬁne the equiv alence class on F such that f i,j ∼ f k,j for all i, k ∈ { 1 , . . . , n } . Then, if P is moment inv ariant to this equiv alence class, we hav e that all marginal random variables ha v e the same distribution. Naturally , this inv ariance is useful in estimations in v olving the distribution P , which we consider in the next example. Example A.8 (Estimations using Moment Inv ariances) . Suppose we hav e a ﬁnite set of real functions F and it has an equiv alence relation ∼ on it. For a function f 0 ∈ F , denote its equi v alence class by F 0 = { f ∈ F | f ∼ f 0 } . If a pmf P is moment inv ariant to ∼ , then: E P ( f 0 ( X )) = 1 |F 0 | X f ∈F 0 E P ( f ( X )) = 1 |F 0 | E P   X f ∈F 0 f ( X )   . Thus, given data samples x (1) , . . . , x ( m ) , we may utilize this in v ariance and estimate E P ( f 0 ( X )) as follo ws: ˆ E P ( f 0 ( X )) = 1 |F 0 | 1 m m X i =1 X f ∈F 0 f ( x ( i ) ) . Utilizing these in v ariances will be beneﬁcial in man y applications. A.6. Conditional Moment Inv ariance 105 A.6 Conditional Moment In v ariance In Section A.3, we deﬁned independence and conditional independence, two special cases of in variances on distrib utions. Similarly , in this section, we de- ﬁne moment independence and conditional moment independence. W e begin by deﬁning conditional and marginal moment in variances. Let (Ω , Σ Ω , P ) be a probability space and let X be a countably inﬁnite space. Further , let X : Ω → X be a measurable function (i.e. a discrete random v ariable) with a pmf P : X → [0 , 1] , where P ( x ) = P ( { w ∈ Ω : X ( w ) = x } ) , x ∈ X . Thus, we may form the probability space ( X , Σ X , P ) , where Σ X is the po wer set of X . Furthermore, for any A ∈ Σ Ω such that P ( A ) > 0 , we may form the conditional probability: P X ( x | A ) = P ( { w ∈ Ω : X ( w ) = x } | { w ∈ A } ) = P ( { w ∈ A : X ( w ) = x } ) P ( { w ∈ A } ) where x ∈ X . Suppose we hav e a set of functions F composed of functions of the form f : X → R . Now , we will deﬁne an in v ariance in terms of an equi v alence relation ∼ Γ on the product space Γ = Σ Ω × F . The general deﬁnition of an in v ariance is as follo ws. Deﬁnition A.9 (Conditionally Moment Inv ariance) . A distribution P is conditionally moment in variant to ∼ Γ if for all ( A 1 , f 1 ) , ( A 2 , f 2 ) ∈ Σ Ω × F we hav e: ( A 1 , f 1 ) ∼ Γ ( A 2 , f 2 ) = ⇒ E P [ f 1 ( X ) | A 1 ] = E P [ f 2 ( X ) | A 2 ] , where E P [ f ( X ) | A ] ≡ X x ∈X f ( x ) P ( x | A ) . Example A.9 (Moment Independence) . Suppose we ha ve an equi v alence relation ∼ Γ on the product space Γ = Σ Ω × F . Let A 1 , A 2 ∈ Σ Ω and f 1 , f 2 ∈ F , and suppose that A 2 = Ω and f 1 = f 2 . 106 Statistical Inv ariances Then, the feature f 1 is said to be moment independent of the event A 1 if ( A 1 , f 1 ) ∼ Γ ( A 2 , f 2 ) . That is, if f 1 is moment independent of A 1 , then: E P ( f 1 ( X ) | A 1 ) = E P ( f 1 ( X )) . Example A.10 (Conditional Moment Independence) . Suppose we have an equi v alence relation ∼ Γ on the product space Γ = Σ Ω × F . Let A 1 , A 2 ∈ Σ Ω and f 1 , f 2 ∈ F , and suppose that A 2 ⊂ A 1 and f 1 = f 2 . Then, the feature f 1 is said to be conditionally moment independent of the e vent A 1 \ A 2 , gi ven the ev ent A 2 if ( A 1 , f 1 ) ∼ Γ ( A 2 , f 2 ) . That is, if f 1 is moment independent of A 1 \ A 2 gi ven A 2 , then: E P ( f 1 ( X ) | A 1 ) = E P ( f 1 ( X ) | A 2 ) . References Edo M Airoldi, David M Blei, Stephen E Fienberg, and Eric P Xing. Mixed mem- bership stochastic blockmodels. In Advances in Neural Information Pr ocessing Systems , pages 33–40, 2009. Claude Berge and Edward Minieka. Gr aphs and hypergr aphs , volume 7. North- Holland publishing company Amsterdam, 1973. Julian Besag. Spatial interaction and the statistical analysis of lattice systems. J our- nal of the Royal Statistical Society . Series B (Methodological) , pages 192–236, 1974. Davi D Bock, W ei-Chung Allen Lee, Aaron M K erlin, Mark L Andermann, Greg Hood, Arthur W W etzel, Serge y Y urgenson, Edward R Soucy , Hyon Suk Kim, and R Clay Reid. Network anatomy and in viv o physiology of visual cortical neurons. Nature , 471(7337):177–182, 2011. Béla Bollobás. Random graphs . Springer , 1998. Craig Boutilier , Nir Friedman, Moises Goldszmidt, and Daphne Koller . Context- speciﬁc independence in bayesian networks. In Pr oceedings of the T welfth in- ternational confer ence on Uncertainty in artiﬁcial intelligence , pages 115–123. Morgan Kaufmann Publishers Inc., 1996. T om Britton. Stochastic epidemic models: a survey . Mathematical biosciences , 225 (1):24–35, 2010. Liming Cai, Russell L Malmberg, and Y unzhou W u. Stochastic modeling of rna pseudoknotted structures: a grammatical approach. Bioinformatics , 19(suppl 1): i66–i73, 2003. 107 108 Ref erences David Maxwell Chickering, David Heckerman, and Christopher Meek. A bayesian approach to learning bayesian networks with local structure. In Pr oceedings of the Thirteenth conference on Uncertainty in artiﬁcial intelligence , pages 80–89. Morgan Kaufmann Publishers Inc., 1997. Noam Chomsky . Syntactic structures . W alter de Gruyter , 2002. Kai Lai Chung. A course in pr obability theory . Academic press, 2001. Peter Clifford. Markov random ﬁelds in statistics. Disor der in physical systems: A volume in honour of John M. Hammer sle y , pages 19–32, 1990. Thomas M Cover and Joy A Thomas. Elements of information theory . John Wile y & Sons, 2012. Michael Drmota. Random tr ees: an interplay between combinatorics and pr obabil- ity . Springer Science & Business Media, 2009. W itold Dyrka and Jean-Christophe Nebel. A stochastic context free grammar based framew ork for analysis of protein sequences. BMC bioinformatics , 10(1):1, 2009. W itold Dyrka, Jean-Christophe Nebel, and Malgorzata K otulska. Probabilistic gram- matical model for helix-helix contact site classiﬁcation. Algorithms for Molecular Biology , 8(1):1, 2013. Paul Erd ˝ os and Alfréd Rényi. On random graphs i. Publ. Math. Debr ecen , 6:290– 297, 1959. Ronald F agin, Joseph Y Halpern, and Nimrod Me giddo. A logic for reasoning about probabilities. Information and computation , 87(1-2):78–128, 1990. Ove Frank and David Strauss. Markov graphs. Journal of the american Statistical association , 81(395):832–842, 1986. Donald Geman, Stuart Geman, Neil Hallonquist, and Laurent Y ounes. V isual tur- ing test for computer vision systems. Pr oceedings of the National Academy of Sciences , 112(12):3618–3623, 2015. Stuart Geman and Christine Graf ﬁgne. Marko v random ﬁeld image models and their applications to computer vision. In Pr oceedings of the International Congr ess of Mathematicians , volume 1, page 2, 1986. Stuart Geman, Daniel F Potter , and Zhiyi Chi. Composition systems. Quarterly of Applied Mathematics , 60(4):707–736, 2002. Lise Getoor . Intr oduction to statistical r elational learning . MIT press, 2007. Lise Getoor, Nir Friedman, Daphne Koller , and A vi Pfef fer . Learning probabilistic relational models. In Relational data mining , pages 307–335. Springer, 2001. Edgar N Gilbert. Random graphs. The Annals of Mathematical Statistics , 30(4): 1141–1144, 1959. Ref erences 109 Ulf Grenander . Elements of pattern theory . JHU Press, 1996. Ulf Grenander . Geometries of knowledge. Pr oceedings of the National Academy of Sciences , 94(3):783–789, 1997. Ulf Grenander . P attern synthesis: Lectur es in pattern theory , volume 18. Springer Science & Business Media, 2012. Ulf Grenander and Michael I Miller . P attern theory: fr om repr esentation to infer ence , volume 1. Citeseer , 2007. Geoffre y R Grimmett. A theorem about random ﬁelds. Bulletin of the London Mathematical Society , 5(1):81–84, 1973. Bénédicte Haas, Grégory Miermont, et al. Scaling limits of markov branching trees with applications to galton–watson and random unordered trees. The Annals of Pr obability , 40(6):2589–2666, 2012. Patsy Haccou, Peter Jagers, and Vladimir A V atutin. Branc hing pr ocesses: variation, gr owth, and extinction of populations . Number 5. Cambridge University Press, 2005. Joseph Y Halpern. An analysis of ﬁrst-order logics of probability . Artiﬁcial intelli- gence , 46(3):311–350, 1990. Mark S Handcock, Garry Robins, T om AB Snijders, Jim Moody , and Julian Besag. Assessing degenerac y in statistical models of social networks. T echnical report, Citeseer , 2003. David Heckerman, Chris Meek, and Daphne Koller . Probabilistic entity-relationship models, prms, and plate models. Intr oduction to statistical r elational learning , pages 201–238, 2007. Peter D Hoff, Adrian E Raftery , and Mark S Handcock. Latent space approaches to social network analysis. Journal of the american Statistical association , 97(460): 1090–1098, 2002. Paul W Holland and Samuel Leinhardt. An exponential family of probability dis- tributions for directed graphs. Journal of the american Statistical association , 76 (373):33–50, 1981. Paul W Holland, Kathryn Blackmond Laskey , and Samuel Leinhardt. Stochastic blockmodels: First steps. Social networks , 5(2):109–137, 1983. Brijnesh J Jain and Fritz W ysotzki. Central clustering of attrib uted graphs. Machine Learning , 56(1-3):169–207, 2004. Svante Janson, T omasz Luczak, and Andrzej Rucinski. Random graphs , volume 45. John W ile y & Sons, 2011. 110 Ref erences Y a Jin and Stuart Geman. Context and hierarchy in a probabilistic image model. In Computer V ision and P attern Recognition, 2006 IEEE Computer Society Confer- ence on , volume 2, pages 2145–2152. IEEE, 2006. Michael I Jordan. Graphical models. Statistical Science , pages 140–155, 2004. Ross Kindermann, James Laurie Snell, et al. Markov random ﬁelds and their appli- cations , volume 1. American Mathematical Society Providence, RI, 1980. John Lamperti. Stochastic pr ocesses: a surve y of the mathematical theory , vol- ume 23. Springer Science & Business Media, 2012. Pierre Latouche, Etienne Birmelé, and Christophe Ambroise. Overlapping stochastic block models with application to the french political blogosphere. The Annals of Applied Statistics , pages 309–336, 2011. Steffen L Lauritzen. Graphical models . Clarendon Press, 1996. Steffen L Lauritzen and Thomas S Richardson. Chain graph models and their causal interpretations. Journal of the Royal Statistical Society: Series B (Statistical Methodology) , 64(3):321–348, 2002. Jean-François Le Gall et al. Random trees and applications. Pr obab . Surv , 2(245): 10–1214, 2005. Brian Milch, Bhaskara Marthi, Stuart Russell, David Sontag, Daniel L Ong, and Andrey Kolobo v . 1 blog: Probabilistic models with unknown objects. Statistical r elational learning , page 373, 2007. Charles J Mode. Multitype age-dependent branching processes and cell c ycle analy- sis. Mathematical Biosciences , 10(3):177–190, 1971. J Pearl and Glenn Shafer . Probabilistic reasoning in intelligent systems: Networks of plausible inference. Synthese-Dordr ec ht , 104(1):161, 1995. Joseph J Pfeiffer III, Sebastian Moreno, T imothy La Fond, Jennifer Neville, and Brian Gallagher . Attributed graph models: Modeling network structure with cor- related attributes. In Pr oceedings of the 23rd international conference on W orld wide web , pages 831–842. A CM, 2014. Garry Robins. Exponential random graph models for social networks. Encyclopae- dia of Complexity and System Science , Springer , 2011. Stuart Russell. Unifying logic and probability . Communications of the ACM , 58(7): 88–97, 2015. Ruslan R Salakhutdinov . Learning in markov random ﬁelds using tempered transi- tions. In Advances in neural information pr ocessing systems , pages 1598–1606, 2009. Ref erences 111 Michael Schweinber ger . Instability , sensitivity , and de generac y of discrete exponen- tial families. Journal of the American Statistical Association , 106(496):1361– 1370, 2011. Eugene Seneta. Population gro wth and the multi-type galton–watson process. 1970. Cosma Rohilla Shalizi and Alessandro Rinaldo. Consistency under sampling of ex- ponential random graph models. Annals of statistics , 41(2):508, 2013. Nino Shervashidze, Pascal Schweitzer , Erik Jan V an Leeuwen, Kurt Mehlhorn, and Karsten M Bor gwardt. W eisfeiler -lehman graph kernels. The Journal of Machine Learning Resear ch , 12:2539–2561, 2011. Solomon E Shimony . Explanation, irrelev ance and statistical independence. In Pr oceedings of the ninth National conference on Artiﬁcial intelligence-V olume 1 , pages 482–487. AAAI Press, 1991. T om AB Snijders, Philippa E Pattison, Garry L Robins, and Mark S Handcock. New speciﬁcations for exponential random graph models. Sociological methodology , 36(1):99–153, 2006. David Strauss. On a general class of models for interaction. SIAM r evie w , 28(4): 513–527, 1986. Ben T askar, Pieter Abbeel, and Daphne Koller . Discriminativ e probabilistic models for relational data. In Pr oceedings of the Eighteenth confer ence on Uncertainty in artiﬁcial intelligence , pages 485–492. Mor gan Kaufmann Publishers Inc., 2002. Ze Tian, T aeHyun Hwang, and Rui Kuang. A hypergraph-based learning algorithm for classifying gene expression and arraycgh data with prior knowledge. Bioin- formatics , 25(21):2831–2838, 2009. Martin J W ainwright and Michael I Jordan. Graphical models, exponential families, and variational inference. F oundations and T r ends® in Machine Learning , 1(1-2): 1–305, 2008. Henry W illiam W atson and Francis Galton. On the probability of the extinction of families. The Journal of the Anthr opological Institute of Great Britain and Ir eland , 4:138–144, 1875. Laurent Y ounes. Estimation and annealing for gibbsian ﬁelds. In Annales de l’IHP Pr obabilités et statistiques , v olume 24, pages 269–294, 1988. Dengyong Zhou, Jiayuan Huang, and Bernhard Schölkopf. Learning with hyper- graphs: Clustering, classiﬁcation, and embedding. In Advances in neural infor- mation pr ocessing systems , pages 1601–1608, 2006. Song-Chun Zhu and David Mumford. A stochastic grammar of images . Now Pub- lishers Inc, 2007.

Graphical Models: An Extension to Random Graphs, Trees, and Other Objects

Original Paper

Comments & Academic Discussion

Leave a Comment

Original Paper

Related Papers

Comments & Academic Discussion

Leave a Comment