Universal Clustering via Crowdsourcing
Consider unsupervised clustering of objects drawn from a discrete set, through the use of human intelligence available in crowdsourcing platforms. This paper defines and studies the problem of universal clustering using responses of crowd workers, wi…
Authors: Ravi Kiran Raman, Lav Varshney
1 Uni v ersal Clustering via Cro wdsourcing Ravi Kiran Raman and Lav R. V arshney Abstract Consider unsupervised clustering of objects drawn from a discrete set, through the use of human intelligence av ailable in crowdsourcing platforms. This paper defines and studies the problem of universal clustering using responses of cro wd workers, without kno wledge of worker reliability or task dif ficulty . W e model stochastic worker response distributions by incorporating traits of memory for similar objects and traits of distance among differing objects. W e are particularly interested in two limiting worker types—temporary workers who retain no memory of responses and long-term workers with memory . W e first define clustering algorithms for these limiting cases and then integrate them into an algorithm for the unified worker model. W e pro ve asymptotic consistenc y of the algorithms and establish sufficient conditions on the sample complexity of the algorithm. Con verse ar guments establish necessary conditions on sample complexity , proving that the defined algorithms are asymptotically order-optimal in cost. Index T erms Budget optimality , clustering, cro wdsourcing, univ ersal information theory , unsupervised learning I . I N T RO D U C T I O N Cro wdsourcing has grown in recent times as a potent tool for performing complex tasks using human skill and kno wledge. It is increasingly being used to collect training data for novel machine learning problems. Almost a fortiori , there is no prior knowledge on the nature of the task and so the use of general human intelligence has been needed [1]. As such, this setting requires processing human-generated signals in the absence of prior kno wledge about their properties; this setting requires uni versality . Because crowdsourcing often employs unreliable workers, the signals they generate are inherently noisy [2]. Hence, responses of cro wd workers are modeled as outputs of a noisy channel. Although these channels are unkno wn to the employer , crowdsourcing techniques ha ve thus far made assumptions on either the channel distrib ution or structure to design appropriate decoders. W e define an alternative approach— universal cr owdsour cing —that designs decoders without channel kno wledge, and de velop achiev ability and con verse ar guments that demonstrate order- optimality . The emergence of di verse online cro wdsourcing platforms such as Amazon Mechanical T urk and Upwork (formerly oDesk) has created the option of choosing between temporary workers and long-term workers. That is, tasks can be completed by either soliciting responses from a large number of workers performing small parts of a large task, or a specialized group of employees who work long-term on the task at hand. The trade-off between reliability and cost of each type of worker pool warrants systematic study . Whereas temporary workers are inexpensi ve and easily av ailable, some labor economists argue the excess cost of long- term employment is worthwhile due to the reliability and quality of work it ensures [3]. Howe ver , no quantitati ve characterization for this conjecture exists. As we will see, the results of this paper allow such comparisons. Since workers are human, they are subject to sev eral factors that affect human decision making, as identified in behavioral economics [4]. A standard assumption in cro wdsourcing and in univ ersal clustering [5], [6] has been independent and identi- cally distrib uted (i.i.d.) w orker responses across time/tasks. Howe ver , empirical evidence argues against temporal independence for indi vidual worker responses [7]. Due to the availability heuristic , worker responses may rely on the immediate examples that come to a person’ s mind, indicating memory in responses across tasks/time. Further , this influence may be more due to salient (vivid) information rather than full statistical history . Due to the anc horing and adjustment heuristic , people tend to excessi vely rely on a specific trait of an object in decision making and further due to the r epr esentativeness heuristic , people tend to assume commonality among This work was supported in part by NSF Grants IIS-1550145 and CCF-1623821, and by Air Force STTR Contract F A8650-16-M-1819. The authors are with the University of Illinois at Urbana-Champaign. 2 objects. These traits indicate there is a notion of distance among the response distributions corresponding to different objects. T o capture memory and distance, we define a unified model of worker responses, and then study two limiting cases. First we consider responses of temporary work ers who respond independently across tasks and across w orkers. W e then consider responses of long-term workers with object-specific memory . Specifically , we consider a Marko v memory model wherein the response to an object is dependent on the most recent response and the response to the most recent occurrence of an object of the same class; generalizations to other Marko v models follo ws readily . In both cases, we address questions of uni versality and sample complexity for reliable clustering, pro viding benchmarks for worst-case performance. A. Prior W ork There is a vast and rich literature on crowdsourcing and clustering; we describe a non-exhausti ve listing of particularly relev ant prior work. Algorithm design for crowdsourcing typically focuses on minimizing the cost of reliability . In particular , algo- rithms with order-optimal budget-reliability trade-offs hav e been designed for binary classification with unkno wn (but i.i.d.) crowd reliabilities [5]. Efficient algorithms for multi-class labeling have been proposed in [8] albeit without cost optimality guarantees. More recently , non-parametric permutation models of crowd workers were considered for binary classification [9]. An alternate strategy for multi-class labeling with w orkers lacking sufficient domain e xpertise is to decompose the ov erall task into simpler subtasks, and introduce redundancy through an error control code [10]. This approach is implicitly effecti ve for mismatched cro wdsourcing for speech transcription [11]. Separate from cro wdsourcing, the problem of clustering has been widely studied. Algorithms such as k -means clustering and its generalization to other Bre gman div ergence similarity measures [12] are popular methods that incorporate distance-based clustering. The problem of uni versal clustering was considered in a communication setting [6], [13], such that messages communicated across an unknown channel, after encoding using a random codebook, are clustered by exploiting dependency among outputs of similar messages. P articularly the decoder uses the minimum partition information functional [14] to perform optimal clustering. Similar information-based agglomerativ e clustering schemes ha ve also been explored [15]. Classification using crowdsourced responses in a clustering framew ork, followed by a labeling phase performed by a domain expert has been studied experimentally [16]. B. Main Contributions In this work, we provide a theoretical study of uni versal crowdsourcing for clustering. In particular , we focus on the design of univ ersal clustering algorithms with prov able asymptotic consistency and order optimality in sample complexity . The presence of memory in work er response demands an approach that dif fers from past cro wd algorithms defined for i.i.d. models [5]. Notwithstanding [10], in the cro wdsourcing frame work herein, we do not ha ve the opportunity to encode messages. This calls for a treatment different from past work in univ ersal clustering [6]. W e first consider the case of temporary workers without memory , wherein, we design a distance-based uni versal clustering decoder that uses distributional identicality among objects of the same class. That is, two objects are clustered together if the f -di ver gence between the conditional distributions of the responses is small. W e prove asymptotic consistenc y of the algorithm and prove order optimality in sample complexity . The algorithm applies directly to a large class of similarity measures. W e then consider long-term employees with object-specific memory . Specifically we consider a Marko v memory model wherein the response to an object is dependent on the most recent response and the response to the most recent occurrence of an object of the same class. F or this model, we sho w the existence of information-based univ ersal clustering strategies that perform asymptotically reliable clustering. Further , we study the sample complexity of the decoder and sho w order optimality by comparing with the necessary cost. W e also highlight the e xtension of the algorithm to higher -order Markov memory models. W e also observe that the uni versal clustering algorithm is structurally similar to traditional clustering algorithms such as the MIRN (mutual information relev ance networks) 3 Fig. 1. Model of the crowdsourcing system clustering [17], and the minimum partition information clustering [14] algorithms under added constraints on the channel model. Finally , we use results obtained for these two limiting cases to construct a clustering algorithm for a unified worker model. W e prov e asymptotic consistency . Further , we sho w order optimality of the sample comple xity as a function of the number of objects to be clustered. I I . M O D E L A. System W e formalize the model of the crowdsourcing system by first describing a unified worker model and then specializing to the cases of workers with and without memory . W e aim to design univ ersal decoders that cluster a gi ven set of objects using the cro wd responses. For an y index v ector S , let Z s be the set { Z i : i ∈ S } , where Z is any vector . Similar notation shall be used for a matrix of indices S and matrix Z . W e consider the problem of cro wd workers employed to perform classification of objects. For instance, consider the task of classifying images of dogs according to their breeds. The work ers observ e images and respond with the breed of the dog in the image. Since worker responses are noisy , in the absence of kno wledge of worker channels it is not feasible to identify the labels (breeds of dogs) accurately . Thus, we aim to cluster the dogs according to their breeds and determine the labels of each cluster by using a domain expert. The cro wdsourcing system model is depicted in Fig. 1. Let T be a finite alphabet of object clusters. W ithout loss of generality , let us assume that T = [ τ ] = { 1 , . . . , τ } , where τ < ∞ is a constant. Each object is vie wed by crowd workers as X ∈ X , which has some relation to its type T ∈ T . Let the set of objects to be clustered be { X 1 , . . . , X ` } . Let the label of object X i be T i , for all i ∈ [ ` ] . That is, the objects to be clustered are treated as manifestations of the various object labels. Thus clustering the set of objects is the same as clustering according to their object labels. Let us assume that the objects are dra wn according to an unknown prior P T ( · ) on the set of classes. For each object X i , i ∈ [ ` ] , the cro wdsourcing system solicits n responses, Y n i , from workers employed to classify the objects according to their labels. The collection of responses is gi ven by the matrix Y = ( Y n 1 , . . . , Y n ` ) 0 = [ Y i,j ] i ∈ [ ` ] ,j ∈ [ n ] . Let W ∈ W be the index of a worker and let S W ⊂ [ ` ] × [ n ] be the index set corresponding to the responses of fered by worker W in Y . W e assume that S w ∩ S ˆ w = ∅ for any two workers w 6 = ˆ w and ∪ w ∈W S w = [ ` ] × [ n ] . Let Q be the set of all conditional probability mass functions (pmfs) that characterize worker responses. Then, Y S W ∼ Q ( W ) ( Y S W | T ˆ S W ) where Q ( w ) ( · ) ∈ Q is the distribution characterizing the response of worker w ∈ W and ˆ S W is the corresponding set of indices of objects. Models of these distributions are detailed later . W ith regard to the example of classifying dog images, the object label, T , is the breed of the dog in an image; the object, X , is the image of a particular dog; and the responses of the crowd workers Y n are the breeds they categorize the image to. 4 Since clustering is performed solely based on the response of the workers, it is fair to assume that the number of clusters that can be formed is directly dependent on |Y | . Howe ver , it is not essential for e very work er to answer e very question in a practical crowdsourcing platform. Thus we assume that the workers either respond with an answer in T or of fer a ‘null’ response ξ to ev ery task [18]. Thus, without loss of generality , we assume that Y ij ∈ Y = T ∪ { ξ } , for all i ∈ [ ` ] , j ∈ [ n ] . B. Universal Clustering P erformance Definition 1 (Corr ect Clustering): A clustering of a set of objects X 1 , . . . , X ` is a partition P of [ ` ] . The sets of a partition are referred to as clusters . The clustering is said to be corr ect if T i , T j ∈ C ⇐ ⇒ T i = T j , for all i, j ∈ [ ` ] , C ∈ P . For a gi ven set of object labels, T ` , let P ∗ T ` be the correct clustering. Let P be the set of all partitions of [ ` ] . Definition 2 (P artition Or dering): A partition P is finer than P 0 , if the follo wing ordering holds P P 0 ⇐ ⇒ for all C ∈ P , there exists C 0 ∈ P 0 : C ⊆ C 0 . Similarly , a partition P is said to be denser than P 0 if P P 0 ⇐ ⇒ P 0 P . Definition 3 (Universal Clustering Decoder): A universal clustering decoder is a sequence of functions Φ ( n ) : Y → P that are designed in the absence of knowledge of Q and P T . Here the index n corresponds to the number of crowd responses collected per object. W e now define how to characterize decoder performance. Definition 4 (Err or Pr obability): Let Φ ( n ) ( · ) be a uni versal decoder . Then, the err or pr obability is giv en by P e (Φ ( n ) ) = Pr h Φ ( n ) ( Y ) 6 = P ∗ T ` i = E P ⊗ ` T h E h 1 n Φ ( n ) ( Y ) 6 = T ` o | T ` ii , (1) where P ⊗ ` T ( t ` ) = Q ` i =1 P T ( t i ) and 1 {·} is the indicator function. Definition 5 (Asymptotic Consistency): A sequence of decoders Φ ( n ) is said to be universally asymptotically consistent if lim n →∞ P e (Φ ( n ) ) = 0 , for all P T ∈ M ( T ) , where M ( · ) is the space of all prior distributions on the set of objects, T . Definition 6 (Sample Complexity): Let > 0 be the permissible error mar gin. Then the sample comple xity of the uni versal clustering problem is N ∗ ( ) = min n ∈ N : max P T ∈M ( T ) P e (Φ ( n ) ) < , where the minimum is taken ov er the set of all sequences of univ ersal decoders Φ ( n ) . For simplicity , we will use Φ to denote Φ ( n ) when it is clear from context. C. W orkers W e now define a model to characterize cro wd worker responses. Let us assume the crowdsourcing system emplo ys n cro wd workers chosen at random from W . W ithout loss of generality , let us assume that the set of work ers chosen is [ n ] . W e assume that each worker responds to ev ery object X i , i ∈ [ ` ] and the set of responses of work er j is { Y 1 ,j , . . . , Y `,j } . W e assume that the responses of each worker w ∈ W are drawn according to the conditional distributions Q w . W e assume that the responses of each worker depend on prior responses in a Mark ov sense. In particular , define the set of neighbors as N i = {{ i − 1 } ∪ { max { k ∈ [ i − 1] : T k = T i }}} for any i ∈ [ ` ] i.e., the most recent object 5 and the most recent occurrence of an object of the same class. The response of any worker j ∈ [ n ] , for any k ≤ ` satisfies Pr h ( Y 1 ,j , . . . , Y k,j ) = ( y 1 , . . . , y k ) | T k = t k i = k Y i =1 Q ( j ) ( y i | y N i , t i ) . (2) Additionally we assume that for any worker j ∈ [ n ] and i ≤ ` , for e very t ∈ T , Pr [ Y i = y | T i = t ] = Q ( j ) ( y | t ) . (3) That is, the worker responses are dependent on the prior responses such that the marginal conditional distribution of the response, giv en an object is identical across objects of the same class (that is, the marginals are inv ariant across permutations of the giv en set of objects). W e also assume that, given the object, the responses are independent across workers. That is, Pr [( Y i, 1 , . . . , Y i,k ) = ( y 1 , . . . , y k ) | T i = t ] = k Y j =1 Pr [ Y i,j = y j | T i = t ] . (4) Thus, the unified worker model of crowd responses is characterized by (2), (3), and (4). As mentioned in Section I, in addition to the unified work er model, we focus on two special classes of workers— temporary workers without memory and long-term workers with memory . In particular , when temporary workers are employed we assume the y do not retain memory of their prior responses and so are independent across time. In order to solicit such responses, we may assume that the crowdsourcing platform delegates each task to a sequence of n workers selected uniformly at random from a sufficiently large cro wd to ensure independence of responses across objects. That is, if the index of responses of worker w is S w ⊂ [ ` ] × [ n ] , then Pr h Y S w = y S w | T ` = t ` i = Y ( i,j ) ∈ S w Q ( w ) ( y i,j | t i ) . (5) Further , the responses also satisfy (4). The second special class of workers we consider is long-term workers with Markov memory . In particular , responses of this class of workers are characterized by (2) and (4). I I I . T E M P O R A RY W O R K E R S Consider the scenario of clustering using responses of temporary w orkers (without memory). As mentioned in the system model, we assume worker responses are independent across objects and workers. W e first introduce f -div ergences and their properties, so as to characterize the quality of cro wd responses. W e then define the univ ersal clustering algorithm and prove asymptotic consistency and order optimality in sample complexity . A. f -Diver gence T o measure the separation among the conditional distrib utions of cro wd responses to dif ferent object classes, we use the Csisz ´ ar f -diver gence [19], [20]. Definition 7 ( f -diver gence): Let p, q be discrete probability distributions defined on a space of m alphabets. Gi ven a conv ex function f : [0 , ∞ ) → R , the f -diver gence is defined as: D f ( p k q ) , m X i =1 q i f p i q i . (6) The function f is said to be normalized if f (1) = 0 . Some specific f -diver gences are the KL div ergence D ( p k q ) and the total variational distance δ ( p, q ) . Specifically , the KL div ergence and the total variational distance are the f -diver gences corresponding to the functions f ( x ) = x log x and f ( x ) = | x − 1 | respectiv ely . W e now state some bounds for f -di ver gences. 6 Theor em 1 ( [21, Chapter II.1]): Let p, q be discrete probability distributions on a space of m alphabets such that there exist r , R satisfying 0 ≤ r ≤ p i q i ≤ R ≤ ∞ for all i ∈ [ n ] . Let f : [0 , ∞ ) → R be a con vex and normalized function satisfying the follo wing criteria: 1) f is twice differentiable on [ r, R ] , and 2) there exist real constants c, C < ∞ such that c ≤ xf 00 ( x ) ≤ C, for all x ∈ ( r, R ) . Then, we hav e, cD ( p k q ) ≤ D f ( p k q ) ≤ C D ( p k q ) . (7) For ease, we refer to the constraints on f in Theorem 1 as smoothness constraints. If f is twice dif ferentiable in [ r , R ] , then we kno w that there exists a constant L such that f is L -Lipschitz. Theor em 2 ( [21, Chapter II.3]): Let f : [0 , ∞ ) → R be con vex, normalized, and L -Lipschitz on [ r, R ] . Then, 0 ≤ D f ( p k q ) ≤ Lδ ( p, q ) . (8) Further , Pinsker’ s inequality lower bounds the KL div ergence D ( p k q ) with respect to the total v ariational distance as D ( p k q ) ≥ (2 log 2 e ) δ 2 ( p, q ) . (9) Cor ollary 1: For an y conv ex and normalized function f that satisfies the smoothness constraints and is L -Lipschitz, κδ 2 ( p, q ) ≤ D f ( p k q ) ≤ Lδ ( p, q ) , (10) where κ = 2 c log 2 e . Pr oof: The result follows from Theorems 1 and 2, and (9). These inequalities are used to prov e consistency of the designed universal clustering decoder . B. T ask Difficulty for W orker P ool Let Q 1 , . . . , Q τ be the conditional response distributions giv en the object class, defined as Q i ( y ) , Pr [ Y = y | T = i ] = E h Q ( W ) ( Y = y | T = i ) i , where the expectation is taken over the workers in the pool. Since the responses are obtained from temporary workers chosen at random, it suffices to consider these expected conditional response distributions. Definition 8 (Distance Quality): For a giv en pool of temporary workers, the difficulty of the tasks is defined as θ d , min { i,j ∈T ,i 6 = j } D f ( Q i k Q j ) , (11) where D f corresponds to the f -di ver gence chosen as the notion of similarity for the problem at hand. The operational significance of this informational definition of distance (Definition 8) will emerge in coding theorems Lemma 1 and Theorem 3. Clustering is performed using the maximum likelihood estimates of the f -div er gence between distrib utions corresponding to responses to objects. Con vergence of the empirical estimates is asymptotically consistent and the rates of con vergence are discussed in Appendix A. C. Universal Clustering using T empor ary W orkers Responses to objects of the same class are identical in distribution. Thus, we perform universal clustering, Φ temp ( Y ) , according to Algorithm 1. That is, the algorithm identifies the cliques in the graph obtained by thresh- olding the f -diver gence between the corresponding empirical distributions. The functioning of the algorithm is depicted in Fig. 2. Lemma 1: For f ( x ) = | x − 1 | and γ n = c 1 n − α , α ∈ [0 , 1 / 2] let γ n < θ d / 2 . F or any other conv ex function f satisfying the smoothness constraints let γ n = c 1 n − β < κ 2 L θ 2 d + 2 κ L 2 1 − r 1 + Lθ 2 d 2 ! , 7 Algorithm 1 Clustering with temporary workers, Φ temp ( Y ) if f ( x ) = | x − 1 | then γ n ← c 1 n − α , where c 1 is a constant and α ∈ [0 , 1 / 2] else γ n ← c 1 n − β , where c 1 is a constant and β ∈ [0 , 1] end if Determine empirical distributions q j , j ∈ [ ` ] Construct G = ([ ` ] , E ) , s.t. ( i, j ) ∈ E ⇐ ⇒ D f ( q i , q j ) ≤ γ n C = { C : C is a maximal clique in G } Select minimal weight partition of [ ` ] from C where κ = 2 c log 2 e and the function f is L -Lipschitz. Define the ball of radius ρ centered at p as B f ( p, ρ ) = { q : D f ( p k q ) ≤ ρ } . If for all i ∈ [ ` ] , the empirical distrib ution of responses q i ∈ B f ( Q t i , γ n / 2) then, Φ temp ( Y ) = P ∗ ( t ` ) , the correct clustering of the set of objects. Pr oof: Let us first consider f ( x ) = | x − 1 | . Since q i ∈ B f ( Q t i , γ n / 2) for all i ∈ [ ` ] , we have max { i,j ∈ [ τ ]: T i = T j } δ ( q i , q j ) ≤ γ n ≤ θ d 2 ≤ min { i,j ∈ [ τ ]: T i 6 = T j } δ ( q i , q j ) . Let C i = { j ∈ [ ` ] : T j = i } , for any i ∈ [ τ ] . Then, for j, k ∈ C i , D f ( q j , q k ) < γ n and so ( j, k ) ∈ E . Thus, C i is a clique of G . Further , this observation also implies that for any i ∈ C t , j ∈ C t 0 , t 6 = t 0 , D f ( q i , q j ) > γ n and so ( i, j ) / ∈ E . Thus, any set C ⊆ [ ` ] such that there exist i, j ∈ C , with T i 6 = T j , is not a clique in G . Thus C i is a maximal clique in G for all i ∈ [ τ ] . Thus Φ temp ( Y ) = { C i : i ∈ [ τ ] } = P ∗ ( t ` ) , the correct partition. For the second part of the lemma, from (10), we note that the condition on γ n guarantees max { i,j ∈ [ τ ]: T i = T j } D f ( q i k q j ) ≤ γ n < min { i,j ∈ [ τ ]: T i 6 = T j } D f ( q i k q j ) . Thus the result follo ws from a very similar argument. From Lemma 1, we observe that when the empirical distrib utions are suf ficiently close to the corresponding true distributions, the algorithm outputs the correct clustering. Using this result, we pro v e consistency of the algorithm. Theor em 3: If f ( x ) = | x − 1 | , then for any α ∈ (0 , 1 / 2) and constant c 1 , for n & max ( 2 c 1 θ d 1 /α , 4 log ` c 0 c 2 1 1 / (1 − 2 α ) ) , (12) Fig. 2. Distance-based clustering of 9 objects of 3 . The graph is obtained by thresholding the f -div ergences of empirical distributions of responses to objects. The clustering is then done by identifying the maximal clusters in the thresholded graph. 8 suf ficiently large, Φ temp ( · ) achieves arbitrarily lo w clustering error probability . For fixed ` and θ d , it is univ ersally asymptotically consistent. For an y other con vex, normalized function f satisfying the smoothness constraints, for any β ∈ (0 , 1) and constant c 1 , for n & max ( 2 c 1 L 2 κ µ θ d 1 /β , C log l c 1 1 / (1 − β ) ) , (13) where, µ θ d = Lθ 2 d + 4 κ 1 − r 1 + Lθ 2 d 2 ! − 1 , with suf ficiently large constant c 1 , Φ temp ( · ) achie ves arbitrarily lo w clustering error probability . For fix ed ` and θ d , it is univ ersally asymptotically consistent. Pr oof: For f ( x ) = | x − 1 | and n ≥ 2 c 1 θ d 1 /α , γ n ≤ θ d / 4 . Thus, when f ( x ) = | x − 1 | , we can bound the error probability as follows P e (Φ temp ) ≤ E P ⊗ l T [Pr [ ∃ i ∈ [ l ] : q i / ∈ B ( Q t i , γ n / 2)]] ≤ ` ( n + 1) |Y | exp − c 0 n γ 2 n 4 (14) = exp log ` + ( τ + 1) log( n + 1) − c 0 c 2 1 4 n 1 − 2 α , where (14) follows from the union bound and Lemma 5. Thus, the cost conditions giv en in (12) and asymptotic consistency follow . Using a similar argument and Lemma 1, we obtain (13). Cor ollary 2: Given T = [ τ ] with τ < ∞ a constant: 1) for a constant θ d > 0 , N ∗ temp ( ) = O (log ` ) 1 / (1 − β ) and taking β → 0 , we observe that N ∗ temp ( ) = O (log ` ) for any of the similarity metrics; 2) for a constant ` , for f ( x ) = | x − 1 | , N ∗ temp ( ) = O θ − 1 /α d and taking α → 1 / 2 , N ∗ temp ( ) = O ( θ − 2 d ) . On the other hand, for other conv ex functions f satisfying the smoothness constraints, N ∗ temp ( ) = O ( θ − 1 /β d ) and specifically , taking β → 1 , N ∗ temp ( ) = O ( θ − 1 d ) . Pr oof: The results follow directly from Theorem 3. T o summarize, this subsection has defined a univ ersal clustering algorithm that is asymptotically consistent and also described sufficient conditions on sample complexity of universal clustering. D. Lower Bound on Sample Comple xity W e now show matching lo wer bounds on the sample complexity for uni versal clustering. Theor em 4: Let min i,j ∈ [ τ ] δ ( Q i , Q j ) = θ d . Then the sample complexity of univ ersal clustering satisfies N ∗ temp = Ω log ` − log θ 2 d . Pr oof: Consider the prior distribution P T such that P T (1) = P T (2) = 0 . 5 . Let ψ ij be the binary hypothesis testing problem giv en by: ψ ij : ( H 0 : T i = T j H 1 : T i 6 = T j . (15) There exists ` 2 such binary hypothesis tests. Choose a set of tests ˜ ψ ⊂ { ψ ij : i, j ∈ [ ` ] , i 6 = j } of cardinality `/ 2 such that no two tests in the set share a common object. This indicates that the binary hypothesis tests are independent of each other owing to the independence across objects. 9 Let Φ be a decoder for the clustering problem. Then a correct solution to the clustering problem implies a correct solution to ψ ij , for all i, j ∈ [ ` ] , i 6 = j . This implies that an instance of correct clustering translates to correct decisions in all tests in ˜ ψ . Thus, P e (Φ) ≥ 1 − Y { i,j ∈ [ ` ]: ψ ij ∈ ˜ ψ } (1 − Pr [ error in ψ ij ]) (16) ≥ 1 − Y { i,j ∈ [ ` ]: ψ ij ∈ ˜ ψ } 1 − 1 4 exp ( − 2 nB ij ) (17) ≥ 1 − 1 − 1 4 exp ( − 2 nB max ) b `/ 2 c (18) = b `/ 2 c 4 exp ( − 2 nB max ) + o (exp ( − 4 nB max )) , (19) where (16) follows from the independence of the binary tests and (17) follo ws from the Kailath lower bound [22]. Here B ij is the Bhattacharyya distance corresponding to the hypotheses of the test ψ ij and considering non- tri viality , there exists a test such that B ij > 0 . Thus bounding from belo w by the test with maximum distance (18), B max = max { i,j ∈ [ ` ]: ψ ij ∈ ˜ ψ } B ij > 0 and using the binomial expansion, we obtain (19). No w , using Jensen’ s inequality , we have B ij ≤ 1 2 ( D ( Q 1 k Q 2 ) + D ( Q 2 k Q 1 )) . This follows from the definition of the binary hypotheses tests and the independence of samples. From Pinsker’ s inequality and rev erse Pinsker’ s inequality [23], we have (2 log 2 e ) δ 2 ( P , Q ) ≤ D ( P k Q ) ≤ 4 log 2 e Q min δ 2 ( P , Q ) , where Q min = min x ∈ supp ( P ) Q ( x ) and supp ( · ) is the support of the distribution. Since we are concerned with the sample complexity in the worst case when δ ( P , Q ) → 0 , it suffices to consider Q min > 0 . Thus, the above bounds indicate that D ( Q 1 k Q 2 ) D ( Q 2 k Q 1 ) θ 2 d , where it is said g h , if there exists constants a, b > 0 such that ah ≤ g ≤ bh . Thus, we hav e P e (Φ) ≥ 1 8 exp log ` − n ( cθ 2 d ) , where c is the constant scaling based on Pinsker’ s and rev erse Pinsker’ s inequalities. From this we observe that N ∗ temp = Ω log ` − log θ 2 d . Cor ollary 3: Let f ( · ) be a con ve x function satisfying the smoothness constraints. Further , let min i,j ∈ [ τ ] D f ( Q i k Q j ) = θ d . Then for constant ` , N ∗ temp = Ω( θ − 1 d ) . Pr oof: From (10) we hav e min i,j ∈ [ τ ] δ ( Q i , Q j ) . p θ d . This proves the result. Thus, we see that the univ ersal clustering decoder achieves the lo wer bound up to the constant factor in sample complexity . Hence our clustering algorithm is asymptotically order optimal in the number of objects to be clustered and the minimum separation of hypotheses. It is worth noting that the quantity θ 2 d is equi v alent in definition to the crowd quality defined in [5] and matches the lower bound obtained on the cost for binary classification using cro wd workers biased to ward giving the right label. The factor of log ` in the cost per object arises since error probability studied here is the block (blocklength ` ) error probability whereas [5] studies the average symbol (classification) error probability . 10 Fig. 3. Bayesian Network model of responses to a set of 7 objects chosen from a set of 3 types. W e observe that the most recent response and the response to the most recent object of the same type influence ev ery response. I V . W O R K E R S W I T H M E M O RY Recall that in Section II, we defined the structure of the stochastic kernel that determines the responses of workers with memory . In particular , we considered a Marko v memory structure (2). This structure is represented in the Bayesian network depicted in Fig. 3. Specifically , we assume that the response Y i,j to an object X i by work er j is dependent on the response to the most recent object, X i − 1 , and the response to the most recent object of the same class. This set of indices for any object X i is given by N i . A. T ask Difficulty for W orker P ool Let Q be the set of such Markov-structured distrib utions representing the work er pool. W e assume the work ers are chosen independently and identically from this set according to some underlying distribution. Thus, the conditional distribution of the response vector for a random worker is Pr h Y ` = y ` T ` = t ` i = E h Q Y ` = y ` T ` = t ` i = ¯ Q y ` t ` , where the expectation is taken over the worker distribution. W e know that a sufficient statistic of the worker responses is the empirical distribution. Asymptotically , we kno w that the empirical pmf conv erges to ¯ Q by the strong law of lar ge numbers. It thus suffices to study the decoder with regard to this characteristic worker response pmf that retains the assumed memory properties. Throughout the section, for any i ∈ [ ` ] , denote by ˜ ı the index such that ˜ ı ∈ N i , T i = T ˜ ı . Definition 9 (Memory Quality): The memory quality in a given pool of long-term workers is θ m = 1 2 min i ∈ [ ` ] I ( Y i ; Y N i ) − max i ∈ [ ` ] ,j 1 , ν i \{ i − 1 } is in the same cluster as i . Pr oof: The results follow from the model definition. W e use the data processing inequality to obtain the following property that motiv ates the decoder construction. Lemma 3: Let C = { c 1 , . . . , c k } and without loss of generality , let c 1 < c 2 < · · · < c k . Gi ven (2), C ∈ P ∗ if and only if for all j ∈ [ k ] , c j = arg max i 0 . That is, the most recent object of the same class has residual information, giv en any other pair from the past. This in turn implies that for all j < i, j 6 = ˜ ı , I ( Y i ; Y i − 1 , Y j ) < I ( Y i ; Y N i ) . The result thus follo ws. It is evident that the partition can be obtained through a careful elimination process using mutual information v alues. Maximum likelihood estimates of the mutual information can be obtained from the samples using and asymptotically consistent estimators. Note that such estimators con verge exponentially; con vergence rates are detailed in Appendix B. C. Information Clustering Algorithm W e now describe the clustering algorithm in two stages. First we describe an algorithm that, gi ven the set of objects and mutual information v alues, outputs a partition that is denser than the correct partition. W e then describe an algorithm that ov ercomes this shortcoming by identifying sub-clusters within the identified clusters recursiv ely . W e then sho w correctness of the algorithm and prove it is asymptotically consistent when the ML estimates of mutual information are used. From the directed acyclic graph (Bayesian network) corresponding to the gi ven set of objects, we know that identifying the parents of each node is suf ficient to identify clusters such that objects of the same type are in the same cluster . From Lemma 3, for any i ∈ [ ` ] , I ( Y i ; Y i − 1 , Y j ) < I ( Y i ; Y N i ) . Thus identifying the parents of node i is equiv alent to solving η i = arg max j ≤ i − 1 I ( Y i ; Y i − 1 , Y j ) . Using this feature we design Algorithm 2, Φ info ( I ) . The algorithm outputs the partition of a set of objects when the corresponding mutual information v alues are gi ven as input. The algorithm starts from the last object and iterates backward while finding the parents of each node. Upon identification, the parent is added to the same cluster as the object. Theor em 5: Gi ven a set of objects T ` and the corresponding set of mutual informations { I ( Y i ; Y i − 1 , Y j ) : i ∈ [ ` ] , j < i } , the output of Alg. 2 satisfies P = Φ info ( I ) P ∗ . 12 (a) (b) Fig. 4. Functioning and shortcoming of Alg. 2. (a) Alg. 2 outputs P = {{ 1 , 5 } , { 2 , 3 , 6 } , { 4 , 7 }} = P ∗ . Here, object 4 of type 3 is not clustered with object 3 of type 2 as it is already assigned a cluster . (b) Alg. 2 outputs P = {{ 1 , 4 , 5 , 7 } , { 2 , 3 , 6 }} P ∗ . Here object 5 is clustered with 4 as objects of T ype 1 are not encountered before. Algorithm 3 Clustering with memory , P = Φ mem ( T ` ) Choose constant k = d − log (log ` − 2 log τ ) e f or i = 1 to k do Choose a permutation ξ ([ ` ]) uniformly at random Collect responses Y ( i ) for the sequence of objects T ξ ([ ` ]) Compute I = { ˆ I ( Y j ; Y j − 1 , Y k ) : k < j, j ∈ [ ` ] } P i = Φ info ( I ) end for Choose finest partition P such that P P i for all i ∈ [ k ] Pr oof: From Lemma 3, I ( Y i ; Y i − 1 , Y j ) ≤ I ( Y i ; Y N i ) for all j < i with equality if and only if j ∈ N i . Thus, the parents of e very node i ∈ [ ` ] in the Bayesian network can be determined, giv en the mutual information v alues. For P = Φ info ( I ) , for ev ery object t ∈ T , there exists C ∈ P such that { i ∈ [ ` ] : T i = t } ⊆ C . Hence the result follo ws. D. Consistency of Universal Clustering Theorem 5 indicates that, giv en the mutual information values, the objects of the same type are clustered together . The maximizer η i in Alg. 2 is clustered only if it is not assigned a cluster before the iteration. Thus object i is not paired with i − 1 unless it has not been assigned a cluster . This particular scenario is depicted in the Bayesian network in Fig. 4(a). Ho we ver , the algorithm f ails in a specific scenario. When there exists clusters C 1 and C 2 such that i < j for e very i ∈ C 1 , j ∈ C 2 , and, max { i ∈ C 1 } = min { j ∈ C 2 } − 1 , then the resulting partition consists of the single cluster C 1 ∪ C 2 rather than the two indi vidual clusters. This is because objects of C 1 hav e not yet been encountered and thus the immediate neighbor of the first object of C 2 is clustered along with C 2 due to the Markov memory structure. This particular scenario is depicted in Fig. 4(b). Such shortcomings of the algorithm ho wev er happen for outlier cases that are of lo w probability when ` τ . Thus, if the finest partition in a collection of permutations of a giv en set of objects, chosen uniformly at random, is obtained using Φ info , then with high probability , the correct partition is obtained. Thus the overall algorithm can be summarized as in Alg. 3 when we want an error probability less than 2 . For each permutation of the objects, n responses are obtained for each object from the workers. Thus, the overall number of samples per object obtained is k n . Theor em 6: Let T ` be the set of objects, ` ≥ τ 2 , and, let Y be the set of responses. Let k ≥ d − log (log ` − 2 log τ ) e be the number of permutations chosen in Alg. 3. Then, for n & max n (log ` − log ) 1 (1 − 2 α − β ) , (log ` − log ) 1 (1 − 4 α ) , θ − 1 α m o , (21) for 0 < α < 1 / 2 and 0 < β < 1 such that log 2 2 n ≤ n β , P e (Φ mem ) ≤ 2 , for any > 0 . Further , for constant ` and θ m , the algorithm is asymptotically consistent. 13 Pr oof: W e first observe that when | ˆ I ( Y i ; Y i − 1 , Y j ) − I ( Y i ; Y i − 1 , Y j ) | < θ m for all i ∈ [ ` ] , j < i , Φ info ( ˆ I ) = Φ info ( I ) . That is, when the empirical mutual information values do not deviate from the actual v alues significantly , the clustering algorithm works without error . For n ≥ ( c 1 /θ m ) 1 /α , γ n < θ m . Let I ij = I ( Y i ; Y i − 1 , Y j ) . Then, Pr h Φ info ( ˆ I ) 6 = Φ info ( I ) i ≤ X i ∈ `,j γ n i (22) ≤ ` 2 exp − nγ 2 n 18 log 2 2 n + o (1) (23) ≤ exp 2 log ` − c 2 1 18 n (1 − ν ) + o (1) , implying asymptotic consistenc y . Here ν = 2 α + β , (22) follo ws from the union bound, and (23) follows from Lemma 6. Additionally , we note that Pr h Φ info ( ˆ I ) 6 = Φ info ( I ) i ≤ X i ∈ `,j γ n i ≤ 3 ` 2 ( n + 1) τ 2 exp − ˜ cnγ 4 n (24) ≤ exp 2 log ` − ˜ cc 4 1 n (1 − 4 α ) + o (1) , where (24) follows from (47). T o obtain the correct partition, we use the responses generated for sev eral uniformly random permutations of the gi ven set of objects and select the finest partition. The correct partition may not be recovered when there exists t, t 0 ∈ T , such that max { i ∈ [ ` ] : T i = t } = min { i ∈ [ ` ] : T i = t 0 } − 1 in every chosen partition. Let K t = |{ i ∈ [ ` ] : T i = t }| , t ∈ [ τ ] and let M t = max { i ∈ [ ` ] : T i = t } and m t = min { i ∈ [ ` ] : T i = t } . Thus, the probability that P = Φ mem ( T ` ) P ∗ is bounded as: Pr [ P P ∗ ] = Pr ∃ t 6 = t 0 : M t = m t 0 − 1 k ≤ X t,t 0 ∈T ,t 6 = t 0 Pr [ M t = m t 0 − 1] k . The total number of possible permutations of the giv en set of objects is gi ven by κ permut = ` ! Q ˜ t ∈T K ˜ t ! . The number of sequences such that M t = m t 0 − 1 can be determined by choosing K t + K t 0 − 1 locations out of ` − 1 locations to fill the objects of type t and t 0 and permute over the other objects. Thus, κ t,t’ ≤ ` − 1 K t + K t 0 − 1 ( ` − K t − K t 0 )! Q ˜ t ∈T \{ t,t 0 } K ˜ t ! . Since the permutations are chosen uniformly at random, Pr [ m t 0 − M t = 1] ≤ κ t,t 0 κ permut = 1 ` K t ! K t 0 ! ( K t + K t 0 − 1)! ≤ 1 ` , (25) where (25) follows from the fact that K t ! K t 0 ! ( K t + K t 0 − 1)! ≤ 1 , (26) as K t , K t 0 ≥ 1 . 14 Thus, we hav e Pr [ P P ∗ ] ≤ τ 2 2 ` k ≤ exp ( − k (log ` − 2 log τ )) . (27) Thus, for k ≥ d − log (log ` − 2 log τ ) e , Pr [ P P ∗ ] ≤ . No w we prove consistency of Alg. 3. Using the union bound, we observe that the probability of error is bounded as P e ≤ Pr [ P P ∗ ] + k Pr h Φ info ( ˆ I ) 6 = Φ info ( I ) i ≤ exp ( − k (log ` − 2 log τ + log 2)) + exp 2 log ` − max c 2 1 18 n (1 − ν ) , ˜ cc 4 1 n (1 − 4 α ) + o (1) (28) ≤ 2 for a large enough n . Here (28) follows from the two concentration bounds on empirical mutual information described in Appendix B. Thus, for any > 0 , there exists n, k suf ficiently large, such that P e < . Hence the consistency result follows. The sample complexity is obtained from the error exponent in (28). W e observe from the proof that there is a trade-of f between the v alues of k and n needed to achie ve a certain level of accuracy . In particular, we observe that when ` is large, it suf fices to consider a small number of permutations of the set of objects, while each permutation requires a larger number of samples. On the other hand, when ` is relati vely small, one needs a large number of permutations while each permutation requires far fewer samples. W e restrict focus to the case where ` τ 2 and under this scenario find the following result on sample complexity . Cor ollary 4: Given T = [ τ ] with τ < ∞ a constant and ` τ 2 , N ∗ mem ( ) = O (log ` − log ) min { 1 / (1 − 2 α − β ) , 1 / (1 − 4 α ) } θ (2 / (1 − β )) m ! . Pr oof: Using Theorem 6 and the fact that the total number of samples used is k n (since clustering with n samples is done k times) per object, we obtain the result. For large ` , we can thus observe that N ∗ mem ( ) = O log ` θ 2 m . Note that under the Markov memory model for long-time workers, the suf ficient number of samples per object is the same in order as for temporary workers. E. Lower Bound on Sample Comple xity W e now provide matching lower bounds by studying the probability of error of a problem which is a reduction of the univ ersal clustering problem. Theor em 7: The sample complexity of uni versal clustering using workers with memory satisfies 1) for a fixed θ m > 0 , N ∗ mem = Ω (log ` ) , and 2) for a fixed ` < ∞ , N ∗ mem = Ω θ − 1 m . Pr oof: Choose a prior , parametrized by the size of the problem ` as P T (1) = 1 − 1 ` , P T (2) = 1 ` such that 1 ` = 1 ` . Let E be the set of all vectors of objects with at most one object of type 2 . Then, Pr h T ` ∈ E i = 2 − 1 ` 1 − 1 ` ` − 1 ` →∞ − → 1 . In particular, we note that Pr T ` ∈ E is an increasing function of ` and is at least 1 2 for any ` > 3 . For a giv en constant θ m , consider the special case of the problem where N i = { ˜ ı } . That is, consider the problem where any two objects are dependent if and only if the y are of the same type. Clearly , any algorithm that solv es the 15 uni versal clustering with memory problem solves this simplified problem as well. Thus, following the con vention established, we hav e ( I ( Y i ; Y ˜ ı ) ≥ 2 θ m , for all i ∈ [ ` ] I ( Y i ; Y j ) = 0 , for all i, j ∈ [ ` ] , T i 6 = T j . Define W = [ W ij ] 1 ≤ i,j ≤ 2 = 1 2 + 1 2 − 1 2 − 1 2 + . Consider the scenario where worker responses are inertial over time and characterized as: Pr [ Y i = k | Y ˜ ı = j ] = W kj , for any k , j ∈ { 0 , 1 } and i ∈ [ ` ] . Additionally , assume that the marginals of the responses are uniform (that is, the response to the first object of each type is distributed as Bern (1 / 2) ). The information constraint implies 1 2 − h − 1 (1 − 2 θ m ) ≤ < 1 2 , where h ( · ) is the binary entropy function and h − 1 ( · ) is its in verse. Let = 1 2 − h − 1 (1 − 2 θ m ) . From the definition of the error probability , P e ≥ 1 2 Pr h ˆ P 6 = P ∗ | T ` ∈ E i . (29) No w consider the set E of vectors. Identifying the correct partition for a vector of objects from this space is equi v alent to identifying the objects. Thus consider the ( ` + 1) -ary hypothesis testing problem defined by ( H 0 : T j = 1 , for all j ∈ [ ` ] H i : T i = 2 , T j = 1 , for all j 6 = i . (30) W e seek to compute the av erage error probability of (30) corresponding to the prior P T . Due to symmetry , note that the optimal decoder accrues the same probability of error under H i for any i > 0 . Thus Pr [ error in (30) ] = Pr [ H 0 ] Pr [ error in (30) | H 0 ] + Pr [ error in (30) | H 1 ] X i ∈ [ ` ] Pr [ H i ] . No w , note that X i ∈ [ ` ] Pr [ H i ] = 1 − 1 ` ` − 1 , Pr [ H 0 ] = 1 − 1 ` ` . Thus, for ` > 1 , 1 2 Pr [ { H i : i > 0 } ] ≤ Pr [ H 0 ] ≤ Pr [ { H i : i > 0 } ] . Thus, Pr [ { H i : i > 0 } ] Pr [ H 0 ] . This indicates that the av erage error probability is lo wer-bounded by a constant factor of the minimax error probability lower bound for (30). Let Q i be the distribution of the set of responses corresponding to the hypotheses defined in (30): Q i ( Y ` ) = ( 1 2 Q ` k =2 W Y k Y k − 1 , i = 0 1 4 Q i − 1 j =2 W Y j Y j − 1 Q ` k = i +1 W Y k Y k − 1 , i ∈ [ ` ] . (31) Lemma 4: For all ` , D ( Q i k Q j ) = O (1) . Pr oof: See Appendix C. Having bounded the KL div ergences between the hypotheses, we obtain a lo wer bound on the error probability of (30) using the generalized Fano inequality [24]. 16 Let β = max i,j ∈ [ ` ] ∪{ 0 } ,i 6 = j D ( Q i k Q j ) . The loss function considered here is the 0-1 loss. Hence, Pr h ˆ P 6 = P ∗ | T ` ∈ E i = Pr [ error in problem (30) ] max 0 ≤ i ≤ ` Pr [ error in problem (30) | H i ] ≥ 1 2 1 − nβ + log 2 log( ` + 1) . (32) Hence, for a constant θ m > 0 , the sample complexity of univ ersal clustering satisfies N ∗ mem = Ω(log ` ) . No w , when ` is fixed, we seek to understand the sample complexity with respect to the memory quality of the cro wd. T o this end, we note that an y consistent clustering algorithm is also consistent for the binary hypothesis test ψ : ( H 0 : I ( Y 1 ; Y 2 ) = 0 H 1 : I ( Y 1 ; Y 2 ) ≥ 2 θ m . (33) That is, if Φ is a decoder for the universal clustering problem, then Φ also solves ψ . Since the sufficient statistics for detection of the binary hypothesis testing above are the responses to T 1 , T 2 , it suf fices to consider Y n 1 , Y n 2 . Let the prior here be P T (1) = P T (2) = 1 / 2 . Let the corresponding distrib utions of worker responses be p ( Y n 1 , Y n 2 ) and q ( Y n 1 , Y n 2 ) under H 0 and H 1 respecti vely . Here, p ( y n 1 , y n 2 ) = 1 2 n Y i =1 Pr [ y 1 ,i , y 2 ,i | T 1 = T 2 = 1] + 1 2 n Y i =1 Pr [ Y 1 ,i , Y 2 ,i | T 1 = T 2 = 2] , and q ( Y n 1 , Y n 2 ) = 1 2 n Y i =1 Pr [ Y 1 ,i | T 1 = 1] Pr [ Y 2 ,i | T 1 = 2] + 1 2 n Y i =1 Pr [ Y 1 ,i | T 1 = 2] Pr [ Y 2 ,i | T 1 = 1] Let p ( i ) j ( y ) = Pr [ Y i = y | T i = j ] , q ( i ) j ( y ) = Pr [ Y i = y | T i = j ] , under H 0 and H 1 respecti vely . W ithout loss of generality , we assume I p ( Y 1 , Y 2 ) = 2 θ m . Since the distributions satisfy the information constraints, we have, D ( p k q ) ≤ n 4 4 I p ( Y 1 ; Y 2 ) + X i,j,k ∈ [2] D ( p ( i ) j k q ( i ) k ) (34) = 2 nθ m , when the marginals under the tw o hypotheses are equal as was the case in the inertial worker response channel. Here (34) follows from con ve xity . Thus the minimum upper bound on the KL div ergence between the hypotheses is 2 θ m . Since we consider the worst case with respect to θ m , it suffices to consider this upper bound. Thus, P e (Φ) ≥ Pr [ error in ψ ] ≥ 1 4 exp ( − 2 B ( p, q )) (35) ≥ 1 4 exp ( − 2 D ( p k q )) ≥ 1 4 exp ( − 4 nθ m ) , (36) where (35) follows from the Kailath lo wer bound [22]. Then, using Jensen’ s inequality , we obtain (36). Thus N ∗ mem = Ω θ − 1 m . 17 W e observe from Theorem 7 that the universal clustering algorithm is order optimal in terms of the number of objects, ` . Howe ver , there is a gap between the lower bound and the achiev able cost in terms of the cro wd quality θ m . This gap is exactly the well-known gap for entropy estimation observed in [25, Corollary 2]. F . Reductions to Other Clustering Algorithms There exist se veral clustering paradigms based on mutual information. Here we describe two such algorithms and reductions of our model that lead to our clustering algorithm becoming the same as those algorithms. An information clustering strategy is defined in [14] that identifies clusters based on the minimum partition information that is defined as I ( Z V ) = min P ∈P I P ( Z V ) , where I P is the partition information according to the partition P , defined as I P ( Z V ) = 1 | P | − 1 X C ∈ P H ( Z C ) − H ( Z V ) ! . Consider the Mark ov memory model such that N i = { ˜ ı } , that is, if Y i is conditionally dependent on an object only if it is of the same type. Then, if | P ∗ | > 1 , then I ( Y ` ) = I P ∗ ( Y ` ) = 0 . That is, the correct partition is the partition that minimizes the partition information. The following reduction indicates that minimizing the partition information is the same as our algorithm, and so [14] is a special case of our approach. First, we hav e H ( Y ` ) = ` X i =1 H ( Y i | Y ˜ ı ) . Similarly , H ( Y C ) = X i ∈ C H ( Y i | Y j ( i ) ) , where j ( i ) = max { k < i : k ∈ C } . This implies that I P ( Y ` ) = 1 | P | − 1 ` X i =1 ( I ( Y i ; Y ˜ ı ) − I ( Y i ; Y j ( i ) )) ! , where j ( i ) = max { k < i : i, k ∈ C } . This indicates that minimizing the partition information is equiv alent to Φ info ( I ) . Another information-based clustering strategy is the mutual information rele v ance network (MIRN) clustering [17]. Here, for a given threshold γ , the clustering strategy determines the connected components of G = ([ ` ] , E ) such that ( i, j ) ∈ E ⇐ ⇒ I ( Z i ; Z j ) > γ . For the Marko v memory model, under the restriction that min { i,j ∈ [ ` ]: T i = T j } I ( Y i ; Y j ) > γ ≥ max { i,j ∈ [ ` ]: T i 6 = T j } I ( Y i ; Y j ) , MIRN outputs the correct partition. Ho we ver , in the uni versal clustering scenario, the decoder is not aware of γ and thus, it may not be feasible to implement MIRN clustering optimally . Ne vertheless. under the restriction that N i = { ˜ ı } and γ = 0 , the MIRN clustering algorithm is just a special case of Φ info ( I ) . 18 Fig. 5. Bayesian Network model of responses of to a set of 10 objects chosen from a set of 3 types with ζ = 2 . W e observe that the most recent response and the response to the two most recent object of the same type influence every response. G. Extended Memory W orkers While the model defined abov e considers the dependence of responses on just the most recent object of the same kind, our results hold for any fixed, finite-order Mark ov structure as well. In particular , consider the scenario where worker responses are dependent on the set N i = { i − 1 } ∪ M i , where M i ⊆ { j < i : T j = T i } such that it contains the most recent ζ indices of the same type of object. That is, the response to an object is dependent not only on the most recent response, but also a constant number of prior responses to objects of the same type. An example of this work er model can be found in Fig. 5 for a set of 10 objects of 3 types with a worker memory of ζ = 2 . Then the algorithm defined and the results obtained can be extended to this scenario. More specifically , the parents of a node i can be computed using the rule I ( Y i ; Y N i ) > I ( Y i ; Y S ) , for any index set S ⊆ [ i − 1] such that | S | ≤ ζ . Then for any constant ζ and n & ( τ + 1) 2 ζ , the consistency of the algorithm holds. That is, as long as ζ = O (1) , the sample complexity results follo w . V . U N I FI E D W O R K E R M O D E L While we studied two distinct classes of work er models in temporary workers and work ers with memory , these two scenarios are limiting cases of a unified worker model described here. After all, it is reasonable to characterize practical cro wd work er decisions as influenced by both aspects—memory of individual responses and task difficulty with respect to objects. For the unified model, we provide an achiev able scheme that makes use of the algorithms defined earlier . Further , we pro ve consistency and order optimality of the scheme. As in Section II, consider worker model (3), where each worker is characterized by a Markov memory model subject to fixed conditional marginal distributions. A. W orker Quality Let us no w define worker quality . Let Q 1 , . . . , Q τ be the conditional response distributions gi ven the object class, defined as Q i ( y ) , Pr [ Y = y | T = i ] = E h Q ( W ) ( Y = y | T = i ) i , where the expectation is taken over the workers in the pool. Define distance quality as θ d , min { i,j ∈T ,i 6 = j } δ ( Q i k Q j ) . Notice this distance quality is analogous to the definition of the worker quality in the case of temporary workers. Additionally , define the memory quality as θ m , 1 2 min i ∈ [ ` ] I ( Y i ; Y N i ) − max i ∈ [ ` ] ,j τ 2 . Then, for n & max n (log ` − log ) 1 (1 − 2 α − β ) , (37) (log ` − log ) 1 (1 − 4 α ) , ( θ m + θ d ) − 1 α o , (38) for 0 < α < 1 / 2 and 0 < β < 1 , P e (Φ u ) ≤ 2 , for any > 0 . Pr oof: First we note that for n ≥ ( c 1 / ( θ m + θ d / 4)) 1 /α , γ n ≤ θ d / 4 + θ m . Thus, at least one of γ n ≤ θ d / 4 or γ n ≤ θ m is true. This in turn indicates that at least one of Φ mem or Φ temp is consistent. Next, from Theorem 5, we note that the output P info P ∗ . Thus, subsequent clustering of the individual clusters is sufficient. This in turn indicates the correctness and asymptotic consistency of Alg. 4. W e now observe that the sample complexity with respect to the number of objects to be clustered is still O (log ` ) while that with respect to the quality parameters is O (( θ m + θ d ) − 2 ) . Cor ollary 5: Given T = [ τ ] with τ < ∞ a constant, 1) for a constant θ m , θ d > 0 , N ∗ u ( ) = O (log ` ) ; 2) for a constant ` , N ∗ u ( ) = O (( θ d + θ m ) − 2 ) . It is worth noting the limiting cases of the unified worker model. In particular , when θ m → 0 , the problem reduces to clustering with temporary workers as do the achie vable scheme and sample complexity requirements. On the other hand, θ d → 0 corresponds to a particular case of clustering using workers with memory . C. Lower Bound on Sample Comple xity W e now deri ve the lo wer bound on sample complexity by extending the proof of the con verse for workers with memory . Theor em 9: Sample complexity of univ ersal clustering under the unified worker model satisfies 1) for a fixed θ > 0 , N ∗ u = Ω (log ` ) , and 2) for a fixed ` < ∞ , N ∗ u = Ω ( θ m + θ 2 d ) − 1 . Pr oof: W e proceed in similar fashion to the proof for the case of workers with memory . Again, consider the prior parametrized by the size of the problem ` as P T (1) = 1 − π , P T (2) = π such that π = 1 ` . Again, we will use the generalized Fano’ s inequality ov er the space E of vectors of objects. W e again consider the case of N i = { ˜ i } . Consider worker responses such that marginals of the responses to an object satisfy Pr [ Y = i | T = j ] = ( p, i = j 1 − p, i 6 = j irrespecti ve of the order of occurrence. Define the matrices W (1) = [ W (1) ij ] 1 ≤ i,j ≤ 2 = a 1 − a 1 − b b , and W (2) = [ W (2) ij ] 1 ≤ i,j ≤ 2 = b 1 − b 1 − a a . 20 Let the worker responses be characterized by Pr h Y i = k Y ˜ ı = ˜ k , T i = T ˜ ı = j i = W ( j ) k ˜ k . From the constraint on distance quality , we hav e: 2 p − 1 ≥ θ d . The constraint on the nature of the marginals establishes: ap − b (1 − p ) = p. The restriction on the information quality implies: h ( p ) − ph ( a ) − (1 − p ) h ( b ) ≥ 2 θ m . Let us consider the case when both inequalities hold with equality . This yields a specific worker channel that satisfies the memory and distance quality requirements. W e analyze the error probability on this worker channel. Again, using analysis similar to the proof of Lemma 4, we observe the KL diver gences between the hypotheses in the ( ` + 1) -ary hypothesis testing problem are O (1) . Hence there e xists a constant β such that (32) holds. Thus, for constant θ m and θ d , the sample complexity of universal clustering satisfies: N ∗ u = O (log ` ) . No w , when ` is fixed, we study the necessary sample complexity of uni versal clustering with respect to θ m , θ d . W e know that a consistent univ ersal clustering algorithm also solves the binary hypothesis test ψ : ( H 0 : I ( Y 1 ; Y 2 ) = 0 H 1 : I ( Y 1 ; Y 2 ) ≥ 2 θ m . Follo wing the analysis from the proof of Theorem 7, from (34), we hav e D ( p k q ) ≤ n 2 θ m + θ d log 1 + θ d 1 − θ d . n ( θ m + θ 2 d ) . Finally , using the Kailath lower bound, we obtain P e (Φ) ≥ 1 4 exp − 4 cn ( θ m + θ 2 d ) . (39) Thus, for constant ` , N ∗ u = Ω(( θ m + θ 2 d ) − 1 ) . From the theorem, we note the univ ersal clustering algorithm is order optimal in sample complexity in terms of the number of objects, for a cro wd of giv en quality . Howe ver , for a giv en number of objects, there e xists an order gap between achie v able sample complexity and the conv erse. As e xpected, the gap follows from the gap in the case of workers with memory , which in turn is from the gap in estimating entropy [25]. In particular , we observe that in the limit of θ d → 0 , the problem reduces to the case of workers with memory and on the other hand the case of θ m → 0 reduces to the problem of clustering using temporary workers without memory . A finer point in the analysis to be noted is that the worst-case channels considered in the conv erse proofs is the inertial channel considered in the proof of Theorem 7, which is also the solution to the set of constraints for the channel in the unified scenario under the limit of θ d → 0 . Thus, we observe temporary workers and long-term workers with memory are indeed closely related through the unified worker model studied here and are limiting scenarios. 21 V I . C O N C L U S I O N This paper establishes an information-theoretic frame work to study the uni versal crowdsourcing problem. Specif- ically , we defined a unified worker model (incorporating aspects of human decision making from experimental cro wdsourcing and behavioral economics) and designed unsupervised clustering algorithms that are budget optimal. W e first studied two limiting cases of workers—ones with and ones without memory . For temporary w orkers without memory , we used distributional identicality of responses to design a uni versal clustering algorithm that is asymptotically consistent and order optimal in sample complexity . For workers with memory , under a Markov model of memory , we used the dependence structure to design a nov el universal clustering algorithm that is asymptotically consistent and order optimal in sample complexity with respect to the number of objects. W e also note that the gap obtained between necessary and suf ficient conditions on sample complexity with respect to the memory quality is also observed in the empirical estimation of entropy . W e then integrated the limiting cases to dev elop a univ ersal clustering algorithm for the unified worker model. W e again proved asymptotic consistenc y and order optimality in sample complexity with respect to the number of objects. W ith re gard to the quality of cro wd workers, the gap observ ed in the case of the workers with memory remains. Behavioral e xperiments using crowd work ers on platforms such as Amazon MT urk can be performed to gain insight into the performance of the algorithms in practice and to validate the unified worker models. Our results provide a way to compare costs between crowdsourcing platforms ha ving workers with and without memory , thereby providing the opportunity to choose the right task-dependent platform. Further , they provide a windo w into more general studies of the computational capabilities and complexities of human-based information systems. In particular , the work sheds light on the influence of various attributes of cro wd work ers such as object- specific memory . In essence, the work studies a space-time tradeoff for human computation systems and to the best of our knowledge is the first of its kind. A P P E N D I X A C O N C E N T R AT I O N O F E M P I R I C A L D I S T R I B U T I O N S In this section we briefly study the rates of conv ergence of the ML estimates of f -di ver gence. Let Z be a finite set of objects and Z 1 , . . . , Z n iid ∼ p . Let ˆ p be the empirical distribution obtained as ˆ p ( z ) = 1 n n X i =1 1 { Z i = z } . Lemma 5: If p and ˆ p are the true and empirical distributions respecti vely , then Pr [ δ ( ˆ p, p ) ≥ ] ≤ ( n + 1) |Z | exp − c 0 n 2 , (40) where c 0 = 2 log 2 e . Further, for any conv ex function f satisfying the smoothness constraints, Pr [ D f ( ˆ p k p ) ≥ ] ≤ ( n + 1) |Z | exp ( − n/C ) , (41) where C < ∞ is a constant such that xf 00 ( x ) < C . Pr oof: The results follow from Pinsker’ s inequality (9), Theorem 1, and Sanov’ s theorem. A P P E N D I X B E S T I M AT I N G M U T U A L I N F O R M A T I O N F R O M S A M P L E S Here we briefly describe the maximum likelihood (ML) estimate of mutual information and its con vergence properties. Let X ∼ p be a random v ariable on a discrete space X . Let X 1 , . . . , X n iid ∼ p and let ˆ p be the corresponding empirical distribution. The ML estimate of entropy of X is giv en by ˆ H ( X ) = E ˆ p [ − log ( ˆ p ( X ))] . The ML estimate of mutual information between random variables X, Y is then giv en by ˆ I ( X ; Y ) = ˆ H ( X ) + ˆ H ( Y ) − ˆ H ( X , Y ) . 22 The ML estimates of entropy and mutual information have been widely studied [25]–[28]. In particular , the follo wing results are notable: 1) For all n , from Jensen’ s inequality , E [ ˆ H ( X )] < H ( X ) [26]. For all n , > 0 , Pr h | ˆ H ( X ) − E [ ˆ H ( X )] | > i ≤ 2 exp − n 2 2 log 2 2 n , (42) from McDiarmid’ s inequality . 2) ML estimate, ˆ H is negati vely biased [27]: b n ( ˆ H ) , E p h ˆ H ( X ) i − H ( X ) = − E p [ D ( ˆ p k p )] < 0 . Further , − |X | n ≤ − log 1 + |X | − 1 n ≤ b n ( ˆ H ) ≤ 0 . 3) From [25], we know that the lower bound on sample complexity for estimating entropy up to an additiv e error of is |X | log |X | . 4) From [28], we observ e that the deviation of the empirical entropy from the entropy of X ∼ P , is bounded in terms of the total variational distance as | ˆ H ( X ) − H ( X ) | ≤ − 2 δ ( ˆ P , P ) log 2 δ ( P , Q ) |X | . (43) Since we deal with finite, constant alphabet sizes, it suf fices for us to consider the ML estimates with sufficiently large n , such that the bias is negligible. Lemma 6: For fixed alphabet sizes, |X | , |Y | , the ML estimate of entropy and mutual information are asymptotically consistent and satisfy Pr h | ˆ H ( X ) − H ( X ) | > i ≤ 2 exp − n 2 2 log 2 2 n + o (1) (44) Pr h | ˆ I ( X ; Y ) − I ( X ; Y ) | > i ≤ 6 exp − n 2 18 log 2 2 n + o (1) . (45) Pr oof: The con ver gence of entrop y follows directly by applying the triangle inequality , union bound, and (42). The result follo ws from the f act that the alphabet is of finite, constant size. This implies the conv ergence result for mutual information. Lemma 7: For fixed alphabet sizes, |X | , |Y | , the ML estimate of entropy and mutual information are asymptotically consistent and satisfy Pr h | ˆ H ( X ) − H ( X ) | > i ≤ ( n + 1) |X | exp − cn 4 , (46) Pr h | ˆ I ( X ; Y ) − I ( X ; Y ) | > i ≤ 3( n + 1) |X ||Y | exp − ˜ cn 4 , (47) where c = 2 |X | 2 log 2 − 1 , ˜ c = 32 max {|X | , |Y |} 2 log 2 − 1 . Pr oof: W e first observe that for all x > 0 , log x < √ x . Thus, for any X ∼ P , from (43), we hav e | ˆ H ( X ) − H ( X ) | ≤ q 2 |X | δ ( ˆ P , P ) . Using this and (40), the first inequality is obtained. Subsequently , using the triangle inequality and union bound, we obtain the con vergence of the empirical mutual information. These rates of con vergence are used to prov e consistency . 23 A P P E N D I X C P RO O F O F L E M M A 4 In this section we describe the proof of Lemma 4. Pr oof: W e first note that D ( Q i k Q j ) = E Q i log Q i ( Y ` ) Q j ( Y ` ) = E Q i log W Y i +1 Y i − 1 W Y j Y j − 1 W Y j +1 Y j W Y i +1 Y i W Y i Y i − 1 W Y j +1 Y j − 1 = E Q ` − i log W Y ` − i +1 Y ` − i − 1 W Y ` − j Y ` − j − 1 W Y ` − j +1 Y ` − j W Y ` − i +1 Y ` − i W Y ` − i Y ` − i − 1 W Y ` − j +1 Y ` − j − 1 = D ( Q ` − i k Q ` − j ) . Then we note that D ( Q 0 k Q 1 ) = X y ` 1 2 ` Y i =2 W y i y i − 1 log (2 W y 2 y 1 ) = 1 − h (1 / 2 − ) , where h ( · ) is the binary entropy function. Similarly D ( Q 1 k Q 0 ) = − 1 2 log 1 − 4 2 . For any 1 < i < ` , D ( Q 0 k Q i ) = E Q 0 log 2 W Y i Y i − 1 W Y i +1 Y i W Y i +1 Y i − 1 = 1 + 1 2 + 2 − 2 log 1 2 + + 1 2 − 2 + 2 log 1 2 − . Similarly , D ( Q i k Q 0 ) = − 1 2 − log 1 2 + + 1 2 + log 1 2 − . Having computed these distances, we make one additional observation. For 1 ≤ i, j ≤ ` , i 6 = j , D ( Q i k Q j ) = E Q i log W Y i +1 Y i − 1 W Y i +1 Y i W Y i Y i − 1 + log W Y j Y j − 1 W Y j +1 Y j W Y j +1 Y j − 1 = D ( Q 0 k Q j ) + D ( Q i k Q 0 ) . Since is bounded, D ( Q 0 k Q i ) = O (1) and D ( Q i k Q 0 ) = O (1) . This in turn prov es that the KL di vergences between any two hypotheses is a constant independent of ` . R E F E R E N C E S [1] A. Kittur , E. H. Chi, and B. Suh, “Cro wdsourcing user studies with Mechanical Turk, ” in Pr oc. SIGCHI Conf. Hum. F actors Comput. Syst. (CHI 2008) , Apr . 2008, pp. 453–456. [2] P . G. Ipeirotis, F . Provost, and J. W ang, “Quality management on Amazon Mechanical Turk, ” in Pr oc. ACM SIGKDD W orkshop Human Comput. (HCOMP’10) , Jul. 2010, pp. 64–67. [3] N. Scheiber , “ A middle ground between contract worker and employee, ” The New Y ork T imes , Dec. 2015. [4] F . Gino and G. Pisano, “T ow ard a theory of behavioral operations, ” Manuf. Service Oper . Mana g. , vol. 10, no. 4, pp. 676–691, Fall 2008. [5] D. R. Karger , S. Oh, and D. Shah, “Budget-optimal task allocation for reliable cro wdsourcing systems, ” Oper . Res. , vol. 62, no. 1, pp. 1–24, Jan.-Feb . 2014. [6] V . Misra and T . W eissman, “Unsupervised learning and univ ersal communication, ” in Pr oc. 2013 IEEE Int. Symp. Inf. Theory , Jul. 2013, pp. 261–265. 24 [7] H. J. Jung, Y . Park, and M. Lease, “Predicting next label quality: A time-series model of crowdw ork, ” in Pr oc. AAAI Conf . Human Comput. and Cr owdsourcing (HCOMP’14) , Nov . 2014, pp. 87–95. [8] D. R. Kar ger , S. Oh, and D. Shah, “Ef ficient cro wdsourcing for multi-class labeling, ” in Pr oc. A CM SIGMETRICS Int. Conf . Meas. Model. Comput. Syst. , Jun. 2013, pp. 81–92. [9] N. B. Shah, S. Balakrishnan, and M. J. W ainwright, “ A permutation-based model for crowd labeling: Optimal estimation and rob ustness, ” arXiv:1606.09632, Jun. 2016. [10] A. V empaty , L. R. V arshney , and P . K. V arshney , “Reliable crowdsourcing for multi-class labeling using coding theory , ” IEEE J. Sel. T opics Signal Pr ocess. , vol. 8, no. 4, pp. 667–679, Aug. 2014. [11] L. R. V arshne y , P . Jyothi, and M. Hasega wa-Johnson, “Language cov erage for mismatched crowdsourcing, ” in Pr oc. 2016 Inf. Theory Appl. W orkshop , Feb . 2016. [12] A. Banerjee, S. Merugu, I. S. Dhillon, and J. Ghosh, “Clustering with Bregman di vergences, ” J. Mac h. Learn. Res. , vol. 6, pp. 1705– 1749, Oct. 2005. [13] V . Misra, “Universal communication and clustering, ” Ph.D. dissertation, Stanford Uni versity , Jun. 2014. [14] C. Chan, A. Al-Bashabsheh, J. B. Ebrahimi, T . Kaced, and T . Liu, “Multivariate mutual information inspired by secret-key agreement, ” Pr oc. IEEE , vol. 103, no. 10, pp. 1883–1913, Oct. 2015. [15] N. Slonim, N. Friedman, and N. Tishby , “ Agglomerativ e multiv ariate information bottleneck, ” in Advances in Neural Information Pr ocessing Systems 14 , T . G. Dietterich, S. Becker , and Z. Ghahramani, Eds. Cambridge, MA: MIT Press, 2002, pp. 929–936. [16] J. Zhang, V . S. Sheng, J. W u, and X. W u, “Multi-class ground truth inference in cro wdsourcing with clustering, ” IEEE T rans. Knowl. Data Eng. , vol. 28, no. 4, pp. 1080–1085, Apr . 2016. [17] K. Nagano, Y . Kaw ahara, and S. Iwata, “Minimum a verage cost clustering, ” in Advances in Neural Information Processing Systems 23 , J. D. Laf ferty , C. K. I. Williams, J. Shawe-T aylor , R. S. Zemel, and A. Culotta, Eds. MIT Press, 2010, pp. 1759–1767. [18] Q. Li, A. V empaty , L. R. V arshney , and P . K. V arshney , “Multi-object classification via crowdsourcing with a reject option, ” arXiv:1602.00575 [cs.LG]., Jun. 2016. [19] S. M. Ali and S. D. Silvey , “ A general class of coef ficients of di vergence of one distribution from another , ” J. R. Stat. Soc. Ser . B. Methodol. , vol. 28, no. 1, pp. 131–142, 1966. [20] I. Csisz ´ ar , “Information-type measures of difference of probability distributions and indirect observations, ” Stud. Sci. Math. Hung. , vol. 2, pp. 299–318, 1967. [21] S. S. Dragomir , Ed., Inequalities for Csisz ´ ar f -Divergence in Information Theory , ser . RGMIA Monographs. V ictoria Univ ersity , 2000. [22] T . Kailath, “The div ergence and Bhattacharyya distance measures in signal selection, ” IEEE T rans. Commun. T echnol. , vol. COM-15, no. 1, pp. 52–60, Feb . 1967. [23] I. Csisz ´ ar and Z. T alata, “Context tree estimation for not necessarily finite memory processes, via BIC and MDL, ” IEEE T rans. Inf. Theory , vol. 52, no. 3, pp. 1007–1016, Mar . 2006. [24] B. Y u, “ Assouad, Fano, and Le Cam, ” in F estschrift for Lucien Le Cam: Researc h P apers in Probability and Statistics , D. Pollard, E. T orgersen, and G. L. Y ang, Eds. New Y ork: Springer , 1997, pp. 423–435. [25] G. V aliant and P . V aliant, “Estimating the unseen: An n/ log ( n ) -sample estimator for entropy and support size, shown optimal via new CL Ts, ” in Pr oc. 43rd Annu. ACM Symp. Theory Comput. (STOC’11) , Jun. 2011, pp. 685–694. [26] A. Antos and I. Konto yiannis, “Con ver gence properties of functional estimates for discrete distrib utions, ” Rand. Str . & Alg. , v ol. 19, no. 3-4, pp. 163–193, 2001. [27] L. Paninski, “Estimation of entropy and mutual information, ” Neural Comput. , vol. 15, no. 6, pp. 1191–1253, Jun. 2003. [28] P . Netrapalli, S. Banerjee, S. Sanghavi, and S. Shakkottai, “Greedy learning of Markov network structure, ” in Proc. 48th Annu. Allerton Conf. Commun. Control Comput. , Sep. 2010, pp. 1295–1302.
Original Paper
Loading high-quality paper...
Comments & Academic Discussion
Loading comments...
Leave a Comment