learning from multiple noisy partial labelers

Learning from Multiple Noisy Partial Labelers

Peilin Yu Tiffany Ding Stephen H. BachDepartment of Computer Science

Brown University{peilin_yu, tiffany_ding, stephen_bach}@brown.edu

Abstract

Programmatic weak supervision creates models without hand-labeled trainingdata by combining the outputs of noisy, user-written rules and other heuristiclabelers. Existing frameworks make the restrictive assumption that labelers outputa single class label. Enabling users to create partial labelers that output subsets ofpossible class labels would greatly expand the expressivity of programmatic weaksupervision. We introduce this capability by defining a probabilistic generativemodel that can estimate the underlying accuracies of multiple noisy partial labelerswithout ground truth labels. We prove that this class of models is genericallyidentifiable up to label swapping under mild conditions. We also show how toscale up learning to 100k examples in one minute, a 300× speed up compared toa naive implementation. We evaluate our framework on three text classificationand six object classification tasks. On text tasks, adding partial labels increasesaverage accuracy by 9.6 percentage points. On image tasks, we show that partiallabels allow us to approach some zero-shot object classification problems withprogrammatic weak supervision by using class attributes as partial labelers. Ourframework is able to achieve accuracy comparable to recent embedding-basedzero-shot learning methods using only pre-trained attribute detectors.

1 Introduction

The need for large-scale labeled datasets has driven recent research on methods for programmaticweak supervision (PWS), such as data programming [1, 2], adversarial label learning [3], learningrules from labeled exemplars [4], and weak supervision with self-training [5]. In PWS, labelingfunctions, such as user-written rules and other heuristics, provide votes on the true labels for unlabeledexamples or abstain from voting. Then, a label model, such as a probabilistic generative modelor minimax game, is often used to estimate the true labels in a way that accounts for unknowndifferences in accuracies and other properties of the labeling functions. Finally, these estimatedlabels are used to train an end model on the unlabeled data that generalizes beyond the informationcontained in the labeling functions. This approach has had recent success with applications in naturallanguage processing [6, 7], computer vision [8], medicine [9, 10, 11], and the Web [12]. However,all of these methods assume that labeling functions cast votes for individual classes. In this work, wepropose to generalize PWS to support labeling functions that cast votes for a subset of classes, calledpartial labels. We refer to such labeling functions as partial labeling functions (PLFs). Our goal is toaggregate the information provided by multiple partial labeling functions that are noisy (i.e., haveimperfect accuracy) in order to estimate labels for unlabeled data.

Incorporating partial labels into PWS would enable users to take advantage of a wider range ofdomain knowledge. In typical PWS frameworks, only heuristics that are specific to one class can beincorporated. As a result, creating labeling functions requires careful task-specific engineering toavoid features that are shared by more than one class. For example, consider the task of classifyingimages of animals as {HORSE, TIGER, LION, ZEBRA}. There are many useful heuristics that can

Preprint. Under review.

arX

iv:2

106.

0453

0v1

[cs

.LG

] 8

Jun

202

1

Politics

Business Sports

“President” = TRUE

“Won”= TRUE

“Points”= TRUE

Tiger

Lion Horse

Zebra

“Stripes” = TRUE

“Stripes” = False

“Claws” = TRUE

“Claws” = FALSE

Figure 1: Examples of the expressivity of partial labeling functions. On the left, three functions eachvote for two of three classes if they observe a particular token in a news article and abstain otherwise.On the right, two functions each vote for two of four classes if they detect a particular attribute in animage. Otherwise, they vote for the other two classes.

be learned from other labeled data sets, such as attribute detectors for claws or stripes [13]. Suchheuristics divide the label space with multiple partitions. A claw detector could produce two partiallabels: {TIGER, LION} if a claw is detected and {HORSE, ZEBRA} if not. Likewise, a stripe detectorcould output {TIGER, ZEBRA} if stripes are detected and {HORSE, LION} if not. However, theseheuristics cannot be used as labeling functions in current PWS frameworks. More generally, weobserve a need for such partial labeling functions in many multiclass applications where users wantto express heuristics that narrow down the set of possible class labels but are not specific to a singleclass (Figure 1).

Learning in such scenarios is challenging because we must resolve ambiguity arising from threesources: (1) PLF imprecision, i.e., voting for a set of classes instead of a single class, (2) PLFinaccuracy, i.e., voting for a set of classes that does not contain the true class, and (3) conflict amongmultiple PLFs. A further requirement is that PWS frameworks should support labeling functionsthat abstain, meaning they can choose not to label certain examples. This is particularly critical forhand-engineered rules that might be highly specialized. A framework for learning from multiplenoisy PLFs should therefore be able to resolve all these types of ambiguities in a principled waywhile also maintaining the expressive capabilities of existing PWS frameworks.

This problem setting is quite general and is related to multiple lines of work in machine learning,although each of them only addresses part of the problem considered here. As mentioned above,previous PWS frameworks generally require labeling functions to provide a single class label [1, 2, 3,4, 5, 7]. One exception is Snorkel MeTaL [14], which is capable of handling labeling functions witha multi-task tree structure where higher-level labeling functions are grouped into super classes thatencompass fine-grained classes. This requirement of a tree structure makes modeling partial labels thatdivide the label space into overlapping subsets practically infeasible. There is also a wide body of workon learning from partial labels, also called superset learning [15, 16, 17, 18, 19, 20, 21, 22, 23, 24].In these settings, there is generally one partial label per example. The ambiguity in the impreciselabels can be resolved as a maximum likelihood or risk minimization problem. Many methodsadditionally learn the likely confusions between partial and true labels [25, 26, 27, 28]. However,such methods do not handle the case of multiple partial labelers that can disagree and abstain.Finally, some work on zero-shot learning (ZSL) creates attribute detectors that can be viewed aspartial labelers [29, 13, 30, 31]. PWS with partial labels can also be viewed as a generalization ofthe transductive ZSL setting [32, 33], in which labelers are allowed to abstain and a class may beassociated with multiple attribute values. Across all these related directions, there remains a need forlearning from multiple noisy partial labelers.

To address this issue, we propose a generalized PWS framework that supports partial labels andhandles the additional ambiguity caused by the imprecise outputs of the PLFs. We introduce aprobabilistic generative model that estimates the agreement between the outputs of each partiallabeling function and the latent, true label. We prove that this model is generically identifiable upto label swapping under mild conditions on the PLFs. This result means that we can, in principle,

2

estimate the accuracy of each partial labeling function without access to ground truth labels. We alsoshow how to learn these parameters efficiently for large datasets. For example, we can learn on 100kexamples with 10 PLFs in one minute. Since PWS is inherently human-in-the-loop, fast iteration iscrucial. Using the learned parameters, we can compute the posterior distribution over true labels foreach example. These probabilistic training labels can then be used to train an end model in the samemanner as other PWS frameworks.

We demonstrate this framework with experiments on three text and six object classification tasks.On the text classification tasks, we show that the additional flexibility provided by partial labelersenables heuristics that significantly improve over single-class labelers alone. We find an average 9.6percentage point improvement in accuracy. On the object classification tasks, we find that modelingthe accuracies of the PLFs explicitly enables us to achieve accuracy comparable to recent embedding-based ZSL methods using only pre-trained attribute detectors. These results provide a foundation forconstructing and learning more modular, reusable knowledge sources for weak supervision.

2 Related Work

In the past few years, programmatic weak supervision (PWS) has emerged as a systematic approachto efficiently create labeled training data [1, 2, 3, 4, 5]. A typical PWS framework consists of threestages. First, domain experts engineer weak supervision sources in the form of labeling functions,such as rules or classifiers related to the target task. Second, a label model, such as a probabilisticgenerative model, is used to estimate the latent true labels using the labeling function outputs. Third,the estimated labels are used to train an end model that generalizes beyond the information in thesupervision sources. The core of such frameworks is the label modeling stage. The choice of labelmodel determines what types of supervision sources are supported. Many frameworks are based oncrowdsourcing methods [34, 35, 36], where providing a single label is a natural assumption. In theoriginal data programming framework [1], labeling functions can output a single label or abstain.The label model is generative, meaning that each true label is a latent variable and the observed votesof the labeling functions are conditioned on the true labels. The parameters of the label model arelearned by maximizing the marginal likelihood of the observed votes. Statistical dependencies suchas correlations among the votes can be modeled, and methods have been introduced to learn specifictypes of dependencies from unlabeled data [37, 38].

The Snorkel MeTaL [14] framework extends data programming to learn across multiple, related tasksorganized in a tree structure. For example, in a fine-grained named entity recognition task, one mightuse a set of labeling functions that vote on coarse-grained entity types and separate sets of labelingfunctions to further vote on the subtypes within each coarse type. In this way, users can write labelingfunctions specialized to different subtasks. The outputs of labeling functions at higher levels of thetree can be thought of as a restricted form of partial labels, in the sense that all labeling functionsmust follow the same tree-structured organization of the classes. In contrast, in our setting, eachpartial labeling function can organize the classes into its own, possibly overlapping groups.

Other PWS frameworks have approached labeling functions and the label modeling processes indifferent ways, but all so far assume that each labeling function votes for a single class. Adversariallabel learning [3], performance-guaranteed majority vote [39], and related work in semi-supervisedensemble learning [40, 41, 42] solve minimax games based on assumed or estimated constraintson labeling function accuracies. Awasthi et al. [4] proposed learning from rules and exemplars forthose rules, learning to downweight the confidence in the rules on data instances not similar to theexemplars. Karamanolakis et al. [5] proposed integrating PWS with semi-supervised self-training.

Other work on learning with partial labels has focused on the case where there is a single partiallabel per example. The ambiguity in the imprecise labels can be resolved via maximum likelihoodestimation [15, 19] or empirical risk minimization [16, 17, 18, 20, 21, 22, 24]. Many methodsadditionally learn the likely confusions between partial and true labels [25, 26, 27, 28]. Wang etal. [23] proposed learning multiple partially labeled tasks simultaneously, in order to exploit structureamong the tasks, but during training there is still only one partial label per prediction. Partial labelsare also related to complementary labels [43, 44], which are annotations that indicate which label theexample does not have.

Our problem setting is also related to some forms of zero-shot learning (ZSL) [32, 33]. In zero-shotclassification, a model learns to match semantic descriptions of classes to examples of those classes.

3

Once learned, the model can be applied to novel classes. Many early approaches to ZSL createddetectors for different attributes [29, 13, 30, 31]. In the transductive setting [32, 33], in which thetarget classes are known and unlabeled examples of them are available during model development,these detectors can be viewed as restricted partial labeling functions that always divide the label setinto non-overlapping groups and never abstain. More recently, much work on ZSL has moved awayfrom relying entirely on attribute detectors, and recent work can be grouped into either embedding-based or generative-based methods [45]. Embedding-based methods align representation spacesbetween classes and examples in order to classify unlabeled data [46, 47, 48, 49, 32, 33]. Some work,e.g., Liu et al. [50] and Liu et al. [51], also learn to exploit and expand attribute-based information,but generally still do not use separate attribute detectors. On the other hand, generative-based ZSLmethods generate examples of the unseen classes with deep generative models and then train aclassifier with that data [52, 53, 54, 55, 56, 57]. In our experiments, we focus on comparing againsttransductive embedding-based methods, which are more similar to PWS because both involve tryingto label a fixed, unlabeled data set. We leave incorporating zero-shot data generation into PWS forfuture work.

3 A Framework for Learning from Partial Labeling Functions

In this section, we describe our weak supervision framework. Following prior work in programmaticweak supervision [1, 2], our framework consists of three stages. First, users develop partial labelingfunctions for the target task (Section 3.1). Second, the partial labeling functions are applied tounlabeled data and a probabilistic label model is estimated using their outputs, with no need for hand-labeled training data (Section 3.2). Third, the learned label model is used to compute the posteriordistribution for the true label of each unlabeled example, which is used to train a noise-aware classifier(Section 3.3).

3.1 Expressing Weak Supervision as Partial Labeling Functions

We propose generalizing labeling functions to partial labeling functions (PLF) in order to make useof many available weak supervision sources that are informative but not specific enough to identify asingle class. PLFs can range in granularity, from dividing the label space into two large groups downto identifying a specific class, i.e., a regular labeling function. This flexibility allows users to takeadvantage of many additional supervision signals, as we illustrate in our examples and experiments.

A PLF G is a function that maps an unlabeled example to a proper subset of the possible labels orabstains by outputting the full set of all possible labels. Formally, our goal is to learn a classifierC : X → Y , where X is the space of inputs and Y = {y1, . . . , yk} is the set of possible labels. APLF is then a function G : X → G ⊆ P(Y) \ {∅}, where P(Y) is the power set of Y . G(X) isa partial label for X ∈ X , i.e., the set of labels that the PLF indicates the example X could have(although this information could be incorrect). If G(X) = Y , the PLF is said to abstain, becauseit provides no information about the true label. As we will explain further in Section 3.2, a keycharacteristic of a PLF is its codomain G, excluding when the PLF abstains. We denote this set ofpartial labels for a PLF G as T (G) = G \ {Y}. To ensure that our label model is well-defined, wewill impose the following conditions on T (G): (1) each label y ∈ Y appears in at least one elementof T (G), and (2) no label y ∈ Y appears in every element of T (G). These are very mild conditionsthat can easily be satisfied by adding a “dummy” output to the codomain G that the PLF might notactually produce.

A PLF can be programmatically defined based on a variety of noisy supervision heuristics using do-main knowledge and/or available resources, such as classifiers for related tasks. To better understandPLFs, consider the following text and object classification examples corresponding to Figure 1:

Example 1. Consider a news classification task where Y = {POLITICS, SPORTS, BUSINESS}.In this task, some words can be very informative as supervision sources even if they do not narrowthe example down to a specific class. For example, the word “president” may frequently appear inboth political and business contexts. We can construct a PLF G based on a simple token matcher for“president” such that G : X → {{BUSINESS, POLITICS}, {SPORTS},Y}. If the token “president”appears in the example X , then G(X) ={BUSINESS, POLITICS}. Otherwise, G(X) = Y , i.e.,G abstains, because the absence of the token is not enough to conclude anything about the label

4

with high confidence. In this example, T (G) = {{BUSINESS, POLITICS}, {SPORTS}}. Notice here{SPORTS} is a “dummy” label set to satisfy the conditions on T (G) described above.Example 2. Consider an object classification task where Y = {HORSE, TIGER, LION, ZEBRA}. Fol-lowing work in zero-shot learning, we can build a binary classifier for the visual attribute of havingstripes by training on other classes of animals for which we already have labels. We can thenuse the classifier’s output to define a PLF G1 : X → {{TIGER, ZEBRA}, {HORSE, LION}}. Foran example X , if the stripes detector returns a positive label, then G1(X) = {TIGER, ZEBRA}.Otherwise, G1(X) = {HORSE, LION}. We can similarly construct a PLF with a claw detector asG2 : X → {{TIGER, LION}, {HORSE, ZEBRA}}.

PLFs are a generalization of the labeling functions used in prior work on PWS; traditional labelingfunctions can be represented as PLFs with codomain G = {{y1}, . . . , {yk},Y}. PLFs provide theadditional flexibility to users of incorporating weak supervision heuristics with differing granularititiesand ways of dividing the label space.

3.2 Label Model

In our framework, users provide two inputs: PLFs and unlabeled examples in X . Like other PWSframeworks, at the core of our method is a probabilistic label model that statistically models theproperties of the weak supervision sources by representing the unknown ground-truth labels as latentvariables. In this subsection, we propose and analyze a novel probabilistic label model for PLFs.

Setup For a classification task with input space X and label space Y = {y1, . . . , yk}, we are givenm unlabeled examples X = (X1, . . . , Xm) with unknown ground truth labels Y = (Y1, . . . , Ym)such that (X,Y ) are i.i.d. samples from some distribution D. We are also given n PLFs G =(G1, . . . , Gn). We use G as shorthand for the m× n matrix of PLF outputs where Gai = Gi(Xa)when it is clear from context.

Joint Distribution In order to model and resolve the three types of noise in our setting, we definea joint distribution P (G, Y ) over the outputs of the PLFs on X and the latent, true labels Y . Weassume that the PLF outputs are conditionally independent given the true labels, i.e., the naive Bayesassumption. In practice this works well, but extending work on learning more complex distributionsfor other types of PWS is a potential direction for future exploration [37, 38]. Analogous to priorwork [14], for each PLF Gi, we define parameters αi ∈ [0, 1]k and βi ∈ [0, 1]. Each element αijis the accuracy of Gi on examples of class yj , i.e., the probability that yj ∈ Gai given that Xa haslabel yj and Gai is not Y . βi is the propensity of Gi voting, i.e., not abstaining. In other words,βi = P (Gai 6= Y). In our framework, the class balance P (Y ) can either be a learnable distributionor fixed. We also assume that when each PLF Gi makes a mistake, it outputs an incorrect partial labelfrom T (Gi) uniformly at random.

To define the joint distribution P (G, Y ), for each PLF Gi we also need to refer to the sets inT (Gi) that are consistent or inconsistent with each label. Let Nij = {L|yj ∈ L for L ∈ T (Gi)}be the set of label sets in the codomain of Gi that contain label yj (excluding Y). Likewise, letNCij = {L|yj /∈ L for L ∈ T (Gi)} be the set of label sets in the codomain of Gi that do not contain

label yj . Then, the joint distribution is

P (G, Y ) =

m∏a=1

P (Ya)

n∏i=1

P (Gai|Ya) where P (Gai|Ya = yj) =

1− βi, if Gai = Yβiαij

|Nij | , if yj ∈ Gai

βi(1−αij)

|NCij |

, otherwise.

(1)

Learning Given the unlabeled examples and PLF outputs G, our goal is to estimate the parametersof P (G, Y ) (denoted collectively as Θ) and compute the posterior P (Y |G) over the unknown labels.To estimate Θ, we maximize the marginal likelihood of the observed outputs of the PLFs:

Θ̂ = argmaxΘ

PΘ(G) = argmaxΘ

∑Y

PΘ(G, Y ) .

This optimization is implemented in PyTorch [58]. The marginal log likelihood of a batch of examplesis computed in the forward pass, and stochastic gradient descent is used to update the parameters.

5

We find that the way the likelihood computation is implemented in the forward pass can lead toan orders-of-magnitude difference in training time. For every example, we need to compute itsconditional likelihood for every class based on votes from every PLF. Naively, this will require threelayers of for loops through examples, PLFs, and classes. We can speed up computation by expressingthe conditional log likelihood computation as a sequence of matrix operations.

Let m be the number of instances in one batch, n be the number of PLFs and k be the numberof classes. For each batch we precompute accuracy indicator matrices AI ∈ {−1, 1}m×n×k andcount matrices N ∈ Zm×n×k where entry AIa,i,j = 1,Na,i,j = − log |Ni,j | if class yj is in thelabel subset output by the i-th PLF on the a-th example, and AIa,i,j = −1,Na,i,j = − log |NC

i,j |otherwise. We also precompute propensity indicator matrices PI ∈ {0, 1}m×n where entry PIa,i = 1if the a-th instance received a non-abstaining vote (vote is not Y) from the i-th PLF. Let A ∈ Rn×kbe the log of the accuracy parameters and B ∈ Rn be the log of the propensity parameters. We canmap these parameters back to probability space as

αi,j =exp(Ai,j)

exp(Ai,j) + exp(−Ai,j)and βb =

exp(Bi)

exp(Bi) + 1. (2)

We extend PI, A, and B to PIext, Aext, and Bext in 3 dimensions, with PI replicated along thethird axis k times, A replicated along the first axis m times, and B replicated along the first axis mtimes and third axis k times. Then, during each forward pass, we only need to calculate normalizingmatrices ZA ∈ Rn×k and ZB ∈ Rn for accuracy and propensity respectively where

ZAi,j = − log (exp(Ai,j) + exp(−Ai,j)) and ZBi = − log(exp(Bi) + 1) .

We similarly extend ZA and ZB to ZAext and ZBext. During the forward pass we calculate the batchconditional log likelihood as:

logP (G|Y )m×k

=∑n

(ZBext

m×n×k+ PIext

m×n×k� ( Aext

m×n×k� AIm×n×k

+ Nm×n×k

+ Bext

m×n×k+ ZAext

m×n×k)

)where � is element-wise multiplication. This modification allows us to remove for loops in our codefrom the computation graph. This approach leads to a 300× speedup in training time compared to anaive approach. This speedup makes the framework practical for iterative PLF development. Forexample, learning with 100k examples and 10 PLFs on an Intel i5-6600k CPU requires one minute.

Identifiability An important theoretical question is whether it is reasonable to try to learn theparameters of P (G, Y ) even though Y is never observed. We answer this question affirmatively byshowing that as long as the codomains of the PLFs are sufficiently targeted or diverse, it is possibleto determine the parameters of the label model (up to label swapping) using only the distribution ofPLF outputs P (G), except for on a measure zero subset of the space of possible parameter values.This property is the strongest useful notion of identifiability for models with latent variables [59]. Amodel whose parameters can be determined except for on a measure zero subset is called genericallyidentifiable. Label swapping refers to the fact that unobserved classes in a latent variable model canbe relabeled without changing the observed distribution. This means that the map going from theobserved distribution of a label model with k classes to parameter values is at best k!-to-one andcannot be one-to-one even under ideal conditions. In practice, label swapping is not an issue becausemost PLFs are more accurate than random guessing. The specific condition imposed on the PLFcodomains in order to ensure generic identifiablity is described in the following theorem.

Theorem 1. The parameters of the label model described in Section 3.2 are generically identifiableup to label swapping provided that the collection G of partial labeling functions can be partitionedinto three disjoint non-empty sets S1, S2, and S3 such that, for sets j = 1, 2 and all classes y ∈ Y ,we can choose label sets ti ∈ T (Gi) satisfying

⋂Gi∈Sj

ti = {y}.

The proof is given in Appendix A. This theorem tells us that it is reasonable to try to estimate the PLFaccuracies even though the true class labels are never observed. Our proof adapts ideas presentedin Theorem 4 of Allman et al. [60], which uses Kruskal’s unique factorization theorem and featuregrouping to establish conditions for the generic identifiability of a naive Bayes model with arbitraryparameters. Since the space of models we consider is equivalent to a measure zero subset of theparameters in an arbitrary naive Bayes model, an additional proof is needed to show that these

6

parameters are generically identifiable. We develop a novel argument to show that the above is asufficient condition for generic identifiability up to label swapping.

In words, the condition described in Theorem 1 requires that for each class y, we can select a labelgroup from the codomain of each PLF in S1 such that the intersection of these label groups containsonly the class y. This condition also applies to S2. One way to satisfy this condition is to createPLFs that produce single-class label groups. For example, if PLF Gi contains {1}, {2}, ..., {k} inits codomain, then any set Sj that contains Gi will satisfy the Theorem 1 condition. However, evenif no PLFs output any single-class label sets, it is still possible for the label model parameters tobe identifiable because the condition can also be satisfied by using multiple PLFs with differentcodomains. Suppose that we want to show that the condition is satisfied for class 1 and we have{1, 2, 3} ∈ T (G1), {1, 3, 4} ∈ T (G2) and {1, 2, 4} ∈ T (G3). The intersection of these sets is {1}.

3.3 Noise-Aware Classifier

The final stage of our framework is to train a classifier. After P (G, Y ) is estimated with unlabeled data,we compute the posterior P (Y |G). Then, we minimize the expected empirical risk with respect to thisdistribution. For classifiers that output probabilistic predictions, the loss function becomes the cross-entropy loss weighted by the posterior over true labels. As in other PWS frameworks [1, 2, 14, 7],many off-the-shelf neural networks can be chosen based on the task.

4 Experimental Results

We demonstrate benefits of incorporating partial labels into PWS on applications in text and objectclassification. In Section 4.1, we compare our framework with baselines that (1) use only traditionallabeling functions and (2) heuristically aggregate partial labels without a probabilistic model. Ourproposed approach significantly improves accuracy over both baselines. In Section 4.2, we usepretrained visual attribute detectors as PLFs for classifying unseen objects. Our framework achievesaccuracy that is competitive with recent embedding-based transductive ZSL methods. While ourframework is not designed specifically for ZSL, we present this comparison to demonstrate itsflexibility and show another scenario where modeling the noise of multiple partial labeling functionscan significantly improve performance relative to a heuristic approach and make discrete attributedetectors competitive with recent ZSL approaches. Together, these results show that partial labels area useful new capability for PWS.

The code for the experiments will be released upon acceptance. Additional details about the experi-ments, datasets, and methods are available in Appendix C.

4.1 Text Classification

Datasets We consider three datasets. First, SciCite [61] is a citation classification dataset sourcedfrom scientific literature. The corresponding task is to classify a citation as referring to eitherbackground information, method details, or results. Second, TREC-6 [62] is a question classificationdataset containing open-domain questions. The task is to classifiy each question as asking aboutone of six semantic categories. Finally, AG-NEWS [63] is a large-scale news dataset. The task is toclassify each example as one of four topics: world politics, sports, business, or technology.

Methods We developed PLFs using a development set for each dataset (916 examples for SciCite,389 for TREC-6, and 500 for AG-NEWS). We evaluate three methods for combining the outputsof the PLFs to train an end model. First, as a baseline, we consider using only the PLFs that areequivalent to traditional labeling functions, i.e., they always output one label or abstain. We callthis baseline LFs Only. Second, as another baseline we use a heuristic called Nearest Class (NC),which chooses the first class with the maximal number of compatible partial labels. This baseline is ageneralized majority vote heuristic for PLFs. Finally, our method, called Noisy Partial Label Model(NPLM), is our label model from Section 3.2. In all cases, we use the estimated labels to train abiLSTM + attention classifier implemented with AllenNLP [64]. Input text are embedded with ELMo[65]. Detailed PLF design, hyperparameters and end model architecture details are in Appendix C.1.

As ablations, we also report each method’s performance using the label model directly to makepredictions, before training an end model, denoted (-End).

7

Table 1: Results for text classification with mean accuracy (ACC), macro F1 (F1) and 95% CIs.

SciCite TREC-6 AG-News

ACC F1 ACC F1 ACC F1

Supervised (Dev. Set) 75.9±0.7 73.0±0.7 72.9±4.0 62.9±2.7 79.9±1.6 79.6±1.8

LFs Only (-End) 65.1 44.7 22.0 23.8 33.0 25.1NC (-End) 73.2 69.2 29.6 32.6 46.1 43.5NPLM (-End) 71.5 69.4 38.2 43.0 51.1 49.4

LFs Only 73.2±2.1 70.5±2.4 59.4±0.8 61.0±1.2 78.4±1.1 78.4±1.1NC 73.2±1.7 70.4±1.7 68.2±0.6 57.2±0.6 70.5±0.8 68.5±1.1NPLM 76.3±1.2 74.0±1.2 81.3±1.4 81.3±1.9 82.1±0.2 81.7±0.3

NPLM vs. LFs Only ↑ 3.1 ↑ 3.5 ↑ 21.9 ↑ 20.3 ↑3.7 ↑ 3.3NPLM vs. NC ↑ 3.1 ↑ 3.6 ↑ 13.1 ↑24.1 ↑11.6 ↑13.5

Results We report mean macro-averaged F1 and micro-averaged accuracy of the compared methodsin Table 1 on the standard test sets. Results using the end model are shown with 95% confidenceintervals obtained using five different random seeds. NPLM consistently improves F1 and accuracyrelative to LFs Only (9.6 and 9.0 percentage points on average, respectively) and NC (9.3 and 13.6percentage points on average, respectively). The performance advantage over LFs Only demonstratesthe benefits of additional weak supervision that can be expressed as PLFs, and the advantage overNC demonstrates that the proposed label model is learning useful information. The ablated versionsof the methods significantly underperform their counterparts, showing that in all cases the end modellearns to generalize beyond the information contained in the weak supervision heuristics. Many,and sometimes most, of the errors are on examples for which all supervision sources abstain, wherea label is chosen arbitrarily or according to the class prior P (Y ). For context, we also report theperformance of the end model trained on the development set with ground-truth labels.

4.2 Object Classification

In this task, we show how our framework can be used to model discrete visual attribute detectors,and that this approach can achieve results competitive with recent embedding-based ZSL methods.Although they have not been used often in recent ZSL work, discrete attribute detectors have benefitssuch as modularity and interpretability. These experiments show that modeling them as PLFs withour unsupervised label model can lead to good accuracy.

Datasets We consider Large-Scale Attribute Dataset (LAD) [66] and Animal with Attributes 2(AwA2) [32], which both provide class-level discrete visual attributes. LAD is a recently proposedattribute-based dataset with 78k instances that organizes common objects into five sub-datasets:electronics, vehicles, fruits, hairstyles and animals. For each sub-dataset, the classes are divided intofive folds of seen and unseen classes, and average performance over all tasks is used as a benchmarkfor ZSL. AwA2 is a widely-used attribute-based ZSL animal classification dataset consisting of ∼30kinstances with 85 binary attributes, 40 seen classes, and 10 unseen classes.

Methods Following early work on zero-shot object classifcation [29, 13, 30, 31], we model eachvisual attribute in the datasets with a binary classifier. In all cases, the classifiers are trained on theseen classes for that task or fold, and the unseen classes are not used at all, not even as validation data.To create classifiers for LAD, we extract features from a ResNet-50 [67] pretrained on ILSVRC [68],in order to compare fairly with prior work. For AwA2, we fine-tune a pretrained ResNet-101 on theseen classes. Each class is trained with respect to the class-wise attribute annotations on the trainingsets of the seen classes. We create PLFs according to the provided attribute annotations.

We incorporate PLFs into the NC and NPLM methods, as described in Section 4.1. We use thedata from the unseen classes as our unlabeled data. In all cases, our end models are three-layerperceptrons trained on the extracted features (ResNet-50 for LAD and ResNet-101 for AwA2). ForLAD, we evaluate in a strict zero-shot setting, so we train the model on the unseen classes with theestimated labels. For AwA2, we evaluate in a generalized zero-shot setting, meaning that the modelis evaluated on both seen and unseen classes, so we mix the unseen classes with estimated labels and

8

Table 2: Results for object classification. For LAD, we report mean accuracy (ACC) with 95% CIsacross the five standard splits for each of the five subtasks. For AWA2, we report mean class accuracy(MCA) with 95% CIs. We evaluate AWA2 in a generalized setting: S and U stands for the per-classaccuracy on the seen and unseen classes respectively. H is the harmonic mean 2×U×S

U+S .

LAD (ACC) AwA2 (MCA)

Animals Fruit Vehicles Electronics Hairstyles Avg. U S H

ConSE [71] 36.9 29.8 37.5 28.3 24.6 31.4 0.5 90.6 1.0ESZSL [48] 50.2 37.2 45.8 32.8 31.8 39.6 77.8 5.9 11.0SynC [72] 61.6 51.4 54.9 43.0 29.1 48.0 90.5 10.0 18.0

VCL [70] 75.4± 0.8 35.0± 1.0 62.4± 0.5 36.7± 0.5 33.8± 0.7 48.7± 0.3 21.4 89.6 34.6QFSL [69] - - - - - - 66.2 93.1 77.4WDVSc [70] 97.2± 0.8 43.3± 1.3 82.1± 0.6 54.8± 1.1 31.1± 2.6 61.7± 0.6 76.4 88.1 81.8

NC (-End) 65.8 31.2 60.3 40.3 39.1 47.3 47.7 - -NPML (-End) 86.0 38.7 73.5 51.8 45.9 59.2 68.2 - -

NC 71.9±1.2 36.2±0.6 65.3±1.2 48.0±0.7 40.9±0.5 52.5±0.3 43.1± 1.2 91.8± 0.2 58.6± 1.1NPLM 87.6± 0.2 42.4± 0.8 77.0± 0.2 57.7± 0.7 46.9± 0.9 62.3± 0.2 71.1± 0.6 91.9± 0.1 80.1± 0.3

NPLM vs. NC ↑ 15.7 ↑ 6.2 ↑ 11.7 ↑ 9.7 ↑ 6.0 ↑ 9.8 ↑ 28.0 - ↑ 21.5

seen classes with given labels during training. Detailed hyperparameters and end model architecturesare in Appendix C.2.

We compare with three recent transductive, embedding-based ZSL methods: QFSL [69], VCL [70],and WDVSc [70]. For context, we also report results from three standard inductive methods,ConSE [71], ESZSL [48], SynC [72], although they are at a disadvantage because they do not accessthe unlabeled data nor any information about the unseen classes. For LAD, we replicate and reportthe results of WDVSc and VCL using the same extracted features.

Results We report the average results and 95% confidence intervals based on five random seeds inTable 2. Similar to the text classification tasks, NPLM significantly outperforms NC (an average of9.8 percentage points on LAD and 21.5 percentage points on AwA2), and the ablations show that theend model generalizes beyond the PLFs even though they never abstain. NPLM is also competitivewith WDVSc, the top-performing ZSL method in these experiments, either slightly overperforming(LAD) or underperforming (AwA2).

5 Conclusions, Limitations, and Future Work

In this paper, we introduced a new capability for programmatic weak supervision (PWS): the abilityto learn from partial labeling functions using a novel probabilistic label model. We demonstrated ascalable way to learn these models, and our theoretical analysis shows they are generically identifiableup to label swapping (the strongest useful notion of identifiability for latent variable models [59]).Our experiments show that our framework can (1) significantly improve the accuracy of PWS on textclassification tasks and (2) enable pre-trained attribute detectors to achieve performance comparableto recent embedding-based methods for transductive ZSL on object classification tasks.

Our work expands the space of supervision sources that can be incorporated into PWS systems. Weaksupervision is complementary to many other techniques, such as semi-supervised learning [73, 74],transfer learning [75, 76], active learning [77, 78], and zero-shot data generation [52, 53, 54, 55, 56,57]. A limitation of our work is that exploring how partial labeling functions interact with thesetechniques is left as future work. The same is true for complementary techniques within weaksupervision, such as adversarial label learning [3], learning rules from labeled exemplars [4], andweak supervision with self-training [5]. Additionally, while PWS can enable more rapid development,its dependence on heuristics introduces the potential for bias. For this reason, auditing any createdmodels for potential negative impacts is as important, if not more important, in PWS as in traditionalsupervised learning [79, 80].

Our goal is to enable users to incorporate a wider range of supervision sources into PWS systems,including less specific rules and pre-trained models for related tasks. As future work, we envisioncreating and exploiting large libraries of rules and pre-trained models that are more modular becausethey are freed from the requirement that they narrow the label space down to a single class.

9

Acknowledgements

This material is based on research sponsored by Defense Advanced Research Projects Agency(DARPA) and Air Force Research Laboratory (AFRL) under agreement number FA8750-19-2-1006.The U.S. Government is authorized to reproduce and distribute reprints for Governmental purposesnotwithstanding any copyright notation thereon. The views and conclusions contained herein arethose of the authors and should not be interpreted as necessarily representing the official policies orendorsements, either expressed or implied, of Defense Advanced Research Projects Agency (DARPA)and Air Force Research Laboratory (AFRL) or the U.S. Government. We gratefully acknowledgesupport from Google and Cisco. Disclosure: Stephen Bach is an advisor to Snorkel AI, a companythat provides software and services for weakly supervised machine learning.

References[1] A. J. Ratner, C. M. De Sa, S. Wu, D. Selsam, and C. Ré, “Data programming: Creating large training sets,

quickly,” in Advances in Neural Information Processing Systems (NeurIPS), 2016.

[2] A. J. Ratner, S. H. Bach, H. E. Ehrenberg, J. Fries, S. Wu, and C. Ré, “Snorkel: Rapid training data creationwith weak supervision,” The VLDB Journal, vol. 29, no. 2, pp. 709–730, 2020.

[3] C. Arachie and B. Huang, “A general framework for adversarial label learning,” The Journal of MachineLearning Research, vol. 22, pp. 1–33, 2021.

[4] A. Awasthi, S. Ghosh, R. Goyal, and S. Sarawagi, “Learning from rules generalizing labeled exemplars,”in International Conference on Learning Representations (ICLR), 2020.

[5] G. Karamanolakis, S. Mukherjee, G. Zheng, and A. H. Awadallah, “Self-training with weak supervision,”in Conference of the North American Chapter of the Association for Computational Linguistics (NAACL),2021.

[6] N. Mallinar, A. Shah, R. Ugrani, A. Gupta, M. Gurusankar, T. K. Ho, Q. V. Liao, Y. Zhang, R. K. Bellamy,R. Yates et al., “Bootstrapping conversational agents with weak supervision,” in AAAI Conference onArtificial Intelligence (AAAI), 2019.

[7] E. Safranchik, S. Luo, and S. H. Bach, “Weakly supervised sequence tagging from noisy rules,” in AAAIConference on Artificial Intelligence (AAAI), 2020.

[8] V. S. Chen, P. Varma, R. Krishna, M. Bernstein, C. Re, and L. Fei-Fei, “Scene graph prediction with limitedlabels,” in IEEE/CVF International Conference on Computer Vision (ICCV), 2019.

[9] J. Fries, P. Varma, V. Chen, K. Xiao, H. Tejeda, S. Priyanka, J. Dunnmon, H. Chubb, S. Maskatia,M. Fiterau, S. Delp, E. Ashley, C. Ré, and J. Priest, “Weakly supervised classification of rare aortic valvemalformations using unlabeled cardiac MRI sequences,” Nature Communications, vol. 10, no. 1, 2019.

[10] K. Saab, J. Dunnmon, C. Ré, D. Rubin, and C. Lee-Messer, “Weak supervision as an efficient approach forautomated seizure detection in electroencephalography,” NPJ Digital Medicine, vol. 3, no. 1, pp. 1–12,2020.

[11] J. A. Fries, E. Steinberg, S. Khattar, S. L. Fleming, J. Posada, A. Callahan, and N. H. Shah, “Ontology-driven weak supervision for clinical entity classification in electronic health records,” Nature Communica-tions, vol. 12, no. 1, pp. 1–11, 2021.

[12] S. H. Bach, D. Rodriguez, Y. Liu, C. Luo, H. Shao, C. Xia, S. Sen, A. Ratner, B. Hancock, H. Alborzi,R. Kuchhal, C. Ré, and R. Malkin, “Snorkel DryBell: A case study in deploying weak supervision atindustrial scale,” in ACM SIGMOD Conference on Management of Data (SIGMOD) Industry Track, 2019.

[13] C. H. Lampert, H. Nickisch, and S. Harmeling, “Learning to detect unseen object classes by between-classattribute transfer,” in IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2009.

[14] A. J. Ratner, B. Hancock, J. Dunnmon, F. Sala, S. Pandey, and C. Ré, “Training complex models withmulti-task weak supervision,” in AAAI Conference on Artificial Intelligence (AAAI), 2019.

[15] R. Jin and Z. Ghahramani, “Learning with multiple labels,” in Neural Information Processing Systems(NeurIPS), 2002.

[16] N. Nguyen and R. Caruana, “Classification with partial labels,” in ACM SIGKDD International Conferenceon Knowledge Discovery and Data Mining (KDD), 2008.

[17] J. Luo and F. Orabona, “Learning from candidate labeling sets,” Tech. Rep., 2010.

[18] T. Cour, B. Sapp, and B. Taskar, “Learning from partial labels,” The Journal of Machine Learning Research,vol. 12, pp. 1501–1536, 2011.

[19] L. Liu and T. G. Dietterich, “A conditional multinomial mixture model for superset label learning,” inNeural Information Processing Systems (NeurIPS), 2012.

10

[20] L. Liu and T. Dietterich, “Learnability of the superset label learning problem,” in International Conferenceon Machine Learning (ICML), 2014.

[21] E. Hüllermeier and W. Cheng, “Superset learning based on generalized loss minimization,” in JointEuropean Conference on Machine Learning and Knowledge Discovery in Databases (ECML PKDD),2015.

[22] V. Cabannnes, A. Rudi, and F. Bach, “Structured prediction with partial labelling through the infimumloss,” in International Conference on Machine Learning (ICML), 2020.

[23] H. Wang, W. Liu, Y. Zhao, T. Hu, K. Chen, and G. Chen, “Learning from multi-dimensional partial labels,”in International Joint Conference on Artificial Intelligence (IJCAI), 2020.

[24] V. Cabannes, F. Bach, and A. Rudi, “Disambiguation of weak supervision with exponential convergencerates,” arXiv preprint arXiv:2102.02789, 2021.

[25] T. Durand, N. Mehrasa, and G. Mori, “Learning a deep convnet for multi-label classification with partiallabels,” in IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2019.

[26] C. Li, X. Li, and J. Ouyang, “Learning with noisy partial labels by simultaneously leveraging global andlocal consistencies,” in ACM International Conference on Information & Knowledge Management (CIKM),2020.

[27] Y. Yan and Y. Guo, “Partial label learning with batch label correction,” in AAAI Conference on ArtificialIntelligence (AAAI), 2020.

[28] M.-K. Xie and S.-J. Huang, “Partial multi-label learning with noisy label identification,” IEEE Transactionson Pattern Analysis and Machine Intelligence, 2021.

[29] A. Farhadi, I. Endres, D. Hoiem, and D. A. Forsyth, “Describing objects by their attributes,” in IEEEConference on Computer Vision and Pattern Recognition (CVPR), 2009.

[30] M. Palatucci, D. Pomerleau, G. E. Hinton, and T. M. Mitchell, “Zero-shot learning with semantic outputcodes,” in Advances in Neural Information Processing Systems (NeurIPS), 2009.

[31] D. Jayaraman and K. Grauman, “Zero shot recognition with unreliable attributes,” in Advances in NeuralInformation Processing Systems (NeurIPS), 2014.

[32] Y. Xian, C. H. Lampert, B. Schiele, and Z. Akata, “Zero-shot learning: A comprehensive evaluation of thegood, the bad and the ugly,” IEEE Transactions on Pattern Analysis and Machine Intelligence (PAMI),2018.

[33] W. Wang, V. W. Zheng, H. Yu, and C. Miao, “A survey of zero-shot learning: Settings, methods, andapplications,” ACM Transactions on Intelligent Systems and Technology (TIST), 2019.

[34] A. P. Dawid and A. M. Skene, “Maximum likelihood estimation of observer error-rates using the EMalgorithm,” Journal of the Royal Statistical Society C, vol. 28, no. 1, pp. 20–28, 1979.

[35] S. Nitzan and J. Paroush, “Optimal decision rules in uncertain dichotomous choice situations,” InternationalEconomic Review, vol. 23, no. 2, pp. 289–97, 1982.

[36] C. Gao and D. Zhou, “Minimax optimal convergence rates for estimating ground truth from crowdsourcedlabels,” CoRR, vol. abs/1207.0016, 2013.

[37] S. H. Bach, B. He, A. Ratner, and C. Ré, “Learning the structure of generative models without labeleddata,” in International Conference on Machine Learning (ICML), 2017.

[38] P. Varma, F. Sala, A. He, A. Ratner, and C. Ré, “Learning dependency structures for weak supervisionmodels,” in International Conference on Machine Learning (ICML), 2019.

[39] A. Mazzetto, D. Sam, A. Park, E. Upfal, and S. H. Bach, “Semi-supervised aggregation of dependent weaksupervision sources with performance guarantees,” in Artificial Intelligence and Statistics (AISTATS), 2021.

[40] A. Balsubramani and Y. Freund, “Optimally combining classifiers using unlabeled data,” in Conference onLearning Theory (COLT), 2015.

[41] ——, “Scalable semi-supervised aggregation of classifiers,” in Neural Information Processing Systems(NeurIPS), 2016.

[42] ——, “Optimal binary classifier aggregation for general losses,” in Neural Information Processing Systems(NeurIPS), 2016.

[43] T. Ishida, G. Niu, W. Hu, and M. Sugiyama, “Learning from complementary labels,” in Neural InformationProcessing Systems (NeurIPS), 2017.

[44] L. Feng, T. Kaneko, B. Han, G. Niu, B. An, and M. Sugiyama, “Learning with multiple complementarylabels,” in International Conference on Machine Learning (ICML), 2020.

[45] F. Pourpanah, M. Abdar, Y. Luo, X. Zhou, R. Wang, C. Lim, and X. Wang, “A review of generalizedzero-shot learning methods,” ArXiv, vol. abs/2011.08641, 2020.

11

[46] R. Socher, M. Ganjoo, C. D. Manning, and A. Ng, “Zero-shot learning through cross-modal transfer,” inAdvances in Neural Information Processing Systems (NeurIPS), 2013.

[47] A. Frome, G. S. Corrado, J. Shlens, S. Bengio, J. Dean, M. Ranzato, and T. Mikolov, “Devise: A deepvisual-semantic embedding model,” in Advances in Neural Information Processing Systems (NeurIPS),2013.

[48] B. Romera-Paredes and P. Torr, “An embarrassingly simple approach to zero-shot learning,” in Internationalconference on machine learning. PMLR, 2015.

[49] Y. Xian, Z. Akata, G. Sharma, Q. Nguyen, M. Hein, and B. Schiele, “Latent embeddings for zero-shotclassification,” in IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2016.

[50] Y. Liu, J. Guo, D. Cai, and X. He, “Attribute attention for semantic disambiguation in zero-shot learning,”in IEEE International Conference on Computer Vision (ICCV), 2019.

[51] L. Liu, T. Zhou, G. Long, J. Jiang, and C. Zhang, “Attribute propagation network for graph zero-shotlearning,” 2020.

[52] M. Bucher, S. Herbin, and F. Jurie, “Generating visual representations for zero-shot classification,” inProceedings of the IEEE International Conference on Computer Vision Workshops, 2017, pp. 2666–2673.

[53] V. K. Verma, G. Arora, A. Mishra, and P. Rai, “Generalized zero-shot learning via synthesized examples,”in IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2018.

[54] R. Felix, I. Reid, G. Carneiro et al., “Multi-modal cycle-consistent generalized zero-shot learning,” inEuropean Conference on Computer Vision (ECCV), 2018.

[55] M. B. Sariyildiz and R. G. Cinbis, “Gradient matching generative networks for zero-shot learning,” inProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2019, pp. 2168–2178.

[56] Y. Xian, S. Sharma, B. Schiele, and Z. Akata, “f-vaegan-d2: A feature generating framework for any-shotlearning,” in IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2019.

[57] S. Narayan, A. Gupta, F. S. Khan, C. G. Snoek, and L. Shao, “Latent embedding feedback and discriminativefeatures for zero-shot classification,” in European Conference on Computer Vision (ECCV), 2020.

[58] A. Paszke, S. Gross, S. Chintala, G. Chanan, E. Yang, Z. DeVito, Z. Lin, A. Desmaison, L. Antiga, andA. Lerer, “Automatic differentiation in PyTorch,” in NeurIPS Autodiff Workshop, 2017.

[59] E. S. Allman, J. A. Rhodes, E. Stanghellini, and M. Valtorta, “Parameter identifiability of discrete bayesiannetworks with hidden variables,” Journal of Causal Inference, vol. 3, no. 2, pp. 189–205, 2015.

[60] E. S. Allman, C. Matias, J. A. Rhodes et al., “Identifiability of parameters in latent structure models withmany observed variables,” The Ann. of Stat., vol. 37, no. 6A, pp. 3099–3132, 2009.

[61] A. Cohan, W. Ammar, M. Van Zuylen, and F. Cady, “Structural scaffolds for citation intent classification inscientific publications,” in Conference of the North American Chapter of the Association for ComputationalLinguistics (NAACL), 2019.

[62] X. Li and D. Roth, “Learning question classifiers,” in International Conference on Computational Linguis-tics (COLING), 2002.

[63] A. Gulli. [Online]. Available: http://groups.di.unipi.it/~gulli/AG_corpus_of_news_articles.html

[64] M. Gardner, J. Grus, M. Neumann, O. Tafjord, P. Dasigi, N. F. Liu, M. Peters, M. Schmitz, and L. S.Zettlemoyer, “Allennlp: A deep semantic natural language processing platform,” 2017.

[65] M. Peters, M. Neumann, M. Iyyer, M. Gardner, C. Clark, K. Lee, and L. Zettlemoyer, “Deep contextualizedword representations,” in Conference of the North American Chapter of the Association for ComputationalLinguistics (NAACL), 2018.

[66] B. Zhao, Y. Fu, R. Liang, J. Wu, Y. Wang, and Y. Wang, “A large-scale attribute dataset for zero-shotlearning,” in IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) Workshops,2019.

[67] K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning for image recognition,” in IEEE Conferenceon Computer Vision and Pattern Recognition (CVPR), 2016.

[68] O. Russakovsky, J. Deng, H. Su, J. Krause, S. Satheesh, S. Ma, Z. Huang, A. Karpathy, A. Khosla, M. Bern-stein, A. C. Berg, and L. Fei-Fei, “ImageNet Large Scale Visual Recognition Challenge,” InternationalJournal of Computer Vision (IJCV), 2015.

[69] J. Song, C. Shen, Y. Yang, Y. Liu, and M. Song, “Transductive unbiased embedding for zero-shot learning,”in IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2018.

[70] Z. Wan, D. Chen, Y. Li, X. Yan, J. Zhang, Y. Yu, and J. Liao, “Transductive zero-shot learning with visualstructure constraint,” in Neural Information Processing Systems (NeurIPS), 2019.

12

http://groups.di.unipi.it/~gulli/AG_corpus_of_news_articles.html

[71] M. Norouzi, T. Mikolov, S. Bengio, Y. Singer, J. Shlens, A. Frome, G. S. Corrado, and J. Dean, “Zero-shotlearning by convex combination of semantic embeddings,” arXiv preprint arXiv:1312.5650, 2013.

[72] S. Changpinyo, W.-L. Chao, B. Gong, and F. Sha, “Synthesized classifiers for zero-shot learning,” in 2016IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2016.

[73] O. Chapelle, B. Scholkopf, and A. Zien, Semi-Supervised Learning. MIT Press, 2009.

[74] J. E. Van Engelen and H. H. Hoos, “A survey on semi-supervised learning,” Machine Learning, vol. 109,no. 2, pp. 373–440, 2020.

[75] S. J. Pan and Q. Yang, “A survey on transfer learning,” IEEE Transactions on Knowledge and DataEngineering, vol. 22, no. 10, pp. 1345–1359, 2009.

[76] F. Zhuang, Z. Qi, K. Duan, D. Xi, Y. Zhu, H. Zhu, H. Xiong, and Q. He, “A comprehensive survey ontransfer learning,” Proceedings of the IEEE, vol. 109, no. 1, pp. 43–76, 2020.

[77] D. A. Cohn, Z. Ghahramani, and M. I. Jordan, “Active learning with statistical models,” Journal of ArtificialIntelligence Research, vol. 4, pp. 129–145, 1996.

[78] B. Settles, “Active learning,” Synthesis Lectures on Artificial Intelligence and Machine Learning, vol. 6,no. 1, pp. 1–114, 2012.

[79] P. Saleiro, B. Kuester, L. Hinkson, J. London, A. Stevens, A. Anisfeld, K. T. Rodolfa, and R. Ghani,“Aequitas: A bias and fairness audit toolkit,” arXiv preprint arXiv:1811.05577, 2018.

[80] N. Mehrabi, F. Morstatter, N. Saxena, K. Lerman, and A. Galstyan, “A survey on bias and fairness inmachine learning,” arXiv preprint arXiv:1908.09635, 2019.

[81] J. B. Kruskal, “Three-way arrays: rank and uniqueness of trilinear decompositions, with application toarithmetic complexity and statistics,” Linear algebra and its applications, vol. 18, no. 2, pp. 95–138, 1977.

[82] X. Zhang, J. Zhao, and Y. LeCun, “Character-level convolutional networks for text classification,” inAdvances in Neural Information Processing Systems, 2015.

[83] I. Loshchilov and F. Hutter, “Decoupled weight decay regularization,” in International Conference onLearning Representations, 2019.

13

A Proof of Theorem 1

Theorem 1 provides sufficient conditions for the generic identifiability of the label model describedin Section 3.2. In this section, we prove this theorem. Our proof is non-trivial because our labelmodel yields probability distributions that are a measure-zero subset of the distributions consideredby Allman et al. [60]. Allman et al. allows each entry of the class-conditional distributions to beany value in the interval [0, 1] such that the entries sum to 1, whereas our label model imposesadditional algebraic constraints on the entries. Allman et al. establishes identifiability except foron a measure-zero subset of the distributions that they consider, but we are unable to directly applytheir results because our family of distributions might be contained in the measure-zero subset theyexclude. It is therefore necessary to establish that the set of distributions for which identifiability doesnot exist is of measure zero with respect to the distributions that can be produced by our label model,or, equivalently, the set of values of the accuracies αi,j , propensities βi, and class balance P (Y ) forwhich identifiability does not exist has measure zero with respect to the set of all possible parametervalues.

Background The key tool that we use in our proof is Kruskal’s unique factorization theorem, whichrelies on the concept of Kruskal rank [81]. The Kruskal rank of a matrix is defined to be the largestinteger n such that every set of n rows is linearly independent. A useful fact is that a matrix with fullrow rank also has full Kruskal rank. Kruskal’s theorem says that if, for u = 1, 2, 3, we have a k × rumatrix Mu with Kruskal rank Iu, and these Iu satisfy

I1 + I2 + I3 ≥ 2k + 2 (3)

then, given only the three-dimensional tensor M where entry (a, b, c) is given by

M(a, b, c) =

k∑j=1

M1(j, a)M2(j, b)M3(j, c) (4)

we can recover the original matrices Mu.

Allman et al. [60] makes the connection that the probability distribution of a latent variable modelwith three observed variables and one latent variable that takes on a finite set of values can bedescribed by the matrix M . Each Mu can be intrepreted as a conditional probability matrix whererow c is a probability distribution over the possible values of feature u given that the latent variablehas value c.

In situations where there are more than three observed variables, variables can be combined to form“grouped" variables that satisfy the theorem conditions, if needed. In our case, it is possible thatindividual PLFs do not have codomains that are large and diverse enough to satisfy the Kruskalrank requirement. In these situations, the condition can be satisfied by amalgamating multiple PLFsto form a grouped PLF, which can be viewed as an observed variable with a codomain that is theCartesian product of the codomains of its member PLFs.

We show that the conditions of Theorem 1 ensure that, for a generic choice of parameters in a k-classmodel, there is a tripartion of the PLFs such that two of the corresponding conditional probabilitymatrices have full Kruskal rank (k) and the third conditional probability matrix has a Kruskal rank ofat least 2. Thus, the conditions of Kruskal’s unique factorization theorem are satisfied, so we canrecover the conditional probability matrices, from which the accuracies αi,j , propensities βi, andclass balance P (Y ) can be computed by solving a system of equations.

Proof of Theorem 1. S1, S2, and S3 partition the PLFs into three disjoint subsets. For u = 1, 2, 3,define some ordering to the PLFs in subset Su so that Su,i gives the i-th PLF in the subset. Wewill treat Su as a “grouped" PLF with codomain G(Su) = {(t1, t2, ..., t|Sj |) | t1 ∈ G(Su,1), t2 ∈G(Su,2), ..., t|Su| ∈ G(Su,|Su|)}, where G(G) denotes the codomain of PLF G. Let Mu denote thek × |G(Su)| conditional probability matrix for the combined output of all PLFs in subset Su, whereeach entry is a product containing some combination of βi, (1−βi), αi,j , (1−αi,j), and normalizingconstants. We assume that the class balance P (Y ) has positive entries and all PLFs have non-zeropropensities βi, because any extraneous class labels or non-voting PLFs would be removed. DefineM̃1 = diag(P (Y ))M1, where diag(v) denotes the matrix with the entries of vector v along its maindiagonal and zeros elsewhere. P (G), the observed distribution of PLF outputs, corresponds to the

14

three-dimensional tensor obtained from applying Equation 4 to M̃1, M2, and M3. We will considerthe Kruskal ranks of M̃1, M2, and M3, which we respectively denote I1, I2, and I3.

We first consider M2. The (row) rank of a matrix A is equal to the largest integer n for which thereexists an n× n submatrix of A that has a nonzero determinant. The determinant of such a submatrixis called an n-minor. M2 has less than full row rank if and only if all of its k-minors are zero. Thiscondition can be expressed as the nonvanishing of a polynomial in the entries of M2, which arethemselves functions of the label model parameters. In other words, the set of parameter values forwhich M2 does not full row rank is the zero set of this polynomial. As described in Allman et al.[60], so long as the polynomial is not identically zero, the parameter values yielding less than fullrow rank is a measure-zero subset of the full parameter space. To show that this polynomial is notzero for all values in the parameter space, it is sufficient to show that there exists at least one set ofparameter values for which the polynomial is nonzero, or, equivalently, that there is a set of parametervalues for which M2 has full row rank.

The values of the propensities βi and class balance P (Y ) do not affect row rank as long as they arenonnegative, as assumed above. We now show that there is a setting of the accuracies αi,j for whichthe Kruskal rank of M2 is k. Set all αi,j = 1. By the conditions of Theorem 1, for each class c, thereis an output in the codomain of S2 for which c appears in all of the individual PLF outputs and noother class appears in all outputs. This implies the following two statements about the column in M2

that is associated with this output: (1) the c-th entry of this column does not contain (1 − αi,j) inits product, and (2) all other entries are products containing at least one (1− αi,j). When αi,j = 1,these entries containing (1− αi,j) are all zero. In other words, M2 has k columns that are all zeroexcept for a single entry, and the row containing this entry is different across the k columns. Thesecolumns form the basis of a column space of dimension k. For any matrix A, dim(Col A) = dim(RowA) = row rank of A. Thus, the row rank of M2 is k. Since M2 has full row rank when all αi,j = 1, italso has full Kruskal rank. This shows that the polynomial whose nonvanishing determines whetherM2 has full row rank is not identically zero, so M2 generically has full row and Kruskal rank.

We now consider M̃1. The arguments applied to M2 can be applied exactly to M1, but we areinterested in M̃1 = diag(P (Y ))M1. However, since we assumed that P (Y ) contains only positiveentries, and multiplying each row of a matrix by a nonzero scalar does not change its row rank, thesame arguments can be applied to M̃1. We conclude that M̃1 also generically has full Kruskal rank.

Finally, we consider M3. The Kruskal rank of a matrix is less than two only if there are two rowsthat are scalar multiples of each other. This can happen in our model only when the class conditionalaccuracies for two classes are exactly equal, which corresponds to a measure zero subset of theparameter space. Thus, we can generically assume that M3 has a Kruskal rank of I3 ≥ 2.

Since, generically, M1 and M2 have Kruskal ranks of k and M3 has a Kruskal rank of at least 2, wehave I1 + I2 + I3 ≥ 2k + 2, so Kruskal’s unique factorization theorem tells us that we can recoverM̃1, M2, and M3 from P (G), the observed distribution of PLF outputs. Once M̃1, M2, and M3 areknown, the accuracies αi,j , propensities βi, and class balance P (Y ) can be computed using algebraicmanipulations.

B Dataset Information

SciCite [61] is a citation purpose classification dataset containing 8243 train, 916 development and1861 test instances of 3 categories sourced from scientific literatures and it is publicly available underApache License 2.0.

TREC-6 [62] is a publicly available dataset for research use and it is a question classification datasetcontaining a broad amount of open-domain questions from 6 semantic categories. Since the originaldataset lack a validation/development set, we sample 389 instances from the training set to make atrain/dev/test size split of 5063/389/500 respectively.

AG-News [63, 82] is a publicly available dataset for research use. It is a large-scale news topicclassification dataset containing 4 categories. We similarly sample 500 training instances as ourdevelopment set. The train/dev/test size are 119.5k/500/7600 respectively.

15

LAD [66] is a publicly available dataset for research use. It has approximately 78k instances thatorganizes common objects into five sub-datasets: electronics, vehicles, fruits, hairstyles and animals.Each sub-dataset is associated with 5 different seen/unseen class splits.

AwA2 [32] is a publicly available dataset for research use. It has 85 binary attributes for 50 animalclasses. For our experiment, following the dataset authors, we adopt the proposed split that dividesthe 50 classes into 40 seen classes and 10 unseen classes.

All datasets we use are publicly available standard research datasets. These datasets generally donot contain personally identifiable information. Public figures are sometimes mentioned in the textdatasets.

C Additional Experiment Details

For the PLFs development and label modeling stage of the text classification task, the experimentsare set on a local PC with Intel i5-6600k cpu and 32 GB of RAM. For the discriminative modelingand PLFs development that involves neural network inference/training for the object classificationtask, we perform our experiments on virtual computing instances with Intel Xeon E5-2698 v4 CPU, 1NVIDIA V100 GPU and 32 GB of RAM.

C.1 Text Classification

For both LFs Only and NPLM, following prior practices in programmatic weak supervision [2, 12],we filter the training instances by only retaining ones with at least one PLFs/LFs votes and the filteredinstances are used for LF-Only/NPLM label and end model training. For the optimization of LFsOnly and NPLM, we use a initial learning rate of 0.01 and a reduce-learning-rate-on-plateau learningrate scheduler with decreasing factor of 0.1. We train the NPLM/LFs Only Label models for 5 epochs.

For the end model, we adopt Elmo-Original contextualized word embedding [65]. We use abidirectional-LSTM with 128 hidden size as our encoder backbone, then we apply multiplicativeattention for the encoded embeddings before passing them to a 3-layer MLP classifier with hiddensize of dimension 256 with LeakyReLU as activation, batch normalization and a dropout layer with50% probability. For all of the discriminative model, we use AdamW (ADAM with weight decay)[83] optimizer and cosine learning rate scheduler and we train the models with a 32 training batchsize. For AG-News, we train 5 epochs and a starting learning rate of 3e-4 and we set the gradientclipping threshold to 2.0. For TREC-6, we train with a total of 10 epochs with a 3.0 gradient clipand 4e-4 initial learning rate. For SciCite, we train the model with a total of 15 epochs with a 4.0gradient clip and 4e-4 initial learning rate. The best end model is picked based on the best validationmacro-averaged F1.

C.2 Object Classification

We use AwA2 for our generalized zero-shot experiments and the sub-tasks of LAD for zero-shotevaluations. For AwA2, we follow the proposed splits and we follow the seen/unseen class split guidenoted by the original authors for LAD. We adopt previous practices and guidelines in evaluatinggeneralized AwA results, using average per class top-1 accuracy (or mean class accuracy, MCA) as themain performance metric for both unseen and seen classes at test and then report the harmonic mean.For LAD, we follow the authors’ practices to report the average accuracy over the 5 sub-categories,each with 5 different seen/unseen class splits.

While the train/validation split among the seen classes are given in AwA2, LAD does not supply atrain/validation split among the seen classes. We randomly sample at least one and at most 10% ofthe seen classes as validation classes for the detector and the rest seen data as training.

For both AwA2 and LAD, we train detectors for each attribute with a 3 layer MLP. We considersetting the hidden dimensions to either 512 or 1024. We select the size that gives the higher minimumper-class accuracy. (In other words, whichever improves the worst-scoring class the most.) Weuse ILSVRC-pretrained ResNet-50 features for LAD and seen class-finetuned ResNet-101 featureson AwA2. The minority class is balanced by oversampling. We also apply batch normalizationand 50% dropout during training at each layer. The activation function used is LeakyReLU. Forthe optimization, we adopt a Adam optimizer with initial learning rate from 1e-4 and a multi-step

16

learning rate scheduler. We train the detector with respect to {100, 300, 500} epochs with a learningrate scheduling step size of {30, 80, 200} respectively. Best model is selected based on best validationaccuracy measured on the held-out seen classes.

For the NPLM label model, we use Adam optimizer with a reduce-learning-rate-on-plateau learningrate scheduler. The full set of hyperparameters will be included in the code to be released uponacceptance. The end model is a 3-layer MLP with both hidden layers size being 1024. We applybatch normalization and 50% dropout at each layer and it is activated with LeakyReLU. We optimizethe end discriminative model with a initial learning rate of 1e-4 and we adopt a reduce-learning-rate-on-plateau learning rate scheduler with decreasing factor of 0.1. For the generalized task on AwA2,we train the model for 11 epochs. For LAD, we pick the best model with the lowest training loss.Same as the text tasks, the training objective is soft cross entropy.

17

learning from multiple noisy partial labelers

Documents