john blitzer: so before i start,i want to credit people who have helped me along. so shai ben-david is a professorat waterloo. koby crammer is a postdoc in our group. mark dredzse internedwith you guys last summer in new york city. ryan mcdonald was a fellowgrad student-- is now at google in new yorkcity as well. and fernando is my adviser.
so the problem i'm going to talkabout is hopefully one that anyone who's ever workedwith statistical models has encountered in someform or another. and that's that you train up amodel in the lab with the data you have. you test it out. it looks pretty good. and then when you go to applyit in the real world, it's just terrible. so this can happen in invision,where you have cases
where you may have good facedata at train time, but when you go to apply a facerecognizer, you might have occlusion or lightingdifferences that you need to deal with. for gene finding now, thisis becoming exceptionally important, because you mayhave good, well-curated, annotated dna sequences from oneorganism, but you want to train a model that works wellfor a different organism. probably the most familiarsetting to people here is in
speech and speaker adaptation,where you have transcriptions from one person, but youactually want to use a speech recognizer in a case where aperson has a very different accent or vocal track length. so in text, this problem isparticularly acute, because there's really a hugevariation in vocabulary and style. so you can have domains likefinancial news and blogs and scientific text.
so one way to approach this isjust to say ok, well, we've got multiple domains. let's just train a model in eachdomain, and we can that way handle all the datawe ever encounter. well, that works well, but ifyou think about it, training a model usually involves some sortof annotation, and in the case of something liketranslation, this means you would have to go and get atranslator to go and translate blogs for you.
it's not something that'sparticularly cheap. and furthermore it's unclearwhat exactly you meant by domain. so obviously within blogs,there's a wide variety of different kinds of languagethat people use. i mean, even within certaintypes, you may have blogs about cell phones versusabout software. again, where do you drawthe line here? so let me dive into two specificcases that i'm going
to address today in this talk. the first is sentiment classification for product reviews. so in this setting, we're goingto receive a review. it's just some text describinghow a person felt about a product he or she bought. and the idea is we're going towant to pipe this through a statistical classifier, svm,naive bayes, or something-- and get out a rating that'seither positive or negative.
and so i just want to pause hereand say i realize that this has actually become apretty hot topic in the literature, and there are lotsof people now working on this. i guess there are even people atgoogle in new york who are doing a project onsentiment now. so where's the domain adaptationhappening? we're going to look at asetting where you have annotated product reviews fromone particular type of product-- let's say books--where together with each
review, someone went throughand told me this review is expressing positive sentimentor this review is expressing negative sentiment. but what we're goingto want to do is go to a different product-- let's say kitchen appliances-- and apply a classifier thatwe learned on books. and for kitchen appliances,we're not going to have any labeled data at all.
so let me let me give you guystwo examples from these two domains, just to illustrate thekinds of problems we're going to try to overcome here. so the first is from books. we pulled both of theseoff amazon. so this is running withscissors, a memoir. "this book was horrible. i read half of it,suffering from a headache the entire time.
and eventually ilit it on fire. one less copy in the world. don't waste your money. i wish i had the time spentreading this book back, so i could use it for betterpurposes. this book wasted my life." ok. now let's look at a kitchenappliance review from the same site, also from amazon.
"i love the way the tefaldeep fryer cooks. however, i'm returning my secondone due to a defective lid closure. the lid may close initially,but after a few uses, it no longer stays closed. i will not be purchasing this one again." ok. so if you look at the actualwords that people are using to express sentiment in these twocases, actually it's quite
different, right? you read half of a book. that means that you didn'treally like it. we are not going to read halfof a tefal deep fryer. again, you don't say things likethis book was defective. it just didn't work. and i'm returning it. right? that's not the way you expressnegativity about books.
and in practice actually, if youtrain up an sbm on books and you actually test it outon kitchen appliances, the error doubles. so this is a prettyserious problem. the other task that i'm goingto address is more traditional, sort ofcanonical nlp task. this is part of speechtagging. so we're going to get sometraining data from the wall street journal.
and let me read to youguys this sentence. so this is a large corpus ofannotated financial news. "the clash is a sign of a newtoughness and divisiveness in japan's once cozy financialcircles. " and the task again, here, for part of speech taggingis to take each word and annotate it with itsgrammatical functions, so to say something like once cozy isan adjective, toughness is a noun, and so on. and actually there are people atpenn now who are interested
in building nlp pipelines forbiomedical abstracts. so they are in the biologydepartment, and they want to build nlp tools forbiology texts. but they don't have anylabeled texts, right? so they get sentences like "theoncogenic mutated forms of the ras proteins areconstituatively active and interfere with normal signaltransduction." so again, the vocabulary reallychanges a lot here. what i've highlighted here arewords that occur five times or
more frequently in one domainthan in another. so in particular, you have wordslike oncogenic which almost never occurs the wallstreet journal, no matter how much text from that particulardomain i'm going to show you. and the same kinds ofproblem occurs. so if you train up astate-of-the-art part of speech tagger in wall streetjournal and test it out on medline, the error quadruples. ok, so before i go intostructural correspondence
learning, which is the topic ofthe talk, i want get us all on the same page in terms ofthe supervised models i'm going to be using. so these are just linearmodels for text. i guess everyone here,especially at google, will be completely familiar with them,but just to sort of get on the same page as far as notation. so the idea here is that we'regoing to take a document, and we're going to represent thisas a vector in a high
dimensional vector space, whereeach dimension of the vector is a particular feature,like a word or a bygram in the caseof sentiment. and for a particular instance,the dimensions which have positive non-zero values arethose words which actually occur in the document. so here you say, oh well, theword horrible occurred three times in this book review. i give it 0.3.
read half occurred once. i give it 0.1, and so on. we also have a weight vector. and in this weight vector,each weight sort of corresponds to the propensityof a particular word to indicate positive ornegative sentiment. so you have things like horriblegets a -1 because it's an indicator that adocument might be negative. waist gets -1.2 and so on.
and the way we're going toclassify is just take the dot product of these two things. so this will give us a score, ifwe add up all the features, weighted by their weights, andthe score is negative, we end up saying ok, this documentexpresses negative sentiment. if we add them up, and it'spositive, this document expresses positive sentiment. the problem i want to focus onagain is-- so remember we had a feature like defective.
and we had a bunch of trainingdata from books on which we can estimate the entriesin this weight vector. but we've never seen the worddefective, this particular feature, so what do we do? well, the best we couldpossibly do is give it a zero weight. we've never seen it before. we don't know what it means. so we just give ita zero weight.
but in practice, of course, thisisn't going to help us when we get to kitchenappliances. ok. so the other thing i want tomention is that so i gave you as a simple binaryclassification setting. obviously, this is kind of thestate-of-the-art for even more complex nlp structured tasks. and we'll use the samesorts of ideas-- this vectorized representationin part of
speech tagging as well. so structural correspondencelearning will cut this error that i was showingyou guys by 40%. and the basic idea is just touse unlabeled data from the target domain. the reason we call itcorrespondence learning is that we're going to inducecorrespondences among features from the different domains. so if we think about oursentiment example, these are
things like well the bigram,read half, in books is sort of like the word defectivein kitchen appliances. there roughly meanthe same thing. and if we can find goodcorrespondences, then the basic intuition is that thelabeled data for the source domain will automatically giveus a good classifier for the so if we do a good jobof learning these correspondences, we can learn arepresentation that already going to help us to doadaptation well.
yes, sam. audience: one-to-one? or many-to-one? john blitzer: so we'll see. it's basically many to many. and when the algorithm getsflushed out, we'll see exactly how that works. actually, as far as i know we'rethe first people to use this kind of idea for text, butfor those of you again who
know speech, this is a fairlycommon problem set-up that they have there, where they wantto do speaker adaptation. and there's a technique calledmaximum likelihood linear regression which workspretty well for them. the set-up is almost identical,but the techniques are going to be quitedifferent. scl, as i alluded, is a two-steplearning process. in the first step, we get abunch of unlabeled data, from both the source andtarget domains.
and the idea is to learn acommon shared representation that maps source instances andtarget instances into the same low dimensional vector space. then we're going to simplytake this low dimensional representation and learnfeatures for that to do good classification. so phi now provides us with abunch of new features, and we're just going to learnweights on those features to do our classification.
we kind of alluded to thisbefore, but just to think about what are the properties ofphi that we're looking for. well, one, we need to makethe domains look as similar as possible. but also we need to allowourselves to do we designed our featurespace to have good discriminative power. and we don't want to lose thatpower in doing this mapping. in particular, you can thinkof fulfilling the first
criterion by mapping allthe points ontp one low dimensional point. and that's obviously notgoing to help us. what's the intuition for howwe're going to do this? so if we go back to our exampleon kitchen appliances, there is this word defective. and we said that if we only knewthat this was an negative word, we could do well here. so how can we figure out thatit's a negative word?
let's take our unlabeledkitchen context. again, look up the worddefective in a bunch of kitchen appliances reviewsfrom amazon, and see where it occurs. so you get things like"do not buy the sharp portable steamer. the trigger mechanism isdefective." "the very nice lady assured me that i musthave a defective set. what a disappointment." "maybemine was defective.
the directions were unclear. " so the words i've picked herein blue are things that basically could be methods forexpressing negativity about either books or kitchenappliances. let's look up now these wordsin the book context. so for not buy, "the book isso repetitive that i found myself yelling. i will definitely not buyanother. " "a disappointment-- ender was talked about for somesmall number of pages
altogether." so ender is acharacter in this book that this guy really likes,and he didn't get enough face time i guess. "it's unclear," "it's repetitiveand boring. " so again we want to somehow use theco-occurrence with these blue features to realize thatdefective is like boring, number of pages, or repetitive, when you go to books. so what are theseblue features?
we're going to call thempivot features. and they have several propertiesthat i want to make explicit here. first, they have to occurfrequently in both domains. they need to be good atcharacterizing the task we actually want to do-- thediscriminative task. in practice, they'regoing to number in the hundreds or thousands. so i showed you three,but we're going
to choose many more. and we need to choose them usingthe data we have. so what can we exploit? we have some labeledsource data. we have some unlabeled sourceand target data for picking these pivots. so let me give you guys twoexamples of how to pick pivots and what kinds of featurescome out. so the first is what i'm goingto call scl, and that's just
to choose words and diagramsthat occur frequently in both domains. and the second, i'm going tocall sclm-mi, which is like scl, but it's based also on themutual information with the labels. so in the first case. you getwords like one, about, when-- probably not such greatin terms of pivots. but in the second case, whenyou also include the mutual information from the labels youhave in the source domain,
you get ones that lookmuch better-- highly recommended,awful, loved it. these pivots are things thatif you could model the co-occurrence as well, you wouldassume that might be able to do good classification. so how are we going toactually do this? well, the idea behind the pivotsis just to use them to align other features. so if we go back to our firstexample with not buy, the idea
here is that we're just going tocover up not buy, mask it, and use the pivot features topredict the presence or absence of not buy in thisparticular example. if you we're going to constructa single binary problem, and instantiate thatacross all the data, and say does the phrase notbuy occur here. yes or no. and we're going to train endlinear predictors here-- one for each binary problem.
the thing to notice here is thateach linear predictor we train is characterizedby a weight factor. one issue i want to pointout is that these-- what i'm going to callpivot predictors-- are implicitly aligningfeatures from different domains. how do they do that? well, if we notice thatdefective and repetitive both have positive weight for notbuy, this pivot predictor,
then we know that in thatcase, we can kind of say hypothesize that thesemight be aligned. audience: [inaudible] john blitzer: what'sa negative example? any instance which doesn't havethe phrase not buy in it. we have all thisweight vectors. if we construct a matrix wherethe columns themselves are the weight vectors from these binaryprediction problems, note that actually doing thematrix vector multiplication
gives n new binary features,where the value of the i feature is basically just thepropensity to see not buy in the same document. so we give you a document back,and i get a bunch of new features which say could "notbuy" buy occur in this context, yes or no? audience: each column is theweight vector for a predictor of a pivot? john blitzer: right, exactly.
audience: so you're sayingi features see not buy. i plus one will be some otherpivot-- some other predictor that's [inaudible]not buy but is-- john blitzer: what was theother one i listed? awful. something like this. we're almost done, but wecreated these 1,000 features. let's say n is 1,000-- if we had 1,000 pivots.
that's still reasonably large. the reason i say this is thatthere's a lot of duplicate information here. you have predictors thatare like horrible, terrible, and awful. all are good pivots, but theytend to mean the same thing. what we'd like is to have asimple basis that kind of characterizes this space, andthat we can use basically as just plug into a standardlinear model.
and we're going to constructthis by computing the svd and using the top left singularvectors, which i'll call phi here. so for those of you who knowhistory of dimensionality reduction in language, thereare these two very probably most famous papers which arelatent semantic indexing and a bayesian and probabilisticvariant of that latent direchlet allocation. i want to stop here and tryand characterize just at a
high level what thedifference is. so first, these dimensionalityreductions are done on the feature document matrix. and in particular, by pickingpivots, we can actually characterize the kinds ofrepresentations that we learn. and this actuallyis important. because here if we get a goodrepresentation, that's great. but if we don't, then there'sno real recourse to understanding how we want todesign a representation to do
a particular discriminativeproblem. so by actually choosing thepivots appropriately, we can direct this dimensionalityreduction to give us good features which are usefuldiscriminatively. so now back to the second step,which is how do we use this in a linear predictor? so we have these two vectors. i showed you guys the highdimensional vector. now we have the projection ofthis-- say we took the top 50
singular vectors. we have the projectionontp a 50 dimensional real valued space. so we want to use thisin a classifier-- standard linear model. the way we're going to dothis is very simple. we had a weight vectorbefore for x. now we have another weightfactor for phi transpose x. and we just add thetwo together.
at train time, we're going tolearn w and v together, and at test time, the idea is to firstapply phi and then apply w and v. and the hope is that,again, the representation we learn here of phi is goodfor domain adaptation. in that case, we'll be able toclassify instances in a new domain using v. ok, so before i go into theresults, i want to stop and mention two sort of directinspirations for my design of scl.
so the first is alternatingstructural optimization, which is a semi-supervised techniqueby ando and zhang. and the idea there is again touse auxiliary predictors on unlabeled data, to traindiscriminative models on unlabeled data, and use that tocharacterize a reasonable hypothesis space for doing gooddiscriminative learning. the second is-- for those of you knowdimensionality reduction-- this area of correspondencedimensionality reduction.
so the idea is there is that youhave some high dimensional representations of a singlelow dimensional manifold. and you want to learn a manifoldthat respects these high dimensionalrepresentations, these high dimensional correspondences. so let's go on to theexperimental results. so first, for sentimentclassification, again, all our data is from amazon. so what we did, we just crawledthe amazon site and we
pulled down a bunchof reviews. these are books, dvds, kitchenappliances, and electronics. these are four domains. we had 2,000 labeled reviewsfrom each and between 3,000 and 6,000 unlabeled. so we treat this as a binaryclassification problem. each review has togetherwith it a set of stars. we take things that are four ormore stars, four and five stars, and call thempositive--
one and two stars andcall them negative. audience: when you sayunlabeled, do you mean you downloaded the stars for those3,000 to 6,000 but you don't show [inaudible]the algorithm? john blitzer: right. audience: there's some reasonwhy in the data set, you don't believe those labels. john blitzer: no, no, no. there's no reason actually.
we could in fact usethem as labeled. this is purely for experimentalpurposes. so i mean well-- that's not quite true. we sort of curated and tried tothrow out duplicates, and do a good job at findingreasonable reviews for our labeled data. but basically, they come fromthe same ultimate source. so the features--
we use unigrams and bigrams,which is pretty standard for this task. audience: so the techniqueyou're explaining, does it assume that the domains aregiven, that you're going to tell it this text is inthis domain, that text is in that domain? or are you able to generalizeacross text where the domain is itself unlabeled? john blitzer: so all theexperiments i'm going to show
you are from the first case. we can actually talk afterwardabout firstly, potentially discovering multiple domains,and also using the same sorts of ideas here when youhave no idea where the domains are segmented. both of those are things we'velooked at, but they're not part of this talk. so for the pivots, we're goingto use scl and scl-mi, which i showed you guys severalslides back.
and at train time, we're justgoing to minimize a huberized version of the hinge loss. you can use whatever yourfavorite loss is. so before i show you thenumerical results, i want to show you a visualization of thekinds of projections that could potentiallycome out of this sort of learning procedure. so what i'm showing you in thetop left here are words that only occur in the books domainand are negative under this
projection. this is a singlecolumn of phi. so plot-- if you talk about theplot, you don't like the book, if you say something'spredictable, that's not a good thing. for kitchen appliances,you know the plastic-- if the little plastichandle breaks-- books typically aren't poorlydesigned, although they could be, but kitchen appliances, ifyou don't like it, you say
it's poorly designed. leaking-- books don't leak. positive-- you have "fascinating,""engaging." must read. grisham-- people like johngrisham on amazon. for kitchen appliances, espressois sort of like the john grisham of kitchenappliances. people just like espresso.
you have other words like "areperfect," "using this was a breeze," "i've been using it foryears now," all these are ways of expressing positivitythat are specific to appliances. the nice thing about this, otherthan just being cute, is that actually-- remember that we're going towant to train a discriminative model here. and even if we've never seen allthese words on the bottom,
we can tell that "poorlydesigned," for instance, expresses negative sentiment,by virtue of the way it's projected relative toplot, predictable, and number of pages. and we do have labeled datafor these, right? so we can actually tellthis immediately from our labeled data. so here are the first setof numerical results. they're kind of complicated,so let me parse them a bit.
first, what i've labeled up top,each of these sections is one domain that we'regoing to adapt to. so this is testingon books here. this 80.4 is the result we wouldget if we took all 2,000 labeled data pointsand trained a classifier on that in books. so this is sort of like theupper bound-- how well could you do if you had a goodbooks classifier? each of these sets of threebars is training in one
particular other domain. so again, like electronicsand kitchen are not very similar to books. dvds are more similar, so ingeneral the bars are higher. the baseline is what happensif i just train an svm, and test it. the blue is what happens if iuse the scl features, and the green are what happens if iuse the scl-mi features. so the take away message here isthat in particular, how to
interpret these-- if you look atthis set, the baseline lost to adaptation is 7.6%. the scl-mi loss isaround 0.7%. so you can do almost as wellas having books using only unlabeled data from thebooks domain and lots of labeled dvd data. audience: these aregreat results. i'm not trying to be negativeabout it, but i just want to understand the red line, labeled8.4, doesn't get to
see the pool of unlabeled datathat the green and blue bars got to see, right? john blitzer: that's right. so potentially-- audience: in principle, ifsomeone thought that they were a semi-supervised learninggenius, they might claim that they could push the redline up a little bit. john blitzer: actually, you canpush it up quite a bit. and i'll show this--
typically you can'tpush it up-- well, ok. you can push the red line up. yeah. i mean-- it depends how much labeledversus unlabeled data you have. no, no, you're right. we'll show, sort of,adaptation versus semi-supervised learning inthe next set of results.
but i don't have numbers forthis particular task. one way to point thisout is actually-- on kitchen and electronics-- so these are really similar. kitchen appliances are almostall a kind of electronic. so both of them canbe defective. a lot of the wordsare the same. so here, you do get this kindof semi-supervised result, where if you add a lot ofunlabeled data, you can
actually do better than the sortof gold standard here-- the red line. but the thing i want tofocus on briefly here is the screw up. so somehow we actually did worseusing scl than you not using the unlabeleddata at all. so what happened is basically wesomehow managed to misalign features from the two domains. you're learning arepresentation
from unlabeled data. so there's a lot of variancein book reviews. some of it is whether or not someone's positive or negative. but a lot of it is like, well,this is a christian literature book, and this is a fictionbook, and this is a nonfiction book, and this is aself help book. so a lot of things you see-- a lot of these mistakes thatyou see-- are basically
projections that look reasonablefor, kitchen appliances but on books areactually kind of doing topical discrimination. so we thought about howone might go about possibly fixing this. and we said well, what could youdo with a minimal amount of labeled data if you werejust a guy who wanted to quickly prototype something. what could you do with50 instances?
so this is 50 versus 2,000. again, we're assuming that wewere using the same training procedure before, so here'swhat we're going to do. we're going to train on thesource data, save the weight factor here-- v sub s-- for the scl features, the lowdimensional features. now on new target data, we'regoing to simply regularize the weight vector to be close to theweight vector we had from the source domain.
so if you look at this is anoptimization problem, the first term is just the hingeloss that we had before. the second term, we want toencourage not using the high dimensional features asmuch as possible. with only 50 instances, it'sunlikely that we're going to get a feel for the kindof vocabulary you see in the new domain. but on the other hand, wemight be able to learn something reasonable about thelow dimensional features.
so the idea here is that we wantto keep the scl weights as close as possibleto the source. we believe that it'smostly right. but we want to correct the fewthings that we did wrong. we're going to do that bytrading off basically this first term from thislast term. so this technique actuallyis based on an idea from chelba and acero. chelba is now hereat googlee--
i guess he's in kirkland. i don't really know-- but they actually proposedregularizing on the high dimensional weight vector. so the place we differ from themis that we advocate using the low dimensional features. we think that by havingthis low dimensional representation, you can get alot more power out of the small number of labeledinstances you have.
so here the results. basically, the baseline isexactly the technique of chelba and acero. the idea there again is toregularize, based on the high dimensional featuresthat you had. and our technique is thisvariant where you only try to match the low dimensionalfeatures to what you learn in and in this case, scl-mi alwaysimproves over the baseline for every pair ofdomains that we have.
so i showed you guys a bunchof results here. and i want to kind ofhelp distill them. so first, even without anyunlabeled data, we reduced error due to transfer by 36%. and the thing i want to pointout is that if you have just 50 instances and use this othertechnique, basically it doesn't work at all. and that's because 50 instancesjust isn't enough to help you with such a highdimensional weight vector--
hundreds of thousands ormillions of features. but if you have a good lowdimensional representation, you can further improveto a 46% relative reduction in error. so the other task that i wantto talk about is part of speech tagging. for this task, the data thatwe're going to look at is quite a bit larger. so we have 1,000,000 labeledwords of wall
street journal text. and we're going to add 2,000,000or 3,000,000 million words of unlabeled text fromthe wall street journal and from medline. and again, the task is to traina tagger in the wall street journal and testit on medline. we're going to use as oursupervised learner what i'll call mira crf. basically the idea here is thatyou want to separate the
best label from the top highestscoring incorrect labels by a margin. and there's a good jmlr paperthat describes this that i highly encourage everyoneto read. and so the other thing that iwant to specify here is what we choose for pivots. so i'm going to focus on sortof the word-by-word representation. so if you look at a three-wordwindow, what we're going to
use for pivots are common left,middle, and rights words across domains. audience: also, we're[inaudible] information about this-- john blitzer: actually we haveresults for that, but i'm not going to show them here. it does slightly better whenyou use mutual information. now the same visualization ofthe projection onto a single dimension from phi.
so only in medline, you getwords like "receptors," "mutation assays," and"lesions," negative under this only in wall street journal,"company," "transaction," "officials." only in medlinethat are positive, "metastaic," "neuronal,""transient," and "functional," versus "political,""short-term," "pretty." so what is this projectiondoing for us? well, it separating nouns onthe negative side from adjectives and determinerson the positive side.
again, the takeaway message isthat even if we haven't seen any of these words here on top,we can do a good job at discriminative learning, byusing their projection onto this line and the similaritywith these other words that we do have lots of labeleddata for-- the wall street journal words. so here are the set of resultscomparing sort of semi-supervised with scl. so the black line here is justtrain a mira tagger on the
wall street journal. the blue line here is train thesemi-supervised method of ando and zhang-- thisalternating structural optimization. and the red curve here is scl. so the first thing i want topoint out is that if you don't have very much labeled dataat all on the wall street journal, so these arelearning curves-- number of wall streetjournal sentences.
so you don't have verymuch data at all-- then you can get a nearly 20%improvement reduction in error for part of speech tagging. but even when you have a lot==so 0.6 is another baseline that i just wanted to throwout there-- this is adwait ratnaparkhi's partof speech tagger. and iit's sort of the standardout of the box tagger that one would use if you wanted towork on this problem. so here actually, you can stillsignificantly improve
over all these methods. and one interesting thing isfor unknown words, where you've never seen this wordbefore and this is kind of what we've designed for, you canactually get more than 20% improvement, even with 1,000,000words of labeled wall street journal text. again, so there are othermethods for incorporating labeled data in this setting. in particular, for these kind ofwhat are called structured
problems, where you have a labelthat's more complex, you can potentially use the outputof a tagger trained in the source domain as a featurefor the target domain. so this was advocated by florianet al, [? nackle ?] a couple i guess threeor four years ago. the idea here is that we want tojust-- so how are we going to compare scl with a normalsupervised tagger? we just want to train one ofthese taggers in the source domain and use it as a featurein the target domain, and see
how much improvementwe can get. ok, so looking at that, thisthick black line here is not using any target date at all,so we just ignore this data. that's why the curvedoesn't go up. we just train thesource tagger. this dotted line is what happenswhen you use a small number of in domain traininginstances. and here what i'm showing, theblue line is the supervised tagger and the red line is scl,combining it with the
target data, using the sameidea of florian et al. again, so for this side of thegraph, you can get a nearly 40% relative reductionin error. so it's from something like86 to 91 by using scl, and together with this trickof using features of the source tagger. but even for a large amount,notice here, actually using the source data doesn'thelp you at all, versus not using it.
but once you train the sourcetagger with scl even with a fairly large number of medlinetraining sentences, you can still get a significantimprovement. so i want to end on a somewhatspeculative note for various people in this audiencethat maybe we can talk about afterward. so first is machinetranslation. i know there are a lot of peopleworking on this here. the basic scenario i'menvisioning is where you have
some domain specific paralleltext-- like news or legal text or un transcripts,what have you-- but what you want is to actuallydo translation in a very different domain--like let's say blogs. and now you may have actuallylots of similar corpora. people write blogs in chinese. people write blogs in english,but they don't often translate them. and you want to exploit allthis unlabeled data.
so i could envision exploitingit in two ways. and we can talk about this thespecifics of that offline, but basically you can obviouslyadapt a language model and people have worked on this. but you could also conceive oflearning new translation rules, based on similar contactsto the source data that you have. the other obviousproblem for this audience is search ranking.
i can kind of envision ascenario where you have a query and a list of top rankdocuments already. what you want to do was rerankthem, based on some features of the documentsand the query. so and you may have some labeleddata either in the form of editorial dataor in the form of click through data. and the adaptation here iswell, you might have very different markets.
so in particular, you may havelots of good editorial data for english, but in indonesia,you have barely any. and yet you still want to beable to explore this in a reasonable way. and the pivots i'm envisioninghere are just common relevant features across thedifferent models. you may have features that arerelevant in both domains and a bunch of other features that youdon't really know about. you want to somehow alignthese features.
and finally, for those of youwho are more interested in learning theory, we have somework on that as well that i would be happy to talk about. so the idea here is that youhave a model that you've trained on one particulardistribution, and you're going to test it on another. can we prove learning bounds,bounds on the error of that model in a new domain? and we have a coupleof papers on that.
ok, thanks. [applause] audience: i was just curious,is it a [? deal ?], this structural correspond tolearning as a basically trying to pick up the pivot features,and then cast as a features that you don't see[? the trending ?] data into those features first. and doa [? low-dimensional ?] reduction and then trainyour model on that. there's a learning times.
you have sort of twoset of practice-- one for regional things,more for this pivot to related features. do you still needthe first parts? john blitzer: yeah. that's a good question. actually you do needit, in the sense that it improves results. so the reason for that is thatthere are some things which
doing unsupervised learningyou just aren't going to model well. and you need to kind of pickup the slack there. so we have done experimentsusing just the low dimensional and basically that kind ofdoes in between the two. so wherever the results i showedyou are, if you use just the low dimensionalrepresentation, it's usually a little better than using thehigh dimensional, but not as good as using both.
audience: i'm thinking aboutthe domain of book reviews. i'm wondering about thesignificance of two sources of noise and sentiment thati can imagine. one is irony, where somebodyuses irony in their review, leading to a sense of theirsentiment might deflect from what it appears from thelearning analysis. the other one is the booksthemselves are often criticisms or praisingof various topics. so discussing the book,independent of your sentiment
of the book, you might be usingcriticism or praise words in discussing thecontent of the book. john blitzer: yes. so first, these are problemsfor sentiment analysis in general, independent of theparticular method that you use, but i absolutelyagree with you. these do come up. in our experience,it's quite rare. but you have to understand thatsomeone going to amazon,
and writing i lit the book onfire is not in new york times review of books. they're not going to expresssentiment in quite the same way. but you're right that itis a potential problem. and i think this is one reasonwhy you don't see numbers as high as you would expect forbinary text classification. so after 2,000 instances,if you look at-- i don't know-- some other textclassification problem-- the
reuters task. you're almost alwayswell above 90. here you're not going to seethat because there is quite a bit of noise, and you're right,that's something you have to deal with. sam? audience: so in the featureselection, that seems like selecting the pivots isvery important here. audience: so when you're givenmutual information, is it that
you have to have in your owndomain high mutual information with the class, or do you alsohave to have high mutual information? do you understand my question? john blitzer: yes, i understandyour question. so i didn't say that, butyou're absolutely right. the assumption is that mutualinformation with a label is the same across domains. so you can violate this a littlebit, but in cases where
you violated a lot, you'rebasically screwed. so you can envision instanceswhere there's some adversary that basically takes the bestlinear predictor in your source domain and flipsall the weights. then there's just nothingyou can do from that unlabeled data. in practice-- it seems like in all theproblems i've looked at, both these and a few others, youdon't see this kind of
flipping a lot. so the mutual informationassumption seems reasonable. audience: do you have thingsthat commonly code for inductamanes, and you filter bymutual information again in both domains? john blitzer: well no. we don't have mutual formationin the target. we just assume that by filteringin the source, and using the commonality, you alsogot features that are
highly informativein the target. yes. prakash? audience: you usethat [inaudible] john blitzer: nothing. so one is limited by the amountof data we can actually get access to. obviously you guys havemuch more than we do on our computers.
but there is a larger versionof the wall street journal which has 30,000,000 words,which we are using for a separate entity recognitiontask. in terms of scaling, there'sno reason why this method can't scale up. in fact, training the predictorsare of course is completely paralyzable. you could do thatfairly easily. yes?
audience: i was wondering howthis would apply to text that do not work well with partof speech taggers, which [inaudible] a lot of comments are veryshort, maybe not grammatially correct ones? john blitzer: so again, theunderlying assumption here is that there is a sort of single,good model for all these different domains. so it doesn't have to be a modelthat you can train in
one, but the idea is thatif i had a lot of-- i don't know-- blog comments totrain on, and i had a lot of wall street journal text totrain on, then i could train a single good tagger thatwas good at both. if i can't-- if for some reason, you know,when i say the word dog in a blog, it's an adjectiveor a verb. i mean there's slang usagesthat probably would be particularly difficult to get.
that would be problematic. but i think i would love to trythis out on more widely varying domains. and actually, the medlineis pretty different. i mean, there are really somethings where i look at it, and i look at it myself and say ihave no idea how to tag this. i have no idea what these peopleare talking about. audience: did you try it? the approach of trainingin one domain?
then you'll see if theclassifier of the other domain taking the ones from eachexpress a strong reference, and then retraining onthe [? assessed ?]. john blitzer: so yes we did. so this approach is oftencalled self training. it can work very well. sometimes it can breakhorribly sometimes. so i think this is one problemwith the sort of boot strapping approaches tosemi-supervised learning, or
to using unlabeled data. it's that you can potentiallyintroduce a lot of noise, and ruin the model you had before. so in our experience, there aresome tasks for which it works really well, and some forwhich it actually works quite a bit worse thana model that you used only labeled data. there is a paper at this year'sacl by [? chun ?] [? hsiung ?]
[? jai ?] and one of hisstudents, that basically study a bunch of these otherapproaches. and they look at selftraining as well. go look at that for otherreferences to self training. audience: so one question is canthis adaptation be thought of as learning atranslation-length table from domain one to domain two forthe important points? john blitzer: i mean, at a highlevel, it's possible. i don't want to commit to thatin part because i'm not an mt
guy and i don't really know. but in part, because i think imean part of the power of this method is that a lot of thesefeatures are bad. a lot of the projections, phi,are actually noisy, but by having a good discriminativemodel to train on, you can overcome that. so it's sort of like if youhad a hard alignment, it's almost like doinghard clustering. you throw out a lot ofinformation you had, and you
could potentially have used thatinformation to recover a good discriminative model. i do think that having actuallythis sort of soft projection onto a lowdimensional real value space is very important. peng? audience: i was wonderingso for this selection of [? yurts ?], you have twofactors basically. you have one, is it a goodpredictor for the final task?
and you have one of them whichis kind of [inaudible] deletes appear frequently, whichwould mean that it might be also a good predictorfor the second task. so i was wondering if to measureis a term a good predictor, if you have a linearupdate [? fail, ?] you have actually the derivative,that is, if you said the effect of win-win the wordin the way it is said. so could you use that instead? so there are lots of ways ofdoing feature selection, and
mutual information is byno means the best one. i don't actually want to committo saying i love mutual information for featureselection. actually i would be interestedin looking at either that criterion, or i mean obviouslyl1 regularization is a good criterion for doingfeature selection. any of these i think have thepotential to be even better by accounting for common dependencies among the features.
audience: have you tried that? john blitzer: no. i haven't tried that,haven't tried that. audience: i have actually twoquestions-- one is, how important is the choiceof pivot features? for example, do you have any[inaudible] or parameters to control the size of features? john blitzer: so, the actualidentity of the pivot features the number of pivot features isactually not so important.
so they are basically twocriteria that you need to kind of trade off. one is that you need enoughpivot features to basically characterize the kinds of spaceyou're in, in terms of training these predictors. the other is that eachindividual predictor-- you need to get good statisticsfor it on the unlabeled data. if you go to include all thefeatures, and you have maybe
some features which only occuronce or twice, even in a huge amount of unlabeled data, youcan envision ten grams or something, in that case, it'shard to actually train a good predictor that generalizeswell. so this is why we chooseonly the frequent ones. audience: if you have many-- i can change the direction of[inaudible] because you can't like one-- that error's weightedthe same each weight. john blitzer: the error'sweighted the same, but
actually this is exceptionallyimportant, because you can get a much higher marginif you have fewer instances to deal with. so if you think aboutthe kinds of loss-- higher margin basicallycorresponds to a smaller weight vector. so if you can train somethingand the magnitude of the weight vector actually doesappear in the svd. audience: so the second oneis, if you use the adapted
model to the original domain,how much worse do you do? or do you do worse? john blitzer: you typicallydon't do worse. and the reason for this isfor a couple reasons. one is you have allthe features from the original domain. sometimes you can dobetter, if the domains are similar enough. you have all the features fromthe original domain to--
if you happen to learn arepresentation that's not so good, you can recover thatbecause you have the original feature space. the other reason though is thatif these domains are so high dimensional, the featurespace is so high dimensional, that actually for text thereusually is one good model. there typically is one reasonable model for sentiment. of course, there are these wordsthat irony or words that
switch polarity in domains-- something like predictable isbad for books, but actually your kitchen appliances, youwant them to be predictable. but there aren't that manyof those kinds of words. typically you can do aswell as or better in john blitzer: do you have a wayof detecting that kind of word and throwing it out? i mean, there are heuristics youcan use, but none of them are really good.
you can sort of say oh well,predictable in kitchen appliances cooccurs with a lotof words that seem to have positive sentiment, but inbooks it's negative. but i haven't found one goodheuristic that seems to work well all the time. audience: do we need to applythe svd before you try to train the combined model? what if you didn't do svd? because you can basicallyput some regularization
on top of it and-- you can. so empirically, it doesn'twork as well. and i don't have any really goodtheoretical justification of that yet. i actually hypothesize thatthe reason the svd is important is that you actuallydon't really care about predicting the pivotsthemselves. in fact, that can causeyou to kind of
over fit to the pivots. what you want really is to beable to sort of predict somewhat nebulous, general positive or negative sentiment. so there really is a lowdimensional, underlying representation that you want. that's why i think thesvd is important. how to characterize thatformally is [? get to work ?]. audience: what [inaudible]is a sort of performance
difference we're looking at? like in terms ofa 46 reduction? john blitzer: instead of 46, iactually don't know the across the board, but it tends to doagain about half as well. so you still getan improvement, but it's not as big. it's right in between the two. i didn't do sort of theextensive averaging or cross validation that i did for theseactual numbers that i'm
showing here. audience: thank you. john blitzer: thanks.
