131
迁移学习理论与应用 Transfer Learning: An Overview 杨强,香港科大 Qiang Yang, HKUST Thanks: Sinno Jialin Pan, NTU, Singapore Ying Wei, HKUST, Hong Kong Ben Tan, HKUST, Hong Kong

Transfer Learning: An overview

Embed Size (px)

Citation preview

迁移学习理论与应用TransferLearning:AnOverview

杨强,香港科大 Qiang Yang, HKUST

Thanks:

Sinno Jialin Pan, NTU, Singapore Ying Wei, HKUST, Hong Kong Ben Tan, HKUST, Hong Kong

Apsychologicalpointofview

•  Transfer of Learning(学习迁移)inEduca7onandPsychology– Thestudyofdependencyofhumanconduct,learningorperformanceonpriorexperience.

–  [Thorndike and Woodworth, 1901] explored how individuals would transfer in one context to another context that share similar characteristics.

•  E.g. ! C++ " Java ! Math/Physics "ComputerScience/Economics

2

Transfer LearningInthemachinelearningcommunity

•  The ability of a system to recognize and apply knowledge and skills learned in previous domains/tasks to novel tasks/domains, which share some commonality.

•  Given a target domain/task, how to transfer knowledge to new domains/tasks (target)?

•  Key: –  Representation Learning, Change of Representation

3

Why Transfer? ! Buildeverymodelfromscratch?

# Timeconsumingandexpensive# Expense:

•  DataCollec7on/Labeling•  Privacy•  Timetotrain

! Reusecommonknowledgeextractedfromexis7ngsystems?# Moreprac7cal

4

Why Transfer Learning?

5

Source Domain Data

Target Domain Data

Predictive Models

Labeled Training

Unlabeled data/a few labeled data for adaptation

Transfer Learning Algorithms

Target Domain Data

Testing

Electronics

Time Period A

Device A

DVD Device B Time Period B

Transfer Learning Different fields

•  Transferlearningforreinforcementlearning.

[Taylor and Stone, Transfer Learning for Reinforcement Learning Domains: A Survey, JMLR 2009]

•  Transferlearningforclassifica7on,andregressionproblems.

[Pan and Yang, A Survey on

Transfer Learning, IEEE TKDE 2010]

6

Focus!

Motivating Example I: IndoorWiFilocaliza7on

7

-30dBm -70dBm -40dBm

Indoor WiFi Localization (cont.)

8

Training

Training Test

Device A

Test

Device B

~ 1.5 meters

~10 meters

Device A

Device A

S=(-37dbm,..,-77dbm),L=(1,3)S=(-41dbm,..,-83dbm),L=(1,4)…S=(-49dbm,..,-34dbm),L=(9,10)S=(-61dbm,..,-28dbm),L=(15,22)

S=(-37dbm,..,-77dbm)S=(-41dbm,..,-83dbm)…S=(-49dbm,..,-34dbm)S=(-61dbm,..,-28dbm)

S=(-37dbm,..,-77dbm)S=(-41dbm,..,-83dbm)…S=(-49dbm,..,-34dbm)S=(-61dbm,..,-28dbm)

S=(-33dbm,..,-82dbm),L=(1,3)…S=(-57dbm,..,-63dbm),L=(10,23)

Localization model

Localization model

Drop!

Average Error Distance

Difference between Domains

9

Time Period A Time Period B

Device B

Device A

Motivating Example II: Sen7mentclassifica7on

10

Sentiment Classification (cont.)

11

Training

Training Test

Electronics

Test

~ 84.6%

~72.65%

Sentiment Classifier

Sentiment Classifier

Drop! Electronics

Classification Accuracy

Electronics DVD

Difference in Representation

12

Electronics Video Games (1) Compact; easy to operate; very good picture quality; looks sharp!

(2) A very good game! It is action packed and full of excitement. I am very much hooked on this game.

(3) I purchased this unit from Circuit City and I was very excited about the quality of the picture. It is really nice and sharp.

(4) Very realistic shooting action and good plots. We played this and were hooked.

(5) It is also quite blurry in very dark settings. I will never buy HP again.

(6) The game is so boring. I am extremely unhappy and will probably never buy UbiSoft again.

AMajorAssump7oninTradi7onalMachineLearning

! Training and future (test) data come from the same domain, which implies

#  Represented in the same feature spaces.

#  Follow the same data distribution.

13

MachineLearning:Yesterday,TodayandTomorrow

14

DeepLearning:Features

ReinforcementLearning:Rewards

TransferLearning:Adapta7on

Yesterday Today Tomorrow

MachineLearning:Yesterday,TodayandTomorrow

15

DeepLearning:LotsofDataOnlytheRich

ReinforcementLearning:

LotsofDataOnlytheRich

TransferLearning:FewDataEveryone

Yesterday Today Tomorrow

DifferentScenarios•  Trainingandtes7ngdatamaycomefromdifferentdomains:# Different different feature spaces/ marginal

distributions:

# Different conditional distributions or different label spaces:

16

Transfer Learning Approaches

17

Instance-based Approaches

Feature-based Approaches

Parameter/Model -based Approaches

Relational Approaches

Instance-based Transfer Learning Approaches

Source and target domains have a lot of overlapping features

18

General Assumption

Instance-based Transfer Learning Approaches

CaseI:UnlabeledTarget

CaseII:SomeLabelsinTarget

19

Problem Setting

Assumption Assumption

Problem Setting

Instance-based Approaches Case I

Givenatargettask,

20

Instance-based Approaches Case I (cont.)

Assumption:

21

Instance-based Approaches Case I (cont.)

22

Correcting Sample Selection Bias / Covariate Shift [Quionero-Candela, etal, Data Shift in Machine Learning, MIT Press 2009]

Instance-based Approaches Correcting sample selection bias (cont.)

•  Thedistribu7onoftheselectorvariablemapsthetargetontothesourcedistribu7on

23

!  Label instances from the source domain with label 1!  Label instances from the target domain with label 0!  Train a binary classifier

[Zadrozny, ICML-04]

Instance-based Approaches Kernel mean matching (KMM)

Maximum Mean Discrepancy (MMD) [Alex Smola, Arthur Gretton and Kenji Kukumizu, ICML-08 tutorial]

24

Instance-based Approaches Direct density ratio estimation

25

[Sugiyama etal., NIPS-07, Kanamori etal., JMLR-09]

KL divergence loss Least squared loss

[Sugiyama etal., NIPS-07] [Kanamori etal., JMLR-09]

Instance-based Approaches Case II

•  Intui7on:Part of the labeled data in the source domain can be reused in the target domain after re-weighting

26

Instance-based Approaches Case II (cont.)

!  TrAdaBoost [Dai etal ICML-07] – For each boosting iteration,

#  Use the same strategy as AdaBoost to update the weights of target domain data.

#  Use a new mechanism to decrease the weights of misclassified source domain data.

27

Feature-based Transfer Learning Approaches

When source and target domains only have some overlapping features. (lots of features only have support in either the source or the target domain)

28

Feature-based Transfer Learning Approaches (cont.)

Howtolearn?! Solution 1: Encode application-specific

knowledge to learn the transformation.

! Solution 2: General approaches to learning the transformation.

29

Feature-based Approaches Encode application-specific knowledge

30

Electronics Video Games (1) Compact; easy to operate; very good picture quality; looks sharp!

(2) A very good game! It is action packed and full of excitement. I am very much hooked on this game.

(3) I purchased this unit from Circuit City and I was very excited about the quality of the picture. It is really nice and sharp.

(4) Very realistic shooting action and good plots. We played this and were hooked.

(5) It is also quite blurry in very dark settings. I will never_buy HP again.

(6) The game is so boring. I am extremely unhappy and will probably never_buy UbiSoft again.

Feature-based Approaches Encode application-specific knowledge (cont.)

31

compact sharp blurry hooked realistic boring 1 1 0 0 0 0 0 1 0 0 0 0 0 0 1 0 0 0

( ) sgn( ), [1,1, 1,0,0,0]Ty f x w x w= = ⋅ = −

compact sharp blurry hooked realistic boring 0 0 0 1 0 0 0 0 0 1 1 0 0 0 0 0 0 1

Electronics

Video Game

Training

Prediction

Feature-based Approaches Encode application-specific knowledge (cont.)

32

Electronics Video Games (1) Compact; easy to operate; very good picture quality; looks sharp!

(2) A very good game! It is action packed and full of excitement. I am very much hooked on this game.

(3) I purchased this unit from Circuit City and I was very excited about the quality of the picture. It is really nice and sharp.

(4) Very realistic shooting action and good plots. We played this and were hooked.

(5) It is also quite blurry in very dark settings. I will never_buy HP again.

(6) The game is so boring. I am extremely unhappy and will probably never_buy UbiSoft again.

Feature-based Approaches Encode application-specific knowledge (cont.)

! Three different types of features !  Source domain (Electronics) specific features, e.g., compact, sharp, blurry !  Target domain (Video Game) specific features, e.g., hooked, realistic, boring !  Domain independent features (pivot features), e.g., good, excited, nice, never_buy

33

Feature-based Approaches Encode application-specific knowledge (cont.)

! How to identify pivot features? ! Term frequency on both domains ! Mutual information between features and labels (source

domain) ! Mutual information on between features and domains

! How to utilize pivots to align features across domains? ! Structural Correspondence Learning (SCL) [Biltzer etal.

EMNLP-06] ! Spectral Feature Alignment (SFA) [Pan etal. WWW-10]

34

Feature-based Approaches Spectral Feature Alignment (SFA)

! Intuition #  Use a bipartite graph to model the correlations

between pivot features and other features #  Discover new shared features by applying

spectral clustering techniques on the graph

35

!  If two domain-specific words have connections to more common pivot words in the graph, they tend to be aligned or clustered together with a higher probability. !  If two pivot words have connections to more common domain-specific words in the graph, they tend to be aligned together with a higher probability.

Spectral Feature Alignment (SFA) High level idea

36

exciting

good

never_buy sharp

boring

blurry

hooked

compact

realistic Pivot features

Domain-specific features

7 6

8 3

6

2

4

5

Electronics

Video Game

exciting

good

never_buy sharp

boring

blurry

hooked

compact

realistic Pivot features

Domain-specific features

7 6

8 3

6

2

4

5

Electronics

Video Game

boring realistic

hooked

blurry

sharp

compact

Electronics

Video Game Electronics

Electronics Video Game

Video Game

Derive new features

Spectral Clustering

37

Spectral Feature Alignment (SFA) Derive new features (cont.)

sharp/hooked compact/realistic blurry/boring 1 1 0 1 0 0 0 0 1

38

( ) sgn( ), [1,1, 1]Ty f x w x w= = ⋅ = −

sharp/hooked compact/realistic blurry/boring 1 0 0 1 1 0 0 0 1

Electronics

Video Game

Training

Prediction

Spectral Feature Alignment (SFA)1.  Identify P pivot features 2.  Construct a bipartite graph between the pivot and

remaining features. 3.  Apply spectral clustering on the graph to derive

new features 4.  Train classifiers on the source using augmented

features (original features + new features)

39

Feature-based Approaches Develop general approaches

40

Time Period A Time Period B

Device B

Device A

Feature-based Approaches Transfer Component Analysis [Pan etal., IJCAI-09, TNN-11]

41

Target Source

Latent factors

Temperature Signal properties

Building structure

Power of APs

Motivation

Transfer Component Analysis (cont.)

42

Target Source

Latent factors

Temperature Signal properties

Building structure

Power of APs

Causes the data distributions between two domains to be different

Transfer Component Analysis (cont.)

43

Target Source

Signal properties

Noisy component

Building structure

Principal components

Transfer Component Analysis (cont.) Learningbyonlyminimizingthedistancebetweendistribu7ons

44

Transfer Component Analysis (cont.) Main idea: the learned should map the source and target domain data to the latent space spanned by the factors which can reduce domain difference and preserve original data structure.

45

High level optimization problem

Transfer Component Analysis (cont.)

46

Recall: Maximum Mean Discrepancy (MMD)

Transfer Component Analysis (cont.)

47

An illustrative exampleLatent features learned by PCA and TCA

PCA Original feature space TCA

Feature-based Approaches Self-taught Feature Learning (Andrew Ng. et al.)

! Intuition: Useful higher-level features can be learned from unlabeled data.

! Steps: 1)  Learn higher-level features from a lot of unlabeled data. 2)  Use the learned higher-level features to represent the data of the

target task. 3)  Train models from the new representations of the target task

(supervised)

! How to learn higher-level features #  Sparse Coding [Raina etal., 2007] #  Deep learning [Glorot etal., 2011]

48

Feature-based Approaches Mul7-taskFeatureLearning

! Assumption: If tasks are related, they should share some good common features.

! Goal: Learn a low-dimensional representation shared across related tasks.

49

General Multi-task Learning Setting

Multi-task Learning Assumption: If tasks are related, they may share similar parameter vectors. For example, [Evgeniou and Pontil, KDD-04]

50

Common part

Specific part for individual task

Mul7-taskFeatureLearning

51

[Argyriou etal., NIPS-07]

[Ando and Zhang, JMLR-05]

[Ji etal, KDD-08]

DeepLearninginTransferLearning

52

TransferLearningwithDeepLearning TransferLearningPerspec7ve:WhyneedDeepLearning?

•  Deepneuralnetworkslearnnonlinearrepresenta7ons–  thatarehierarchical;–  thatdisentangledifferent

explanatoryfactorsofvaria7onbehinddatasamples;

–  thatmanifestinvariantfactorsunderlyingdifferentpopula7ons.

DeepLearningPerspec7ve:WhyneedTransferLearning?

•  TransferLearningalleviates–  theincapabilityoflearningon

adatasetwhichmaynotbelargeenoughtotrainanen7redeepneuralnetworkfromscratch

BenchmarkDataset:Office •  Descrip7on:leveragesourceimagestoimproveclassifica7onoftargetimages

3 domains

31 c

ateg

orie

s back

pack

bike

amazon 2,817 images

webcam

957 images dslr

795 images

object images in Amazon

low-resolution images taken from a web

camera

high-resolution images taken from a digital SLR

camera

Results Unsupervised domain adaptation Amazon→Webcam over time

2011 2012 2013 2014 2015

Mul

ti-cl

ass a

ccur

acy

10 20

30 40

50 60

70

TCA GFK

CNN

DDC

DLID

DAN BA

TL without DL TL with DL DL without TL

With Deep Learning, Transfer Learning improves. Applying Transfer Learning techniques outperforms directly applying Deep Learning models trained on the source.

DASH-N [1] Finetuning [2,3]

SCNN [4]

Overview •  Overview

supervised unsupervised

single modality

multiple modalities

[5] DLID [6] DCC [7] DAN [8] BA [9]

SHL-MDNN [10] ST [11]

Are there labelled target data?

Are dimensions of source and target domains equal?

[12] Ngiam, Jiquan, et al. "Multimodal deep learning." ICML. 2011. [13] Srivastava, Nitish, and Ruslan Salakhutdinov. "Multimodal learning with deep Boltzmann machines." JMLR. 2014 [14] Sohn, Kihyuk, Wenling Shang, and Honglak Lee. "Improved multimodal deep learning with variation of information." NIPS. 2014.

DBN [12] DBM [13]

MDRNN [14]

[5] Glorot, Xavier, Antoine Bordes, and Yoshua Bengio. "Domain adaptation for large-scale sentiment classification: A deep learning approach." ICML. 2011. [6] Chopra, Sumit, Suhrid Balakrishnan, and Raghuraman Gopalan. "Dlid: Deep learning for domain adaptation by interpolating between domains." ICML. 2013. [7] Tzeng, Eric, et al. "Deep domain confusion: Maximizing for domain invariance." arXiv preprint arXiv:1412.3474. 2014. [8] Long, Mingsheng, and Jianmin Wang. "Learning transferable features with deep adaptation networks." arXiv preprint arXiv:1502.02791. 2015. [9] Ganin, Yaroslav, and Victor Lempitsky. "Unsupervised Domain Adaptation by Backpropagation." ICML. 2015.

[10] Huang, Jui-Ting, et al. "Cross-language knowledge transfer using multilingual deep neural network with shared hidden layers." ICASSP. 2013. [11] Gupta, Saurabh, Judy Hoffman, and Jitendra Malik. "Cross Modal Distillation for Supervision Transfer." arXiv preprint arXiv:1507.00448. 2015.

[1] Nguyen, Hien V., et al. "Joint hierarchical domain adaptation and feature learning." PAMI. 2013. [2] Oquab, Maxime, et al. "Learning and transferring mid-level image representations using convolutional neural networks." CVPR. 2014.[3] Yosinski, Jason, et al. "How transferable are features in deep neural networks?." NIPS. 2014. [4] Tzeng, Eric, et al. "Simultaneous deep transfer across domains and tasks." CVPR. 2015.

SingleModality

•  Directlyapplyingthemodelparameters(deepneuralnetworkweights)fromthesourcetotarget source

domain

input outpu

t

target domai

n

input outpu

t

shared weights Arethefeaturestransferrable?

SingleModality •  Transferability of layer-wise features

ImageNet1000class

A:500class

B:500class

sourcedomain

targetdomain

randomsplit

baseA: train all layers with AbaseB: train all layers with B

BnB: initialize the first n layers with baseB and fix, randomly initialize the other layers and train with BBnB+: initialize the first n layers with baseB, randomly initialize the other layers, and train all layers with B

AnB: initialize the first n layers with baseA and fix, randomly initialize the other layers and train with BAnB+: initialize the first n layers with baseA, randomly initialize the other layers, and train all layers with B

SingleModality •  Transferability of layer-wise features

[3]

Conclusion 1: lower layer features are more general and transferrable, and higher layer features are more specific and non-transferrable.Conclusion 2: transferring features + fine-tuning always improve generalization. What if we do not have any labelled data to finetune in the target domain? What happens if the source and target domain are very dissimilar?

ImageNet is not randomly split, but into A = {man-made classes} andB = {natural classes}

SingleModality •  Generalframeworkofunsupervisedtransfer

source domai

n

input outpu

t

target domai

n

input outpu

t

domaindistanceloss

For lower level features (more general & transferrable), the source transfers to the target directly.

For higher level features (more domain specific & not transferrable), the source transfers to the target by minimizing domain distances.

shared weights

If some labelled target data are available, it would be better.

SingleModality

•  Overalltrainingobjec7ve

•  Domaindistancelosses– MaximumMeanDiscrepancy[7]

source domain classification loss domain distance loss

a particular representation, e.g. the representation after 5th layer

SingleModality

•  Domaindistancelosses– MK-MMD(Mul7-kernelvariantofMMD)[8]

– Domainclassifier[4,9]

an embedding

A distribution-free metric - maximizes the domain classification error

Learn a more flexible distance metric than MMD by adjusting

SingleModality •  Otherfactorstoimprovetransfer

–  Whichlayersshouldthedomaindistancelossbeconsidered?

•  Bylearning,pinpointthelayerthatminimizesthedomaindistanceamongallspecificlayers,saythefourth.[7]

•  Allthespecificlayers,saythelasttwolayers.[8]

source domain

input output

target domain

input

domaindistanceloss

SingleModality •  Otherfactorstoimprovetransfer– Whenwehavesometrainingdatainthetargetdomain?

•  sojlabelsupervision[4]:categorieswithoutanylabeledtargetdataares7llupdatedtooutputnon-zeroprobabili7es

target doma

in

input outpu

t

source domain

Mul7pleModali7es •  Thesourcedomainandtargetdomaincouldhavedifferentfeature

spaces,i.e.,dimensionality.–  Mul7mediaontheweb

•  Images•  Textdocuments•  Audio•  Video

–  Recommendersystems•  Douban•  Taobao•  XiamiMusic

–  Robo7cs•  Vision•  Audio•  Sensors

How to deal with multi-modal transfer with Deep Learning?

Mul7pleModali7es

•  Key

The cat is sitting on a sofa with ears cocking.

shared concept

cat ears kitten eyes

Mul7pleModali7es

•  Generalframeworkofunsupervisedtransfer source domai

n

input

target domai

n

input

common

Pairedloss

reconstruction layer

reconstruction layer

Reconstruction errors: Paired loss: the similarity of a pair of source and target instances is preserved in the common space. Paired loss:

similarity

Mul7pleModali7es

•  Generalframeworkofsupervisedtransfer outpu

t

output

Pairedloss

Classification loss:

source domai

n

input

target domai

n

input

common

MIR-FlickrDataset •  1 million images with user-generated tags

•  25,000 images are labelled with 24 categories •  10,000 for training, 5,000 for validation, 10,000 for testing

categories baby, female, portrait, people

plant life, river, water

clouds, sea, sky,transport, water

animals, dog,food

domain 1: images

domain 2: text

Results Mean Average Precision (MAP) by applying LR to different layers [13]

Transferring either one of the two domains to the other (joint hidden), outperforms the domain itself (image_input OR text_input).

DBN [12]DBM [13]

References [1] Nguyen, Hien V., et al. "Joint hierarchical domain adaptation and feature learning." PAMI. 2013. [2] Oquab, Maxime, et al. "Learning and transferring mid-level image representations using convolutional neural networks." CVPR. 2014.[3] Yosinski, Jason, et al. "How transferable are features in deep neural networks?." NIPS. 2014. [4] Tzeng, Eric, et al. "Simultaneous deep transfer across domains and tasks." CVPR. 2015. [5] Glorot, Xavier, Antoine Bordes, and Yoshua Bengio. "Domain adaptation for large-scale sentiment classification: A deep learning approach." ICML. 2011. [6] Chopra, Sumit, Suhrid Balakrishnan, and Raghuraman Gopalan. "Dlid: Deep learning for domain adaptation by interpolating between domains." ICML. 2013. [7] Tzeng, Eric, et al. "Deep domain confusion: Maximizing for domain invariance." arXiv preprint arXiv:1412.3474. 2014. [8] Long, Mingsheng, and Jianmin Wang. "Learning transferable features with deep adaptation networks." arXiv preprint arXiv:1502.02791. 2015. [9] Ganin, Yaroslav, and Victor Lempitsky. "Unsupervised Domain Adaptation by Backpropagation." ICML. 2015. [10] Huang, Jui-Ting, et al. "Cross-language knowledge transfer using multilingual deep neural network with shared hidden layers." ICASSP. 2013. [11] Gupta, Saurabh, Judy Hoffman, and Jitendra Malik. "Cross Modal Distillation for Supervision Transfer." arXiv preprint arXiv:1507.00448. 2015. [12] Ngiam, Jiquan, et al. "Multimodal deep learning." ICML. 2011. [13] Srivastava, Nitish, and Ruslan Salakhutdinov. "Multimodal learning with deep Boltzmann machines." JMLR. 2014 [14] Sohn, Kihyuk, Wenling Shang, and Honglak Lee. "Improved multimodal deep learning with variation of information." NIPS. 2014.

SimultaneousDeepTransferAcrossDomainsandTasksEricTzeng,JudyHoffman,TrevorDarrell,KateSaenko,ICCV2015

Tzengetal.:Architecture

Tzengetal.:Architecture

Oquab, Bottou, Laptev, Sivic: Learning and Transferring Mid-Level Image Representations using Convolutional Neural Networks. CVPR 2014.

TransferLearninginConvolu7onalNeuralNetworks

•  SourceDomain:ImageNet–  1000classes,1.2millionimages

•  TargetDomain:PascalVOC2007objectclassifica7on–  20classes,about5000images

•  PRE-1000C:theproposedmethod

DeCAF:ADeepConvolu7onalAc7va7onFeatureforGenericVisualRecogni7on

•  JeffDonahue,YangqingJia,OriolVinyals,JudyHoffman,NingZhang,EricTzeng,TrevorDarrell.ICML2014

•  Ques7ons:–  Howtotransferfeaturestotaskswithdifferentlabels–  DofeaturesextractedfromtheCNNgeneralizetootherdatasets?–  Howdoesperformancevarywithnetworkdepth?

•  Algorithm:–  Adeepconvolu7onalmodelisfirsttrainedinafullysupervisedseqng

usingastate-of-the-artmethodKrizhevskyetal.(2012).–  extractvariousfeaturesfromthisnetwork,andevaluatetheefficacyof

thesefeaturesongenericvisiontasks.

78

Comparison:DECAFtoothers

79

Relational Transfer Learning Approaches

! Motivation:! Iftwologicallydescribeddomains(rela7onal,dataisnon-i.i.d)arerelated,theymustsharesimilarrela)onsamongobjects.

! Theserela7onscanbeusedfortransferlearning

80

Relational Transfer Learning Approaches (cont.)

81

Actor(A) Director(B) WorkedFor

Movie (M)

Student (B) Professor (A) AdvisedBy

Paper (T)

Publication Publication

Academic domain (source) Movie domain (target)

MovieMember MovieMember

AdvisedBy (B, A) ˄ Publication (B, T) => Publication (A, T)

WorkedFor (A, B) ˄ MovieMember (A, M) => MovieMember (B, M)

P1(x, y) ˄ P2 (x, z) => P2 (y, z)

[Mihalkova etal., AAAI-07, Davis and Domingos, ICML-09]

TRANSFERLEARNINGAPPLICATIONS

迁移学习应用

82

83

Query Classification and Online Advertisement

•  ACM KDDCUP 05 Winner

•  SIGIR 06 •  ACM Transactions on

Information Systems Journal 2006 –  Joint work with Dou

Shen, Jiantao Sun and Zheng Chen

84 84

QC as Machine Learning

Inspired by the KDDCUP’05 competition –  Classify a query into a ranked list of categories –  Queries are collected from real search engines –  Target categories are organized in a tree with each

node being a category

85

Target-transfer Learning in QC

•  Classifier, once trained, stays constant –  Target Classes Before

•  Sports, Politics (European, US, China) –  Target Classes Now

•  Sports (Olympics, Football, NBA), Stock Market (Asian, Dow, Nasdaq), History (Chinese, World) How to allow target to change?

•  Application: –  advertisements come and go, –  but our query"target mapping needs not be retrained!

•  We call this the target-transfer learning problem

86 86

Solutions: Query Enrichment + Staged Classification

Target CategoriesQueries

Solution: Bridging classifier

Construction of Synonym- based

Classifiers

Construction of Statistical Classifier

QuerySearch Engine

Labels of Returned

Pages

Text of Returned

Pages

Classified results

Classified results

Finial ResultsPhase II: the testing phase

Phase I: the training phase

The Architecture of Our Approach

87 87

$  Category information

Full text

Step 1: Query enrichment

•  Textual information

Title Snippet

Category

88 88

Step 2: Bridging Classifier

•  Wish to avoid: –  When target is changed, training needs to repeat!

•  Solution: –  Connect the target taxonomy and queries by

taking an intermediate taxonomy as a bridge

89 89

Bridging Classifier (Cont.)

$  How to connect?

Prior prob. of IjC

The relation between and I

jC

TiC

The relation between and I

jCq

The relation between and T

iCq

90 90

Category Selection for Intermediate Taxonomy

– Category Selection for Reducing Complexity

•  Total Probability (TP)

•  Mutual Information

91

Result of Bridging Classifiers

– Using bridging classifier allows the target classes to change freely • no the need to retrain the classifier!

$  Performance of the Bridging Classifier with Different Granularity of Intermediate Taxonomy

CrossDomainAc7vityRecogni7on[Zheng,Hu,Yang,Ubicomp2009]

•  Challenges:–  Anewdomainofac7vi7eswithoutlabeleddata

•  Cross-domainac7vityrecogni7on–  Transfersomeavailablelabeleddatafromsourceac7vi7estohelptrainingtherecognizerforthetargetac7vi7es.

92

CleaningIndoor

Laundry

Dishwashing

Howtousethesimilari7es?

93

SourceDomainLabeledData

SimilarityMeasure

<SensorReading,Ac7vityName>

Example:<SS,“MakeCoffee”>

sim(“MakeCoffee”,“MakeTea”)=0.6

PseudoTrainingData:<SS,“Make

Tea”,0.6>

TargetDomainPseudoLabeled

Data

WeightedSVMClassifier

THEWEB

Calcula7ngAc7vitySimilari7es! Howsimilararetwoac7vi7es?◦  UseWebsearchresults◦  TFIDF:Tradi7onalIRsimilaritymetrics(cosinesimilarity)◦  Example"  Minedsimilaritybetweentheac7vity“sweeping”and“vacuuming”,“makingthebed”,“gardening”

CalculatedSimilaritywiththeactivity"Sweeping"

Similaritywiththeactivity"Sweeping"

94

Cross-DomainAR:PerformanceMean Accuracy with Cross Domain Transfer

# Activities (Source Domain)

# Activities (Target Domain)

Baseline (Random Guess)

MIT Dataset (Cleaning to Laundry)

58.9% 13 8 12.5%

MIT Dataset (Cleaning to Dishwashing)

53.2% 13 7 14.3%

Intel Research Lab Dataset

63.2% 5 6 16.7%

95

!  Ac7vi7esinthesourcedomainandthetargetdomainaregeneratedfromtenrandomtrials,meanaccuraciesarereported.

Transferringknowledgefromsocialtophysical

! Ubiquitousphysicalsensorsmo7vateextensiveresearchonubiquitouscompu7ng.

Whichac7vityisthis person performing?

Transferringfromsocialtophysical IamonabusinesstripinNewYork.TheMetropolitanMuseumofArtisfantas7c! BrilliantnightatChilliFood,wine,hospitalityallexcellent.Bristol'stoprestaurant.

Backinthe#gymajer3.5weeks:)feelinggood#exercise

Canwetransferknowledgefromsocial

mediatophysicalworld?

Transferfromsocialtophysical

CellphoneSensorDataset

! 232sensorrecords! 10volunteers! 7me,GPS,tri-axialaccelerometer,loca7onPOIinfo

SinaWeibo

! 10,791tweets ! Distribu7onoflabels

! Distribu7onoftop9labels

Transferfromsocialtophysical

! Results Anaivecombina7onofsensorandsocialfeaturesperformsbezerthansensorfeaturesonly(Combinedv.s.Sensor),whichvalidatesthenecessityofins7llingsocialknowledgeintophysicalsensordata.

HeterogeneoustransferlearningmethodsshowimprovementoverCombined:employingsocialmessagestoenrichsensorreadings’featurerepresenta7oninalatentspaceismoreeffec7vethannaivecombina7on.

! Ourmethodcoulduseonly50%labelleddataofothermethodstoachievethesameperformance.

TransferLearningforCollabora7veFiltering

101

IMDB Database

Amazon.com

101

TransferLearninginCollabora7veFiltering

•  Source(Dense):Encodecluster-levelra7ngpazerns•  Target(Sparse):Mapusers/itemstotheencodedprototypes

102

A B C

IIIIII

A B

IIIIII

a e b f c d2

64513

c d a b e1

362

4

7

a b c d e f

a b c d e1

65432

7

1

65432

BO

OK

S(T

arge

t - S

pars

e)M

OV

IES

(Aux

iliar

y - D

ense

)

Cluster-level Rating Pattern

Matching

323

231

323

231

132

3 ?3 3

1 11 1

2 2? 2

2 22 2

3 ?3 3

3 33 3

? 33 3

2 22 2

1 11 ?

? 1 11 1 ?1 ? 1

2 ?2 2

3 3 33 ? 3

? 33 3

2 2 ?? 2 2

2

32?33

3

23211

3

131?2

3

?3122

2

3233??

23211

?

31123

1

??123

3

2?3?2

3

23?3?

1

31??3

? 2 3 3 2

Perm

ute

row

s &

cols

5 Red

uce

to G

roup

s

3 33 ?? 3

ADVANCEDDEVELOPMENTS

103

Source-FreeTransferLearning

EvanWeiXiang,SinnoJialinPan,WeikePan,JianSuandQiangYang.Source-Selec7on-FreeTransferLearning.InProceedingsofthe22ndInterna7onalJointConferenceonAr7ficialIntelligence(IJCAI-11),Barcelona,Spain,July2011.

TransferLearning

Lackoflabeledtrainingdata

alwayshappens

Whenwehavesomerelated

sourcedomains

Supervised Learning

Transfer Learning

Wherearethe“right”sourcedata?

•  Wemayhaveanextremelylargenumberofchoicesofpoten7alsourcestouse.

SFTL–Buildingbasemodels

vs.

vs.

vs.

vs.

vs.

vs.

vs.

vs.

vs.

vs.

vs.

Fromthetaxonomyoftheonlineinforma7onsource,wecan“compile”alotofbase

classifica7onmodels

SourceFreeTransferLearning

vs.

vs.

vs.

vs.

vs.

Foreachtargetinstance,wecanobtainacombinedresult

onthelabelspaceviaaggrega=ngthepredic=onsfromallthebaseclassifiers

However,doweneedtocallthebaseclassifiersduringthepredic)onphase?TheanswerisNo!

Thenwecanusetheprojec=onmatrixVtotransformsuchcombinedresultsfrom

thelabelspacetoalatentspace

V

Projection matrix

q

m

Prob

abili

ty

Label space

ATargetInstance

Compila7on:Learningaprojec7onmatrixWtoampthetargetinstancetolatentspace

vs.

vs.

vs.

vs.

vs.V

Projection matrixTarget Domain

Labeled & Unlabeled

Data

q

m

Wd

m

Learned Projection matrix

Ourregressionmodel

Lossonlabeleddata

Lossonunlabeleddata

Foreachtargetinstance,wefirstaggregateitspredic=oninthebaselabelspace,and

thenprojectitontothelatentspace

SFTL–Predic7onsfortheincomingtestdata

vs.

vs.

vs.

vs.

vs.V

Projec=onmatrix

Target Domain

Incoming Test Data

q

m

Wd

m

LearnedProjec=onmatrix

WiththeparametermatrixW,wecanmakepredic=ononanyincomingtestdatabasedonthedistancetothelabelprototypes,withoutcalling

thebaseclassifica=onmodelsNo need to use base models �explicitly!

Transi7veTransferLearning

withintermediatedomains Qiang Yang

Hong Kong University of Science and Technology

http://www.cse.ust.hk/~qyang

FarTransfervs.NearTransfer

Problemdefini7on !  Givendistantsourceandtargetdomains,andasetofintermediatedomains,canwefindoneormoreintermediatedomainstoenablethetransferlearningbetweensourceandtarget?

Not directly Transferrable

Intermediatedomain1

Common factor 1

PreviousworkandTTL %  Tradi7onalmachinelearning

&  trainingandtestdatashouldbefromthesameproblemdomain.

%  Transferlearning&  trainingandtestdatashouldbefromsimilarproblemdomains.

%  Transi7vetransferlearning&  trainingandtestdatacouldbefromdistantproblemdomains.

ML: Same domain

TL: Similar domains

TTL: Distant domains

Text-to-ImageClassifica7on

Sourceandtargetdomainshavefewoverlaps

Text-to-image Classification with co-occurrence data as intermediate domain

accelerator-to-gyroscope activity recognition with data from intelligent devices as intermediate domains

TTL:singleintermediatedomain Intermediatedomainselec7on,thenpropagateknowledge

!  Crawlalotofimageswithannota7onsfromInternet!  Usedomaindistance,suchasA-distance,toiden7fydomain!  Transi7vetransferthroughsharedhiddenfactorsinrowbymatrixtri-

factoriza7on

Matrixtri-factoriza7onforclustering/classifica7on

TTL:sharedhiddenfactorsinrowbymatrixtri-factoriza7on

ExperimentsNUS-WISEdataset ! TheNUS-WISEdatasetareused

! 45text-to-imagetasks! Eachtaskiscomposedof1200textdocuments,600images,and1600co-occurredtext-imagepairs.

SupervisedLearningw/auto-encoder

Labeled Source Domain

Feature Engineering

Predictive Model Learning

Shared

Text Classification

DesigningObjec7veFunc7onofTTL

Transitive Transfer Learning with intermediate data

Intermediate domain weighting/selection

The weights for the intermediate domains are learned from data. The intermediate data help find a better hidden layer.

Predictive Model Learning

Feature Engineering

TTLwithsupervisedauto-encoder

SourceFeature Engineering

Predictive Model Learning

Shared Target

Intermediates

!  The NUS-WISE data ! 45 text-to-image tasks! Each task is composed of 1200 text documents, 600 images, and 1600 co-occurred text-image pairs. In each task, 1600*45 co-occurred text-image pairs will be used for knowledge transfer.

TTLwithsupervisedauto-encoder

SourceFeature Engineering

Predictive Model Learning

Shared Target

Intermediates

Text-to-image w/ intermediate data

ReinforcementTransferLearningviaSparseCoding

•  Slow learning speed remains a fundamental problem for reinforcement learning in complex environments.

•  Main problem: the numbers of states and actions in the source and target domains are different. –  Existing works: hand-coded inter-task mapping between state-

action pairs

•  Tool: new transfer learning based on sparse coding

Ammar, Tuyls, Taylor, Driessens, Weiss: Reinforcement Learning Transfer via Sparse Coding. AAMAS, 2012.

ReinforcementLearningTransferviaSparseCoding A u t h o r s m e a s u r e d t h e

performance as the number ofstepsduringanepisodetocontrolthepoleinanuprightposi7ononagivenfixedamountofsamples.

•  GivenState-Ac7on-StateTripletsinthesourcetask,learndic7onaryas

•  Usingthecoefficientmatrixinthefirststep,wecanlearnthedic7onaryinthetargettaskas

•  Thenforeachtripletinthetargettask,-sparseprojec7onisusedtofinditscoefficients

•  Asaresult,theinter-taskmappingcanbelearned!

ReinforcementTransferLearningviaSparseCoding

Reference

!  [Thorndike and Woodworth, The Influence of Improvement in one mental function upon the efficiency of the other functions, 1901]

!  [Taylor and Stone, Transfer Learning for Reinforcement Learning Domains: A Survey, JMLR 2009]

!  [Pan and Yang, A Survey on Transfer Learning, IEEE TKDE 2009] !  [Quionero-Candela, etal, Data Shift in Machine Learning, MIT Press

2009] !  [Biltzeretal..DomainAdapta7onwithStructuralCorrespondence

Learning,EMNLP2006] !  [Pan etal., Cross-Domain Sentiment Classification via Spectral Feature

Alignment, WWW 2010] !  [Pan etal., Transfer Learning via Dimensionality Reduction, AAAI

2008] 126

Reference(cont.)

!  [Pan etal., Domain Adaptation via Transfer Component Analysis, IJCAI 2009]

!  [Evgeniou and Pontil, Regularized Multi-Task Learning, KDD 2004] !  [Zhang and Yeung, A Convex Formulation for Learning Task

Relationships in Multi-Task Learning, UAI 2010] !  [Agarwal etal, Learning Multiple Tasks using Manifold

Regularization, NIPS 2010] !  [Argyriou etal., Multi-Task Feature Learning, NIPS 2007] !  [Ando and Zhang, A Framework for Learning Predictive Structures

from Multiple Tasks and Unlabeled Data, JMLR 2005] !  [Ji etal, Extracting Shared Subspace for Multi-label Classification,

KDD 2008]

127

Reference(cont.)

!  [Raina etal., Self-taught Learning: Transfer Learning from Unlabeled Data, ICML 2007]

!  [Dai etal., Boosting for Transfer Learning, ICML 2007] !  [Glorot etal., Domain Adaptation for Large-Scale Sentiment

Classification: A Deep Learning Approach, ICML 2011] !  [Davis and Domingos, Deep Transfer vis Second-order Markov Logic,

ICML 2009] !  [Mihalkova etal., Mapping and Revising Markov Logic Networks for

Transfer Learning, AAAI 2007] !  [Li etal., Cross-Domain Co-Extraction of Sentiment and Topic

Lexicons, ACL 2012]

128

Reference(cont.)

!  [Sugiyama etal., Direct Importance Estimation with Model Selection and Its Application to Covariate Shift Adaptation, NIPS 2007]

!  [Kanamori etal., A Least-squares Approach to Direct Importance Estimation, JMLR 2009]

!  [Cris7aninietal.,OnKernelTargetAlignment,NIPS2002] !  [Huang etal., Correcting Sample Selection Bias by Unlabeled Data,

NIPS 2006] !  [Zadrozny, Learning and Evaluating Classifiers under Sample

Selection Bias, ICML 2004]

129

TransferLearninginConvolu7onalNeuralNetworks

•  Convolutional neural networks (CNN): outstanding image-classification.

•  Learning CNNs requires a very large number of annotated image samples–  Millions of parameters, to many that prevents application

of CNNs to problems with limited training data.

•  Key Idea: –  the internal layers of the CNN can act as a generic

extractor of mid-level image representation–  Model-based Transfer Learning

Thank You

131