15
Oneshot learning by inver2ng a composi2onal causal process Brenden M.Lake Ruslan Salakhutdinov Joshua B. Tenenbaum 能地宏 @NII スライド中の図は論文からの引用です

NIPS読み会2013: One-shot learning by inverting a compositional causal process

  • Upload
    nozyh

  • View
    3.541

  • Download
    0

Embed Size (px)

Citation preview

Page 1: NIPS読み会2013: One-shot learning by inverting  a compositional causal process

One-­‐shot  learning  by  inver2ng  a  composi2onal  causal  process

Brenden  M.LakeRuslan  SalakhutdinovJoshua  B.  Tenenbaum

能地宏  @NII  

※スライド中の図は論文からの引用です

Page 2: NIPS読み会2013: One-shot learning by inverting  a compositional causal process

One-­‐shot  classifica2on

One-shot learning by inverting a compositional causalprocess

Brenden M. LakeDept. of Brain and Cognitive Sciences

[email protected]

Ruslan SalakhutdinovDept. of Statistics and Computer Science

University of [email protected]

Joshua B. TenenbaumDept. of Brain and Cognitive Sciences

[email protected]

Abstract

People can learn a new visual class from just one example, yet machine learn-ing algorithms typically require hundreds or thousands of examples to tackle thesame problems. Here we present a Hierarchical Bayesian model based on com-positionality and causality that can learn a wide range of natural (although sim-ple) visual concepts, generalizing in human-like ways from just one image. Weevaluated performance on a challenging one-shot classification task, where ourmodel achieved a human-level error rate while substantially outperforming twodeep learning models. We also tested the model on another conceptual task, gen-erating new examples, by using a “visual Turing test” to show that our modelproduces human-like performance.

1 IntroductionPeople can acquire a new concept from only the barest of experience – just one or a handful ofexamples in a high-dimensional space of raw perceptual input. Although machine learning hastackled some of the same classification and recognition problems that people solve so effortlessly,the standard algorithms require hundreds or thousands of examples to reach good performance.While the standard MNIST benchmark dataset for digit recognition has 6000 training examples perclass [19], people can classify new images of a foreign handwritten character from just one example(Figure 1b) [23, 16, 17]. Similarly, while classifiers are generally trained on hundreds of images perclass, using benchmark datasets such as ImageNet [4] and CIFAR-10/100 [14], people can learn a

12

3

Human drawers

canonical

12

3

5

1 2

35.1

1 23

5.3

1 2

3

6.2

1 2

3

6.2

21 3

7.1

21 3

7.4

31

2

8.2

1 23

8.4

1

3

2

12

1

23

Simple drawers

canonical

132

6.5

21

3

24

123

17

3

21

20

21

3

11

31

2

10

1

23

23

32

1

17

1

2

3

16

31

2

20

b) c)a)

Figure 1: Can you learn a new concept from just one example? (a & b) Where are the other examples of theconcept shown in red? Answers for b) are row 4 column 3 (left) and row 2 column 4 (right). c) The learnedconcepts also support many other abilities such as generating examples and parsing.

1

One-shot learning by inverting a compositional causalprocess

Brenden M. LakeDept. of Brain and Cognitive Sciences

[email protected]

Ruslan SalakhutdinovDept. of Statistics and Computer Science

University of [email protected]

Joshua B. TenenbaumDept. of Brain and Cognitive Sciences

[email protected]

Abstract

People can learn a new visual class from just one example, yet machine learn-ing algorithms typically require hundreds or thousands of examples to tackle thesame problems. Here we present a Hierarchical Bayesian model based on com-positionality and causality that can learn a wide range of natural (although sim-ple) visual concepts, generalizing in human-like ways from just one image. Weevaluated performance on a challenging one-shot classification task, where ourmodel achieved a human-level error rate while substantially outperforming twodeep learning models. We also tested the model on another conceptual task, gen-erating new examples, by using a “visual Turing test” to show that our modelproduces human-like performance.

1 IntroductionPeople can acquire a new concept from only the barest of experience – just one or a handful ofexamples in a high-dimensional space of raw perceptual input. Although machine learning hastackled some of the same classification and recognition problems that people solve so effortlessly,the standard algorithms require hundreds or thousands of examples to reach good performance.While the standard MNIST benchmark dataset for digit recognition has 6000 training examples perclass [19], people can classify new images of a foreign handwritten character from just one example(Figure 1b) [23, 16, 17]. Similarly, while classifiers are generally trained on hundreds of images perclass, using benchmark datasets such as ImageNet [4] and CIFAR-10/100 [14], people can learn a

12

3

Human drawers

canonical

12

3

5

1 2

35.1

1 23

5.3

1 2

3

6.2

1 2

3

6.2

21 3

7.1

21 3

7.4

31

2

8.2

1 23

8.4

1

3

2

12

1

23

Simple drawers

canonical

132

6.5

21

3

24

123

17

3

21

20

21

3

11

31

2

10

1

23

23

32

1

17

1

2

3

16

31

2

20

b) c)a)

Figure 1: Can you learn a new concept from just one example? (a & b) Where are the other examples of theconcept shown in red? Answers for b) are row 4 column 3 (left) and row 2 column 4 (right). c) The learnedconcepts also support many other abilities such as generating examples and parsing.

1

Page 3: NIPS読み会2013: One-shot learning by inverting  a compositional causal process

One-­‐shot  Genera2onPeople HBPL Affine HDExample

Figure 6: Generating newexamples from just a single“target” image (left). Eachgrid shows nine new exam-ples synthesized by peo-ple and the three computa-tional models.

Visual Turing test. To compare the examples generated by people and the models, we ran a visualTuring test using 50 new participants in the USA on Mechanical Turk. Participants were told thatthey would see a target image and two grids of 9 images (Figure 6), where one grid was drawnby people with their computer mice and the other grid was drawn by a computer program that“simulates how people draw a new character.” Which grid is which? There were two conditions,where the “computer program” was either HBPL or the Affine model. Participants were quizzedon their comprehension and then they saw 50 trials. Accuracy was revealed after each block of10 trials. Also, a button to review the instructions was always accessible. Four participants whoreported technical difficulties were not analyzed.

Results. Participants who tried to label drawings from people vs. HBPL were only 56% percent cor-rect, while those who tried to label people vs. the Affine model were 92% percent correct. A 2-wayAnalysis of Variance showed a significant effect of condition (p < .001), but no significant effect ofblock and no interaction. While both group means were significantly better than chance, a subjectanalysis revealed only 2 of 21 participants were better than chance for people vs. HBPL, while 24of 25 were significant for people vs. Affine. Likewise, 8 of 50 items were above chance for peoplevs. HBPL, while 48 of 50 items were above chance for people vs. Affine. Since participants couldeasily detect the overly consistent Affine model, it seems the difficulty participants had in detectingHBPL’s exemplars was not due to task confusion. Interestingly, participants did not significantlyimprove over the trials, even after seeing hundreds of images from the model. Our results suggestthat HBPL can generate compelling new examples that fool a majority of participants.

4 Discussion

Hierarchical Bayesian Program Learning (HBPL), by exploiting compositionality and causality, de-parts from standard models that need a lot more data to learn new concepts. From just one example,HBPL can both classify and generate compelling new examples, fooling judges in a “visual Turingtest” that other approaches could not pass. Beyond the differences in model architecture, HBPL wasalso trained on the causal dynamics behind images, although just the images were available at eval-uation time. If one were to incorporate this compositional and causal structure into a deep learningmodel, it could lead to better performance on our tasks. Thus, we do not see our model as the finalword on how humans learn concepts, but rather, as a suggestion for the type of structure that bestcaptures how people learn rich concepts from very sparse data. Future directions will extend thisapproach to other natural forms of generalization with characters, as well as speech, gesture, andother domains where compositionality and causality are central.

AcknowledgmentsWe would like to thank MIT CoCoSci for helpful feedback. This work was supported by ARO MURIcontract W911NF-08-1-0242 and a NSF Graduate Research Fellowship held by the first author.

8

People HBPL Affine HDExample

Figure 6: Generating newexamples from just a single“target” image (left). Eachgrid shows nine new exam-ples synthesized by peo-ple and the three computa-tional models.

Visual Turing test. To compare the examples generated by people and the models, we ran a visualTuring test using 50 new participants in the USA on Mechanical Turk. Participants were told thatthey would see a target image and two grids of 9 images (Figure 6), where one grid was drawnby people with their computer mice and the other grid was drawn by a computer program that“simulates how people draw a new character.” Which grid is which? There were two conditions,where the “computer program” was either HBPL or the Affine model. Participants were quizzedon their comprehension and then they saw 50 trials. Accuracy was revealed after each block of10 trials. Also, a button to review the instructions was always accessible. Four participants whoreported technical difficulties were not analyzed.

Results. Participants who tried to label drawings from people vs. HBPL were only 56% percent cor-rect, while those who tried to label people vs. the Affine model were 92% percent correct. A 2-wayAnalysis of Variance showed a significant effect of condition (p < .001), but no significant effect ofblock and no interaction. While both group means were significantly better than chance, a subjectanalysis revealed only 2 of 21 participants were better than chance for people vs. HBPL, while 24of 25 were significant for people vs. Affine. Likewise, 8 of 50 items were above chance for peoplevs. HBPL, while 48 of 50 items were above chance for people vs. Affine. Since participants couldeasily detect the overly consistent Affine model, it seems the difficulty participants had in detectingHBPL’s exemplars was not due to task confusion. Interestingly, participants did not significantlyimprove over the trials, even after seeing hundreds of images from the model. Our results suggestthat HBPL can generate compelling new examples that fool a majority of participants.

4 Discussion

Hierarchical Bayesian Program Learning (HBPL), by exploiting compositionality and causality, de-parts from standard models that need a lot more data to learn new concepts. From just one example,HBPL can both classify and generate compelling new examples, fooling judges in a “visual Turingtest” that other approaches could not pass. Beyond the differences in model architecture, HBPL wasalso trained on the causal dynamics behind images, although just the images were available at eval-uation time. If one were to incorporate this compositional and causal structure into a deep learningmodel, it could lead to better performance on our tasks. Thus, we do not see our model as the finalword on how humans learn concepts, but rather, as a suggestion for the type of structure that bestcaptures how people learn rich concepts from very sparse data. Future directions will extend thisapproach to other natural forms of generalization with characters, as well as speech, gesture, andother domains where compositionality and causality are central.

AcknowledgmentsWe would like to thank MIT CoCoSci for helpful feedback. This work was supported by ARO MURIcontract W911NF-08-1-0242 and a NSF Graduate Research Fellowship held by the first author.

8

People HBPL Affine HDExample

Figure 6: Generating newexamples from just a single“target” image (left). Eachgrid shows nine new exam-ples synthesized by peo-ple and the three computa-tional models.

Visual Turing test. To compare the examples generated by people and the models, we ran a visualTuring test using 50 new participants in the USA on Mechanical Turk. Participants were told thatthey would see a target image and two grids of 9 images (Figure 6), where one grid was drawnby people with their computer mice and the other grid was drawn by a computer program that“simulates how people draw a new character.” Which grid is which? There were two conditions,where the “computer program” was either HBPL or the Affine model. Participants were quizzedon their comprehension and then they saw 50 trials. Accuracy was revealed after each block of10 trials. Also, a button to review the instructions was always accessible. Four participants whoreported technical difficulties were not analyzed.

Results. Participants who tried to label drawings from people vs. HBPL were only 56% percent cor-rect, while those who tried to label people vs. the Affine model were 92% percent correct. A 2-wayAnalysis of Variance showed a significant effect of condition (p < .001), but no significant effect ofblock and no interaction. While both group means were significantly better than chance, a subjectanalysis revealed only 2 of 21 participants were better than chance for people vs. HBPL, while 24of 25 were significant for people vs. Affine. Likewise, 8 of 50 items were above chance for peoplevs. HBPL, while 48 of 50 items were above chance for people vs. Affine. Since participants couldeasily detect the overly consistent Affine model, it seems the difficulty participants had in detectingHBPL’s exemplars was not due to task confusion. Interestingly, participants did not significantlyimprove over the trials, even after seeing hundreds of images from the model. Our results suggestthat HBPL can generate compelling new examples that fool a majority of participants.

4 Discussion

Hierarchical Bayesian Program Learning (HBPL), by exploiting compositionality and causality, de-parts from standard models that need a lot more data to learn new concepts. From just one example,HBPL can both classify and generate compelling new examples, fooling judges in a “visual Turingtest” that other approaches could not pass. Beyond the differences in model architecture, HBPL wasalso trained on the causal dynamics behind images, although just the images were available at eval-uation time. If one were to incorporate this compositional and causal structure into a deep learningmodel, it could lead to better performance on our tasks. Thus, we do not see our model as the finalword on how humans learn concepts, but rather, as a suggestion for the type of structure that bestcaptures how people learn rich concepts from very sparse data. Future directions will extend thisapproach to other natural forms of generalization with characters, as well as speech, gesture, andother domains where compositionality and causality are central.

AcknowledgmentsWe would like to thank MIT CoCoSci for helpful feedback. This work was supported by ARO MURIcontract W911NF-08-1-0242 and a NSF Graduate Research Fellowship held by the first author.

8

Page 4: NIPS読み会2013: One-shot learning by inverting  a compositional causal process

One-­‐shot  Genera2on

People HBPL Affine HDExample

Figure 6: Generating newexamples from just a single“target” image (left). Eachgrid shows nine new exam-ples synthesized by peo-ple and the three computa-tional models.

Visual Turing test. To compare the examples generated by people and the models, we ran a visualTuring test using 50 new participants in the USA on Mechanical Turk. Participants were told thatthey would see a target image and two grids of 9 images (Figure 6), where one grid was drawnby people with their computer mice and the other grid was drawn by a computer program that“simulates how people draw a new character.” Which grid is which? There were two conditions,where the “computer program” was either HBPL or the Affine model. Participants were quizzedon their comprehension and then they saw 50 trials. Accuracy was revealed after each block of10 trials. Also, a button to review the instructions was always accessible. Four participants whoreported technical difficulties were not analyzed.

Results. Participants who tried to label drawings from people vs. HBPL were only 56% percent cor-rect, while those who tried to label people vs. the Affine model were 92% percent correct. A 2-wayAnalysis of Variance showed a significant effect of condition (p < .001), but no significant effect ofblock and no interaction. While both group means were significantly better than chance, a subjectanalysis revealed only 2 of 21 participants were better than chance for people vs. HBPL, while 24of 25 were significant for people vs. Affine. Likewise, 8 of 50 items were above chance for peoplevs. HBPL, while 48 of 50 items were above chance for people vs. Affine. Since participants couldeasily detect the overly consistent Affine model, it seems the difficulty participants had in detectingHBPL’s exemplars was not due to task confusion. Interestingly, participants did not significantlyimprove over the trials, even after seeing hundreds of images from the model. Our results suggestthat HBPL can generate compelling new examples that fool a majority of participants.

4 Discussion

Hierarchical Bayesian Program Learning (HBPL), by exploiting compositionality and causality, de-parts from standard models that need a lot more data to learn new concepts. From just one example,HBPL can both classify and generate compelling new examples, fooling judges in a “visual Turingtest” that other approaches could not pass. Beyond the differences in model architecture, HBPL wasalso trained on the causal dynamics behind images, although just the images were available at eval-uation time. If one were to incorporate this compositional and causal structure into a deep learningmodel, it could lead to better performance on our tasks. Thus, we do not see our model as the finalword on how humans learn concepts, but rather, as a suggestion for the type of structure that bestcaptures how people learn rich concepts from very sparse data. Future directions will extend thisapproach to other natural forms of generalization with characters, as well as speech, gesture, andother domains where compositionality and causality are central.

AcknowledgmentsWe would like to thank MIT CoCoSci for helpful feedback. This work was supported by ARO MURIcontract W911NF-08-1-0242 and a NSF Graduate Research Fellowship held by the first author.

8

People HBPL Affine HDExample

Figure 6: Generating newexamples from just a single“target” image (left). Eachgrid shows nine new exam-ples synthesized by peo-ple and the three computa-tional models.

Visual Turing test. To compare the examples generated by people and the models, we ran a visualTuring test using 50 new participants in the USA on Mechanical Turk. Participants were told thatthey would see a target image and two grids of 9 images (Figure 6), where one grid was drawnby people with their computer mice and the other grid was drawn by a computer program that“simulates how people draw a new character.” Which grid is which? There were two conditions,where the “computer program” was either HBPL or the Affine model. Participants were quizzedon their comprehension and then they saw 50 trials. Accuracy was revealed after each block of10 trials. Also, a button to review the instructions was always accessible. Four participants whoreported technical difficulties were not analyzed.

Results. Participants who tried to label drawings from people vs. HBPL were only 56% percent cor-rect, while those who tried to label people vs. the Affine model were 92% percent correct. A 2-wayAnalysis of Variance showed a significant effect of condition (p < .001), but no significant effect ofblock and no interaction. While both group means were significantly better than chance, a subjectanalysis revealed only 2 of 21 participants were better than chance for people vs. HBPL, while 24of 25 were significant for people vs. Affine. Likewise, 8 of 50 items were above chance for peoplevs. HBPL, while 48 of 50 items were above chance for people vs. Affine. Since participants couldeasily detect the overly consistent Affine model, it seems the difficulty participants had in detectingHBPL’s exemplars was not due to task confusion. Interestingly, participants did not significantlyimprove over the trials, even after seeing hundreds of images from the model. Our results suggestthat HBPL can generate compelling new examples that fool a majority of participants.

4 Discussion

Hierarchical Bayesian Program Learning (HBPL), by exploiting compositionality and causality, de-parts from standard models that need a lot more data to learn new concepts. From just one example,HBPL can both classify and generate compelling new examples, fooling judges in a “visual Turingtest” that other approaches could not pass. Beyond the differences in model architecture, HBPL wasalso trained on the causal dynamics behind images, although just the images were available at eval-uation time. If one were to incorporate this compositional and causal structure into a deep learningmodel, it could lead to better performance on our tasks. Thus, we do not see our model as the finalword on how humans learn concepts, but rather, as a suggestion for the type of structure that bestcaptures how people learn rich concepts from very sparse data. Future directions will extend thisapproach to other natural forms of generalization with characters, as well as speech, gesture, andother domains where compositionality and causality are central.

AcknowledgmentsWe would like to thank MIT CoCoSci for helpful feedback. This work was supported by ARO MURIcontract W911NF-08-1-0242 and a NSF Graduate Research Fellowship held by the first author.

8

People HBPL Affine HDExample

Figure 6: Generating newexamples from just a single“target” image (left). Eachgrid shows nine new exam-ples synthesized by peo-ple and the three computa-tional models.

Visual Turing test. To compare the examples generated by people and the models, we ran a visualTuring test using 50 new participants in the USA on Mechanical Turk. Participants were told thatthey would see a target image and two grids of 9 images (Figure 6), where one grid was drawnby people with their computer mice and the other grid was drawn by a computer program that“simulates how people draw a new character.” Which grid is which? There were two conditions,where the “computer program” was either HBPL or the Affine model. Participants were quizzedon their comprehension and then they saw 50 trials. Accuracy was revealed after each block of10 trials. Also, a button to review the instructions was always accessible. Four participants whoreported technical difficulties were not analyzed.

Results. Participants who tried to label drawings from people vs. HBPL were only 56% percent cor-rect, while those who tried to label people vs. the Affine model were 92% percent correct. A 2-wayAnalysis of Variance showed a significant effect of condition (p < .001), but no significant effect ofblock and no interaction. While both group means were significantly better than chance, a subjectanalysis revealed only 2 of 21 participants were better than chance for people vs. HBPL, while 24of 25 were significant for people vs. Affine. Likewise, 8 of 50 items were above chance for peoplevs. HBPL, while 48 of 50 items were above chance for people vs. Affine. Since participants couldeasily detect the overly consistent Affine model, it seems the difficulty participants had in detectingHBPL’s exemplars was not due to task confusion. Interestingly, participants did not significantlyimprove over the trials, even after seeing hundreds of images from the model. Our results suggestthat HBPL can generate compelling new examples that fool a majority of participants.

4 Discussion

Hierarchical Bayesian Program Learning (HBPL), by exploiting compositionality and causality, de-parts from standard models that need a lot more data to learn new concepts. From just one example,HBPL can both classify and generate compelling new examples, fooling judges in a “visual Turingtest” that other approaches could not pass. Beyond the differences in model architecture, HBPL wasalso trained on the causal dynamics behind images, although just the images were available at eval-uation time. If one were to incorporate this compositional and causal structure into a deep learningmodel, it could lead to better performance on our tasks. Thus, we do not see our model as the finalword on how humans learn concepts, but rather, as a suggestion for the type of structure that bestcaptures how people learn rich concepts from very sparse data. Future directions will extend thisapproach to other natural forms of generalization with characters, as well as speech, gesture, andother domains where compositionality and causality are central.

AcknowledgmentsWe would like to thank MIT CoCoSci for helpful feedback. This work was supported by ARO MURIcontract W911NF-08-1-0242 and a NSF Graduate Research Fellowship held by the first author.

8

Page 5: NIPS読み会2013: One-shot learning by inverting  a compositional causal process

One-­‐shot  Genera2onPeople HBPL Affine HDExample

Figure 6: Generating newexamples from just a single“target” image (left). Eachgrid shows nine new exam-ples synthesized by peo-ple and the three computa-tional models.

Visual Turing test. To compare the examples generated by people and the models, we ran a visualTuring test using 50 new participants in the USA on Mechanical Turk. Participants were told thatthey would see a target image and two grids of 9 images (Figure 6), where one grid was drawnby people with their computer mice and the other grid was drawn by a computer program that“simulates how people draw a new character.” Which grid is which? There were two conditions,where the “computer program” was either HBPL or the Affine model. Participants were quizzedon their comprehension and then they saw 50 trials. Accuracy was revealed after each block of10 trials. Also, a button to review the instructions was always accessible. Four participants whoreported technical difficulties were not analyzed.

Results. Participants who tried to label drawings from people vs. HBPL were only 56% percent cor-rect, while those who tried to label people vs. the Affine model were 92% percent correct. A 2-wayAnalysis of Variance showed a significant effect of condition (p < .001), but no significant effect ofblock and no interaction. While both group means were significantly better than chance, a subjectanalysis revealed only 2 of 21 participants were better than chance for people vs. HBPL, while 24of 25 were significant for people vs. Affine. Likewise, 8 of 50 items were above chance for peoplevs. HBPL, while 48 of 50 items were above chance for people vs. Affine. Since participants couldeasily detect the overly consistent Affine model, it seems the difficulty participants had in detectingHBPL’s exemplars was not due to task confusion. Interestingly, participants did not significantlyimprove over the trials, even after seeing hundreds of images from the model. Our results suggestthat HBPL can generate compelling new examples that fool a majority of participants.

4 Discussion

Hierarchical Bayesian Program Learning (HBPL), by exploiting compositionality and causality, de-parts from standard models that need a lot more data to learn new concepts. From just one example,HBPL can both classify and generate compelling new examples, fooling judges in a “visual Turingtest” that other approaches could not pass. Beyond the differences in model architecture, HBPL wasalso trained on the causal dynamics behind images, although just the images were available at eval-uation time. If one were to incorporate this compositional and causal structure into a deep learningmodel, it could lead to better performance on our tasks. Thus, we do not see our model as the finalword on how humans learn concepts, but rather, as a suggestion for the type of structure that bestcaptures how people learn rich concepts from very sparse data. Future directions will extend thisapproach to other natural forms of generalization with characters, as well as speech, gesture, andother domains where compositionality and causality are central.

AcknowledgmentsWe would like to thank MIT CoCoSci for helpful feedback. This work was supported by ARO MURIcontract W911NF-08-1-0242 and a NSF Graduate Research Fellowship held by the first author.

8

People HBPL Affine HDExample

Figure 6: Generating newexamples from just a single“target” image (left). Eachgrid shows nine new exam-ples synthesized by peo-ple and the three computa-tional models.

Visual Turing test. To compare the examples generated by people and the models, we ran a visualTuring test using 50 new participants in the USA on Mechanical Turk. Participants were told thatthey would see a target image and two grids of 9 images (Figure 6), where one grid was drawnby people with their computer mice and the other grid was drawn by a computer program that“simulates how people draw a new character.” Which grid is which? There were two conditions,where the “computer program” was either HBPL or the Affine model. Participants were quizzedon their comprehension and then they saw 50 trials. Accuracy was revealed after each block of10 trials. Also, a button to review the instructions was always accessible. Four participants whoreported technical difficulties were not analyzed.

Results. Participants who tried to label drawings from people vs. HBPL were only 56% percent cor-rect, while those who tried to label people vs. the Affine model were 92% percent correct. A 2-wayAnalysis of Variance showed a significant effect of condition (p < .001), but no significant effect ofblock and no interaction. While both group means were significantly better than chance, a subjectanalysis revealed only 2 of 21 participants were better than chance for people vs. HBPL, while 24of 25 were significant for people vs. Affine. Likewise, 8 of 50 items were above chance for peoplevs. HBPL, while 48 of 50 items were above chance for people vs. Affine. Since participants couldeasily detect the overly consistent Affine model, it seems the difficulty participants had in detectingHBPL’s exemplars was not due to task confusion. Interestingly, participants did not significantlyimprove over the trials, even after seeing hundreds of images from the model. Our results suggestthat HBPL can generate compelling new examples that fool a majority of participants.

4 Discussion

Hierarchical Bayesian Program Learning (HBPL), by exploiting compositionality and causality, de-parts from standard models that need a lot more data to learn new concepts. From just one example,HBPL can both classify and generate compelling new examples, fooling judges in a “visual Turingtest” that other approaches could not pass. Beyond the differences in model architecture, HBPL wasalso trained on the causal dynamics behind images, although just the images were available at eval-uation time. If one were to incorporate this compositional and causal structure into a deep learningmodel, it could lead to better performance on our tasks. Thus, we do not see our model as the finalword on how humans learn concepts, but rather, as a suggestion for the type of structure that bestcaptures how people learn rich concepts from very sparse data. Future directions will extend thisapproach to other natural forms of generalization with characters, as well as speech, gesture, andother domains where compositionality and causality are central.

AcknowledgmentsWe would like to thank MIT CoCoSci for helpful feedback. This work was supported by ARO MURIcontract W911NF-08-1-0242 and a NSF Graduate Research Fellowship held by the first author.

8

People HBPL Affine HDExample

Figure 6: Generating newexamples from just a single“target” image (left). Eachgrid shows nine new exam-ples synthesized by peo-ple and the three computa-tional models.

Visual Turing test. To compare the examples generated by people and the models, we ran a visualTuring test using 50 new participants in the USA on Mechanical Turk. Participants were told thatthey would see a target image and two grids of 9 images (Figure 6), where one grid was drawnby people with their computer mice and the other grid was drawn by a computer program that“simulates how people draw a new character.” Which grid is which? There were two conditions,where the “computer program” was either HBPL or the Affine model. Participants were quizzedon their comprehension and then they saw 50 trials. Accuracy was revealed after each block of10 trials. Also, a button to review the instructions was always accessible. Four participants whoreported technical difficulties were not analyzed.

Results. Participants who tried to label drawings from people vs. HBPL were only 56% percent cor-rect, while those who tried to label people vs. the Affine model were 92% percent correct. A 2-wayAnalysis of Variance showed a significant effect of condition (p < .001), but no significant effect ofblock and no interaction. While both group means were significantly better than chance, a subjectanalysis revealed only 2 of 21 participants were better than chance for people vs. HBPL, while 24of 25 were significant for people vs. Affine. Likewise, 8 of 50 items were above chance for peoplevs. HBPL, while 48 of 50 items were above chance for people vs. Affine. Since participants couldeasily detect the overly consistent Affine model, it seems the difficulty participants had in detectingHBPL’s exemplars was not due to task confusion. Interestingly, participants did not significantlyimprove over the trials, even after seeing hundreds of images from the model. Our results suggestthat HBPL can generate compelling new examples that fool a majority of participants.

4 Discussion

Hierarchical Bayesian Program Learning (HBPL), by exploiting compositionality and causality, de-parts from standard models that need a lot more data to learn new concepts. From just one example,HBPL can both classify and generate compelling new examples, fooling judges in a “visual Turingtest” that other approaches could not pass. Beyond the differences in model architecture, HBPL wasalso trained on the causal dynamics behind images, although just the images were available at eval-uation time. If one were to incorporate this compositional and causal structure into a deep learningmodel, it could lead to better performance on our tasks. Thus, we do not see our model as the finalword on how humans learn concepts, but rather, as a suggestion for the type of structure that bestcaptures how people learn rich concepts from very sparse data. Future directions will extend thisapproach to other natural forms of generalization with characters, as well as speech, gesture, andother domains where compositionality and causality are central.

AcknowledgmentsWe would like to thank MIT CoCoSci for helpful feedback. This work was supported by ARO MURIcontract W911NF-08-1-0242 and a NSF Graduate Research Fellowship held by the first author.

8

People The  model

Page 6: NIPS読み会2013: One-shot learning by inverting  a compositional causal process

One-­‐shot  Genera2on

People HBPL Affine HDExample

Figure 6: Generating newexamples from just a single“target” image (left). Eachgrid shows nine new exam-ples synthesized by peo-ple and the three computa-tional models.

Visual Turing test. To compare the examples generated by people and the models, we ran a visualTuring test using 50 new participants in the USA on Mechanical Turk. Participants were told thatthey would see a target image and two grids of 9 images (Figure 6), where one grid was drawnby people with their computer mice and the other grid was drawn by a computer program that“simulates how people draw a new character.” Which grid is which? There were two conditions,where the “computer program” was either HBPL or the Affine model. Participants were quizzedon their comprehension and then they saw 50 trials. Accuracy was revealed after each block of10 trials. Also, a button to review the instructions was always accessible. Four participants whoreported technical difficulties were not analyzed.

Results. Participants who tried to label drawings from people vs. HBPL were only 56% percent cor-rect, while those who tried to label people vs. the Affine model were 92% percent correct. A 2-wayAnalysis of Variance showed a significant effect of condition (p < .001), but no significant effect ofblock and no interaction. While both group means were significantly better than chance, a subjectanalysis revealed only 2 of 21 participants were better than chance for people vs. HBPL, while 24of 25 were significant for people vs. Affine. Likewise, 8 of 50 items were above chance for peoplevs. HBPL, while 48 of 50 items were above chance for people vs. Affine. Since participants couldeasily detect the overly consistent Affine model, it seems the difficulty participants had in detectingHBPL’s exemplars was not due to task confusion. Interestingly, participants did not significantlyimprove over the trials, even after seeing hundreds of images from the model. Our results suggestthat HBPL can generate compelling new examples that fool a majority of participants.

4 Discussion

Hierarchical Bayesian Program Learning (HBPL), by exploiting compositionality and causality, de-parts from standard models that need a lot more data to learn new concepts. From just one example,HBPL can both classify and generate compelling new examples, fooling judges in a “visual Turingtest” that other approaches could not pass. Beyond the differences in model architecture, HBPL wasalso trained on the causal dynamics behind images, although just the images were available at eval-uation time. If one were to incorporate this compositional and causal structure into a deep learningmodel, it could lead to better performance on our tasks. Thus, we do not see our model as the finalword on how humans learn concepts, but rather, as a suggestion for the type of structure that bestcaptures how people learn rich concepts from very sparse data. Future directions will extend thisapproach to other natural forms of generalization with characters, as well as speech, gesture, andother domains where compositionality and causality are central.

AcknowledgmentsWe would like to thank MIT CoCoSci for helpful feedback. This work was supported by ARO MURIcontract W911NF-08-1-0242 and a NSF Graduate Research Fellowship held by the first author.

8

People HBPL Affine HDExample

Figure 6: Generating newexamples from just a single“target” image (left). Eachgrid shows nine new exam-ples synthesized by peo-ple and the three computa-tional models.

Visual Turing test. To compare the examples generated by people and the models, we ran a visualTuring test using 50 new participants in the USA on Mechanical Turk. Participants were told thatthey would see a target image and two grids of 9 images (Figure 6), where one grid was drawnby people with their computer mice and the other grid was drawn by a computer program that“simulates how people draw a new character.” Which grid is which? There were two conditions,where the “computer program” was either HBPL or the Affine model. Participants were quizzedon their comprehension and then they saw 50 trials. Accuracy was revealed after each block of10 trials. Also, a button to review the instructions was always accessible. Four participants whoreported technical difficulties were not analyzed.

Results. Participants who tried to label drawings from people vs. HBPL were only 56% percent cor-rect, while those who tried to label people vs. the Affine model were 92% percent correct. A 2-wayAnalysis of Variance showed a significant effect of condition (p < .001), but no significant effect ofblock and no interaction. While both group means were significantly better than chance, a subjectanalysis revealed only 2 of 21 participants were better than chance for people vs. HBPL, while 24of 25 were significant for people vs. Affine. Likewise, 8 of 50 items were above chance for peoplevs. HBPL, while 48 of 50 items were above chance for people vs. Affine. Since participants couldeasily detect the overly consistent Affine model, it seems the difficulty participants had in detectingHBPL’s exemplars was not due to task confusion. Interestingly, participants did not significantlyimprove over the trials, even after seeing hundreds of images from the model. Our results suggestthat HBPL can generate compelling new examples that fool a majority of participants.

4 Discussion

Hierarchical Bayesian Program Learning (HBPL), by exploiting compositionality and causality, de-parts from standard models that need a lot more data to learn new concepts. From just one example,HBPL can both classify and generate compelling new examples, fooling judges in a “visual Turingtest” that other approaches could not pass. Beyond the differences in model architecture, HBPL wasalso trained on the causal dynamics behind images, although just the images were available at eval-uation time. If one were to incorporate this compositional and causal structure into a deep learningmodel, it could lead to better performance on our tasks. Thus, we do not see our model as the finalword on how humans learn concepts, but rather, as a suggestion for the type of structure that bestcaptures how people learn rich concepts from very sparse data. Future directions will extend thisapproach to other natural forms of generalization with characters, as well as speech, gesture, andother domains where compositionality and causality are central.

AcknowledgmentsWe would like to thank MIT CoCoSci for helpful feedback. This work was supported by ARO MURIcontract W911NF-08-1-0242 and a NSF Graduate Research Fellowship held by the first author.

8

People HBPL Affine HDExample

Figure 6: Generating newexamples from just a single“target” image (left). Eachgrid shows nine new exam-ples synthesized by peo-ple and the three computa-tional models.

Visual Turing test. To compare the examples generated by people and the models, we ran a visualTuring test using 50 new participants in the USA on Mechanical Turk. Participants were told thatthey would see a target image and two grids of 9 images (Figure 6), where one grid was drawnby people with their computer mice and the other grid was drawn by a computer program that“simulates how people draw a new character.” Which grid is which? There were two conditions,where the “computer program” was either HBPL or the Affine model. Participants were quizzedon their comprehension and then they saw 50 trials. Accuracy was revealed after each block of10 trials. Also, a button to review the instructions was always accessible. Four participants whoreported technical difficulties were not analyzed.

Results. Participants who tried to label drawings from people vs. HBPL were only 56% percent cor-rect, while those who tried to label people vs. the Affine model were 92% percent correct. A 2-wayAnalysis of Variance showed a significant effect of condition (p < .001), but no significant effect ofblock and no interaction. While both group means were significantly better than chance, a subjectanalysis revealed only 2 of 21 participants were better than chance for people vs. HBPL, while 24of 25 were significant for people vs. Affine. Likewise, 8 of 50 items were above chance for peoplevs. HBPL, while 48 of 50 items were above chance for people vs. Affine. Since participants couldeasily detect the overly consistent Affine model, it seems the difficulty participants had in detectingHBPL’s exemplars was not due to task confusion. Interestingly, participants did not significantlyimprove over the trials, even after seeing hundreds of images from the model. Our results suggestthat HBPL can generate compelling new examples that fool a majority of participants.

4 Discussion

Hierarchical Bayesian Program Learning (HBPL), by exploiting compositionality and causality, de-parts from standard models that need a lot more data to learn new concepts. From just one example,HBPL can both classify and generate compelling new examples, fooling judges in a “visual Turingtest” that other approaches could not pass. Beyond the differences in model architecture, HBPL wasalso trained on the causal dynamics behind images, although just the images were available at eval-uation time. If one were to incorporate this compositional and causal structure into a deep learningmodel, it could lead to better performance on our tasks. Thus, we do not see our model as the finalword on how humans learn concepts, but rather, as a suggestion for the type of structure that bestcaptures how people learn rich concepts from very sparse data. Future directions will extend thisapproach to other natural forms of generalization with characters, as well as speech, gesture, andother domains where compositionality and causality are central.

AcknowledgmentsWe would like to thank MIT CoCoSci for helpful feedback. This work was supported by ARO MURIcontract W911NF-08-1-0242 and a NSF Graduate Research Fellowship held by the first author.

8

PeopleThe  model

Page 7: NIPS読み会2013: One-shot learning by inverting  a compositional causal process

Overview

‣人間はたった一つの例から、そのシンボルの特徴を取り出せる

-­‐ 分類:似たものを取り出せる

-­‐ 生成:新しいサンプルを作り出せる

‣機械学習は典型的に、ラベル毎に大量のデータを必要とする

-­‐ ex)  MNIST:  6000  training  data  /  class

‣タスクと貢献

-­‐ 機械学習はこの人間の能力を模倣できるか?

-­‐ 丁寧に生成モデルを定義したら、人間と同じような結果が得られた

-­‐ 人間も同じような仕組みで特徴を抽出していると言えるかも

Page 8: NIPS読み会2013: One-shot learning by inverting  a compositional causal process

データと学習

Omniglot  dataset

50  alphabets;1600  characters;20  examples  /  character

Figure 2: Four alphabets from Omniglot, each with five characters drawn by four different people.

new visual object from just one example (e.g., a “Segway” in Figure 1a). These new larger datasetshave developed along with larger and “deeper” model architectures, and while performance hassteadily (and even spectacularly [15]) improved in this big data setting, it is unknown how thisprogress translates to the “one-shot” setting that is a hallmark of human learning [3, 22, 28].

Additionally, while classification has received most of the attention in machine learning, peoplecan generalize in a variety of other ways after learning a new concept. Equipped with the concept“Segway” or a new handwritten character (Figure 1c), people can produce new examples, parse anobject into its critical parts, and fill in a missing part of an image. While this flexibility highlights therichness of people’s concepts, suggesting they are much more than discriminative features or rules,there are reasons to suspect that such sophisticated concepts would be difficult if not impossibleto learn from very sparse data. Theoretical analyses of learning express a tradeoff between thecomplexity of the representation (or the size of its hypothesis space) and the number of examplesneeded to reach some measure of “good generalization” (e.g., the bias/variance dilemma [8]). Giventhat people seem to succeed at both sides of the tradeoff, a central challenge is to explain thisremarkable ability: What types of representations can be learned from just one or a few examples,and how can these representations support such flexible generalizations?

To address these questions, our work here offers two contributions as initial steps. First, we introducea new set of one-shot learning problems for which humans and machines can be compared side-by-side, and second, we introduce a new algorithm that does substantially better on these tasks thancurrent algorithms. We selected simple visual concepts from the domain of handwritten characters,which offers a large number of novel, high-dimensional, and cognitively natural stimuli (Figure2). These characters are significantly more complex than the simple artificial stimuli most oftenmodeled in psychological studies of concept learning (e.g., [6, 13]), yet they remain simple enoughto hope that a computational model could see all the structure that people do, unlike domains suchas natural scenes. We used a dataset we collected called “Omniglot” that was designed for studyinglearning from a few examples [17, 26]. While similar in spirit to MNIST, rather than having 10characters with 6000 examples each, it has over 1600 character with 20 examples each – making itmore like the “transpose” of MNIST. These characters were selected from 50 different alphabets onwww.omniglot.com, which includes scripts from natural languages (e.g., Hebrew, Korean, Greek)and artificial scripts (e.g., Futurama and ULOG) invented for purposes like TV shows or videogames. Since it was produced on Amazon’s Mechanical Turk, each image is paired with a movie([x,y,time] coordinates) showing how that drawing was produced.

In addition to introducing new one-shot learning challenge problems, this paper also introducesHierarchical Bayesian Program Learning (HBPL), a model that exploits the principles of composi-tionality and causality to learn a wide range of simple visual concepts from just a single example. Wecompared the model with people and other competitive computational models for character recog-nition, including Deep Boltzmann Machines [25] and their Hierarchical Deep extension for learningwith very few examples [26]. We find that HBPL classifies new examples with near human-levelaccuracy, substantially beating the competing models. We also tested the model on generating newexemplars, another natural form of generalization, using a “visual Turing test” to evaluate perfor-mance. In this test, both people and the model performed the same task side by side, and then otherhuman participants judged which result was from a person and which was from a machine.

2 Hierarchical Bayesian Program LearningWe introduce a new computational approach called Hierarchical Bayesian Program Learning(HBPL) that utilizes the principles of compositionality and causality to build a probabilistic gen-erative model of handwritten characters. It is compositional because characters are representedas stochastic motor programs where primitive structure is shared and re-used across characters atmultiple levels, including strokes and sub-strokes. Given the raw pixels, the model searches for a

2

30  alphabetslearn  hyper  parameters:

a) b)

c)

1 2 ≥ 4

number of strokes

stroke start positions

library of motor primitives

0 2 4 6 80

2000

4000

6000Number of strokes

frequ

ency

1 2

3 4

1 2

3 4

1 2

3 4

1 2

3 4

3

Figure 4: Learned hyper-parameters. a) A subset ofprimitives, where the toprow shows the most com-mon ones. The first con-trol point (circle) is a filled.b&c) Empirical distribu-tions where the heatmapc) show how starting pointdiffers by stroke number.

Image. An image transformation A

(m) 2 R

4 is sampled from P (A

(m)) = N([1, 1, 0, 0], ⌃

A

),where the first two elements control a global re-scaling and the second two control a global transla-tion of the center of mass of T

(m). The transformed trajectories can then be rendered as a 105x105grayscale image, using an ink model adapted from [10] (see Section SI-2). This grayscale imageis then perturbed by two noise processes, which make the gradient more robust during optimiza-tion and encourage partial solutions during classification. These processes include convolution witha Gaussian filter with standard deviation �(m)

b

and pixel flipping with probability ✏(m), where theamount of noise �(m)

b

and ✏(m) are drawn uniformly on a pre-specified range (Section SI-2). Thegrayscale pixels then parameterize 105x105 independent Bernoulli distributions, completing the fullmodel of binary images P (I

(m)|✓(m)) = P (I

(m)|T (m), A

(m),�

(m)b

, ✏

(m)).

2.3 Learning high-level knowledge of motor programs

The Omniglot dataset was randomly split into a 30 alphabet “background” set and a 20 alphabet“evaluation” set, constrained such that the background set included the six most common alphabetsas determined by Google hits. Background images, paired with their motor data, were used to learnthe hyperparameters of the HBPL model, including a set of 1000 primitive motor elements (Figure4a) and position models for a drawing’s first, second, and third stroke, etc. (Figure 4c). Whereverpossible, cross-validation (within the background set) was used to decide issues of model complexitywithin the conditional probability distributions of HBPL. Details are provided in Section SI-4 forlearning the models of primitives, positions, relations, token variability, and image transformations.

2.4 Inference

Posterior inference in this model is very challenging, since parsing an image I

(m) requires exploringa large combinatorial space of different numbers and types of strokes, relations, and sub-strokes. Wedeveloped an algorithm for finding K high-probability parses, [1]

, ✓

(m)[1], ...,

[K], ✓

(m)[K], whichare the most promising candidates proposed by a fast, bottom-up image analysis, shown in Figure5a and detailed in Section SI-5. These parses approximate the posterior with a discrete distribution,

P ( , ✓

(m)|I(m)) ⇡

KX

i=1

w

i

�(✓

(m) � ✓

(m)[i])�( �

[i]), (4)

where each weight w

i

is proportional to parse score, marginalizing over shape variables x,

w

i

/ w̃

i

= P (

[i]\x

, ✓

(m)[i], I

(m)) (5)

and constrained such thatP

i

w

i

= 1. Rather than using just a point estimate for each parse, theapproximation can be improved by incorporating some of the local variance around the parse. Sincethe token-level variables ✓(m), which closely track the image, allow for little variability, and since itis inexpensive to draw conditional samples from the type-level P ( |✓(m)[i]

, I

(m)) = P ( |✓(m)[i]

) asit does not require evaluating the likelihood of the image, just the local variance around the type-levelis estimated with the token-level fixed. Metropolis Hastings is run to produce N samples (SectionSI-5.5) for each parse ✓(m)[i], denoted by [i1]

, ...,

[iN ], where the improved approximation is

P ( , ✓

(m)|I(m)) ⇡ Q( , ✓

(m), I

(m)) =

KX

i=1

w

i

�(✓

(m) � ✓

(m)[i])

1

N

NX

j=1

�( �

[ij]). (6)

5

20  alphabets learn  posterior  fromonly  one  example:

One-shot learning by inverting a compositional causalprocess

Brenden M. LakeDept. of Brain and Cognitive Sciences

[email protected]

Ruslan SalakhutdinovDept. of Statistics and Computer Science

University of [email protected]

Joshua B. TenenbaumDept. of Brain and Cognitive Sciences

[email protected]

Abstract

People can learn a new visual class from just one example, yet machine learn-ing algorithms typically require hundreds or thousands of examples to tackle thesame problems. Here we present a Hierarchical Bayesian model based on com-positionality and causality that can learn a wide range of natural (although sim-ple) visual concepts, generalizing in human-like ways from just one image. Weevaluated performance on a challenging one-shot classification task, where ourmodel achieved a human-level error rate while substantially outperforming twodeep learning models. We also tested the model on another conceptual task, gen-erating new examples, by using a “visual Turing test” to show that our modelproduces human-like performance.

1 IntroductionPeople can acquire a new concept from only the barest of experience – just one or a handful ofexamples in a high-dimensional space of raw perceptual input. Although machine learning hastackled some of the same classification and recognition problems that people solve so effortlessly,the standard algorithms require hundreds or thousands of examples to reach good performance.While the standard MNIST benchmark dataset for digit recognition has 6000 training examples perclass [19], people can classify new images of a foreign handwritten character from just one example(Figure 1b) [23, 16, 17]. Similarly, while classifiers are generally trained on hundreds of images perclass, using benchmark datasets such as ImageNet [4] and CIFAR-10/100 [14], people can learn a

12

3

Human drawers

canonical

12

3

5

1 2

35.1

1 23

5.3

1 2

3

6.2

1 2

3

6.2

21 3

7.1

21 3

7.4

31

2

8.2

1 23

8.4

1

3

2

12

1

23

Simple drawers

canonical

132

6.5

21

3

24

123

17

3

21

20

21

3

11

31

2

10

1

23

23

32

1

17

1

2

3

16

31

2

20

b) c)a)

Figure 1: Can you learn a new concept from just one example? (a & b) Where are the other examples of theconcept shown in red? Answers for b) are row 4 column 3 (left) and row 2 column 4 (right). c) The learnedconcepts also support many other abilities such as generating examples and parsing.

1

One-shot learning by inverting a compositional causalprocess

Brenden M. LakeDept. of Brain and Cognitive Sciences

[email protected]

Ruslan SalakhutdinovDept. of Statistics and Computer Science

University of [email protected]

Joshua B. TenenbaumDept. of Brain and Cognitive Sciences

[email protected]

Abstract

People can learn a new visual class from just one example, yet machine learn-ing algorithms typically require hundreds or thousands of examples to tackle thesame problems. Here we present a Hierarchical Bayesian model based on com-positionality and causality that can learn a wide range of natural (although sim-ple) visual concepts, generalizing in human-like ways from just one image. Weevaluated performance on a challenging one-shot classification task, where ourmodel achieved a human-level error rate while substantially outperforming twodeep learning models. We also tested the model on another conceptual task, gen-erating new examples, by using a “visual Turing test” to show that our modelproduces human-like performance.

1 IntroductionPeople can acquire a new concept from only the barest of experience – just one or a handful ofexamples in a high-dimensional space of raw perceptual input. Although machine learning hastackled some of the same classification and recognition problems that people solve so effortlessly,the standard algorithms require hundreds or thousands of examples to reach good performance.While the standard MNIST benchmark dataset for digit recognition has 6000 training examples perclass [19], people can classify new images of a foreign handwritten character from just one example(Figure 1b) [23, 16, 17]. Similarly, while classifiers are generally trained on hundreds of images perclass, using benchmark datasets such as ImageNet [4] and CIFAR-10/100 [14], people can learn a

12

3

Human drawers

canonical

12

3

5

1 2

35.1

1 23

5.3

1 2

3

6.2

1 2

3

6.2

21 3

7.1

21 3

7.4

31

2

8.2

1 23

8.4

1

3

2

12

1

23

Simple drawers

canonical

132

6.5

21

3

24

123

17

3

21

20

21

3

11

31

2

10

1

23

23

32

1

17

1

2

3

16

31

2

20

b) c)a)

Figure 1: Can you learn a new concept from just one example? (a & b) Where are the other examples of theconcept shown in red? Answers for b) are row 4 column 3 (left) and row 2 column 4 (right). c) The learnedconcepts also support many other abilities such as generating examples and parsing.

1

Page 9: NIPS読み会2013: One-shot learning by inverting  a compositional causal process

先に結果を紹介

One-­‐shot  classifica2on  (Error  rate)

human HBPL affine DBM HD

4.5 4.818.2

34.8 38

One-­‐shot  genera2on

Visual  Turing  test:

9個の同じシンボルを見てどちらが人間かを当ててもらう  ⇒  56%  で正解

‣ Deep  learning  よりも良い性能

‣ほぼ人間と同じエラー率

Page 10: NIPS読み会2013: One-shot learning by inverting  a compositional causal process

モデル

x

(m)11

x11

y11}

}

y12

x12R1

x21y21

R2 R1R2

y

(m)11

L

(m)1

y

(m)12

L

(m)2

T

(m)1 T

(m)2

{A, ✏, �

b

}(m)

I

(m)

x

(m)21

y

(m)21

x

(m)12

{A, ✏, �

b

}(m)

I

(m)

L

(m)1

T

(m)1

L

(m)2

T

(m)2

= independent

= along s11= independent

= start of s11

z11 = 17

z12 = 17

z21 = 42

z11 = 5

z21 = 17

character type 1 ( = 2) character type 2 ( = 2)

...17

2

5

42

157

primitives

x11y11

x

(m)11y

(m)11

x21y21

x

(m)21

y

(m)21

token

level✓

(m)

type

level

R

(m)1

R

(m)2

R

(m)1 R

(m)2

Figure 3: An illustration of the HBPL model generating two character types (left and right), where the dottedline separates the type-level from the token-level variables. Legend: number of strokes , relations R, primitiveid z (color-coded to highlight sharing), control points x (open circles), scale y, start locations L, trajectories T ,transformation A, noise ✏ and ✓b, and image I .

“structural description” to explain the image by freely combining these elementary parts and theirspatial relations. Unlike classic structural description models [27, 2], HBPL also reflects abstractcausal structure about how characters are actually produced. This type of causal representationis psychologically plausible, and it has been previously theorized to explain both behavioral andneuro-imaging data regarding human character perception and learning (e.g., [7, 1, 21, 11, 12, 17]).As in most previous “analysis by synthesis” models of characters, strokes are not modeled at thelevel of muscle movements, so that they are abstract enough to be completed by a hand, a foot, oran airplane writing in the sky. But HBPL also learns a significantly more complex representationthan earlier models, which used only one stroke (unless a second was added manually) [24, 10] orreceived on-line input data [9], sidestepping the challenging parsing problem needed to interpretcomplex characters.

The model distinguishes between character types (an ‘A’, ‘B’, etc.) and tokens (an ‘A’ drawn by aparticular person), where types provide an abstract structural specification for generating differenttokens. The joint distribution on types , tokens ✓(m), and binary images I

(m) is given as follows,

P ( , ✓

(1), ..., ✓

(M), I

(1), ..., I

(M)) = P ( )

MY

m=1

P (I

(m)|✓(m))P (✓

(m)| ). (1)

Pseudocode to generate from this distribution is shown in the Supporting Information (Section SI-1).

2.1 Generating a character type

A character type = {, S, R} is defined by a set of strokes S = {S1, ..., S

} and spatial relationsR = {R1, ..., R

} between strokes. The joint distribution can be written as

P ( ) = P ()

Y

i=1

P (S

i

)P (R

i

|S1, ..., Si�1). (2)

The number of strokes is sampled from a multinomial P () estimated from the empirical frequencies(Figure 4b), and the other conditional distributions are defined in the sections below. All hyperpa-rameters, including the library of primitives (top of Figure 3), were learned from a large “backgroundset” of character drawings as described in Sections 2.3 and SI-4.

Strokes. Each stroke is initiated by pressing the pen down and terminated by lifting thepen up. In between, a stroke is a motor routine composed of simple movements called sub-strokes S

i

= {s

i1, ..., sini} (colored curves in Figure 3), where sub-strokes are separated by

3

Page 11: NIPS読み会2013: One-shot learning by inverting  a compositional causal process

ハイパーパラメータの学習

‣シンボルの描き方に関する“常識”を学習

‣motor  cycle  data  (動画)を用いる

a) b)

c)

1 2 ≥ 4

number of strokes

stroke start positions

library of motor primitives

0 2 4 6 80

2000

4000

6000Number of strokes

frequ

ency

1 2

3 4

1 2

3 4

1 2

3 4

1 2

3 4

3

Figure 4: Learned hyper-parameters. a) A subset ofprimitives, where the toprow shows the most com-mon ones. The first con-trol point (circle) is a filled.b&c) Empirical distribu-tions where the heatmapc) show how starting pointdiffers by stroke number.

Image. An image transformation A

(m) 2 R

4 is sampled from P (A

(m)) = N([1, 1, 0, 0], ⌃

A

),where the first two elements control a global re-scaling and the second two control a global transla-tion of the center of mass of T

(m). The transformed trajectories can then be rendered as a 105x105grayscale image, using an ink model adapted from [10] (see Section SI-2). This grayscale imageis then perturbed by two noise processes, which make the gradient more robust during optimiza-tion and encourage partial solutions during classification. These processes include convolution witha Gaussian filter with standard deviation �(m)

b

and pixel flipping with probability ✏(m), where theamount of noise �(m)

b

and ✏(m) are drawn uniformly on a pre-specified range (Section SI-2). Thegrayscale pixels then parameterize 105x105 independent Bernoulli distributions, completing the fullmodel of binary images P (I

(m)|✓(m)) = P (I

(m)|T (m), A

(m),�

(m)b

, ✏

(m)).

2.3 Learning high-level knowledge of motor programs

The Omniglot dataset was randomly split into a 30 alphabet “background” set and a 20 alphabet“evaluation” set, constrained such that the background set included the six most common alphabetsas determined by Google hits. Background images, paired with their motor data, were used to learnthe hyperparameters of the HBPL model, including a set of 1000 primitive motor elements (Figure4a) and position models for a drawing’s first, second, and third stroke, etc. (Figure 4c). Whereverpossible, cross-validation (within the background set) was used to decide issues of model complexitywithin the conditional probability distributions of HBPL. Details are provided in Section SI-4 forlearning the models of primitives, positions, relations, token variability, and image transformations.

2.4 Inference

Posterior inference in this model is very challenging, since parsing an image I

(m) requires exploringa large combinatorial space of different numbers and types of strokes, relations, and sub-strokes. Wedeveloped an algorithm for finding K high-probability parses, [1]

, ✓

(m)[1], ...,

[K], ✓

(m)[K], whichare the most promising candidates proposed by a fast, bottom-up image analysis, shown in Figure5a and detailed in Section SI-5. These parses approximate the posterior with a discrete distribution,

P ( , ✓

(m)|I(m)) ⇡

KX

i=1

w

i

�(✓

(m) � ✓

(m)[i])�( �

[i]), (4)

where each weight w

i

is proportional to parse score, marginalizing over shape variables x,

w

i

/ w̃

i

= P (

[i]\x

, ✓

(m)[i], I

(m)) (5)

and constrained such thatP

i

w

i

= 1. Rather than using just a point estimate for each parse, theapproximation can be improved by incorporating some of the local variance around the parse. Sincethe token-level variables ✓(m), which closely track the image, allow for little variability, and since itis inexpensive to draw conditional samples from the type-level P ( |✓(m)[i]

, I

(m)) = P ( |✓(m)[i]

) asit does not require evaluating the likelihood of the image, just the local variance around the type-levelis estimated with the token-level fixed. Metropolis Hastings is run to produce N samples (SectionSI-5.5) for each parse ✓(m)[i], denoted by [i1]

, ...,

[iN ], where the improved approximation is

P ( , ✓

(m)|I(m)) ⇡ Q( , ✓

(m), I

(m)) =

KX

i=1

w

i

�(✓

(m) � ✓

(m)[i])

1

N

NX

j=1

�( �

[ij]). (6)

5

Page 12: NIPS読み会2013: One-shot learning by inverting  a compositional causal process

one-­‐shot  classifica2on

One-shot learning by inverting a compositional causalprocess

Brenden M. LakeDept. of Brain and Cognitive Sciences

[email protected]

Ruslan SalakhutdinovDept. of Statistics and Computer Science

University of [email protected]

Joshua B. TenenbaumDept. of Brain and Cognitive Sciences

[email protected]

Abstract

People can learn a new visual class from just one example, yet machine learn-ing algorithms typically require hundreds or thousands of examples to tackle thesame problems. Here we present a Hierarchical Bayesian model based on com-positionality and causality that can learn a wide range of natural (although sim-ple) visual concepts, generalizing in human-like ways from just one image. Weevaluated performance on a challenging one-shot classification task, where ourmodel achieved a human-level error rate while substantially outperforming twodeep learning models. We also tested the model on another conceptual task, gen-erating new examples, by using a “visual Turing test” to show that our modelproduces human-like performance.

1 IntroductionPeople can acquire a new concept from only the barest of experience – just one or a handful ofexamples in a high-dimensional space of raw perceptual input. Although machine learning hastackled some of the same classification and recognition problems that people solve so effortlessly,the standard algorithms require hundreds or thousands of examples to reach good performance.While the standard MNIST benchmark dataset for digit recognition has 6000 training examples perclass [19], people can classify new images of a foreign handwritten character from just one example(Figure 1b) [23, 16, 17]. Similarly, while classifiers are generally trained on hundreds of images perclass, using benchmark datasets such as ImageNet [4] and CIFAR-10/100 [14], people can learn a

12

3

Human drawers

canonical

12

3

5

1 2

35.1

1 23

5.3

1 2

3

6.2

1 2

3

6.2

21 3

7.1

21 3

7.4

31

2

8.2

1 23

8.4

1

3

2

12

1

23

Simple drawers

canonical

132

6.5

21

3

24

123

17

3

21

20

21

3

11

31

2

10

1

23

23

32

1

17

1

2

3

16

31

2

20

b) c)a)

Figure 1: Can you learn a new concept from just one example? (a & b) Where are the other examples of theconcept shown in red? Answers for b) are row 4 column 3 (left) and row 2 column 4 (right). c) The learnedconcepts also support many other abilities such as generating examples and parsing.

1

One-shot learning by inverting a compositional causalprocess

Brenden M. LakeDept. of Brain and Cognitive Sciences

[email protected]

Ruslan SalakhutdinovDept. of Statistics and Computer Science

University of [email protected]

Joshua B. TenenbaumDept. of Brain and Cognitive Sciences

[email protected]

Abstract

People can learn a new visual class from just one example, yet machine learn-ing algorithms typically require hundreds or thousands of examples to tackle thesame problems. Here we present a Hierarchical Bayesian model based on com-positionality and causality that can learn a wide range of natural (although sim-ple) visual concepts, generalizing in human-like ways from just one image. Weevaluated performance on a challenging one-shot classification task, where ourmodel achieved a human-level error rate while substantially outperforming twodeep learning models. We also tested the model on another conceptual task, gen-erating new examples, by using a “visual Turing test” to show that our modelproduces human-like performance.

1 IntroductionPeople can acquire a new concept from only the barest of experience – just one or a handful ofexamples in a high-dimensional space of raw perceptual input. Although machine learning hastackled some of the same classification and recognition problems that people solve so effortlessly,the standard algorithms require hundreds or thousands of examples to reach good performance.While the standard MNIST benchmark dataset for digit recognition has 6000 training examples perclass [19], people can classify new images of a foreign handwritten character from just one example(Figure 1b) [23, 16, 17]. Similarly, while classifiers are generally trained on hundreds of images perclass, using benchmark datasets such as ImageNet [4] and CIFAR-10/100 [14], people can learn a

12

3

Human drawers

canonical

12

3

5

1 2

35.1

1 23

5.3

1 2

3

6.2

1 2

3

6.2

21 3

7.1

21 3

7.4

31

2

8.2

1 23

8.4

1

3

2

12

1

23

Simple drawers

canonical

132

6.5

21

3

24

123

17

3

21

20

21

3

11

31

2

10

1

23

23

32

1

17

1

2

3

16

31

2

20

b) c)a)

Figure 1: Can you learn a new concept from just one example? (a & b) Where are the other examples of theconcept shown in red? Answers for b) are row 4 column 3 (left) and row 2 column 4 (right). c) The learnedconcepts also support many other abilities such as generating examples and parsing.

1

各イメージに対して、stroke  の  posterior  を推定

a) b)

c)

1 2 ≥ 4

number of strokes

stroke start positions

library of motor primitives

0 2 4 6 80

2000

4000

6000Number of strokes

frequ

ency

1 2

3 4

1 2

3 4

1 2

3 4

1 2

3 4

3

Figure 4: Learned hyper-parameters. a) A subset ofprimitives, where the toprow shows the most com-mon ones. The first con-trol point (circle) is a filled.b&c) Empirical distribu-tions where the heatmapc) show how starting pointdiffers by stroke number.

Image. An image transformation A

(m) 2 R

4 is sampled from P (A

(m)) = N([1, 1, 0, 0], ⌃

A

),where the first two elements control a global re-scaling and the second two control a global transla-tion of the center of mass of T

(m). The transformed trajectories can then be rendered as a 105x105grayscale image, using an ink model adapted from [10] (see Section SI-2). This grayscale imageis then perturbed by two noise processes, which make the gradient more robust during optimiza-tion and encourage partial solutions during classification. These processes include convolution witha Gaussian filter with standard deviation �(m)

b

and pixel flipping with probability ✏(m), where theamount of noise �(m)

b

and ✏(m) are drawn uniformly on a pre-specified range (Section SI-2). Thegrayscale pixels then parameterize 105x105 independent Bernoulli distributions, completing the fullmodel of binary images P (I

(m)|✓(m)) = P (I

(m)|T (m), A

(m),�

(m)b

, ✏

(m)).

2.3 Learning high-level knowledge of motor programs

The Omniglot dataset was randomly split into a 30 alphabet “background” set and a 20 alphabet“evaluation” set, constrained such that the background set included the six most common alphabetsas determined by Google hits. Background images, paired with their motor data, were used to learnthe hyperparameters of the HBPL model, including a set of 1000 primitive motor elements (Figure4a) and position models for a drawing’s first, second, and third stroke, etc. (Figure 4c). Whereverpossible, cross-validation (within the background set) was used to decide issues of model complexitywithin the conditional probability distributions of HBPL. Details are provided in Section SI-4 forlearning the models of primitives, positions, relations, token variability, and image transformations.

2.4 Inference

Posterior inference in this model is very challenging, since parsing an image I

(m) requires exploringa large combinatorial space of different numbers and types of strokes, relations, and sub-strokes. Wedeveloped an algorithm for finding K high-probability parses, [1]

, ✓

(m)[1], ...,

[K], ✓

(m)[K], whichare the most promising candidates proposed by a fast, bottom-up image analysis, shown in Figure5a and detailed in Section SI-5. These parses approximate the posterior with a discrete distribution,

P ( , ✓

(m)|I(m)) ⇡

KX

i=1

w

i

�(✓

(m) � ✓

(m)[i])�( �

[i]), (4)

where each weight w

i

is proportional to parse score, marginalizing over shape variables x,

w

i

/ w̃

i

= P (

[i]\x

, ✓

(m)[i], I

(m)) (5)

and constrained such thatP

i

w

i

= 1. Rather than using just a point estimate for each parse, theapproximation can be improved by incorporating some of the local variance around the parse. Sincethe token-level variables ✓(m), which closely track the image, allow for little variability, and since itis inexpensive to draw conditional samples from the type-level P ( |✓(m)[i]

, I

(m)) = P ( |✓(m)[i]

) asit does not require evaluating the likelihood of the image, just the local variance around the type-levelis estimated with the token-level fixed. Metropolis Hastings is run to produce N samples (SectionSI-5.5) for each parse ✓(m)[i], denoted by [i1]

, ...,

[iN ], where the improved approximation is

P ( , ✓

(m)|I(m)) ⇡ Q( , ✓

(m), I

(m)) =

KX

i=1

w

i

�(✓

(m) � ✓

(m)[i])

1

N

NX

j=1

�( �

[ij]). (6)

5

type

token

Image

Thinned

Traced graph (raw)

traced graph (cleaned)

Binary image

Thinned image

planning

planning cleaned

Binary image

Thinned image

planning

planning cleaned

0 -60 -89 -159 -168

Binary image

Thinned image

planning

planning cleaned

a) b)i

ii

iii

12

train

0

12

−59.6

1

−88.9

12

−159

1

−168

12

test

−2.12e+03

12

12

−1.98e+03

12 1

−2.07e+03

112

−2.09e+03

121

−2.12e+03

1

12

train

0

12

−59.6

1

−88.9

12

−159

1

−168

12

test

−2.12e+03

12

12

−1.98e+03

12 1

−2.07e+03

112

−2.09e+03

121

−2.12e+03

1

12

train

0

12

−59.6

1

−88.9

12

−159

1

−168

12

test

−2.12e+03

12

12

−1.98e+03

12 1

−2.07e+03

112

−2.09e+03

121

−2.12e+03

1

12

train

0

12

−59.6

1

−88.9

12

−159

1

−168

12

test

−2.12e+03

12

12

−1.98e+03

12 1

−2.07e+03

112

−2.09e+03

121

−2.12e+03

1

12

train

0

12

−59.6

1

−88.9

12

−159

1

−168

12

test

−2.12e+03

12

12

−1.98e+03

12 1

−2.07e+03

112

−2.09e+03

121

−2.12e+03

1

-1273

12

train

0

12

−59.6

1

−88.9

12

−159

1

−168

12

test

−2.12e+03

12

12

−1.98e+03

12 1

−2.07e+03

112

−2.09e+03

121

−2.12e+03

1

12

train

0

12

−59.6

1

−88.9

12

−159

1

−168

12

test

−2.12e+03

12

12

−1.98e+03

12 1

−2.07e+03

112

−2.09e+03

121

−2.12e+03

1

-831

12

train

0

12

−59.6

1

−88.9

12

−159

1

−168

1 2

test

−831

1 2 12

−881

12 1

−983

11

2

−979

12 1

−1.17e+03

1

12

train

0

12

−59.6

1

−88.9

12

−159

1

−168

1 2

test

−831

1 2 12

−881

12 1

−983

11

2

−979

12 1

−1.17e+03

1

12

train

0

12

−59.6

1

−88.9

12

−159

1

−168

1

2

test

−1.41e+03

1

2 12

−1.22e+03

12 1

−1.18e+03

1 12

−1.72e+03

12 1

−1.54e+03

1

12

train

0

12

−59.6

1

−88.9

12

−159

1

−168

1

2

test

−1.41e+03

1

2 12

−1.22e+03

12 1

−1.18e+03

1 12

−1.72e+03

12 1

−1.54e+03

1

-2041Figure 5: Parsing a raw image. a) The raw image (i) is processed by a thinning algorithm [18] (ii) and thenanalyzed as an undirected graph [20] (iii) where parses are guided random walks (Section SI-5). b) The fivebest parses found for that image (top row) are shown with their logwj (Eq. 5), where numbers inside circlesdenote stroke order and starting position, and smaller open circles denote sub-stroke breaks. These five parseswere re-fit to three different raw images of characters (left in image triplets), where the best parse (top right)and its associated image reconstruction (bottom right) are shown above its score (Eq. 9).

Given an approximate posterior for a particular image, the model can evaluate the posterior predic-tive score of a new image by re-fitting the token-level variables (bottom Figure 5b), as explained inSection 3.1 on inference for one-shot classification.

3 Results3.1 One-shot classification

People, HBPL, and several alternative models were evaluated on a set of 10 challenging one-shotclassification tasks. The tasks tested within-alphabet classification on 10 alphabets, with examplesin Figure 2 and detailed in Section SI-6 . Each trial (of 400 total) consists of a single test image ofa new character compared to 20 new characters from the same alphabet, given just one image eachproduced by a typical drawer of that alphabet. Figure 1b shows two example trials.

People. Forty participants in the USA were tested on one-shot classification using Mechanical Turk.On each trial, as in Figure 1b, participants were shown an image of a new character and asked toclick on another image that shows the same character. To ensure classification was indeed “oneshot,” participants completed just one randomly selected trial from each of the 10 within-alphabetclassification tasks, so that characters never repeated across trials. There was also an instructionsquiz, two practice trials with the Latin and Greek alphabets, and feedback after every trial.

Hierarchial Bayesian Program Learning. For a test image I

(T ) and 20 training images I

(c) forc = 1, ..., 20, we use a Bayesian classification rule for which we compute an approximate solution

argmaxc

log P (I

(T )|I(c)). (7)

Intuitively, the approximation uses the HBPL search algorithm to get K = 5 parses of I

(c), runsK MCMC chains to estimate the local type-level variability around each parse, and then runs K

gradient-based searches to re-optimizes the token-level variables ✓(T ) (all are continuous) to fit thetest image I

(T ). The approximation can be written as (see Section SI-7 for derivation)

log P (I

(T )|I(c)) ⇡ log

ZP (I

(T )|✓(T ))P (✓

(T )| )Q(✓

(c), , I

(c)) d d✓

(c)d✓

(T ) (8)

⇡ log

KX

i=1

w

i

max✓

(T )P (I

(T )|✓(T ))

1

N

NX

j=1

P (✓

(T )| [ij]), (9)

where Q(·, ·, ·) and w

i

are from Eq. 6. Figure 5b shows examples of this classification score. Whileinference so far involves parses of I

(c) refit to I

(T ), it also seems desirable to include parses of I

(T )

refit to I

(c), namely P (I

(c)|I(T )). We can re-write our classification rule (Eq. 7) to include just the

reverse term (Eq. 10 center), and then to include both terms (Eq. 10 right), which is the rule we use,

argmaxc

log P (I

(T )|I(c)) = argmax

c

log

P (I

(c)|I(T ))

P (I

(c))

= argmaxc

log

P (I

(c)|I(T ))

P (I

(c))

P (I

(T )|I(c)),

(10)

6

posterior推定されたtypeからのターゲットの生成確率

Page 13: NIPS読み会2013: One-shot learning by inverting  a compositional causal process

posterior  inference

1.  シンボルの上でランダムウォークを行い、ストロークのサンプルを得る(150個)

2.  そのストロークのスコアを  prior  から計算、上位  K  個に絞る

Image

Thinned

Traced graph (raw)

traced graph (cleaned)

Binary image

Thinned image

planning

planning cleaned

Binary image

Thinned image

planning

planning cleaned

0 -60 -89 -159 -168

Binary image

Thinned image

planning

planning cleaned

a) b)i

ii

iii

12

train

0

12

−59.6

1

−88.9

12

−159

1

−168

12

test

−2.12e+03

12

12

−1.98e+03

12 1

−2.07e+03

112

−2.09e+03

121

−2.12e+03

1

12

train

0

12

−59.6

1

−88.9

12

−159

1

−168

12

test

−2.12e+03

12

12

−1.98e+03

12 1

−2.07e+03

112

−2.09e+03

121

−2.12e+03

1

12

train

0

12

−59.6

1

−88.9

12

−159

1

−168

12

test

−2.12e+03

12

12

−1.98e+03

12 1

−2.07e+03

112

−2.09e+03

121

−2.12e+03

1

12

train

0

12

−59.6

1

−88.9

12

−159

1

−168

12

test

−2.12e+03

12

12

−1.98e+03

12 1

−2.07e+03

112

−2.09e+03

121

−2.12e+03

1

12

train

0

12

−59.6

1

−88.9

12

−159

1

−168

12

test

−2.12e+03

12

12

−1.98e+03

12 1

−2.07e+03

112

−2.09e+03

121

−2.12e+03

1

-1273

12

train

0

12

−59.6

1

−88.9

12

−159

1

−168

12

test

−2.12e+03

12

12

−1.98e+03

12 1

−2.07e+03

112

−2.09e+03

121

−2.12e+03

1

12

train

0

12

−59.6

1

−88.9

12

−159

1

−168

12

test

−2.12e+03

12

12

−1.98e+03

12 1

−2.07e+03

112

−2.09e+03

121

−2.12e+03

1

-831

12

train

0

12

−59.6

1

−88.9

12

−159

1

−168

1 2

test

−831

1 2 12

−881

12 1

−983

11

2

−979

12 1

−1.17e+03

1

12

train

0

12

−59.6

1

−88.9

12

−159

1

−168

1 2

test

−831

1 2 12

−881

12 1

−983

11

2

−979

12 1

−1.17e+03

1

12

train

0

12

−59.6

1

−88.9

12

−159

1

−168

1

2

test

−1.41e+03

1

2 12

−1.22e+03

12 1

−1.18e+03

1 12

−1.72e+03

12 1

−1.54e+03

1

12

train

0

12

−59.6

1

−88.9

12

−159

1

−168

1

2

test

−1.41e+03

1

2 12

−1.22e+03

12 1

−1.18e+03

1 12

−1.72e+03

12 1

−1.54e+03

1

-2041Figure 5: Parsing a raw image. a) The raw image (i) is processed by a thinning algorithm [18] (ii) and thenanalyzed as an undirected graph [20] (iii) where parses are guided random walks (Section SI-5). b) The fivebest parses found for that image (top row) are shown with their logwj (Eq. 5), where numbers inside circlesdenote stroke order and starting position, and smaller open circles denote sub-stroke breaks. These five parseswere re-fit to three different raw images of characters (left in image triplets), where the best parse (top right)and its associated image reconstruction (bottom right) are shown above its score (Eq. 9).

Given an approximate posterior for a particular image, the model can evaluate the posterior predic-tive score of a new image by re-fitting the token-level variables (bottom Figure 5b), as explained inSection 3.1 on inference for one-shot classification.

3 Results3.1 One-shot classification

People, HBPL, and several alternative models were evaluated on a set of 10 challenging one-shotclassification tasks. The tasks tested within-alphabet classification on 10 alphabets, with examplesin Figure 2 and detailed in Section SI-6 . Each trial (of 400 total) consists of a single test image ofa new character compared to 20 new characters from the same alphabet, given just one image eachproduced by a typical drawer of that alphabet. Figure 1b shows two example trials.

People. Forty participants in the USA were tested on one-shot classification using Mechanical Turk.On each trial, as in Figure 1b, participants were shown an image of a new character and asked toclick on another image that shows the same character. To ensure classification was indeed “oneshot,” participants completed just one randomly selected trial from each of the 10 within-alphabetclassification tasks, so that characters never repeated across trials. There was also an instructionsquiz, two practice trials with the Latin and Greek alphabets, and feedback after every trial.

Hierarchial Bayesian Program Learning. For a test image I

(T ) and 20 training images I

(c) forc = 1, ..., 20, we use a Bayesian classification rule for which we compute an approximate solution

argmaxc

log P (I

(T )|I(c)). (7)

Intuitively, the approximation uses the HBPL search algorithm to get K = 5 parses of I

(c), runsK MCMC chains to estimate the local type-level variability around each parse, and then runs K

gradient-based searches to re-optimizes the token-level variables ✓(T ) (all are continuous) to fit thetest image I

(T ). The approximation can be written as (see Section SI-7 for derivation)

log P (I

(T )|I(c)) ⇡ log

ZP (I

(T )|✓(T ))P (✓

(T )| )Q(✓

(c), , I

(c)) d d✓

(c)d✓

(T ) (8)

⇡ log

KX

i=1

w

i

max✓

(T )P (I

(T )|✓(T ))

1

N

NX

j=1

P (✓

(T )| [ij]), (9)

where Q(·, ·, ·) and w

i

are from Eq. 6. Figure 5b shows examples of this classification score. Whileinference so far involves parses of I

(c) refit to I

(T ), it also seems desirable to include parses of I

(T )

refit to I

(c), namely P (I

(c)|I(T )). We can re-write our classification rule (Eq. 7) to include just the

reverse term (Eq. 10 center), and then to include both terms (Eq. 10 right), which is the rule we use,

argmaxc

log P (I

(T )|I(c)) = argmax

c

log

P (I

(c)|I(T ))

P (I

(c))

= argmaxc

log

P (I

(c)|I(T ))

P (I

(c))

P (I

(T )|I(c)),

(10)

6

パースの候補を選んで近似

a) b)

c)

1 2 ≥ 4

number of strokes

stroke start positions

library of motor primitives

0 2 4 6 80

2000

4000

6000Number of strokes

frequ

ency

1 2

3 4

1 2

3 4

1 2

3 4

1 2

3 4

3

Figure 4: Learned hyper-parameters. a) A subset ofprimitives, where the toprow shows the most com-mon ones. The first con-trol point (circle) is a filled.b&c) Empirical distribu-tions where the heatmapc) show how starting pointdiffers by stroke number.

Image. An image transformation A

(m) 2 R

4 is sampled from P (A

(m)) = N([1, 1, 0, 0], ⌃

A

),where the first two elements control a global re-scaling and the second two control a global transla-tion of the center of mass of T

(m). The transformed trajectories can then be rendered as a 105x105grayscale image, using an ink model adapted from [10] (see Section SI-2). This grayscale imageis then perturbed by two noise processes, which make the gradient more robust during optimiza-tion and encourage partial solutions during classification. These processes include convolution witha Gaussian filter with standard deviation �(m)

b

and pixel flipping with probability ✏(m), where theamount of noise �(m)

b

and ✏(m) are drawn uniformly on a pre-specified range (Section SI-2). Thegrayscale pixels then parameterize 105x105 independent Bernoulli distributions, completing the fullmodel of binary images P (I

(m)|✓(m)) = P (I

(m)|T (m), A

(m),�

(m)b

, ✏

(m)).

2.3 Learning high-level knowledge of motor programs

The Omniglot dataset was randomly split into a 30 alphabet “background” set and a 20 alphabet“evaluation” set, constrained such that the background set included the six most common alphabetsas determined by Google hits. Background images, paired with their motor data, were used to learnthe hyperparameters of the HBPL model, including a set of 1000 primitive motor elements (Figure4a) and position models for a drawing’s first, second, and third stroke, etc. (Figure 4c). Whereverpossible, cross-validation (within the background set) was used to decide issues of model complexitywithin the conditional probability distributions of HBPL. Details are provided in Section SI-4 forlearning the models of primitives, positions, relations, token variability, and image transformations.

2.4 Inference

Posterior inference in this model is very challenging, since parsing an image I

(m) requires exploringa large combinatorial space of different numbers and types of strokes, relations, and sub-strokes. Wedeveloped an algorithm for finding K high-probability parses, [1]

, ✓

(m)[1], ...,

[K], ✓

(m)[K], whichare the most promising candidates proposed by a fast, bottom-up image analysis, shown in Figure5a and detailed in Section SI-5. These parses approximate the posterior with a discrete distribution,

P ( , ✓

(m)|I(m)) ⇡

KX

i=1

w

i

�(✓

(m) � ✓

(m)[i])�( �

[i]), (4)

where each weight w

i

is proportional to parse score, marginalizing over shape variables x,

w

i

/ w̃

i

= P (

[i]\x

, ✓

(m)[i], I

(m)) (5)

and constrained such thatP

i

w

i

= 1. Rather than using just a point estimate for each parse, theapproximation can be improved by incorporating some of the local variance around the parse. Sincethe token-level variables ✓(m), which closely track the image, allow for little variability, and since itis inexpensive to draw conditional samples from the type-level P ( |✓(m)[i]

, I

(m)) = P ( |✓(m)[i]

) asit does not require evaluating the likelihood of the image, just the local variance around the type-levelis estimated with the token-level fixed. Metropolis Hastings is run to produce N samples (SectionSI-5.5) for each parse ✓(m)[i], denoted by [i1]

, ...,

[iN ], where the improved approximation is

P ( , ✓

(m)|I(m)) ⇡ Q( , ✓

(m), I

(m)) =

KX

i=1

w

i

�(✓

(m) � ✓

(m)[i])

1

N

NX

j=1

�( �

[ij]). (6)

5

prior  からのスコア  ⇒  正規化

Page 14: NIPS読み会2013: One-shot learning by inverting  a compositional causal process

adjustment

Image

Thinned

Traced graph (raw)

traced graph (cleaned)

Binary image

Thinned image

planning

planning cleaned

Binary image

Thinned image

planning

planning cleaned

0 -60 -89 -159 -168

Binary image

Thinned image

planning

planning cleaned

a) b)i

ii

iii

12

train

0

12

−59.6

1

−88.9

12

−159

1

−168

12

test

−2.12e+03

12

12

−1.98e+03

12 1

−2.07e+03

112

−2.09e+03

121

−2.12e+03

1

12

train

0

12

−59.6

1

−88.9

12

−159

1

−168

12

test

−2.12e+03

12

12

−1.98e+03

12 1

−2.07e+03

112

−2.09e+03

121

−2.12e+03

1

12

train

0

12

−59.6

1

−88.9

12

−159

1

−168

12

test

−2.12e+03

12

12

−1.98e+03

12 1

−2.07e+03

112

−2.09e+03

121

−2.12e+03

1

12

train

0

12

−59.6

1

−88.9

12

−159

1

−168

12

test

−2.12e+03

12

12

−1.98e+03

12 1

−2.07e+03

112

−2.09e+03

121

−2.12e+03

1

12

train

0

12

−59.6

1

−88.9

12

−159

1

−168

12

test

−2.12e+03

12

12

−1.98e+03

12 1

−2.07e+03

112

−2.09e+03

121

−2.12e+03

1

-1273

12

train

0

12

−59.6

1

−88.9

12

−159

1

−168

12

test

−2.12e+03

12

12

−1.98e+03

12 1

−2.07e+03

112

−2.09e+03

121

−2.12e+03

1

12

train

0

12

−59.6

1

−88.9

12

−159

1

−168

12

test

−2.12e+03

12

12

−1.98e+03

12 1

−2.07e+03

112

−2.09e+03

121

−2.12e+03

1

-831

12

train

0

12

−59.6

1

−88.9

12

−159

1

−168

1 2

test

−831

1 2 12

−881

12 1

−983

11

2

−979

12 1

−1.17e+03

1

12

train

0

12

−59.6

1

−88.9

12

−159

1

−168

1 2

test

−831

1 2 12

−881

12 1

−983

11

2

−979

12 1

−1.17e+03

1

12

train

0

12

−59.6

1

−88.9

12

−159

1

−168

1

2

test

−1.41e+03

1

2 12

−1.22e+03

12 1

−1.18e+03

1 12

−1.72e+03

12 1

−1.54e+03

1

12

train

0

12

−59.6

1

−88.9

12

−159

1

−168

1

2

test

−1.41e+03

1

2 12

−1.22e+03

12 1

−1.18e+03

1 12

−1.72e+03

12 1

−1.54e+03

1

-2041Figure 5: Parsing a raw image. a) The raw image (i) is processed by a thinning algorithm [18] (ii) and thenanalyzed as an undirected graph [20] (iii) where parses are guided random walks (Section SI-5). b) The fivebest parses found for that image (top row) are shown with their logwj (Eq. 5), where numbers inside circlesdenote stroke order and starting position, and smaller open circles denote sub-stroke breaks. These five parseswere re-fit to three different raw images of characters (left in image triplets), where the best parse (top right)and its associated image reconstruction (bottom right) are shown above its score (Eq. 9).

Given an approximate posterior for a particular image, the model can evaluate the posterior predic-tive score of a new image by re-fitting the token-level variables (bottom Figure 5b), as explained inSection 3.1 on inference for one-shot classification.

3 Results3.1 One-shot classification

People, HBPL, and several alternative models were evaluated on a set of 10 challenging one-shotclassification tasks. The tasks tested within-alphabet classification on 10 alphabets, with examplesin Figure 2 and detailed in Section SI-6 . Each trial (of 400 total) consists of a single test image ofa new character compared to 20 new characters from the same alphabet, given just one image eachproduced by a typical drawer of that alphabet. Figure 1b shows two example trials.

People. Forty participants in the USA were tested on one-shot classification using Mechanical Turk.On each trial, as in Figure 1b, participants were shown an image of a new character and asked toclick on another image that shows the same character. To ensure classification was indeed “oneshot,” participants completed just one randomly selected trial from each of the 10 within-alphabetclassification tasks, so that characters never repeated across trials. There was also an instructionsquiz, two practice trials with the Latin and Greek alphabets, and feedback after every trial.

Hierarchial Bayesian Program Learning. For a test image I

(T ) and 20 training images I

(c) forc = 1, ..., 20, we use a Bayesian classification rule for which we compute an approximate solution

argmaxc

log P (I

(T )|I(c)). (7)

Intuitively, the approximation uses the HBPL search algorithm to get K = 5 parses of I

(c), runsK MCMC chains to estimate the local type-level variability around each parse, and then runs K

gradient-based searches to re-optimizes the token-level variables ✓(T ) (all are continuous) to fit thetest image I

(T ). The approximation can be written as (see Section SI-7 for derivation)

log P (I

(T )|I(c)) ⇡ log

ZP (I

(T )|✓(T ))P (✓

(T )| )Q(✓

(c), , I

(c)) d d✓

(c)d✓

(T ) (8)

⇡ log

KX

i=1

w

i

max✓

(T )P (I

(T )|✓(T ))

1

N

NX

j=1

P (✓

(T )| [ij]), (9)

where Q(·, ·, ·) and w

i

are from Eq. 6. Figure 5b shows examples of this classification score. Whileinference so far involves parses of I

(c) refit to I

(T ), it also seems desirable to include parses of I

(T )

refit to I

(c), namely P (I

(c)|I(T )). We can re-write our classification rule (Eq. 7) to include just the

reverse term (Eq. 10 center), and then to include both terms (Eq. 10 right), which is the rule we use,

argmaxc

log P (I

(T )|I(c)) = argmax

c

log

P (I

(c)|I(T ))

P (I

(c))

= argmaxc

log

P (I

(c)|I(T ))

P (I

(c))

P (I

(T )|I(c)),

(10)

6

One-shot learning by inverting a compositional causalprocess

Brenden M. LakeDept. of Brain and Cognitive Sciences

[email protected]

Ruslan SalakhutdinovDept. of Statistics and Computer Science

University of [email protected]

Joshua B. TenenbaumDept. of Brain and Cognitive Sciences

[email protected]

Abstract

People can learn a new visual class from just one example, yet machine learn-ing algorithms typically require hundreds or thousands of examples to tackle thesame problems. Here we present a Hierarchical Bayesian model based on com-positionality and causality that can learn a wide range of natural (although sim-ple) visual concepts, generalizing in human-like ways from just one image. Weevaluated performance on a challenging one-shot classification task, where ourmodel achieved a human-level error rate while substantially outperforming twodeep learning models. We also tested the model on another conceptual task, gen-erating new examples, by using a “visual Turing test” to show that our modelproduces human-like performance.

1 IntroductionPeople can acquire a new concept from only the barest of experience – just one or a handful ofexamples in a high-dimensional space of raw perceptual input. Although machine learning hastackled some of the same classification and recognition problems that people solve so effortlessly,the standard algorithms require hundreds or thousands of examples to reach good performance.While the standard MNIST benchmark dataset for digit recognition has 6000 training examples perclass [19], people can classify new images of a foreign handwritten character from just one example(Figure 1b) [23, 16, 17]. Similarly, while classifiers are generally trained on hundreds of images perclass, using benchmark datasets such as ImageNet [4] and CIFAR-10/100 [14], people can learn a

12

3

Human drawers

canonical

12

3

5

1 2

35.1

1 23

5.3

1 2

3

6.2

1 2

3

6.2

21 3

7.1

21 3

7.4

31

2

8.2

1 23

8.4

1

3

2

12

1

23

Simple drawers

canonical

132

6.5

21

3

24

123

17

3

21

20

21

3

11

31

2

10

1

23

23

32

1

17

1

2

3

16

31

2

20

b) c)a)

Figure 1: Can you learn a new concept from just one example? (a & b) Where are the other examples of theconcept shown in red? Answers for b) are row 4 column 3 (left) and row 2 column 4 (right). c) The learnedconcepts also support many other abilities such as generating examples and parsing.

1

One-shot learning by inverting a compositional causalprocess

Brenden M. LakeDept. of Brain and Cognitive Sciences

[email protected]

Ruslan SalakhutdinovDept. of Statistics and Computer Science

University of [email protected]

Joshua B. TenenbaumDept. of Brain and Cognitive Sciences

[email protected]

Abstract

People can learn a new visual class from just one example, yet machine learn-ing algorithms typically require hundreds or thousands of examples to tackle thesame problems. Here we present a Hierarchical Bayesian model based on com-positionality and causality that can learn a wide range of natural (although sim-ple) visual concepts, generalizing in human-like ways from just one image. Weevaluated performance on a challenging one-shot classification task, where ourmodel achieved a human-level error rate while substantially outperforming twodeep learning models. We also tested the model on another conceptual task, gen-erating new examples, by using a “visual Turing test” to show that our modelproduces human-like performance.

1 IntroductionPeople can acquire a new concept from only the barest of experience – just one or a handful ofexamples in a high-dimensional space of raw perceptual input. Although machine learning hastackled some of the same classification and recognition problems that people solve so effortlessly,the standard algorithms require hundreds or thousands of examples to reach good performance.While the standard MNIST benchmark dataset for digit recognition has 6000 training examples perclass [19], people can classify new images of a foreign handwritten character from just one example(Figure 1b) [23, 16, 17]. Similarly, while classifiers are generally trained on hundreds of images perclass, using benchmark datasets such as ImageNet [4] and CIFAR-10/100 [14], people can learn a

12

3

Human drawers

canonical

12

3

5

1 2

35.1

1 23

5.3

1 2

3

6.2

1 2

3

6.2

21 3

7.1

21 3

7.4

31

2

8.2

1 23

8.4

1

3

2

12

1

23

Simple drawers

canonical

132

6.5

21

3

24

123

17

3

21

20

21

3

11

31

2

10

1

23

23

32

1

17

1

2

3

16

31

2

20

b) c)a)

Figure 1: Can you learn a new concept from just one example? (a & b) Where are the other examples of theconcept shown in red? Answers for b) are row 4 column 3 (left) and row 2 column 4 (right). c) The learnedconcepts also support many other abilities such as generating examples and parsing.

1

の計算

推定された  type  変数を用いて、ターゲットのtoken  変数を推定(MCMC)

Page 15: NIPS読み会2013: One-shot learning by inverting  a compositional causal process

まとめ

‣モデルはかなり作り込んでいてアドホック

‣しかし、機械学習も人間と同じように一つのexampleから分類・生成が行えることを示した、という点で面白い

‣人間がどのように特徴を抽出しているか、ということの理解に繋がる