Tensor Decomposition and its Applications

Applications of tensor (multiway array) factorizations and

decompositions in data mining

機械学習班輪講 11/10/25

@taki__taki__

Paper

Mørup, M. (2011), Applications of tensor (multiway array) factorizations and decompositions in data mining.

Wiley Interdisciplinary Reviews: Data Mining and Knowledge Discovery, 1: 24–

40. doi: 10.1002/widm.1

こちらの論文からいくつか図を引用します．

Table of Contents

1.全体のイントロダクション2.2階のテンソルと行列分解3.SVD: Singular Value Decomposition4.論文のイントロダクションと記法など5.TuckerモデルとCPモデルの紹介6.応用とまとめ

•テンソル分解(本文中ではfactorization，又はdecomposition)がデータマイニングの分野で重要な道具になってきているため，紹介する

• (Wikipediaでは)テンソル(tensor)とは線形的な量または幾何概念を一般化したもの→乱暴に言えば多次元配列に相当する

•主に行列(2階のテンソル)と，3次元配列(3階のテンソル，複数の行列の集まり)を題材としてテンソル分解を考える

Introduction

まずは行列分解から考える

2階のテンソル (1)

•定義: 2つの任意のベクトルに対して実数を対応させる関数で，任意のベクトル及びスカラーkに対して以下の双線形性が成り立つ関数Tを2階のテンソルと呼ぶ

•ベクトルの内積は関数Tになる．正規直交基底上の3次元空間において，基底ベクトルに対するテンソルをとする


• {i,j}の組合せに対応して任意のベクトルu, v

基本ベクトルの変換対

•双線形性より

•行列による線形変換は基本ベクトルの変換対と同じ形になるので，2つを対応させて考える


•任意のベクトルを線形変換Tで変換させたベクトルの成分はTの行列表示を用いて表現できる (普通の行列の積)

•行列の線形変換は2つのベクトルv,uに対して次の線形性を持つ:

•ベクトルvの線形変換Tによる像T(v)と，ベクトルuとの内積が実数なので，この操作を関数として置く:

•これは双線形性を満たすので，このTを新しく2階のテンソルとする．そこで2階のテンソルを行列Tでも表現することとする．

2階のテンソルの分解 (1)

データの世界線形代数の世界

変換分解

•例えばグラフは隣接行列という形で表現されたり，文章と単語の出現関係を行列で表現したり，行列表現は比較的良く出てくる

•そこで行列の特性を測ったり，特徴を表現したりする技術に線形代数での技術を用いる

2階のテンソルの分解 (2)

• 2階のテンソルが行列で表現されるため，2階のテンソル分解は行列分解同値．そこで特異値分解(SVD:Singular Value Decomposition)を例題として見る．

• SVDは行列分解として基本的な操作の一つで，LSI(Latent Semantic Indexing)で利用されている．LSIでは各文書における単語の出現を表して行列を利用する．• 単語iが文書jに出現するとき，行列の(i,j)要素にその情報

を入れる:(頻度，出現回数，TF-IDFの情報，etc)

• SVDはこの行列を用語と何らかの概念の関係及び概念と文書間の関係に変換する

Table of Contents


SVD:Singular Value Decomposition

• SVDの考え方は次のようになる．例として文書と単語の出現に関する行列Xを考える．

各単語tiの各文書djの情報

各文書djにおける各単語tiの情報

は単語ベクトル同士の内積全てが入る

は文書ベクトル同士の内積全てが入る


• SVDの考え方は次のようになる．例として文書と単語の出現に関する行列Xを考える．

は単語ベクトル同士の内積全てが入る

は文書ベクトル同士の内積全てが入る

•仮に行列Xが直交行列U，Vと対角行列Σによって分解されるならば…

が対角行列

単語に関する情報はUに，文章に関する情報はVに入る


• SVDは(m,n)行列Aをに分解する

• U: (m,r)行列．行列のr個の非零な固有ベクトル(左特異)から構成される．

• V: (n,r)行列．行列のr個の非零な固有ベクトル(右特異)から構成される．

•Σ:(r,r)行列．特異値からなる対角行列．特異値はの非零な固有値の降順にr個．

• SVDは重要ではない次元を自動的に落としてくれる．そしてΣは元になった行列Aを上手く近似してくれるとされる．

octave-3.4.0:4> A = [1 2 3; 4 5 6; 7 8 9];octave-3.4.0:5> [u v d] = svd(A);

octave-3.4.0:6> u -0.21484 0.88723 0.40825 -0.52059 0.24964 -0.81650 -0.82634 -0.38794 0.40825

octave-3.4.0:7> v 1.6848e+01 0 0 0 1.0684e+00 0 0 0 1.4728e-16

octave-3.4.0:8> d -0.479671 -0.776691 0.408248 -0.572368 -0.075686 -0.816497 -0.665064 0.625318 0.408248

Table of Contents


論文のIntroduction

• (Wikipedia)テンソル(tensor)は，乱暴に言えば多次元配列に相当．2階のテンソル(行列)は2次元配列，3階のテンソルは3次元配列である．

• 2階テンソルの例のように，色んなデータが3次元以上の配列で集められているとする．そこでSVDのように，そのテンソルをいくつかの要素に分解して解釈したい．

•論文ではCandecomp/Parafac(CP)モデルと，Tuckerモデルと呼ばれるテンソル分解について説明する．(自由度が高く，他にたくさんある)

テンソルに関する記法 (1)

• N階の実テンソルをと表し，各要素をと表す．簡単な例として3階のテンソルを考え，αを実数とする．• スカラー倍• テンソルの和• テンソルの内積

• フロベニウスノルム

•テンソルの行列化/非行列化

WIREs Data Mining and Knowledge Discovery Applications of tensor (multiway array) factorizations and decompositions in data mining

mining and its importance is growing spurred by thecomputational power and storage capabilities of mod-ern computers. Tensor decomposition is being appliedto new fields every year and there is no doubt that ten-sor factorization will be an important framework forknowledge discovery of many types of modern large-scale data sets.

Tensor factorization has many challenges andopen problems, particularly because its geometry isnot yet fully understood, the occurrence of degener-ate solutions, and no guarantee of finding the opti-mal solution. Furthermore, most tensor decomposi-tion models impose a very restricted structure on thedata which in turn require that data exhibit a strongdegree of regularity. To overcome these limitations avariety of extensions and variants of tensor decompo-sition approaches have been proposed over the years.Thus, understanding the data generating process iskey for the formulation of adequate tensor decompo-sition models that can well extract the inherent mul-timodal structure.

This overview will limit itself to the basic ten-sor decomposition models such as the Candecomp/Parafac (CP) and Tucker model, as well as their ap-plication in data mining. Other great introductoryresources for tensor decomposition and their applica-tions can be found in the recent review of Ref 24 thebook on multiway analysis for the chemical sciences28

as well as the book on applied multiway analysis ofRef 2. Furthermore, a good introduction to nonnega-tive tensors and their decompositions can be found inRef 25. In the present paper, model estimation is re-duced to a minimum considering only the simple andwidely used alternating least squares (ALS) approach.For a thorough treatment of tensor model estima-tion approaches, we suggest that the reader consultRefs 24, 26, 27, and the references therein.

The paper is organized as follows: In ‘TensorNomenclature’, we introduce standard tensor no-tation and operations, in ‘The Tucker and Cande-comp/Parafac Models’, we describe the two mostwidely used tensor decomposition approaches namelythe Tucker and CP decompositions as well as someof their extensions. In ‘Tensor Factorization for DataMining’, we describe some of the applications of ten-sor factorization and decomposition in data mining.Because of space limitation, the aim of this article isto give an overview, thus, full credit cannot be givento all the many achievements made over the years ofmultiway/tensor analysis.

TENSOR NOMENCLATURETensors and multiway arrays, also referred to ashypermatrices, are normally written in calligraphed

letters. A general real tensor of order N is writ-ten X ! RI1"I2"..."IN, we will use the following no-tation to more compactly denote the size of a tensorX I1"I2"..."IN, whereas a given element of the tensorX is denoted by xi1,i2,...,iN. The following section in-troduces the basic notation and operations that, forclarity, is given for third-order tensors, whereas theytrivially generalize to tensors of arbitrary order.

Consider the third-order tensor AI"J "K andBI"J "K . Scalar multiplication, addition of two ten-sors, and the inner product between two tensors aregiven by

!B = C, where ci, j,k = !bi, j,k (1)

A + B = C, where ci, j,k = ai + bi (2)

#A,B$ =!

i, j,k

ai, j,kbi, j,k (3)

As such, the Frobenius norm of a tensor is given by%A%F =

&#A,A$.

The nth mode matricizing and unmatricizing op-eration maps a tensor into a matrix and a matrix intoa tensor, respectively, i.e.,

X I1"I2"..."IN 'matricizing X In"I1·I2···In(1·In+1···IN

(n) (4)

X In"I1·I2···In(1·In+1···IN(n)

'un-matricizing X I1"I2"..."IN (5)

The matricizing operation for a third-order tensor isillustrated in Figure 1. The n-mode multiplication ofan order N tensor X I1"I2"..."IN with a matrix M J "In

is given by

X "n M = X •n M = Z I1"..."In(1"J "In+1"..."IN, (6)

zi1,...,in(1, j,in+1,...,iN =In!

in=1

xi1,...,in(1,in,in+1,...,iNmj,in . (7)

Using the matricizing operation, this operation cor-responds to Z(n) = MX (n). As a result, the matrixproducts underlying the singular value decomposi-tion (SVD) can be written as U SV) = S "1 U "2 V =S "2 V "1 U as the order of the multiplication doesnot matter. The outer product of the three vectors a,b, and c is given by

X = a * b * c, such that xi, j,k = aibj ck (8)

The Kronecker product is given by

P I"J + QK"L = RIK"J L,

such that rk+K(i(1),l+L( j(1) = pi jqkl, (9)

Volume 1, January /February 2011 25c! 2011 John Wi ley & Sons , Inc .












#A,B$ =!

i, j,k

ai, j,kbi, j,k (3)


&#A,A$.



(n) (4)




is given by


zi1,...,in(1, j,in+1,...,iN =In!

in=1

xi1,...,in(1,in,in+1,...,iNmj,in . (7)


















#A,B$ =!

i, j,k

ai, j,kbi, j,k (3)


&#A,A$.



(n) (4)




is given by


zi1,...,in(1, j,in+1,...,iN =In!

in=1

xi1,...,in(1,in,in+1,...,iNmj,in . (7)


















#A,B$ =!

i, j,k

ai, j,kbi, j,k (3)


&#A,A$.



(n) (4)




is given by


zi1,...,in(1, j,in+1,...,iN =In!

in=1

xi1,...,in(1,in,in+1,...,iNmj,in . (7)


















#A,B$ =!

i, j,k

ai, j,kbi, j,k (3)


&#A,A$.



(n) (4)




is given by


zi1,...,in(1, j,in+1,...,iN =In!

in=1

xi1,...,in(1,in,in+1,...,iNmj,in . (7)







(n)番目を中心に開く

行列化のイメージOverview wires.wiley.com/widm

FIGURE 1 | The matricizing operation on a third-order tensor of size 4 ! 4 ! 4.

whereas the Khatri–Rao product is defined as acolumn-wise Kronecker product

AI!J | " |B K!J = AI!J # B K!J = CIK!J ,

such that ck+K(i$1), j = ai j bkj . (10)

An important property when calculating the Moore–Penrose inverse (i.e., A† = (A% A)$1 A%) of Kroneckerand Khatri–Rao products are

(P " Q)† = (P† " Q†) (11)

(A # B)† = [(A% A)&(B% B)]$1(A # B)% (12)

where & denotes elementwise multiplication.This reduces the complexity from O(J 3L3)

to O(max{I J 2, K J 2, J 3, L3}) and O(IK J 2) toO(max{IK J , I J 2, K J 2, J 3}), respectively. For addi-tional properties of these matrix products see, alsoRef 28. In Table 1, a summary of the operatorsdescribed above can be found.

THE TUCKER ANDCANDECOMP/PARAFAC MODELSThe two most widely used tensor decompositionmethods are the Tucker model29 and Canonical De-composition (CANDECOMP)30 also known as ParallelFactor Analysis (PARAFAC)31 jointly abbreviated CP.In the following section, we describe the models for

TABLE 1 Summary of the Utilized Variables and Operations. X , X, x, and x are Used to DenoteTensors, Matrices, Vectors, and Scalars Respectively.

Operator Name Operation

'A,B( Inner product 'A,B( =!

i, j ,k ai, j ,kbi, j ,k)A)F Frobenius norm

*'A,A(

X(n) Matricizing X I 1!I 2!...!I N + X I n!I 1·I 2···I n$1·I n+1···I N(n)

!n or •n n-mode product X !n M = Z where Z(n) = MX (n), outer product a , b = Z where zi, j = ai b j" Kronecker product A " B = Z where zk+K (i $1),l+L ( j $1) = ai j bkl# or | " | Khatri–Rao product A # B = Z, where zk+K (i $1), j = ai j bk j .

kA k-rank Maximal number of columns of A guaranteed to be linearly independent.

26 Volume 1, January /February 2011c! 2011 John Wi ley & Sons , Inc .

(本文中より引用)

(1)

(3)

(2)












#A,B$ =!

i, j,k

ai, j,kbi, j,k (3)


&#A,A$.



(n) (4)




is given by


zi1,...,in(1, j,in+1,...,iN =In!

in=1

xi1,...,in(1,in,in+1,...,iNmj,in . (7)








• n-mode積: テンソルと行列のn-mode積はと表記する．定義は次のようになる．

行列化演算子を用いて行列AのSVD: はn-mode積を用いて

n-mode積の定理/性質らしい


•ベクトル3つの外積は3階のテンソルを成す

•クロネッカー積

• Khatri-Rao積 (column-wise クロネッカー積)

ムーアペンローズの逆行列を求めるとき

Overview wires.wiley.com/widm

FIGURE 1 | The matricizing operation on a third-order tensor of size 4 ! 4 ! 4.

whereas the Khatri–Rao product is defined as acolumn-wise Kronecker product

AI!J | " |B K!J = AI!J # B K!J = CIK!J ,

such that ck+K(i$1), j = ai j bkj . (10)

An important property when calculating the Moore–Penrose inverse (i.e., A† = (A% A)$1 A%) of Kroneckerand Khatri–Rao products are

(P " Q)† = (P† " Q†) (11)

(A # B)† = [(A% A)&(B% B)]$1(A # B)% (12)

where & denotes elementwise multiplication.This reduces the complexity from O(J 3L3)

to O(max{I J 2, K J 2, J 3, L3}) and O(IK J 2) toO(max{IK J , I J 2, K J 2, J 3}), respectively. For addi-tional properties of these matrix products see, alsoRef 28. In Table 1, a summary of the operatorsdescribed above can be found.

THE TUCKER ANDCANDECOMP/PARAFAC MODELSThe two most widely used tensor decompositionmethods are the Tucker model29 and Canonical De-composition (CANDECOMP)30 also known as ParallelFactor Analysis (PARAFAC)31 jointly abbreviated CP.In the following section, we describe the models for

TABLE 1 Summary of the Utilized Variables and Operations. X , X, x, and x are Used to DenoteTensors, Matrices, Vectors, and Scalars Respectively.

Operator Name Operation

'A,B( Inner product 'A,B( =!

i, j ,k ai, j ,kbi, j ,k)A)F Frobenius norm

*'A,A(

X(n) Matricizing X I 1!I 2!...!I N + X I n!I 1·I 2···I n$1·I n+1···I N(n)

!n or •n n-mode product X !n M = Z where Z(n) = MX (n), outer product a , b = Z where zi, j = ai b j" Kronecker product A " B = Z where zk+K (i $1),l+L ( j $1) = ai j bkl# or | " | Khatri–Rao product A # B = Z, where zk+K (i $1), j = ai j bk j .

kA k-rank Maximal number of columns of A guaranteed to be linearly independent.



Table of Contents


TuckerモデルとCPモデルの紹介

• TuckerモデルとCPモデルは広く利用されているテンソル分解手法．論文では3階のテンソルの場合について説明する．


FIGURE 2 | Illustration of the Tucker model of a third-order tensorX . The model decomposes the tensor into loading matrices with amode specific number of components as well as a core arrayaccounting for all multilinear interactions between the components ofeach mode. The Tucker model is particularly useful for compressingtensors into a reduced representation given by the smaller core array G.

a third-order tensor but they trivially generalize togeneral Nth order arrays by introducing additionalmode-specific loadings.

Tucker ModelThe Tucker model proposed in Ref 29 reads for athird-order tensor X I!J !K

X I!J !K "!

lmn

gl,m,naIl # bJ

m # cKn , such that

xi, j,k "!

lmn

gl,m,nai,lbj,mck,n,

where the so-called core array GL!M!N with elementsgl,m,n accounts for all possible linear interactions be-tween the components of each mode. To indicate howmany vectors pertain to each modality, it is customaryalso to denote the model a Tucker(L, M, N) model.Using the n-mode tensor product !n,29,32 the modelcan be written as

X I!J !K " GL!M!N !1 AI!L !2 B J !M !3 CK!N.

Each mode of the array is approximately spanned bygiven loading matrices for that mode such that thevectors of each modality interact with the vectors ofall remaining modalities with strengths given by thecore tensor G, see, also Figure 2.

The Tucker model is not unique. As such, multi-plying by invertible matrices QL!L, RM!M, and SN!N

gives an equivalent representation, i.e.,

X " (G !1 Q!2 R !3 S) !1 (AQ$1) !2 (B R$1))

!3(CS$1)) = "G !1 "A !2 "B !3 "C.

As a result, the factors of the unconstrained Tuckermodel can be constrained orthogonal or orthonormal(which is useful for compression) without hamper-ing the reconstruction error. However, imposing or-thogonality/orthonormalty does not resolve the lackof uniqueness as the solution is still ambiguous to

multiplication by orthogonal/orthonormal matricesQ, R, and S. Using the n-mode matricizing andKronecker product operation, the Tucker model canbe written as

X (1) " AG(1)(C % B)&

X (2) " BG(2)(C % A)&

X (3) " CG(3)(B % A)&.

The above decomposition for a third-order tensor isalso denoted a Tucker3 model, the Tucker2 modeland Tucker1 models are given by

Tucker2: X " G !1 A !2 B !3 I ,

Tucker1: X " G !1 A !2 I !3 I ,

where I is the identity matrix. Thus, the Tucker1model is equivalent to regular matrix decompositionbased on the representation X (1) = AG(1).

Model EstimationTraditionally, the Tucker model has been estimatedon the basis of updating the elements of each modein turn that for the least squares objective commonlyis denoted ALS. By fitting the model using ALS, theestimation reduces to a sequence of regular matrixfactorization problems. As a result, for least squaresminimization, the solution of each mode can be solvedby pseudoinverses, i.e.,

A ' X (1)(G(1)(C % B)&)†

B ' X (2)(G(2)(C % A)&)†

C ' X (3)(G(3)(B % A)&)†

G ' X !1 A† !2 B† !3 C†.

The analysis simplifies when orthogonality isimposed24 such that the estimation of the core can beomitted. Orthogonality can be imposed by estimatingthe loadings of each mode through the SVD formingthe Higher-order Orthogonal Iteration (HOOI),10,24

i.e.,

AS(1)V (1)& = X (1)(C % B),

B S(2)V (2)& = X (2)(C % A),

CS(3)V (3)& = X (3)(B % A).

such that A, B, and C are found as the first L, M, andN left singular vectors given by solving the right handside by SVD. The core array is estimated upon con-vergence by G ' X !1 A† !2 B† !3 C†. The aboveprocedures are unfortunately not guaranteed to con-verge to the global optimum.

A special case of the Tucker model is given bythe HOSVD29,32 where the loadings of each mode is



FIGURE 3 | Illustration of the CANDECOMP/PARAFAC (CP) model of athird-order tensor X . The model decomposes a tensor into a sum ofrank one components and the model is very appealing due to itsuniqueness properties.

RD!D, and SD!D, we find

X " (D !1 Q!2 R !3 S) !1 (AQ#1) !2 (B R#1)

!3(CS#1) = !D !1 !A !2 !B !3 !C.

As such, the new core !D must be nonzero only alongthe diagonal for the representation to remain a CP

model. This, in practice, has the consequence that Q,R, and S can only be scale and permutation matri-ces with identical permutation. In Refs 34, 35, theuniqueness properties of the CP model were thor-oughly investigated and among several results, thefollowing uniqueness criterion derived

kA + kB + kC $ 2D + 2. (13)

Here, the Kruskal rank or k-rank kA of a matrix A isthe maximal number r such that any set of r columnsof the matrix A is linearly independent; therefore,kA % rank(A) % D where D is the number of compo-nents.

The notion of k-rank is closely related to thenotion of spark in compressed sensing36 and whilek-rank is NP hard to compute it can be bounded bymeasures of coherence.36 In the presence of noise withcontinuous probability distribution and when all thedimensions of the tensor are larger than D, we havein practice kA = kB = kC = D. As Kruskal wrote inRef 34, struck by his own uniqueness criterion,

‘A surprising fact is that the nonrotatability char-acteristic can hold even when the number of factorsextracted is greater than every dimension of the three-way array.’

The criterion has been generalized to order N arraysin Ref 37.

The uniqueness property of the optimal CP so-lution is perhaps the most appealing aspect of theCP model. Uniqueness of matrix decomposition hasbeen a longstanding challenge that has spurred agreat deal of research early on in the psychomet-rics literature where rotational approaches such asVARIMAX were proposed (see, also Refs 1, 2, 31,

and references therein) and lately in the signal pro-cessing literature where methods based on statisti-cal independence9 (see, also, ‘Signal Processing’) havebeen derived in order to disambiguate matrix decom-positions. Thus, contrary to the regular matrix fac-torization approaches, the CP model admit a uniquerepresentation of the data.

Unfortunately, uniqueness sometimes comes ata price as CP degenerate solutions are known to oc-cur, i.e., solutions in which the component loadingsare highly correlated in all the modes and where thesolution is not physically meaningful but rather amathematical artifact caused by the inability of CP

to model that particular tensor with that particularnumber of components. This makes the CP estimationunstable, slow in convergence, and difficult to inter-pret because the components are dominated by strongcancelations effects between the various componentsin the model.38,39 An example of CP degeneracy isgiven in Figure 6.

Model EstimationAs with the Tucker model, the CP decomposition doesnot admit any known closed form solution, and ingeneral, there is no guarantee that the optimal solu-tion, even if it exists, can be identified. Although adirect fitting approach based on a generalized eigen-value problem with fixed computational complex-ity can be imposed, the optimization criterion is notstrictly well defined in terms of the least squares objec-tive, whereas the solutions obtained have been foundto be inferior to the following ALS approach.27

Parameter estimation by ALS is widely used be-cause of its ease of implementation by the use ofthe Khatri–Rao product and the matricizing opera-tions. For a third-order CP model, we can disambiguatthe scaling ambiguity between the diagonal elementsof the core and the loadings by fixing the diagonalcore elements to 1 such that the CP model can bewritten as

X I!J !K ""

d

aId & bJ

d & cKd , such that

xi, j,k ""

d

ai,dbj,dck,d.

Using the matricizing and Khatri–Rao product, this isequivalent to

X (1) " A(C ' B)(,

X (2) " B(C ' A)(,

X (3) " C(B ' A)(.


Tuckerモデル CPモデル



TABLE 2 Overview of the Most Common Tensor Decomposition Models, Details of the Models as wellas References to Their Literature can be Found in Refs 24, 28, and 44

Model name Decomposition Unique

CP xi, j ,k !!

d ai,d b j ,d ck,d YesThe minimal D for which approximation is exact is called the rank of a tensor, model in general unique.

Tucker xi, j ,k !!

l ,m,n gl ,m,nai,l b j ,mck,n NoThe minimal L , M , N for which approximation is exact is called the multilinear rank of a tensor.

Tucker2 xi, j ,k !!

l m gl ,m,kai,l b j ,m NoTucker model with identity loading matrix along one of the modes.Tucker1 xi, j ,k !

!l ,m,n gl , j ,kai,l No

Tucker model with identity loading matrices along two of the modes.The model is equivalent to regular matrix decomposition.

PARAFAC2 xi, j ,k !!D

d ai,d b(k)j ,d ck,d , s.t.

!j b(k)

j ,d b(k)j ,d " = !d,d " Yes

Imposes consistency in the covariance structure of one of the modes. The model is well suited to account for shape changes;furthermore, the second mode can potentially vary in dimensionality.

INDSCAL xi, j ,k !!

d ai,d a j ,d ck,d YesImposing symmetry on two modes of the CP model.

Symmetric CP xi, j ,k !!

d ai,d a j ,d ak,d YesImposing symmetry on all modes in the CP model useful in the analysis of higher order statistics.

CANDELINC xi, j ,k !!

l mn (!

d al ,d bm,d cn,d )ai,l b j ,mck,n NoCP with linear constraints can be considered a Tucker decomposition where the Tucker core has CP structure.

DEDICOM xi, j ,k !!

d,d " ai,d bk,d r d,d " bk,d " a j ,d " YesCan capture asymmetric relationships between two modes that index the same type of object.

PARATUCK2 xi, j ,k !!

d,e ai,d bk,d r d,esk,et j ,e Yes55

A generalization of DEDICOM that can consider interactions between two possible different sets of objects.Block Term Decomp. xi, j ,k !

!r!

l mn g(r )l mna(r )

i,nb(r )j ,mc(r )

k,n Yes56

A sum over R Tucker models of varying sizes where the CP and Tucker models are natural special cases.ShiftCP xi, j ,k !

!d ai,d b j #"i,d ,d ck,d Yes6

Can model latency changes across one of the modes.ConvCP xi, j ,k !

!T"

!d ai,d," b j #",d ck,d Yes

Can model shape and latency changes across one of the modes. When T = J the model can be reduced to regular matrixfactorization; therefore, uniqueness is dependent on T.

determined solely by the SVD of the matricized array,

AS(1)V (1)$ = X (1),

B S(2)V (2)$ = X (2),

CS(3)V (3)$ = X (3).

Although this approach strikingly resembles theSVD,32 it is not guaranteed to find an optimal com-pression. In particular, the approach does not takethe (Tucker) structure of the remaining modes into ac-count when solving for the loadings of a given mode,see also Ref 10. Therefore, the HOSVD is commonlyused as an initialization method that is refined byother Tucker estimation approaches. If the matricizedranks in the three modes are found to be L, M, andN respectively, then a Tucker(L, M, N) model fits thedata perfect.

CP ModelThe CP model independently proposed in Refs 30, 31,33 can be considered a special case of the Tuckermodel where the size of each modality of the corearray G is the same, i.e., L = M = N, whereas in-teractions are only between columns of same indicessuch that the only nonzero elements are found alongthe diagonal of the core, i.e., gl,m,n %= 0 if and only ifl = m = n, see also Figure 3. As a result, the CP modelcan be written as

X I&J &K ! DD&D&D &1 AI&D &2 B J &D &3 CK&D,

where D is a diagonal tensor. An important propertyof the CP model is that the restriction imposed on theTucker core leads to uniqueness of the representa-tion. When multiplying by invertible matrices QD&D,



Tuckerモデル (1)

• Tuckerモデルは3階のテンソルを核配列(core-array) と3つのmodeに分ける．

n-mode積による定義WIREs Data Mining and Knowledge Discovery Applications of tensor (multiway array) factorizations and decompositions in data mining




X I!J !K "!

lmn

gl,m,naIl # bJ

m # cKn , such that

xi, j,k "!

lmn

gl,m,nai,lbj,mck,n,






X " (G !1 Q!2 R !3 S) !1 (AQ$1) !2 (B R$1))

!3(CS$1)) = "G !1 "A !2 "B !3 "C.



X (1) " AG(1)(C % B)&

X (2) " BG(2)(C % A)&

X (3) " CG(3)(B % A)&.


Tucker2: X " G !1 A !2 B !3 I ,

Tucker1: X " G !1 A !2 I !3 I ,



A ' X (1)(G(1)(C % B)&)†

B ' X (2)(G(2)(C % A)&)†

C ' X (3)(G(3)(B % A)&)†

G ' X !1 A† !2 B† !3 C†.


i.e.,

AS(1)V (1)& = X (1)(C % B),

B S(2)V (2)& = X (2)(C % A),

CS(3)V (3)& = X (3)(B % A).




Tuckerモデル (2)

• Tuckerモデルは一意にならない．と核逆行列が存在するにより

になるらしい

• Tuckerモデルはn-mode積と行列化演算子，クロネッカー積を用いて次のように書ける

• 3階テンソルを3方向全て真面目に分解しないモデルはTucker2/Tucker1モデルと呼ばれる

• Tuckerモデルの推定は各モード(mode)の成分を順番に更新していく．最小二乗法の目的関数を持つ場合，ALSと呼ばれる

Tuckerモデル (3)





X I!J !K "!

lmn

gl,m,naIl # bJ

m # cKn , such that

xi, j,k "!

lmn

gl,m,nai,lbj,mck,n,






X " (G !1 Q!2 R !3 S) !1 (AQ$1) !2 (B R$1))

!3(CS$1)) = "G !1 "A !2 "B !3 "C.



X (1) " AG(1)(C % B)&

X (2) " BG(2)(C % A)&

X (3) " CG(3)(B % A)&.


Tucker2: X " G !1 A !2 B !3 I ,

Tucker1: X " G !1 A !2 I !3 I ,



A ' X (1)(G(1)(C % B)&)†

B ' X (2)(G(2)(C % A)&)†

C ' X (3)(G(3)(B % A)&)†

G ' X !1 A† !2 B† !3 C†.


i.e.,

AS(1)V (1)& = X (1)(C % B),

B S(2)V (2)& = X (2)(C % A),

CS(3)V (3)& = X (3)(B % A).








X I!J !K "!

lmn

gl,m,naIl # bJ

m # cKn , such that

xi, j,k "!

lmn

gl,m,nai,lbj,mck,n,






X " (G !1 Q!2 R !3 S) !1 (AQ$1) !2 (B R$1))

!3(CS$1)) = "G !1 "A !2 "B !3 "C.



X (1) " AG(1)(C % B)&

X (2) " BG(2)(C % A)&

X (3) " CG(3)(B % A)&.


Tucker2: X " G !1 A !2 B !3 I ,

Tucker1: X " G !1 A !2 I !3 I ,



A ' X (1)(G(1)(C % B)&)†

B ' X (2)(G(2)(C % A)&)†

C ' X (3)(G(3)(B % A)&)†

G ' X !1 A† !2 B† !3 C†.


i.e.,

AS(1)V (1)& = X (1)(C % B),

B S(2)V (2)& = X (2)(C % A),

CS(3)V (3)& = X (3)(B % A).




• Tuckerモデルに直交性を課す条件がある．この条件は解析を簡素化させる

A,B,Cの初期値として，SVDの左特異ベクトル列を使う





X I!J !K "!

lmn

gl,m,naIl # bJ

m # cKn , such that

xi, j,k "!

lmn

gl,m,nai,lbj,mck,n,






X " (G !1 Q!2 R !3 S) !1 (AQ$1) !2 (B R$1))

!3(CS$1)) = "G !1 "A !2 "B !3 "C.



X (1) " AG(1)(C % B)&

X (2) " BG(2)(C % A)&

X (3) " CG(3)(B % A)&.


Tucker2: X " G !1 A !2 B !3 I ,

Tucker1: X " G !1 A !2 I !3 I ,



A ' X (1)(G(1)(C % B)&)†

B ' X (2)(G(2)(C % A)&)†

C ' X (3)(G(3)(B % A)&)†

G ' X !1 A† !2 B† !3 C†.


i.e.,

AS(1)V (1)& = X (1)(C % B),

B S(2)V (2)& = X (2)(C % A),

CS(3)V (3)& = X (3)(B % A).








X I!J !K "!

lmn

gl,m,naIl # bJ

m # cKn , such that

xi, j,k "!

lmn

gl,m,nai,lbj,mck,n,






X " (G !1 Q!2 R !3 S) !1 (AQ$1) !2 (B R$1))

!3(CS$1)) = "G !1 "A !2 "B !3 "C.



X (1) " AG(1)(C % B)&

X (2) " BG(2)(C % A)&

X (3) " CG(3)(B % A)&.


Tucker2: X " G !1 A !2 B !3 I ,

Tucker1: X " G !1 A !2 I !3 I ,



A ' X (1)(G(1)(C % B)&)†

B ' X (2)(G(2)(C % A)&)†

C ' X (3)(G(3)(B % A)&)†

G ' X !1 A† !2 B† !3 C†.


i.e.,

AS(1)V (1)& = X (1)(C % B),

B S(2)V (2)& = X (2)(C % A),

CS(3)V (3)& = X (3)(B % A).




•「解析」はより良い分解を探索するというイメージで，簡素化したモデルはHOSVDと呼ぶ．

CPモデル (1)

• CPモデルはTuckerモデルの特別な場合として考案された．CPモデルでは核配列のサイズはどの次元も同じでL=M=N．分解は以下で定義．

条件として核の要素は対角成分のみ非零→相互作用は同じインデックスの間でのみ起きる

• CPモデルはその制限によって一意な核を持つ正則行列

Dが対角行列なので，ただスケーリングするだけ

CPモデル (2)

• CPモデルの推定は次のようになるスケーリングのあいまい性を排除するために

対角成分を全て1としたモデル


FIGURE 3 | Illustration of the CANDECOMP/PARAFAC (CP) model of athird-order tensor X . The model decomposes a tensor into a sum ofrank one components and the model is very appealing due to itsuniqueness properties.

RD!D, and SD!D, we find

X " (D !1 Q!2 R !3 S) !1 (AQ#1) !2 (B R#1)

!3(CS#1) = !D !1 !A !2 !B !3 !C.

As such, the new core !D must be nonzero only alongthe diagonal for the representation to remain a CP

model. This, in practice, has the consequence that Q,R, and S can only be scale and permutation matri-ces with identical permutation. In Refs 34, 35, theuniqueness properties of the CP model were thor-oughly investigated and among several results, thefollowing uniqueness criterion derived

kA + kB + kC $ 2D + 2. (13)

Here, the Kruskal rank or k-rank kA of a matrix A isthe maximal number r such that any set of r columnsof the matrix A is linearly independent; therefore,kA % rank(A) % D where D is the number of compo-nents.

The notion of k-rank is closely related to thenotion of spark in compressed sensing36 and whilek-rank is NP hard to compute it can be bounded bymeasures of coherence.36 In the presence of noise withcontinuous probability distribution and when all thedimensions of the tensor are larger than D, we havein practice kA = kB = kC = D. As Kruskal wrote inRef 34, struck by his own uniqueness criterion,

‘A surprising fact is that the nonrotatability char-acteristic can hold even when the number of factorsextracted is greater than every dimension of the three-way array.’

The criterion has been generalized to order N arraysin Ref 37.

The uniqueness property of the optimal CP so-lution is perhaps the most appealing aspect of theCP model. Uniqueness of matrix decomposition hasbeen a longstanding challenge that has spurred agreat deal of research early on in the psychomet-rics literature where rotational approaches such asVARIMAX were proposed (see, also Refs 1, 2, 31,

and references therein) and lately in the signal pro-cessing literature where methods based on statisti-cal independence9 (see, also, ‘Signal Processing’) havebeen derived in order to disambiguate matrix decom-positions. Thus, contrary to the regular matrix fac-torization approaches, the CP model admit a uniquerepresentation of the data.

Unfortunately, uniqueness sometimes comes ata price as CP degenerate solutions are known to oc-cur, i.e., solutions in which the component loadingsare highly correlated in all the modes and where thesolution is not physically meaningful but rather amathematical artifact caused by the inability of CP

to model that particular tensor with that particularnumber of components. This makes the CP estimationunstable, slow in convergence, and difficult to inter-pret because the components are dominated by strongcancelations effects between the various componentsin the model.38,39 An example of CP degeneracy isgiven in Figure 6.

Model EstimationAs with the Tucker model, the CP decomposition doesnot admit any known closed form solution, and ingeneral, there is no guarantee that the optimal solu-tion, even if it exists, can be identified. Although adirect fitting approach based on a generalized eigen-value problem with fixed computational complex-ity can be imposed, the optimization criterion is notstrictly well defined in terms of the least squares objec-tive, whereas the solutions obtained have been foundto be inferior to the following ALS approach.27

Parameter estimation by ALS is widely used be-cause of its ease of implementation by the use ofthe Khatri–Rao product and the matricizing opera-tions. For a third-order CP model, we can disambiguatthe scaling ambiguity between the diagonal elementsof the core and the loadings by fixing the diagonalcore elements to 1 such that the CP model can bewritten as

X I!J !K ""

d

aId & bJ

d & cKd , such that

xi, j,k ""

d

ai,dbj,dck,d.

Using the matricizing and Khatri–Rao product, this isequivalent to

X (1) " A(C ' B)(,

X (2) " B(C ' A)(,

X (3) " C(B ' A)(.


•最小二乗法のためには


For the least squares objective we, thus, find

A ! X (1)(C " B)(C#C $ B# B)%1

B ! X (2)(C " A)(C#C $ A# A)%1

C ! X (3)(B " A)(B# B $ A# A)%1

However, some calculations are redundant betweenthe alternating steps. Thus, the following approachbased on premultiplying the largest mode(s) with thedata is more computationally efficient.27 Multiplyingthe first mode with the data when updating for thesecond and third mode of a third-order array gives

A ! X (1)(C " B)(C#C $ B# B)%1, !X (1) = A#X (1)

B ! !X (2)(C " I )(C#C $ A# A)%1

C ! !X (3)(B " I )(B# B $ A# A)%1.

We will, without loss of generality, assume I & J &K, hence, the above approach reduces the cost in-voked for the above updates of B and C fromO(I J K D) to O(max{J D2, D3}) when taking advan-tage of the sparsity structure of the Khatri–Rao prod-ucts where D is the number of components in the CP

model. Commonly, the above ALS algorithm is runmultiple times with different initializations to avoidthe identification of a local minima solution.

Although the ALS algorithm for CP and Tuckerestimation are widely used, ALS can suffer from slowconvergence. Although ALS converges at most linearlyin practice, it can be extremely slow particularly incases of high collinearity between the factors.27 Al-ternative estimation approaches such as Levenberg–Marquardt, conjugate gradient, and enhanced linesearch have been shown to improve convergence.26

For further details on CP and Tucker model estima-tion and alternatives to ALS optimization includingcomplexity analysis and performance evaluation, seealso Refs 24, 26, 27, and references therein.

Rank and Multilinear Rank of a TensorThe rank of a tensor is given by its minimal sum ofrank one components R such that

X =R"

r=1

ar ' br ' cr . (14)

Notice, contrary to the matrix case, the rank of atensor can be greater than min(I, J , K). Furthermore,for tensor decomposition over the real field a 2 (2 ( 2 tensor can both be rank 2 and rank 3 but inthe complex field 2 ( 2 ( 2 tensors generically haverank 2.26 The fact that the typical rank of a tensorcan take more than one value is specific to the realfield. One major difference between matrix and tensor

rank is that there is no straightforward algorithm todetermine the rank of a tensor. In practice, the rank ofa tensor is determined numerically by fitting variousCP models for different R.24 Using the Tucker modelrepresentation, a third-order tensor is said to havemultilinear rank-(L, M, N) if its mode-1 rank, mode-2 rank, and mode-3 rank are equal to L, M, and N,respectively10,32,39

X =LMN"

lmn

gl,m,nal ' bm ' cn. (15)

Although the Tucker model, due to its orthogonalrepresentation, is useful for projection onto tensorialsubspaces (i.e., compression), the CP model by defi-nition is outer product rank revealing and often ofinterest because of its unique and easily interpretedrepresentations.

Missing ValuesMissing data can arise in a variety of settings becauseof loss of information, errors in the data collectionprocess, or costly experiments.40 In the case of miss-ing data, a standard practice is to impute missing datavalues, i.e., the missing data element xi, j,k is replacedby the estimated value ri, j,k of that element from thedecomposition model, i.e., xi, j,k = ri, j,k starting froman initial guess of these missing values (this is alsoreferred to as expectation maximization). An alter-native approach is to use marginalization where themissing values are ignored (i.e., marginalized) dur-ing optimization, i.e., for least squares estimation byconsidering the objective

#i, j,k wi, j,k(xi, j,k % ri, j,k)2,

where wi, j,k = 1 if xi, j,k is present and wi, j,k = 0 ifxi, j,k is missing. Although methods based on impu-tation often are easier to implement (i.e., alternat-ing least squares can be directly applied), they areuseful only as long as the amount of missing datais relatively small as their performance tend to de-grade for large amounts of missing data as the inter-mediate models used for imputation have increasedrisk of convergence to a wrong solution (see alsoRefs 27, 40, and references therein). Factorizing ten-sors based on the CP model have been shown torecover the true underlying components from noisydata with up to 99% data missing for third-ordertensors,40 whereas the corresponding two-way meth-ods become rather unstable already with 25–40%of data missing.27,40 This has been attributed to thefact that there are fewer free parameters p relativeto observations for models accounting for multilineardynamics, i.e., the CP model has p = D(I + J + K),the Tucker model has p = IL + J M + K N (whenconsidering the core a deterministic function of the


Table of Contents


•論文中の応用例としては心理学，化学，神経科学，信号処理，バイオインフォマティクス，コンピュータビジョン，Webマイニングの7つ

テンソル分解の応用例


constraints. Fast prototyping and handling of sparsemultiway arrays in Matlab is provided by theTensorToolbox.46 For additional software, see alsoRefs 4, 24.

TENSOR FACTORIZATION FOR DATAMININGThe first applications of Tensor decomposition waswithin the field of Psychology in the 1970s whenthe CP model was demonstrated to alleviate the ro-tational ambiguity in factor analysis, whereas theframework enabled to address higher order inter-actions. In 1981 Appellof and Davidson3 pioneeredthe use of the CP model in chemistry for the anal-ysis of fluorescence data, whereas Mocks47 demon-strated in 1988 how the CP model was useful in theanalysis of multisubject-evoked potentials of electro-encephalography (EEG) data by reinventing the modelunder the name topographic component model. Sincethen tensor decompositions have found wide use inpractically all fields of science ranging from signalprocessing, computer vision, bioinformatics to webmining. In many of these studies, it has been proventhat the use of tensor decomposition can explore re-lations and interactions between the modes of thedata that are lost when resorting to traditional matrixapproaches. In particular, tensor decomposition effi-ciently extract the consistent patterns of activationwhile giving an intuitive account of how the mea-surements of each mode interact. However, tensordecomposition has not only proven useful for redun-dancy reduction (i.e., compression) but also for manytypes of data proven to account well for the under-lying physics/dynamics of the system generating thedata. In the following, some of the key applicationsof tensor factorization in data mining is given acrossa multitude of scientific fields given more or less intheir historical order. This is in no way an exhaustiveaccount of the many applications of tensor decom-position; however, the examples given will hopefullydemonstrate some of the many benefits of multiwaymodeling for a variety of data and problem domains.

PsychologyThe first applications of CP was within the field ofpsychometrics in 1970 pioneered by the work of Car-roll and Chang30 and Harshman.31 Ref 30 introducedCanonical Decomposition in the context of analyzingmultiple similarity or dissimilarity matrices from a va-riety of subjects. They applied the method to one data

FIGURE 4 | Example of a Tucker(2, 3, 2) analysis of the chopindata X 24 Preludes!20 Scales!38 Subjects described in Ref 49. The overallmean of the data has been subtracted prior to analysis. Black andwhite boxes indicate negative and positive variables, whereas the sizeof the boxes their absolute value. The model accounts for 40.42% ofthe variation in the data, whereas the model on the same data randompermuted accounts for 2.41 ± 0.09% of the variation. As such, thedata are very structured and compressible by the Tucker model.

set on auditory tones from Bell Labs and to anotherdata set of comparisons of countries based on the ideathat simply averaging the data removed the differentaspects present in the data,31 introduced PARAFAC be-cause it eliminated the rotational ambiguity associ-ated with two-dimensional PCA and thus has betteruniqueness properties motivated by Cattells principleof parallel proportional profiles.48 PARAFAC was hereapplied to vowel-sound data where different individ-uals spoke different vowels and the formant (i.e., thepitch) was measured, i.e.,

X Subject!Vowel!Pitch

"!

d

aSubjectd # bVowel

d # cPitchd . (18)

Since these initial works both the CP as well as Tuckermodel also referred to as N-mode PCA2 have hada widespread application within social and behav-ioral sciences addressing questions such as ‘Whichgroup of subjects behave differently on which vari-ables under which conditions?’2 In Figure 4 is givena Tucker(2, 3, 2) analysis of 24 chopin preludesbased on 20 types of scoring scales evaluated by 38judges/subjects,49 i.e.,

X Predude!Scales!Subject

"!

lmn

gl,m,naPreludel # bScales

m # cSubjectn . (19)

The analysis extracts loadings that well span the dy-namics of each mode, whereas the core array accounts



FIGURE 7 | Left panel: Tutorial dataset two of ERPWAVELAB50 given by X 64 Channels!61 Frequency bins!72 Time points!11Subjects!2Conditions. Rightpanel a three component nonnegativity constrained three-way CP decomposition of Channel ! Time " Frequency ! Subject " Condition and athree component nonnegative matrix factorization of Channel ! Time " Frequency " Subject " Condition. The two models account for 60% and76% of the variation in the data, respectively. The matrix factorization assume spatial consistency but individual time-frequency patterns ofactivation across the subjects and conditions, whereas the three-way CP analysis impose consistency in the time-frequency patterns across thesubjects and conditions. As such, these most consistent patterns of activations are identified by the model.

down-weighted in the extracted estimates of the con-sistent event-related activations.

XChannel!Time!Trial #D!

d=1

aChanneld $ bTime

d $ cTriald (22)

Unfortunately, violation of multilinearity in thedata can cause degeneracy in the CP model, see alsoFigure 6. To avoid CP degeneracy, artificial restrictionsin the form of orthogonality have been imposed or al-ternatively the signals have been analyzed via purelyadditive models based on analysis of amplitudes in aspectral representation, see also Ref 6 and referencestherein. In Ref 6, these ad-hoc measures were foundunsatisfactory. Rather than restricting the CP model, apseudo-multilinear model using the unambiguous CP

model combined with a time-shift accounting for ex-plicit delays based on the shiftCP representation wasproposed. In Figure 6, it can be seen that account-ing for shift can indeed alleviate CP degeneracy whilethe consistent pattern of activations are identified, fordetails on the shiftCP approach, see also Ref 6 andreferences therein.

Signal ProcessingMultilinear algebra has recently gained a large interestwithin the signal processing community largely due toits applications in higher-order statistics (HOS).9–11,52

In the original work on independent component anal-ysis (ICA) by Comon,9 it was demonstrated how theblind source separation problem

X = AS + E (23)

such that S is statistically independent and E residualnoise can be solved through the CP decomposition ofsome higher-order cumulants due to the importantproperty that cumulants obey multilinearity.9,52 Thefirst-order cumulant corresponds to the mean and thesecond-order cumulant to the variance such that

E(X) = AE(S) + E(E) (24)

Cov(X) = ACov(S)A% + Cov(E) (25)

Where E(·) denotes expectation and Cov the covari-ance. For a general Nth-order cumulant, we have

K(N)X = K(N)

S !1 A !2 A ! · · · !N A + K(N)E (26)

where K(n)S is a diagonal matrix for independent S.

The ICA problem can potentially be uniquely solvedby identifying A in the symmetric CP decompositionof any cumulants of order N > 2, which for the third-or fourth-order cumulant is given by

K(3)X # D !1 A !2 A !3 A (27)

K(4)X # D !1 A !2 A !3 A !4 A, (28)

where D is a diagonal tensor. Generally speaking,it becomes harder to estimate cumulants from sam-ple data as the order increases, i.e., longer datasetsare required to obtain the same accuracy. Hence, inpractice, the use of higher-order statistics is usuallyrestricted to third- and fourth-order cumulants andbecause the third-order cumulants for symmetric dis-tributions are zero, fourth-order cumulants are oftenused.10


•多次元配列に入れられたデータの分解や解析が発展したのは，計算速度が向上したことも理由の1つで，色んなデータに応用されている

• 2階テンソル(行列)の分解が既にデータの理解や解析のため強力なツールになっていることから，3階以上のN階テンソルの解析も，今後重要な技術の1つになると考えられる

•行列よりも，より一般化されたN階のテンソルによって，複雑な解析が行えると期待される

まとめ

Technology

Tensor Decomposition and its Applications