Upload
others
View
0
Download
0
Embed Size (px)
Citation preview
U-curve Search for Biological StatesCharacterization and Genetic Network Design
Marcelo Ris – Universidade de São Paulo – Instituto de Matemática e EstatísticaJunior Barrera – Universidade de São Paulo – Instituto de Matemática e EstatísticaHelena Brentani - Hospital do Câncer, Fundação Antônio Prudente
Outline
• Introduction
• Feature selection problem
• U-curve search algorithm
• Characterization of biological states
• Genetic network design
• Application
• Introduction
• Feature selection problem
• U-curve search algorithm
• Characterization of biological states
• Genetic network design
• Application
• Biological ProblemsP1. Biological states characterization
P2. Genetic Network Design
• Gene expression dataP1. States samples
P2. Time-course samples
• Mathematical approach– Feature Selection Problem
– State of the art: heuristic optimizations
– U-curve algorithm
• Introduction
• Feature selection problem
• U-curve search algorithm
• Characterization of biological states
• Genetic network design
• Application
][
][
][
][2
1
=
tx
tx
tx
tx
n
M
])}[],[(,]),1[],1[(]),0[],0[{( mymxyxyx K
},,2,1{ nA K⊂
Feature Selection
},,1{][
{][ 21
cKty
rrrrRtx ili
K
K
=∈
∈=∈ R },,,,
KRA →||:ψ
),( YXP
∑−∈
=
→−
}1,0,1{
1)(
]1,0[}1,0,1{:
y
yP
P
Y Distribution
)''()'(
)'()(
)(log)()(}1,0,1{
YHYH
YHYH
yPyPYHy
=
>
−= ∑−∈
Y Entropy
0)|()(),( ≥−= XYHYHYXI
Mutual Information
-1 0 1
P(Y)
Y
-1 0 1
P(Y’)
Y’
-1 0 1
P(Y’’)
Y’’
)|(log)|()()]|([ |
}1,0,1{ }1,0,1{
| xyPxyPxPXYHE XY
x y
XYX∑ ∑−∈ −∈
=
Mean Conditional Entropy
Mean Mutual Information
)]|([)()],([ XYHEYHYXIE −=
)|(log)|()()]|([ |
}1,0,1{ }1,0,1{
| xyPxyPxPXYHE XY
x y
XYX
))))
∑ ∑−∈ −∈
=
Estimation
Estimation
)]|([)()],([ XYHEYHYXIE)))
−=
• Problem– find the subset A that optimizes the cost
function
– Ex: mean conditional entropy minimization (cost function)
– Exponential
• Search Space– Complete boolean lattice of order n– Each node represents a possible candidate
A– Cost function: estimated for each node
– Find the node with the minimum cost
Boolean Lattice of order 44-element chain is emphasized
• Heuristics: SFS, SFFS– Incremental– Does not search all the candidates space
– Could not obtain the “best” result• Ex: 2 elements alone turns the result worse, but
together improves it a lot
• Introduction
• Feature selection problem
• U-curve search algorithm
• Characterization of biological states
• Genetic network design
• Application
• U-curve property of Ê[H(Y|X)]
Ë[H(Y|X)]
|A|
– Why ?
– Estimation composed by:• Real measure – decreases from H(Y) to the real value
E[H(Y|X)]• Estimation error – increases as more attributes are
added to X
– For a fixed number of samples
– For any chain of the search space
– Ê[H(Y|X)] forms an U-curve
• Features of the algorithm– Branch-and-Bound: go through the whole space
without having to visit all the candidates– Stochastic– Some definitions:
• U-cost Boolean Lattice
• Local minimum
• Exhausted minimum
• Global minimum
• Search space characterized by:– Upper Bound List
– Lower Bound List
• An element is reachable if there is a chain from an upper or lower list element
• At each step:– Select with some
probability a beginning list
– Select an aleatoryelement from this list
– Build a chain iteratively:
• Inserts to the chain an aleatory reachable adjacent to the last one
• Stop, when the cost of the last element is greater than the last one 0000
0100
0110
1110
10
7
9
6
Prune
Procedure
• Additional Procedures– Minimum exhausting
• Avoid more than one visit to the same candidate• Using a stack
– Pruning elements from an element E• Upper bound list – remove elements U’s that contain E, and
inserts elemets reachable from U that not contain E• Lower bound list – remove elements L’s that are contained
in E, and inserts elemets reachable from L that is not contained in E
• Introduction
• Feature selection problem
• U-curve search algorithm
• Characterization of biological states
• Genetic network design
• Application
][
][
][
][2
1
=
tx
tx
tx
tx
n
M
])}[],[(,]),1[],1[(]),0[],0[{( mymxyxyx K
},,2,1{ nA K⊂
U-curve algorithm
},,1{][
},,,,{][ 21
cKty
rrrrRtx ili
K
K
=∈
∈=∈ R
KRA →||:ψ
),( YXP
Quantized
Microarray
Quantized
Values
Biological
States
• Introduction
• Feature selection problem
• U-curve search algorithm
• Characterization of biological states
• Genetic network design
• Application
• Dynamical Systems– State: vector x– Transition function Ф
– x[t+1] = Ф(x[t])
• Stochastic Process– Stochastic transition function
• Next State – aleatory vector realization
– Ex: Markov Chain (πX|Y , π0 )• Time-discrete, finite-size vector, finite domain
• Aleatory state sequence
||||||||3|||2|||1
3|||3|33|23|1
2|||2|32|22|1
1|||1|31|21|1
|
||
3
2
1
0
=
=
nnnnn
n
n
n
n
RRRRR
R
R
R
XY
R pppp
pppp
pppp
pppp
p
p
p
p
L
MOMMM
L
L
L
M
ππ
space-sub on this of projection theis e wher
, :assuch , ,dimension
of space-sub a is there,, e.
,1| is there},,..,1{
e is, that tic,determinis-almost d.
),|(,,
is, that t,independenlly condiciona é c.
, ,0 b.
,on independs s,homogeneou is a.
:axioms following with the),(Chain Markov -
,
||
|
|
1|
|
|
||
0|
,
xx
ppnjj
NiRx
pRrnNi
Rx
xyppRyx
Ryxp
tp
xyxy
n
xry
n
XY
n
i ixy
n
XY
n
xy
xyXY
XY
ii
i
=<<
∈∀∈∀
≈∈=∈
∈∀
=∈∀
∈∀>
=
=∏π
π
π
ππ
• Probabilistic Genetic Networks - PGN
,,,
||||||||||||
3|3|3|3|
2|2|2|2|
1|1|1|1|
|
|||
321
321
321
321
21
=
nl
nnn
l
l
l
i
n
RrRrRrRr
rrrr
rrrr
rrrr
XX
XXXXXX
pppp
pppp
pppp
pppp
P
PPP
L
MOMMM
L
L
L
K
3|33|33|23|1
3|33|33|23|1
2|32|32|22|1
1|31|31|21|1
|
=
nnnnn
n
n
n
pppp
pppp
pppp
pppp
XY
L
MOMMM
L
L
L
π
• Markov Chain
• Probabilistic Genetic Networks - PGN
Almost Deterministic
Expression (Gene 1)
time
Expression (Gene 2)
time
Expression (Gene 3)
time
. . . . . .
. . . . . .
. . . . . .Expression (Gene n)
time
]1[x ]2[x ]3[x ]4[x ]5[x ]6[x ]7[x ]9[x ]10[x ]11[x ]12[x ]13[x ]1[ −mx ][mx....
ExpressionMeasurement
Techniques
Time-Course Gene Expression Data
][
][
][
][2
1
=
tx
tx
tx
tx
n
M
])}[],[(,]),1[],1[(]),0[],0[{( mymxyxyx jjj K
},,2,1{ nAj K⊂
U-curve algorithm
]1[][
},,,,{][ 21
+=
∈=∈
txty
rrrrRtx
jj
ili RK
KR jA
j →||
:ψ
njYXP j ,,1),,( K=
Quantized
Microarray
at t
Quantized
Values
Gene j quantized
expression at t+1
• Introduction
• Feature selection problem
• U-curve search algorithm
• Characterization of biological states
• Genetic network design
• Application
• Application to design a estrogen regulated network– 21 target genes estrogen regulated
– Time-course gene expression (Ed Liu et al.)• MCF-7 cells treated with estrogen• 16 microarray experiments in 24 hours:
– each hour: 8 first hours
– each 2 hours: 16 last hours
– Quantization (Barrera et al.)• 3 levels {-1, 0, 1}
– For each target gene• 15x21 matrix
– 20 first columns – gene expressions on the first 15 experiments
– Last column – gene target expression on the last15 experiments
– The algorithm returns between the 20 genes the subset that best predict the target
THBS1
PGR
CXCL12
STC2
SERPINE1
NRIP1
FZD8
CRABP2
BMP7
IL6ST
REA
EPHA4
CD7
JAK1
CTSD
JDP1
CCNG2
ABCA3
GATA3
NMAIGFBP4
Thanks!!!