Upload
russodl
View
332
Download
1
Embed Size (px)
Citation preview
David RussoWang Lab Summer 2010
June 2010-August 2010
Background
• Senior Mathematics and Biology double major at Pacific Lutheran University in Tacoma, WA
• Math: mostly pure math• Biology: wet lab background working with A.
thaliana jasmonic acid signaling pathway
Getting Started
• Familiarize myself with topics covered in BIO5488
• Gaining familiarity with Perl, Next Gen sequencing, epigenetics, homology, etc.
• Chose to use Java instead of Perl for convenience and familiarity
Goal
• Identify conserved regions in DNA that could serve as transcription factor binding sites– Improve existing Patser program
Patser
• Program written by Gary Stormo and Jerry Hertz for pattern recognition in sequences
• Accepts an alignment matrix and a sequence as its arguments
• Converts alignment matrix to position weight matrix and scores the sequences according to the PWM
Patser
Additional Patser Features
• Patser also calculates P-values and E-values, which were not included in my program for simplicity
Alignment matrices
• Also known as “raw count matrices”• Base counts across multiple genomes for
transcription factor of interest • Generated with ChIP-seq and ChIP-chip whole-
genome binding experiments• Obtained from TRANSFAC and JASPAR libraries
CSLC alignment matrix
A 5 21 21 21 0 0C 2 0 0 0 0 0G 9 0 0 0 21 0T 5 0 0 0 0 21
Conversion to Position Weight Matrix
• The conversion of an alignment matrix to a PWM makes use of the following formula:
• Where nk is the kth element of the alignment matrix, pk = 0.25, and N is the number of sequences being compared (sum of elements in any column)€
lnnk + pkN +1
⎛ ⎝ ⎜
⎞ ⎠ ⎟ 1pk
⎛ ⎝ ⎜
⎞ ⎠ ⎟
CSLC Alignment Matrix --> PWMColumn 1: ln(((5+.25)/(21+1))/0.25) = -0.047ln(((2+.25)/(21+1))/0.25) = -0.894ln(((5+.25)/(21+1))/0.25) = 0.520ln(((5+.25)/(21+1))/0.25) = -0.047
Position Weight Matrix
• Position weight matrices approximate binding affinities between proteins and DNA sequences
• Key difference: positive values correspond to higher binding affinities, while negative values correspond to lower binding affinities
Patser AlgorithmA -0.047 1.352 1.352 1.352 -3.091 -3.091 C -0.894 -3.091 -3.091 -3.091 -3.091 -3.091 G 0.520 -3.091 -3.091 -3.091 1.352 -3.091 T -0.047 -3.091 -3.091 -3.091 -3.091 1.352
Random segment from an imaginary sequence: ACGATA
Patser Score for random segment: (-0.047)+(-3.091)+(-3.091)+(1.352)+(-3.091)+(-3.091) = -11.059
Patser Algorithm
• Since the imaginary sequence ACGATA has only one base in common with the logo, the patser score is very low
• Because it is a low score, it is not a potential binding site for the CSLC transcription factor
Interleukin 22 (IL22)
• Located on 12th chromosome in humans (10th chromosome in mice)
• Involved in cellular inflammatory responses by initiating innate immune responses against bacterial pathogens, especially in respiratory and gut epithelial cells
Choosing a Region To Analyze
• Approximately 10,000bp before the beginning of the gene body, and 1,000bp into the gene body
• Region in front of promoter is thought to be regulatory in nature, and a site where transcription factors can bind
IL22 Region of InterestChr12: 66,918,269-66,929,282
Transcription factors for investigation(chosen by Ting and GeneCard)
• STAT3• CSLC• CSLH• CSLL• CMAF01• RORA1• RORA2• AP1• NFKappaB1• STAT5A1• STAT5A4• STAT5B1
Standard Patser Output(first 5 scores for CSLC)
hg18_dna position= 1 score= -7.46 sequence= CTTTGT hg18_dna position= 1C score= -6.62 sequence= ACAAAG hg18_dna position= 2 score= -15.50 sequence= TTTGTG hg18_dna position= 2C score= -7.46 sequence= CACAAA hg18_dna position= 3 score= -11.06 sequence= TTGTGG hg18_dna position= 3C score= -11.91 sequence= CCACAA hg18_dna position= 4 score= -11.06 sequence= TGTGGA hg18_dna position= 4C score= -11.06 sequence= TCCACA hg18_dna position= 5 score= -14.94 sequence= GTGGAG hg18_dna position= 5C score= -16.35 sequence= CTCCAC
Top Scoring Positions hg18_dna position= 1810C score= 7.28 ln(p-value)= -8.32 sequence= GAAAGT hg18_dna position= 2970C score= 7.28 ln(p-value)= -8.32 sequence= GAAAGT hg18_dna position= 3623 score= 7.28 ln(p-value)= -8.32 sequence= GAAAGT hg18_dna position= 4446C score= 7.28 ln(p-value)= -8.32 sequence= GAAAGT hg18_dna position= 4770 score= 7.28 ln(p-value)= -8.32 sequence= GAAAGT hg18_dna position= 7383C score= 7.28 ln(p-value)= -8.32 sequence= GAAAGT hg18_dna position= 8151C score= 7.28 ln(p-value)= -8.32 sequence= GAAAGT hg18_dna position= 9014 score= 7.28 ln(p-value)= -8.32 sequence= GAAAGT hg18_dna position= 9551C score= 7.28 ln(p-value)= -8.32 sequence= GAAAGT hg18_dna position= 10245 score= 7.28 ln(p-value)= -8.32 sequence= GAAAGT hg18_dna position= 10352 score= 7.28 ln(p-value)= -8.32 sequence= GAAAGT
STAT3 and STAT5 Binding Sites
CSLC, CSLH, CSLL, CMAF01 Binding Sites
AP1, NFKappaB1, RORA1, RORA2 Binding Sites
Conservation
• Patser could be improved by integrating conservation scores (obtained from USCS Genome Browser)
• Several strategies: multiply each position in sequence by the average of all conservation scores, multiply each position by corresponding conservation scores
Conservation
• “Average Approach” is not likely to be effective, as it is an oversimplification of the conservation scores
• Essentially reports a scaled down version of Patser scores
Conservation
• Base-by-base conservation approach: More likely to be effective at identifying transcription factor binding sites
• New Patser formula:
• ck is the conservation score at position k
€
lnnk + pkN +1
⎛ ⎝ ⎜
⎞ ⎠ ⎟ 1pk
⎛ ⎝ ⎜
⎞ ⎠ ⎟× ck
⎛
⎝ ⎜
⎞
⎠ ⎟
k= i
i+L
∑
Example
• Imaginary Sequence– ACGATA
• Imaginary Conservation Scores– 1(A): 0.5– 2(C):0.25– 3(G): 0.125– 4(A): 0– 5(T): 0.125– 6(A):0.25
Example
(A): (0.5 x -0.047)(C): (0.25 x -3.091)(G): (0.125 x -3.091)(A): (0 x 1.352)(T): (0.125 x -3.091(A): (0.25 x -3.091)
€
Σ =−2.34
Methods for Generating Conservation Scores
• PhastCons Scores– Uses Hidden Markov Model to generate scores
• PhyloP Scores– Reports scores as a p-value
PhastCons Scores
• Hidden Markov Model generates scores based on preceding and proceeding bases when generating scores
• Longer, clumped regions of conservation• Scores range from 0 to 1
PhastCons Scores
Patser and PhastCons/Patser for CSLC, CSLH, CSLL
Patser and PhastCons/Patser for STAT3and STAT5A
Patser and PhastCons/Patser for RORA1, NFKappaB1
Patser/PhastCons Scores: CMAF, STAT5B1, AP1
PhyloP Scores
• Determine scores on an individual, base-by-base basis
• Scores (for specific IL22 region) range from -5.799 to 3.225
• Scores are reported as negative logarithms of p-values – Large, positive scores indicate conservation– Negative scores indicate non-conserved regions
PhyloP Scores
Patser and PhyloP
• PhyloP scores cannot be used directly in Patser scoring system
• Direct use of PhyloP scores results in scores greater than 30– Negative PhyloP scores were multiplied by
negative PWM entries to yield large positive scores
PhyloP Score Adjustment
• All negative scores were assigned a conservation score of 0
• All positive scores were divided by the largest conservation score to scale scores down to scale directly comparable to PhastCons scores
• Negative scores indicate positive selection, which is not of interest in this study
Adjusted PhyloP Scores
PhyloP / Adjusted PhyloP Comparison
Patser and PhyloP/Patser for CSLC, CSLH, CSLL
Patser and PhyloP/Patser for STAT3and STAT5A
Patser and PhyloP/Patser for RORA1, NFKappaB1
Patser/PhyloP Scores: CMAF, STAT5B1, AP1
Conclusions
• Sites in which Patser, Patser/Phast, and Patserl/PhyloP were in the same region: CMAF, AP1, CSLC– Complete summary available as PDF on wiki page
• Results could be verified in wet-lab to test effectiveness of program
• If effective, program could provide quick means of identifying putative TF binding sites
Future Work
• Improvements on program efficiency– Currently, the program PatserCons.java is run, the
results are then saved as a text file, and the program TopScore.java organizes these results to display only the top-scoring sites
– PatserCons and TopScore could be condensed
Future Work
• Cleaning up code• Possibly test effectiveness with A. thaliana
project at PLU
Additional Work
• Bacterial plasmid preparation for Xiaoyun• Collection of updated transcription factor
binding profiles from JASPAR (available on wiki page)
Acknowledgments
• Everyone presently in the Wang Lab• Former members: Kevin, Chris• Ryan Christensen in Gary Stormo’s Lab