Wang labsummer2010

David RussoWang Lab Summer 2010

June 2010-August 2010

Background

• Senior Mathematics and Biology double major at Pacific Lutheran University in Tacoma, WA

• Math: mostly pure math• Biology: wet lab background working with A.

thaliana jasmonic acid signaling pathway

Getting Started

• Familiarize myself with topics covered in BIO5488

• Gaining familiarity with Perl, Next Gen sequencing, epigenetics, homology, etc.

• Chose to use Java instead of Perl for convenience and familiarity

Goal

• Identify conserved regions in DNA that could serve as transcription factor binding sites– Improve existing Patser program

Patser

• Program written by Gary Stormo and Jerry Hertz for pattern recognition in sequences

• Accepts an alignment matrix and a sequence as its arguments

• Converts alignment matrix to position weight matrix and scores the sequences according to the PWM

Patser

Additional Patser Features

• Patser also calculates P-values and E-values, which were not included in my program for simplicity

Alignment matrices

• Also known as “raw count matrices”• Base counts across multiple genomes for

transcription factor of interest • Generated with ChIP-seq and ChIP-chip whole-

genome binding experiments• Obtained from TRANSFAC and JASPAR libraries

CSLC alignment matrix

A 5 21 21 21 0 0C 2 0 0 0 0 0G 9 0 0 0 21 0T 5 0 0 0 0 21

Conversion to Position Weight Matrix

• The conversion of an alignment matrix to a PWM makes use of the following formula:

• Where nk is the kth element of the alignment matrix, pk = 0.25, and N is the number of sequences being compared (sum of elements in any column)€

lnnk + pkN +1

⎛ ⎝ ⎜

⎞ ⎠ ⎟ 1pk

⎛ ⎝ ⎜

⎞ ⎠ ⎟

CSLC Alignment Matrix --> PWMColumn 1: ln(((5+.25)/(21+1))/0.25) = -0.047ln(((2+.25)/(21+1))/0.25) = -0.894ln(((5+.25)/(21+1))/0.25) = 0.520ln(((5+.25)/(21+1))/0.25) = -0.047

Position Weight Matrix

• Position weight matrices approximate binding affinities between proteins and DNA sequences

• Key difference: positive values correspond to higher binding affinities, while negative values correspond to lower binding affinities

Patser AlgorithmA -0.047 1.352 1.352 1.352 -3.091 -3.091 C -0.894 -3.091 -3.091 -3.091 -3.091 -3.091 G 0.520 -3.091 -3.091 -3.091 1.352 -3.091 T -0.047 -3.091 -3.091 -3.091 -3.091 1.352

Random segment from an imaginary sequence: ACGATA

Patser Score for random segment: (-0.047)+(-3.091)+(-3.091)+(1.352)+(-3.091)+(-3.091) = -11.059

Patser Algorithm

• Since the imaginary sequence ACGATA has only one base in common with the logo, the patser score is very low

• Because it is a low score, it is not a potential binding site for the CSLC transcription factor

Interleukin 22 (IL22)

• Located on 12th chromosome in humans (10th chromosome in mice)

• Involved in cellular inflammatory responses by initiating innate immune responses against bacterial pathogens, especially in respiratory and gut epithelial cells

Choosing a Region To Analyze

• Approximately 10,000bp before the beginning of the gene body, and 1,000bp into the gene body

• Region in front of promoter is thought to be regulatory in nature, and a site where transcription factors can bind

IL22 Region of InterestChr12: 66,918,269-66,929,282

Transcription factors for investigation(chosen by Ting and GeneCard)

• STAT3• CSLC• CSLH• CSLL• CMAF01• RORA1• RORA2• AP1• NFKappaB1• STAT5A1• STAT5A4• STAT5B1

Standard Patser Output(first 5 scores for CSLC)

hg18_dna position= 1 score= -7.46 sequence= CTTTGT hg18_dna position= 1C score= -6.62 sequence= ACAAAG hg18_dna position= 2 score= -15.50 sequence= TTTGTG hg18_dna position= 2C score= -7.46 sequence= CACAAA hg18_dna position= 3 score= -11.06 sequence= TTGTGG hg18_dna position= 3C score= -11.91 sequence= CCACAA hg18_dna position= 4 score= -11.06 sequence= TGTGGA hg18_dna position= 4C score= -11.06 sequence= TCCACA hg18_dna position= 5 score= -14.94 sequence= GTGGAG hg18_dna position= 5C score= -16.35 sequence= CTCCAC

Top Scoring Positions hg18_dna position= 1810C score= 7.28 ln(p-value)= -8.32 sequence= GAAAGT hg18_dna position= 2970C score= 7.28 ln(p-value)= -8.32 sequence= GAAAGT hg18_dna position= 3623 score= 7.28 ln(p-value)= -8.32 sequence= GAAAGT hg18_dna position= 4446C score= 7.28 ln(p-value)= -8.32 sequence= GAAAGT hg18_dna position= 4770 score= 7.28 ln(p-value)= -8.32 sequence= GAAAGT hg18_dna position= 7383C score= 7.28 ln(p-value)= -8.32 sequence= GAAAGT hg18_dna position= 8151C score= 7.28 ln(p-value)= -8.32 sequence= GAAAGT hg18_dna position= 9014 score= 7.28 ln(p-value)= -8.32 sequence= GAAAGT hg18_dna position= 9551C score= 7.28 ln(p-value)= -8.32 sequence= GAAAGT hg18_dna position= 10245 score= 7.28 ln(p-value)= -8.32 sequence= GAAAGT hg18_dna position= 10352 score= 7.28 ln(p-value)= -8.32 sequence= GAAAGT

STAT3 and STAT5 Binding Sites

CSLC, CSLH, CSLL, CMAF01 Binding Sites

AP1, NFKappaB1, RORA1, RORA2 Binding Sites

Conservation

• Patser could be improved by integrating conservation scores (obtained from USCS Genome Browser)

• Several strategies: multiply each position in sequence by the average of all conservation scores, multiply each position by corresponding conservation scores

Conservation

• “Average Approach” is not likely to be effective, as it is an oversimplification of the conservation scores

• Essentially reports a scaled down version of Patser scores

Conservation

• Base-by-base conservation approach: More likely to be effective at identifying transcription factor binding sites

• New Patser formula:

• ck is the conservation score at position k

€

lnnk + pkN +1

⎛ ⎝ ⎜

⎞ ⎠ ⎟ 1pk

⎛ ⎝ ⎜

⎞ ⎠ ⎟× ck

⎛

⎝ ⎜

⎞

⎠ ⎟

k= i

i+L

∑

Example

• Imaginary Sequence– ACGATA

• Imaginary Conservation Scores– 1(A): 0.5– 2(C):0.25– 3(G): 0.125– 4(A): 0– 5(T): 0.125– 6(A):0.25

Example

(A): (0.5 x -0.047)(C): (0.25 x -3.091)(G): (0.125 x -3.091)(A): (0 x 1.352)(T): (0.125 x -3.091(A): (0.25 x -3.091)

€

Σ =−2.34

Methods for Generating Conservation Scores

• PhastCons Scores– Uses Hidden Markov Model to generate scores

• PhyloP Scores– Reports scores as a p-value

PhastCons Scores

• Hidden Markov Model generates scores based on preceding and proceeding bases when generating scores

• Longer, clumped regions of conservation• Scores range from 0 to 1

PhastCons Scores

Patser and PhastCons/Patser for CSLC, CSLH, CSLL

Patser and PhastCons/Patser for STAT3and STAT5A

Patser and PhastCons/Patser for RORA1, NFKappaB1

Patser/PhastCons Scores: CMAF, STAT5B1, AP1

PhyloP Scores

• Determine scores on an individual, base-by-base basis

• Scores (for specific IL22 region) range from -5.799 to 3.225

• Scores are reported as negative logarithms of p-values – Large, positive scores indicate conservation– Negative scores indicate non-conserved regions

PhyloP Scores

Patser and PhyloP

• PhyloP scores cannot be used directly in Patser scoring system

• Direct use of PhyloP scores results in scores greater than 30– Negative PhyloP scores were multiplied by

negative PWM entries to yield large positive scores

PhyloP Score Adjustment

• All negative scores were assigned a conservation score of 0

• All positive scores were divided by the largest conservation score to scale scores down to scale directly comparable to PhastCons scores

• Negative scores indicate positive selection, which is not of interest in this study

Adjusted PhyloP Scores

PhyloP / Adjusted PhyloP Comparison

Patser and PhyloP/Patser for CSLC, CSLH, CSLL

Patser and PhyloP/Patser for STAT3and STAT5A

Patser and PhyloP/Patser for RORA1, NFKappaB1

Patser/PhyloP Scores: CMAF, STAT5B1, AP1

Conclusions

• Sites in which Patser, Patser/Phast, and Patserl/PhyloP were in the same region: CMAF, AP1, CSLC– Complete summary available as PDF on wiki page

• Results could be verified in wet-lab to test effectiveness of program

• If effective, program could provide quick means of identifying putative TF binding sites

Future Work

• Improvements on program efficiency– Currently, the program PatserCons.java is run, the

results are then saved as a text file, and the program TopScore.java organizes these results to display only the top-scoring sites

– PatserCons and TopScore could be condensed

Future Work

• Cleaning up code• Possibly test effectiveness with A. thaliana

project at PLU

Additional Work

• Bacterial plasmid preparation for Xiaoyun• Collection of updated transcription factor

binding profiles from JASPAR (available on wiki page)

Acknowledgments

• Everyone presently in the Wang Lab• Former members: Kevin, Chris• Ryan Christensen in Gary Stormo’s Lab

Documents

Wang labsummer2010