Bioinformatics: October 2007

SNPs(Single nucleotide polymorphisms)

# more than 99% of human DNA sequences are the same across the population

# it must occur in at least 1% of the population.

# SNPs, which make up about 90% of all human genetic variation, occur every 100 to 300 bases along the 3-billion-base human genome

# two of every three SNPs involve the replacement of cytosine (C) with thymine (T)

# SNPs can occur in both coding (gene) and noncoding regions of the genome.

# Many SNPs have no effect on cell function, but scientists believe others could predispose people to disease or influence their response to a drug

Normalization

# Background Correction (Oligonucleotide arrays)
- The array is split into 16 rectangular zones
- Zone background is chosen to be the lowest 2% of intensities in each zone
- The background for each of the probes is computed as weighted sum of backgrounds of all zones.
- The corrected probe balues can be calculated by subtracting the background

# Normalization is necessary because the raw intensities of labeled targets vary among arrays due to sources of experimental variability independent of level of expression

Twin Study

C: the number of concordant pairs
D: the number of discordant pairs

Pairwise concordance = C/(C+D)
Probandwise concordance = 2C/(2C+D)

Pedigree Analysis

library(kinship)

# generate an example data
id <- 1:14
dadid <- c(NA, NA, 1, 1, 1, 3, 5, NA, NA, 8, 8, NA, NA, 11)
momid <- c(NA, NA, 2, 2, 2, 12, 13, NA, NA, 9, 9, NA, NA, 4)
sex <- c(1, 2, 1, 2, 1, 1, 2, 1, 2, 2, 1, 2, 2, 1)
affected <-c(1, 2, 1, 1, 2, 1, 2, 1, 1, 1, 1, 2, 2, 1)
status <-c(0, 1, 0, 0, 1, 0, 0, 1, 0, 0, 0, 0, 0, 0)

# make it a data frame
ped <- data.frame(id, dadid, momid, sex, affected, status)

# pedigree analysis
pp<-pedigree(id=ped$id,dadid=ped$dadid,momid=ped$momid,sex=ped$sex,affected=ped$affected,status=ped$status)

# pedigree plot
plot(pp)

Microarray in general

### cDNA microarray - 2 color

Cy5: Red (experimental, mutant)
Cy3: Green (reference, wild-type)

M = log2(R/G)
A = {log2(R) + log2(G)}/2

### "Long" oligo arrays
- two color and double-stranded
- 60-80 bp long

### Affymetrix arrays
- one color array and single stranded
- 25 bp

# probe: material that is purposedfully places on the array before the experiment
# target: the material that is gathered from a sample
# hybridization: target material is put on array, then targets stick to complementary probes

Comparison Analysis (Experimental vs Baseline arrays)

# Compare the difference values (PM-MM) of each probe pair in the baseline array to its matching probe pair on the experimental array.

# Before comparing two arrays, variations between the two experiments caused by technical and biological factors must be corrected by scaling, normalization or a Robust normalization.

# Change p-value
Using the difference between PM and MM as well as PM and background intensities, the Change p-value is calculated by the Wilcoxon's signed rank test.

# Change Call
Increase (I): p-value < gamma1
Marginal Increase (MI): gamma1 < p-value < gamma2
No Change (NC): gamm2 < p-value < 1-gamma2
Marginal Decrease (MD): 1-gamma2 < p-value < 1-gamma1
Decrease (D): p-value > 1-gamma1

# Signal Log Ratio Algorithm
One-step Tukey's Biweight method

Single Array Analysis (Oilgonucleotide expression arrays)

# Single stranded DNA, 25 bp

# 14~20 probe pairs for each gene

# Each probe pair has a Perfect Match (PM) and a Miss Match (MM)

# MAS 4.0 (Average difference) = average of PM-MM difference

# Low-level anaysis: feature extraction, normalization, computation of expression indexes

# High-level analysis: t-test, ANOVA

# Discrimination Score
R = (PM - MM) / (PM + MM)

# Detection p-value by One-sided Wilcoxon's Signed Rank test
H0: E(R) = tau (default = 0.015)
Ha: E(R) > tau

# Detection Call
Present: p-value <= alpha1
Marginal: alpha1 < p-value <= alpha2
Absent: p-value > alpha2

defaults: alpha1=0.04, alpha2 = 0.06

# Signal Algorithm
One-Step Turkey's Biweight Estiimate

If PM > MM, informative
if PM < MM, uninformative and use an imputed value called Idealized Mismatch (IM)

Bioinformatics

Saturday, October 27, 2007