Basic awk to tweak data files!

Making an Annotation File for Genes

I had the task to make a little annotation file (for use with vcftools) to annotate a big list of SNPs with genes, because
I don’t speak “SNP” language. This turned into having fun with awk and text files, and I wanted to share what I learned.

We could easily get chromosome positions using R, but I happen to know that my diabolical lab mate had so nicely provided a file with positions for 827 autism genes, and so we will use this file. Thanks Maude!! You will have to ask her if you want this file, but I’ll just give you a preview of what the format looks like:


Ensembl Gene ID,Gene Start (bp),Gene End (bp),HGNC symbol,Chromosome Name,EntrezGene ID
ENSG00000162068,2471499,2474145,NTN3,16,4917
ENSG00000275351,54832726,54848250,KIR2DS4,CHR_HSCHR19KIR_G085_A_HAP_CTG3_1,3809
ENSG00000271771,34158121,34158767,,5,
</pre>



And for our file to do gene annotations with vcftools, we want it to match this format, and we want to get it in this format


#CHR     FROM   TO      ANNOTATION 
1        12345  22345   gene1
1        67890  77890   gene2
</pre>


Time for some awk magic!!


awk -F"," 'NR!=1{print $2,$5,$3,$4}' OFS="\t" all_chrom_position.txt > AU-827.annot
</pre>


Let’s talk about the command above. The -F command tells awk that the file is separated by commas. The NR!=1 command says that we want to skip the first line (this is the old header). The {print $2,$5,$3,4$} says that we want to print the 2nd, 5th, 3rd, and 4th column. Finally, OFS=”\t” indicates an output field separator to print, and the final appendage of > AU-827.annot has all that screen output concat into the file “AU-827.annot”

Don’t freak out – putting the starting chromosome position first was intentional because we want to sort the file by that.


sort -t\t -nk1 AU-827.annot > AU-827-sorted.annot
</pre>


The -t\t tells sort that the file is tab delimited. The -nk1 tells to sort by the first column, and it’s a number.

Now we can put it in the correct order (if it even matters, might as well!)


awk -F"\t" '{print $2,$1,$3,$4}' OFS="\t" AU-827-sorted.annot > AU-827.annot
</pre>


Let’s get rid of the annotations for which there are no gene names (I don’t know if Maude meant to include these or not!)


awk -F'\t' '$4 != ""' AU-827.annot > AU-827-filter.annot
bgzip AU-827-filter.annot
</pre>


If you wanted to add a header, you could do it like this (although we don’t need to)


#CHR FROM TO ANNOTATION
awk 'BEGIN{print "#CHR\tFROM\tTO\tANNOTATION"}1' AU-827-sorted.annot > AU-827-header.annot

vim AU-827-header.annot

#CHR    FROM    TO      ANNOTATION
KI270752.1      144     268
MT      577     647     MT-TF
MT      648     1601    MT-RNR1
KI270724.1      1529    1607
</pre>


Omg. So awesome.

NOW we can annotate our vcf file with the genes! First, we need to get rid of the ones that we don’t have gene names for (sadface). We know that each line SHOULD have 4 things, so let’s get rid of ones that do not.


awk -F'\t+' 'NF == 4' AU-827-header.annot > AU-827-header-filter.annot
bgzip AU-827-header.annot

# Now convert to tabix (indexed) format:

tabix -p vcf AU-827-header.annot.gz
vcf-annotate -h

cat AU-merged.vcf | vcf-annotate -a AU-827-header.annot.gz -d key=ANNOTATION,ID=ANN,Number=1,Type=Integer,Description='GENE' -c CHROM,FROM,TO,INFO/ANN > test.vcf

#CHR FROM TO ANNOTATION
16 2471499 2474145 NTN3



This is where I stopped at this point, because the script writing algorithm in my brain told me I should do these steps in R instead, and I agreed!

Learning about VCF (variant call format) file analysis

If you just want to learn from the scripts I’m working on: SCRIPTS
And as a subset of this post, if you want to learn about using awk to work with text files

What is variant analysis?

We all know that our genes have a lot to say about phenotype: we spend countless time and money to search for genetic markers that could predict disease, and in the future we would want to be able to change these markers to prevent it. We also have different theories for how we think about the relationship between genes and disease. The common variant model of disease says that there are a bunch of variations in our genetic code that are pretty common, and it’s when you get a particular combination of these common variants that you develop disease. The rare variant model says that disease can be accounted for by a much smaller number of rare variants (perhaps that are only present in an individual or family.) I’m not going to state that one is more right than the other, however my opinion is that the rare variant model makes more sense in the context of diseases like autism for which we still haven’t found a “genetic signature” in the most typical places to look. Where are those places? We typically look at protein coding sequences of the genome (called the “exome”), across a bunch of single nucleotide polymorphisms (SNPs) [snpedia]. What are we looking at, exactly?

When we look at someones genetic code, you may remember that genes are made of the base pairs ATCG, called nucleotides. And you may also know that while we are all very similar in this sequence, there are random insertions, deletions, or other kinds of variation that happens. Is this good or bad? Well, a lot of the mutations don’t really do anything that is totally obvious to us. However, sometimes if a mutation changes the function of a gene (meaning that a protein is not coded, or is coded differently), this could lead to phenotypic changes (some of which might be linked to the disease!) So when we want to study diseases (and possible find explanations for them), we are interested in finding these mutations. So in a nutshell, “variant analysis” is looking at variations in our genetic sequence, and hoping to link those variations to disease.

What is a vcf file?

In biomedical informatics we are very good at making standard file formats. A vcf file is just a formatted text file in “variant call format,” meaning that it is full of information about variants! And here we are at this post. I need to learn how to parse a vcf file to find rare variants. First let me (broadly) set up the analysis that I want to do:

Define a threshold of “rare” to be represented in <1% of the population
Extract these rare variants
Filter down the set to only include deleterious or important
Map these variants to genes
Add in some (intelligent?) weighting, (to be thought about later)
DAVID (more on this later)

The first step is to get a vcf file. I am using one from our autism dataset.

Next, we should decide how to work with this file. It is again notable that the file is just a text file – there is nothing magic or compiled about it :) This means that if we wanted, we could write our own script with bash, python, R, etc. However, there are already tools that are made for this! Let’s download (and learn) the vcftools and a nice little package for working with big text tables called tabix.

First, get the vcftools with subversion:


svn checkout http://svn.code.sf.net/p/vcftools/code/trunk/ vcftools
</pre>


Note this means you can always then cd to your vcftools directory and type “svn update” to do exactly that. Once we have the tools, we need to compile for perl. CD to where you installed them:


cd /home/vanessa/Packages/vcftools

# Export the path to the perl subdirectory:  
export PERL5LIB=/home/vanessa/Packages/vcftools/perl/

# and (in the vcftools directory), compile!  
make  
</pre>


Now for tabix, download and extract/unzip to your Packages folder


cd tabix-0.2.6
make
</pre>


Now we need to add the tools to our path:


vim /home/vanessa/.profile

# Add vcftools to path
VCFTOOLS=/home/vanessa/Packages/vcftools
export VCFTOOLS
PATH=$PATH:$VCFTOOLS/bin

# Add tabix to path
TABIX=/home/vanessa/Packages/tabix-0.2.6
PATH=$PATH:$TABIX
export PATH

# (now exit)
</pre>


Now source the bash profile to adjust the path


source /home/vanessa/.profile
</pre>


and now when you type “which vcftools” you should see:


/home/vanessa/Packages/vcftools/bin/vcftools
</pre>


Good job!


# Why do we have vcf files?

Again, the variant call format (vcf) is a text file for storing gene sequence variations. Why did someone come up with it? Well, if we stored an ENTIRE genome, that is a MASSIVE file. Someone smart a while back figured out that we could store a “standard human genome” (we call this a reference), and then a second file (the vcf) that describes how one or more people vary from this reference. It starts with lots of header fields for meta information, and then is basically rows and rows of variants, each of which has meta information of its own! Let’s take a look at what this might look like. I’m going to steal this right off of wikipedia:

The junk at the top with ## before it are all meta (header) information fields. For example, if you were opening this file for the first time, you might want to know how it was filtered, and what different annotations in the file mean (e.g., AA == Ancestral Allele”).

When the double ## goes down to a single #, these are the field names that describe the following rows, including:

- CHROM: Chromosome number
- POS: Position on the chromosome
- ID: If (for example) a SNP has a particular ID (rs*)
- REF: This is the base that is found in the reference genome
- ALT: This is the alternate allele found in the sample’s genome
- QUAL: This is a quality score, called a “Phred quality score” http://en.wikipedia.org/wiki/Phred_quality_score, 10log_10 prob(call in ALT is wrong).
- FILTER: If the nucleotide passes all filters, it will say PASS. if it says something else, this is the filter that was not passed
- INFO: NS means “number of samples” and you should look up the link provided below to see what all the other acronyms mean!
- FORMAT: the stuff here explains what the numbers mean in the next set of columns (the samples). For example,
- GT:GQ:DP:HQ says that the first thing is a genotype, followed by a conditional genotype quality, a read depth, and haplotype qualities. For the genotype, the | means phased and / means unphased.


##fileformat=VCFv4.0
##fileDate=20110705
##reference=1000GenomesPilot-NCBI37
##phasing=partial
##INFO=<ID=NS,Number=1,Type=Integer,Description="Number of Samples With Data">
##INFO=<ID=DP,Number=1,Type=Integer,Description="Total Depth">
##INFO=<ID=AF,Number=.,Type=Float,Description="Allele Frequency">
##INFO=<ID=AA,Number=1,Type=String,Description="Ancestral Allele">
##INFO=<ID=DB,Number=0,Type=Flag,Description="dbSNP membership, build 129">
##INFO=<ID=H2,Number=0,Type=Flag,Description="HapMap2 membership">
##FILTER=<ID=q10,Description="Quality below 10">
##FILTER=<ID=s50,Description="Less than 50% of samples have data">
##FORMAT=<ID=GQ,Number=1,Type=Integer,Description="Genotype Quality">
##FORMAT=<ID=GT,Number=1,Type=String,Description="Genotype">
##FORMAT=<ID=DP,Number=1,Type=Integer,Description="Read Depth">
##FORMAT=<ID=HQ,Number=2,Type=Integer,Description="Haplotype Quality">
#CHROM POS    ID        REF  ALT     QUAL FILTER INFO                              FORMAT      Sample1        Sample2        Sample3
2      4370   rs6057    G    A       29   .      NS=2;DP=13;AF=0.5;DB;H2           GT:GQ:DP:HQ 0|0:48:1:52,51 1|0:48:8:51,51 1/1:43:5:.,.
2      7330   .         T    A       3    q10    NS=5;DP=12;AF=0.017               GT:GQ:DP:HQ 0|0:46:3:58,50 0|1:3:5:65,3   0/0:41:3
2      110696 rs6055    A    G,T     67   PASS   NS=2;DP=10;AF=0.333,0.667;AA=T;DB GT:GQ:DP:HQ 1|2:21:6:23,27 2|1:2:0:18,2   2/2:35:4
2      130237 .         T    .       47   .      NS=2;DP=16;AA=T                   GT:GQ:DP:HQ 0|0:54:7:56,60 0|0:48:4:56,51 0/0:61:2
2      134567 microsat1 GTCT G,GTACT 50   PASS   NS=2;DP=9;AA=G                    GT:GQ:DP    0/1:35:4       0/2:17:2       1/1:40:3
</pre>


This is why you should not be afraid when someone mentions a vcf file. It’s just a text file that follows a particular format. I went over the basics, but if you want more details on the format can be [found here](http://www.1000genomes.org/wiki/Analysis/Variant%2520Call%2520Format/vcf-variant-call-format-version-41).


# Basic working with vcf files

We will first get comfortable working with the tools, and then write a bash (or python) script to achieve the functionality that we want. Let’s cd to where we downloaded the vcf file:


cd /home/vanessa/Documents/Work/GENE_EXPRESSION/tutorial/vcf
</pre>


Let’s get basic info about our vcf file:


vcftools --vcf AU-8001_1.vcf

VCFtools - v0.1.13
(C) Adam Auton and Anthony Marcketta 2009

Parameters as interpreted:
--vcf AU-8001_1.vcf

After filtering, kept 1 out of 1 Individuals
After filtering, kept 21749 out of a possible 21749 Sites
Run Time = 0.00 seconds
</pre>



# Filtering and Writing Files

We might want to filter down to a certain chromosome, a quality score, or just for a particular region. This is super easy to do!


# Here filter to chromosome 1, from base pairs 2,000,000 to 3,000,000
vcftools --vcf AU-8001_1.vcf --chr 1 --from-bp 2000000 --to-bp 3000000

VCFtools - v0.1.13
(C) Adam Auton and Anthony Marcketta 2009

Parameters as interpreted:
--vcf AU-8001_1.vcf
--chr 1
--to-bp 3000000
--from-bp 2000000

After filtering, kept 1 out of 1 Individuals
After filtering, kept 10 out of a possible 21749 Sites
Run Time = 0.00 seconds
</pre>


I’m not sure how you would a priori decide which bases to look at. Perhaps doing a search for a gene, and then deciding in this interface?

But the entire purpose of filtering is to reduce your data to some subset, so arguably we would want to write this subset to a new file. To do this, just add “–recode –recode-INFO-all”. The second part makes sure that we keep the metadata.


vcftools --vcf AU-8001_1.vcf --chr 1 --from-bp 2000000 --to-bp 3000000 --chr 1 --from-bp 1000000 --to-bp 2000000 --recode --out subset
</pre>


You can also pipe the output into a file, sort of like the above


vcftools --vcf AU-8001_1.vcf --chr 1 --from-bp 2000000 --to-bp 3000000 --chr 1 --from-bp 1000000 --to-bp 2000000 --recode -c > /home/vanessa/Desktop/shmeagley.vcf
</pre>


Two files are produced: a subset.log (containing the screen output you would have seen) and a subset.recode.vcf with the filtered data. If you want to pipe the data into your script (and not produce a gazillion new files) you can also do that pretty easily:


vcftools --vcf AU-8001_1.vcf --chr 1 --from-bp 2000000 --to-bp 3000000 --recode --stdout
</pre>


If you want to look through it manually, add | more like this:


vcftools --vcf input_data.vcf --diff other_data.vcf --out compare
</pre>


Or the equivalent | less will let you browse through it in a terminal like vim (and scroll up, etc.)


# Comparing Files

If I had a Mom and a Dad and child, or any people that I want to compare, I can do that too!


vcftools --vcf input_data.vcf --diff other_data.vcf --out compare
</pre>


You could again pipe that into an output file.


# Getting Allele Frequencies

You can also get allele frequency for ALL people in your file like this:


vcftools --vcf AU-8001_1.vcf --freq --out output
vim output.frq

CHROM POS N_ALLELES N_CHR {ALLELE:FREQ}  
 1 69511 2 2 A:0 G:1  
 1 881627 2 2 G:0.5 A:0.5  
 1 887801 2 2 A:0 G:1  
 1 888639 2 2 T:0 C:1  
 1 888659 2 2 T:0 C:1  
 1 897325 2 2 G:0 C:1  
 1 900505 2 2 G:0.5 C:0.5  
 1 909238 2 2 G:0 C:1  
 1 909309 2 2 T:0.5 C:0.5  
 1 949608 2 2 G:0.5 A:0.5  
 …
</pre>


I can imagine wanting to do this with MANY vcf files to get averages for a particular population. The above isn’t very interesting because I only have ONE person, so it’s essentially showing that for a particular chromosome, a particular position, there are 2 possible alleles at a given locus (N_ALLELES), the alternate and the reference, and we have data available for both of those at each site (N_CHR), the reference and our sample. Then the last two columns show the frequency counts for each. Let’s download another set of VCFs and see if we can merge them! I think the perl vcftools could accomplish this.


# Merging many VCF files into one

It looks like we are going to need to compress the files (using bgzip) and index (using tabix) before doing any kind of combination. Note that doing bgzip *.vcf is not going to work, so we need to feed in a list of files. Here is how to do that.


VCFS=`ls *.vcf`
for i in ${VCFS}; do
  bgzip $i
  tabix -p vcf $i".gz";
done

ls *.gz.tbi -1
AU-2901_1.vcf.gz.tbi
AU-7801_2.vcf.gz.tbi
AU-7901_2.vcf.gz.tbi
AU-8001_1.vcf.gz.tbi
AU-8401_2.vcf.gz.tbi
AU-9201_3.vcf.gz.tbi
</pre>


There they are! If you try to look at the file, it’s compiled (hence the efficient indexing) so it looks like gobbeltee-gook.


# Merging the files into one VCF!


vcf-merge *.vcf.gz | bgzip -c > AU-merged.vcf.gz

Using column name 'AU-26302_2' for AU-2901_1.vcf.gz:AU-26302_2
Using column name 'AU-7801_2' for AU-7801_2.vcf.gz:AU-7801_2
Using column name 'AU-7901_2' for AU-7901_2.vcf.gz:AU-7901_2
Using column name 'AU-8001_1' for AU-8001_1.vcf.gz:AU-8001_1
Using column name 'AU-8401_2' for AU-8401_2.vcf.gz:AU-8401_2
Using column name 'AU-9201_3' for AU-9201_3.vcf.gz:AU-9201_3
</pre>


Sweet! That seemed to work. We are making progress in the world! Now let’s try calculating the allele frequencies (for a more interesting result?)


gunzip AU-merged.vcf.gz
vcftools --vcf AU-merged.vcf --freq --out all-AU

VCFtools - v0.1.13
(C) Adam Auton and Anthony Marcketta 2009

Parameters as interpreted:
--vcf AU-merged.vcf
--freq
--out all-AU

After filtering, kept 6 out of 6 Individuals
Outputting Frequency Statistics...
After filtering, kept 46228 out of a possible 46228 Sites
Run Time = 1.00 seconds

CHROM POS N_ALLELES N_CHR {ALLELE:FREQ}
1 69270 2 2 A:0 G:1
1 69511 2 10 A:0 G:1
1 69761 2 2 A:0 T:1
1 69897 2 2 T:0 C:1
1 877831 2 10 T:0.2 C:0.8
1 879317 2 2 C:0.5 T:0.5
1 881627 2 12 G:0.333333 A:0.666667
1 887801 2 12 A:0 G:1
1 888639 2 12 T:0 C:1
1 888659 2 12 T:0 C:1
1 897325 2 12 G:0 C:1
1 900505 2 6 G:0.5 C:0.5
1 906272 2 2 A:0.5 C:0.5
1 909238 2 10 G:0.2 C:0.8
1 909309 2 4 T:0.5 C:0.5
1 909419 2 4 C:0.5 T:0.5
1 949608 2 8 G:0.5 A:0.5
1 949654 2 12 A:0.0833333 G:0.916667
...
</pre>


Right off the bat we can see that there are markers that we only have for a subset of the population (the ones that are not 12 – since we have 6 people if everyone has two, this means that the N_CHR would be 12?). I think more importantly in this file is the Allele Frequency – for example having A for 949654 is super rare! Again, this is calculating frequencies across our vcf files. I think if we are looking for variants in autism, for example, we would want to compare to a reference genome, because it could be that particular variants in an ASD population are more common.


# Exporting variants to a table

This vcftools business makes it so easy! Not only can I sort (vcf-sort) and calculate statistics over the files (vcf-stats) or create a custom format file (vcf-query), I can in one swift command convert my file to a tab delimited table:


bgzip AU-merged.vcf
zcat AU-merged.vcf.gz | vcf-to-tab > AU-merged.tab
</pre>



# Sequencing Depth

Here is how to get sequencing depth. Again, this would probably be useful to do some kind of filtering… I would imagine we would want to eliminate samples entirely that don’t have a particular depth?:


vcftools --vcf input_data.vcf --depth -c > depth_summary.txt

INDV N_SITES MEAN_DEPTH
AU-8001_1 21749 69.3184
</pre>


70 seems OK. If it were under 10, would that not be good enough? What is an acceptable depth?


# Linkage Disequilibrium

I don’t remember exactly what this is – I seem to remember Maude explaining to me that there are chunks on chromosomes that don’t get flopped around when DNA does it’s recombination thing, and so as a result of that you have sets of alleles / genes that tend to stick around together. So (I think) that doing this, we do a pairwise comparison of our data and find markers that travel together over time? And I (think) this is how you would do that:


vcftools --vcf AU-8001_1.vcf --geno-chisq
</pre>



# Putting this all together…

We now want to write a specific command to achieve our goal. What do we want to do again?

 1. Compile all vcfs into one  
 2. Filter down to calls with good quality  
 3. Identify rare variants (less than 1% of population, reference)  
 4. Write rare variants into a big matrix! (for further analysis in R)

Let’s do this, and write a script to perform what we want to do!


## [parseVCF.sh: A compilation of the stuff above to go from VCF –> tab delimited data file for R (currently in progress!)](https://github.com/vsoch/vvcftools/blob/master/parseVCF.sh)

Aug 28

Self Organizing Maps (SOM)

I have a thing for dimensionality reduction. One method in particular, the Self organizing map, is probably my favorite. I made this tutorial for my lab, but I thought it would be useful to share with the world of the internet! Click on the presentation, and then use your cursor to move through the slides. Here you can also see the presentation in a new window, or download a PDF. <iframe class="iframe-class" frameborder="0" height="650" scrolling="yes" src="http://vbmis.com/bmi/diabolical/SOM.html" width="100%"></iframe>
·
Jul 30

Randy Pausch: The Last Lecture

I want to draw attention to a video that most of us have seen at one time or another by an amazing professor of computer science at CMU, Randy Pausch. To say that this man was amazing does not do him justice. This talk is inspiring, touching, and humbling. While I recommend watching it in entirety, even if you’ve already seen it, here are some of my favorite things:

When there’s an elephant in the room, introduce him.
We can not change the cards we are dealt, just play the hand.
The brick walls are there to give us a chance to show how badly we want something. The brick walls are there to stop the people that don’t want it badly enough. They’re there to stop the other people :) When you’re screwing up and nobody is saying anything to you anymore, that means they gave up.
Experience is what you get when you get what you don’t wanted.
Wait long enough and people will surprise and impress you.
That’s one of the best things you can give somebody, the chance to show them what it feels like to make someone else excited and happy.
Never use the child-like wonder, it’s too important, it’s what drives us.
When it comes to men that are romantically interested in you, just ignore everything they say, and only pay attention to what they do.
You get people to help you by being honest, earnest.
Don’t bail, the best of the gold is at bottoms of barrels of crap.
Be good at something, it makes you valuable.
Luck is where preparation meets opportunity.

My favorite:

Don’t complain, just work harder.

And this seems comically relevant for all of us graduate students pursuing a PhD:

This is my son, he’s a doctor but not the kind who helps people.

I made a conscious decision to leave the following items of wisdom out of my list:

You can’t get there alone.
Have something to bring to the table, because that will make you more welcomed.

These statements obviously are getting at collaboration, something that is both necessary and frustrating for someone who sets incredibly high ular standards for her work. For the first, getting “there” also means that I must have some vision or goal to define “there.” I don’t spend a lot of time writing out careful goals or plans because I’m more proactive to just jump into what needs to be done. This makes me more concerned with enjoying “here.” While I agree that collaboration is essential for many larger things, I’ve had enough life experience of having completely relied on myself for things large and small, and I’m convinced that when I need to, I can do things (mostly) on my own. “Mostly” means that I am the catalyst, and look to internet or peer resources if I have a particular question. For the second statement, I need to address this table. I don’t know if dinner tastes better on it, but I’m much happier sitting in the quiet side room with my Thanksgiving plate, and enjoying listening in on the noise from the other room. Once you’re welcomed to the table, there is an expectation that you sit there, and I’d rather not. :)
·
Jul 27

Machine Learning in R

·

Create Cytogenetic Maps from Gene Lists

I have two or more lists of genes that I want to compare. If I compare strings directly, I could run into trouble if there is slight variation (e.g., SLC6A4 vs. SLC6A9). I could also look at functional annotation, but that still doesn’t give me a nice visualization of “how close are these two sets?.” What I basically want is a nice snapshot of where the genes are in relation to one another.

This is a job for… Cytogenetic Mapping!

A cytogenetic map historically means chromosomes stained under a microscope. Let’s be real. Most of us don’t work with microscopes, and thus a cytogenetic map is simply a mapping of something (genes, SNP’s, etc) to chromosomes. I would like to create these maps with d3 (to allow for interactive graphics), however a rule of thumb when you are a noob at anything is to start simple. So today we will do the following:

Start with gene lists
Look up chromosome id and position
Create a static cytogenetic map with an easy-to-use software solution

An important note! We will be mapping the **starting position of genes, since we are working with full genes and not SNPs. This means that the genes may take up more real estate on the chromosomes than the map indicates.

Writing a Script in R

R can do my laundry for me, so it sure as heck can do this simple task:


 library('biomaRt') 

# Set our output and present working directory setwd('/home/vanessa/Documents/Work/GENE_EXPRESSION/neurosynth/results/sigresults/') 

# Prepare biomart to look up chromosome information from gene # What marts are available? 
listMarts() 
# Get ensemble object 
ens = useMart('ensembl') 
# What databases are there? 
listDatasets(ens) 
# Get the human database
human = useMart('ensembl', dataset = 'hsapiens_gene_ensembl')
# What attributes can I get?
listAttributes(human)

# Each file has a single line of genes, separated by tab
genelists = list.files('visualization/genelis',pattern='*txt')

for (g in genelists){
  # The name of the file will be our phenotype label
  geneset = gsub('.txt','',g)
  genes = strsplit(readLines(paste("visualization/genelist/",g,sep='')),sep='\t')

  # This is the command to look up the chromosome name and position
  chr = getBM(attributes= c('hgnc_symbol','chromosome_name', 'start_position'), filters='hgnc_symbol',values=genes,mart=human)
  
  # Get rid of the weird ones
  idx = grep('PATCH|LRG|HSCHR|HG&',chr$chr)
  if (length(idx) != 0){
    chr = chr[-idx,]
  }
  
  # Add label column
  chr = cbind(chr,rep(geneset,nrow(chr)))
  colnames(chr) = c('annotation','chr','pos','phenotype')

  # Write single disorder to output file
  write.table(chr,file=paste('visualization/geneChromosomeTable',geneset,'.txt',sep=''),row.names=FALSE,sep='\t',quote=FALSE)  
}
</pre>



## Creating the Map

You can combine different gene lists from the output files if you want more than one phenotype in your map.  It’s then simply a matter of [using this tool to show you map](http://visualization.ritchielab.psu.edu/phenograms/plot).  It’s so easy! [![phenogramDisorderSubset](http://vsoch.com/blog/wp-content/uploads/2014/07/phenogramDisorderSubset-1024x682.png)](http://vsoch.com/blog/wp-content/uploads/2014/07/phenogramDisorderSubset.png)


## When is this a good idea?

I’m not great at visualization, but I want to be.  The first question I’ve been asking myself is “What do I want to show?” I think it’s easy to get sucked into wanting to make some crazy cool network without actually thinking about what you are trying to communicate to your viewer.  Don’t do that!  Once you know what you want to show, the second question is “Does this strategy achieve that?”  We already know we want to show physical proximity for different gene sets, if we can justify that closeness == relatedness.  With this in mind, when will the static cytogenetic map be a good visualization strategy?  It will be good when…

- **you have a small number of genes or SNPs**: too many things to plot means a visually overwhelming graphic that just isn’t that meaningful
- **you have more phenotypes: **The strength of this visual is to show the relationship between multiple phenotypes
- **chromosome position is meaningful: **if physical closeness isn’t useful or related to what you are trying to show, this is pretty useless

 


## How can I do better?

This is a good start, but I’d really like this interactive.  I’ll start by incorporating these static plots into a d3 visualization, and ultimately (hopefully) have these maps be integrated into the d3 as well.  As always, start simple, and grow from that! :)

Jun 08

Look up Probe IDs with Bioconductor

This is one of those “I’ll store this here for safekeeping” posts, specifically for a snippet of code to seamlessly translate between gene probe IDs.

I started doing genomic analyses at the beginning of the calendar year, 2014. Before delving in I was under the naive impression that genes existed in a cleanly defined set, each with a unique name, and of course, plenty of documentation. Little did I know such thoughts were fantastic! There are more probe IDs than I know what to do with. And of course the format that I’m currently working with is not the format that I need. This calls for easy ways to convert IDs between formats.

What did I do first? I found a few places to convert IDs. My browser would hang for hours on end, and sometimes return a result, sometimes not. Inputting smaller set sizes was arduous and annoying. I’m sure that’s where every poor noob soul browser goes to cough and die.

Silly me, silly me. The same can be accomplished programmatically, of course! Bioconductor is quite amazing.
```
# Install and load packages
source("http://bioconductor.org/biocLite.R")
biocLite(c("hgu95av2.db", "GO.db"))
library(AnnotationDbi)
library("GO.db")
library("hgu95av2.db")

# What kind of IDs can I choose from?
keytypes(hgu95av2.db)

# What does an ACCNUM Id look like?
head(keys(hgu95av2.db, keytype="ACCNUM"))

# Now GIVE ME DATA FRAME that goes from my ids (accnum) to the gene symbol (for gsea chip file)
test = select(hgu95av2.db, as.character(genes$GB_ACC), "GENENAME", "ACCNUM")

# What is the result?
head(test)
ACCNUM SYMBOL
1 AB000409 MKNK1
2 AB000463 SH3BP2
3 AB000781 KNDC1
4 AB001328 SLC15A1
5 AB002294 ZNF646
6 AB002308 SEC16A
</pre>
```
·
May 11

An interpretation of what constitutes a teacher

What is a good interpretation of what constitutes a teacher? That was the question of interest a few days ago in an email conversation between a friend and I. He was looking for an abstract answer, and I realized quickly that this was a case when I could take advantage of my tendency to be straightforward about things. I decided to frame the challenge by asking some basic questions.

What is a teacher when we are young?

On the most basic level, a teacher is some entity that demonstrates knowledge. When we are very little, teachers are people, including our parents, and our siblings. As kids the role expands to include these funny looking people that exist in strange classroom places. They draw with chalk, tell us to march in lines and mind our manners, and try to shove knowledge into us. How is this knowledge? It is a static set of facts and things to memorize, and we always can know the right answer. This time of life is passive in that learning is the responsibility of this other person, and not ourselves. The teacher when we are young is a person of authority that directs our attention, and makes choices for us.

What is a teacher when we are teenagers?

As we get older, the role of teacher takes on greater responsibility. We start learning less from our parents, and these teacher people (possibly) become friends, mentors, or annoying drill sergeants. At the beginning of college, all of a sudden the role is reflected back on us and we become responsible for some of our own learning. Teachers become our peers, our life experiences, and finally, a little bit of ourselves. As children we were showered in political correctness and told how wonderful we are, and in college we face the reality that failure is not only possible, it is definite. Coming to terms with that is a challenge in itself, and many just can’t. At some point, a subset of us crash into an epiphany like bullets through glass that there is strength in being able to take risks, fail, and get up to try again. During this time we learn how to learn, and how to be responsible humans and take care of ourselves. There are still authority figures and rules, however there is a shift from learning that the world is static, to realizing that everything is gray and must be challenged. Our little pyramids of certainty are questioned, and then collapse, and we must make new sense of the world.

What is a teacher when we are young adults?

Beyond college the transition is complete, and there no longer exists an entity that cares whether or not we learn a thing. It’s all up to us, and some people choose to learn nothing more, and to plaster their pyramids of knowledge into an again static structure. They might still be terrified of uncovering things that they don’t understand, or living in a world where questions are unanswered. They also might covet the abilities and achievements of others, and become addicted to what they can get and claim. The world isn’t fair, and they will devote their energy to complaining about that. The other type of individual has found empowerment in the uncertainty, and dives into the task to ask more questions and accumulate more knowledge than when someone was compelling them to. They take on the majority of their responsibility for learning, and perhaps even take on the responsibility for others. Rather than focusing on what they do not have, they identify what they want to achieve, and the careful steps needed to do exactly that. They are addicted to what they can build and become.

So, what constitutes a teacher?

So the interesting thing about the role of the teacher, for me, isn’t some specific quality in a person, outcome, or goal. It is the change between different states of taking ownership for learning, being aware of those states, and understanding the beauty and limits of one’s own understanding.
·
Apr 29

Generate RSS 2.0 Feed for Files in Directory

This was hard to find, so I want to share a little script for automatically generating an RSS 2.0 feed for files (with a particular extension) in a server directory. For example, you may have a set of photos, videos, or audio files that you want to share! Here are steps for customizing it for your server:
1. Create a directory, and add your files
2. Download the script
3. Change the $allowed_ext variable to include extensions of files you want to include in the feed
4. Change the $feedName, $feedDesc, $feedURL, and $feedBaseURL to your server paths
5. Drop the file in your http://www.mysite/directory
6. Navigate to http://www.mysite/directory to see the feed’s xml
Done! Now you can subscribe via your favorite RSS service, or check that your feed is valid.
·
Apr 24

Introduction to Gephi

This is a guest post by my diabolically awesome lab-mate Maude, who is very experienced using the software Gephi to make beautiful network visualizations.

##

How to use Gephi ?

Here I try to describe the procedure I used in a few steps, to help you to start with this pretty easy and awesome software.

##

1/ Download Gephi

(no you don’t need to donate)
(well you can)

##

2/ What does it do ?

It is an Open Source Software for Exploring and Manipulating Networks . It uses a force-directed graph drawing algorithms based on a linear-linear model (attraction and repulsion proportional to distance between nodes) . The (attraction,repulsion)-model of ForceAtlas has an intermediate position betweenNoack’s LinLog (technical university of Cottbus) and the algorithm of Fruchterman and Rheingold. Basically: the higher is the number that linked 2 nodes = the edge, the closer they will appear on a graph. The smaller, the more apart the nodes will be.

##

3/ File format.

The format that works well and I use is a matrix, separated by semi colon ( check for format error with wrangler is a good way to avoid a lot of problem, needs to be “unix” and the first column and raw needs to be empty, and make sure you don’t have any spaces in the names)

You can basically apply this software for any matrix. But I think it is more interesting to determine the correlation between variables based on solid correlation stats, and with the network making some sense (positive/negative correlations…..) Below is an example.

##

4/In the example of Vanessa:
1. For each gene set that were significantly enriched ( 4 different ones) Vanessa gave the list of genes with the ES value (=enrichment score) which gives an idea of the ranking of the genes in the list when the GSEA is performed.
2. Which mean that we keep all the genes in the gene sets, but we attribute them their ES.
3. Perform Shapiro test: here, out of 200 genes 12 were not normal , so I kept going (5% ? not bad)
4. Then use a Pearson test to determine if there was a correlation. And corrected the p-value ( FDR, ade4 package, local variation)
5. Build a matrix.
Which value did I put in the matrix?
- I built it with all the correlations tested = symmetric
- I appended the r scare of the Pearson tests performed
- I brought to zero the r scare of the correlations that where not significant after p value correction
- I added in front the sign of the r value of the pearson ( negative or positive regressive slope)
That way : the lowest value is -1, the highest +1, and the non correlated zero, which is what Gephi is trying to take in account.

What is the purpose of this ?

To see which genes are together at the beginning of the list of at the end, or if some genes are at the end when some other are at the beginning i.e. they are negatively correlated.
·
Mar 17

Analysis of Affymetrix .CEL Files

I’ve been working with microarray expression data from the Broad Connectivity Map, and of course decided to download raw data (to work with in R) over using their GUI. The raw data is in the form of “.CEL” files, which I believe are produced by the Affymetrix GeneChip System. What in the world? Basically, we can evaluate mRNA expression of thousands of genes at once, which are put on one of those cool chips. I’m not going to claim to know anything about the preparation of the DNA - that’s for the wet lab folk :). What is important to me is that this machine can scan the array, analyze the images, and spit out expression profiles For more details, here is the reference I was reading. For Affymetrix methods (implemented in Bioconductor), I will reference this document.

Instead of going through the pain of formatting code for this blog, I created an RMarkdown and published on RPubs. So much easier! Note that toward the end I filtered my data based on the affymetrix IDs that I had entrez IDs for (to submit to gene ontology (GO), and you of course don’t have to do this, and can cluster without filtering.

Analysis of Affymetrix .CEL Files
·
Feb 06

Use Foursquare API to Map a Resource

I wanted to test out the foursquare API, so I made a simple goal of creating a “resource density map” for a search term of interest. Given my lab’s expertise, I chose the search term “autism.” I first downloaded a file with all major US cities, and then familiarized myself with the foursquare API. You need to create an account and register an API, and then use your client ID and secret when you make calls. I wrote a pretty basic script to query for search term “autism” for these cities, and then save to a file for import into a Google Fusion Table (for easy mapping, you could map the data any way that you like!)

·
Jan 04

Why Are Journal Articles So Boring?

Anyone who claims that reading journal articles is exciting is probably fibbing. It’s not that the research itself is boring, but rather the way that it is communicated. Part of my brain is eager to understand the method and the result, and the other part is yawning and polishing some neurons. So, why not just change the writing style of journals to be more like science news?

We are unfortunately stuck like a blueberry in a surprising pancake. On the one hand, the style of a journal is a reflection of the authors in it. Researchers adopt a professional writing style to be viewed as professional themselves. On the syrupy side of this pancake, domain-specific jargon can limit the audience with which the research is shared. It follows, then, that some desire to maintain a reputation (either the journal or the researcher), flies against the goal of the publication process entirely, which is to disseminate the findings to as many individuals as possible. However, given the stringent access rules and ridiculous costs to wheedle down the market to only the biggest of research institutions, I’m guessing that journals don’t mind the second point so much. They leave it to the science media to translate cool findings to the rest of the world! And thus, we are stuck in a static, black box-reality where (most) journals are professional, and dry.

This is exactly the conundrum that went through my mind late one evening last week, and what I did next is prime evidence for the direct relationship between tiredness and silly thinking. You are probably anticipating some solution to the problem stated above, but unfortunately what I’m about to share is completely useless. I started to replace random words with much better choices (why, they didn’t test their experimental manipulation with a control group, they used a flock of angry penguins!), and was a little bit giddy and pleased with my silly result. I then decided that a good use of my time would be to create Pubmed Abstract… Mad Libs! And so, I did :o)

##

PubmedLib

This little page queries Pubmed for an abstract based on a search term,

It then parses the abstract to determine parts of speech, and randomly selects words for you to fill in:

Then, you get to see your silly creation, with words in red. That’s pretty much it :o)

As you can see, completely useless, other than maybe some practice for me using Pubmed’s API. I should probably do some real work now, hehe :)
·
Jan 04

Star Wars, Origami, and Photoshop

Pew, pew pew!

·
Jan 04

Haricots V

A moment of awareness, overwhelming for a small pea
an ephemeral glimpse of what it might be like, to be.
The world is so harsh for one that is so small.
The world is too busy to listen for your call.
Many miss you, your gentle nature they fail to see.
But I’ll never miss you, you are special to me, my Haricots V.

·
Dec 29

Introduction to the Genome Analysis Toolkit (GATK) I

The Genome Analysis Toolkit (GATK) is a nice software package for the analysis of sequence data. With the development of the Allen Brain Atlas and the desire to do analysis that spans imaging and genetics, I’ve been waiting for the perfect storm (or this is a good thing, so let’s say the perfect sunny day) to teach myself this software and associated methods. I am following documentation from the Broad Institute to write this. Let’s jump right in!

Installation of the Genome Analysis Toolkit

You definitely want to do this on a linux machine. You could use a virtual machine from Windows, or the pathetic cygwin, but I highly recommend just sticking with linux. First, register on the GATK website and download the software. There login system is a little buggy - my credentials kept getting rejected, but then when I opened a new tab, I was logged in. You can extract the software to some nice folder where you have packages. Next, check your version of java. As of the end of 2013, you need java 1.7 for the software to work:
```
# Check your version of java
java -version

# You should see something like this
java version "1.7.0_25" Java(TM) SE Runtime Environment (build 1.7.0_25-b15) Java HotSpot(TM) 64-Bit Server VM (build 23.25-b01, mixed mode)
```
If you don’t, then first download JDK version 1.7 from oracle, and follow the installation instructions detailed on the page. I chose to install under /usr/local/java, e.g.,
```
# From the downloads directory
mkdir /usr/local/java
tar -zxvf jdk-7u-linux-x64.tar.gz
/usr/local/java
```
If you want to permanently add this version of java to your path (to overwrite your old version, then you can edit your .bash_profile (or in linux mint it is just called .profile in the home folder). You can either add the following to this file, or just type it in a console for a one-time session:
```
PATH=/usr/local/java/bin:$PATH
export $PATH
```
I didn’t do either of the above, actually, I decided to make an alias (also in my .profile), so I could call this version by typing “java7” and the old version would remain as “java”
```
# In your bash profile
alias java7='/usr/lib/jvm/java-7-openjdk/bin/java'

# Back on the command line, resource your environment
. .profile
```
And then typing java7 or java will run the correct version.

Testing the Genome Analysis Toolkit installation

Installation just means that you unzipped it to a folder. Go to that folder, and run (using either your java7 alias or just java):
```
code
```
If you see help text, you are golden! If you see this as the first line of an error message:
```
Exception in thread "main" java.lang.UnsupportedClassVersionError: org/broadinstitute/sting/gatk/CommandLineGATK : Unsupported major.minor version 51.0
```
you are still using the wrong version of java.

Getting Started with GATK

The raw data files that we are working with are called (FASTQ) and they are the output of, for example, an Ilumina Genome Analyzer. You can read about the format, but largely it’s a text file that lists the nucleotides that are present in the many snippets of DNA cut apart by the machine (my advisor said the typical length is ~35-75 base pairs (bp)), and there are billions. Each snippet of the read has 4 lines: an identifier, the read itself, another identifier/description line, and a quality line. A simple goal of this software is to put these pieces back together, taking into account missing reads, the potential for error, and data quality.

Data Preprocessing

This is where we go from our raw FASTQ files to what are called BAM files, which is a file that has all the snippets put back together, ready for analysis. For my fastq file (sometimes abbreviated .fq) I went into the “resources” directory of GATK, and decided to use “sample_1.fastq.” The initial step of mapping your fastq reads to a reference genome is not done with GATK. So let’s go through the following steps:

The steps involved are:
1. Mapping and Marking Duplicates
2. Local Realignment Around Indels
3. Base Quality Score Recalibration (BQSR)
4. Data Compression with ReduceReads (optional)
5. Variant discovery: from reads (BAM files) to variants (VCF files)
6. Suggested preliminary analyses
Don’t worry, I haven’t a clue what most of these steps are, so we are going to learn together! I have chosen a numbered list instead of bullets because they must be done in this order!

Mapping to a Reference Genome with the Burrows Wheeler Aligner

The basis of many of these analyses is mapping our reads in our fastq file to what is called a “reference genome.” A reference genome is basically an approximated template sequence for a particular species or human subtype (eg, European vs African). If you imagine taking a population of individuals and lining up their sequences, the reference genome might get the most highly agreed upon letter in each position. Even if our fastq snippets don’t have perfect overlap with this reference, the overlap is similar enough so that we can use the reference to correctly order our snippets. For this aim, we are going to be using the Burrows-Wheeler Aligner, which is suggested by GATK.

Download and Install Stuff!

My first strategy was to find different examples of reference genomes and random reads off of the internet, and I quickly learned that this was a bad idea. A more experienced colleague of mine summed it up nicely: “Don’t waste good analyses trash.” Apparently, there is quite a bit of sketchy data out there. My next strategy was to look at the Broad Institute resource bundle.

You basically connect to their FTP server, and have lots of files to choose from. At the highest level is a zip called “tutorial_files.zip.” This has a reference genome, fasta file, and a few .vcf file with known SNPs:
```
dbsnp_b37_20.vcf
dbsnp_b37_20.vcf.idx
dedupped_20.bam
Homo_sapiens_assembly19.1kg_pilot_indels.vcf
human_b37_20.fasta
indels_b37_20.vcf
indels_b37_20.vcf.idx
tutorial_files.zip
```
I don’t know that much about genetic analyses, but think that the dbsnp.vcf and indels.vcf and Homo_sapiens*.vcf files are all known SNPs, and the .fasta file is our reference. I would also guess that the “20” refers to the chromosome that the reads are for. the dedupped_20.bam file is of reads that have already been processed (mapped to the reference and duplicates removed). Since we aren’t given a file of raw reads (and this preprocessing is important) - I also downloaded the exampleFASTA.fasta.zip in the “exampleFASTA” folder. We can learn preprocessing using this file, and then switch over to the provided bam file, since we haven’t a clue what the exampleFASTA is.

Now, let’s install BWA. Let’s also grab the “samtools,” which stand for “Sequence Alignment/Map” Tools, which is the file format that is used to store large sequences of nucleotides. What isn’t totally obvious is that you also need to install “htslib” https://github.com/samtools/htslib as a part of samtools:
```
# Install BWA
git clone https://github.com/lh3/bwa.git
cd bwa
make

# Get SAM Tools
git clone git://github.com/samtools/samtools.git

# Get htslib (put in same directory as samtools top folder)
git clone git://github.com/samtools/htslib.git
cd htslib
make

# Go back into samtools, then compile
cd ../samtools
make

# If you get this error, it can't find your htslib installation:
Makefile:53: ../htslib/htslib.mk: No such file or directory
make: *** No rule to make target `../htslib/htslib.mk'. Stop.
```
You can either put it where it expects it (as instructed above), or see line 52 of the MakeFile and change the path to where you installed it. We also need the Picard tools to sort the reads by coordinate. Download and unzip the software to your directory of choice.

Now let’s start using these tools! First, here is how you can add the directory to any set of tools to your path, so it’s easy to call:
```
# I usually just cd to the directory
cd /path/to/bwa
PATH=$PATH:$PWD
export PATH
```
You can also add that to your .profile (or .bash_rc) if you want it to happen at startup.

Index your reference genome

Have you heard of a hash table in computer science? You can think of it like a dictionary in python or your language of choice - it’s a data structure that maps keys to values, or makes it easy to compute an index into some array, and each index is unique. I imagine that we would need to do something like this for the reference genome, because it’s such a beastly file. Which algorithm should we use? The default algorithm is called “IS”:
```
IS linear-time algorithm for constructing suffix array. It requires 5.37N memory where N is the size of the database. IS is moderately fast, but does not work with database larger than 2GB. IS is the default algorithm due to its simplicity. The current codes for IS algorithm are reimplemented by Yuta Mori.
```
what we would want for whole genome (the hg19.fa file) is BWT-SW:
```
bwtsw Algorithm implemented in BWT-SW. This method works with the whole human genome.
```
This is local sequence alignment! We learned about this in Russ’ BMI 214 class! From the documentation, the BWT-SW algorithm seeds alignments with maximal exact matches (MEMs) and then extends seeds with the affine-gap Smith-Waterman algorithm (SW). Note that if you use whole genome this step can take a couple of hours, and you need 5GB of memory:
```
# Index your reference genome
bwa index -a bwtsw human_b37_20.fasta

# If you do it like this, you will use the default IS algorithm:
bwa index bwtsw human_b37_20.fasta

[bwa_index] Pack FASTA...
```
This creates a truck ton of files (.amb,.bwt,.ann,.pac,.sa) that I’m thinking are for BWA to use to perform the alignment. LOOK UP THESE FILES. When the above finishes, we now want to make a simple file that lists the indexes. Remember the samtools?:
```
samtools faidx human_b37_20.fasta
```
The output of this ends in “*.fai” which I think stands for “fasta index,” and the script doesn’t produce any interesting output in the terminal window. If we look at the file, each line has a contig (“a set of overlapping DNA segments that together represent a consensus region of DNA”) name, size, location, basesPerLine, and bytesPerLine. Here is the fa index file for chromosome 20 (human_b37_20.fasta.fai):

My, that’s a tiny file! I also want to show the index for the whole genome (hg19.fa), just to demonstrate that we have many more chromosomes:

Now let’s jump back to the Picard tools and create a sequence dictionary, “human_b37_20.fasta.dict” by running the following:
```
java -jar CreateSequenceDictionary.jar \
REFERENCE=human_b37_20.fasta \
OUTPUT=human_b37_20.dict
```
This produces some job output to the screen as well. I can imagine if I were setting this up in a massive processing pipeline, I would want to direct all this screen output to some job file, for checking in the case of error. Here is a snapshot of what the dictionary looks like:
```
@HD VN:1.4 SO:unsorted
@SQ SN:20 LN:20000000 UR:file:/home/vanessa/Packages/bwa/data/test/human_b37_20.fasta M5:26585f45bc49d1acf5efd030c442ef0f
```
@HD specifies the header line. VN is the version. SO means “sorting order.”
@SQ specifies the start of the sequence dictionary. SN is the reference sequence name (20), and LN is the reference sequence length. UR is the URI of the sequence, which is a uniform resource identifier, which is like a unique ID that can be used for many kinds of web resources. Here is the whole genome dictionary (just to compare):

(Finally) Mapping to a Reference Genome with the Burrows Wheeler Aligner

Now let’s use BWA to create our SAM, or Sequence Alignment/Map file. I highly stress to read the README.md file, which tells us about the different algorithms that can be used to match data to a reference genome (BWA-backtrack, BWA-SM, and BWA-MEM). BWA-MEM is recommended as the latest for high quality queries. For this example, our reference is called “human_b37_20.fasta.”
```
# Generate a SAM file containing aligned reads - 1 file
bwa mem -M -R '@RG\tID:exampleFASTA\tSM:sample1' human_b37_20.fasta exampleFASTA.fasta > aligned_reads.sam
```
The “-R” argument is the “read group,” for more information seem the SAM file spec: http://samtools.sourceforge.net/SAMv1.pdf. The only required fields seem to be the ID and SM (sample). If you had more than one reads file, each would need a unique ID. Thanks to my friend “Boots” for pointing this out to me :)

If you don’t specify the read string, then your SAM file won’t have read group information, and you will need to add this later (I include this step in the walkthrough below.) I would recommend doing it correctly from the start, but if you don’t, I go over the “fix it” step later.
```
# Convert to BAM, sort and mark duplicates
java -jar SortSam.jar \
INPUT=aligned_reads.sam \
OUTPUT=sorted_reads.bam \
SORT_ORDER=coordinate
```
This is the first step that appears to produce a compiled file (e.g., I can no longer open something coherent in my text editor). As a reminder, we started with a bunch of unordered reads, and we first aligned them to a reference genome, and then we sorted them based on their coordinate. Are we done with preprocessing? Not quite yet! During sequencing, the same DNA molecule can be sequenced multiple times, meaning that we could have duplicates. This doesn’t do much to help us, so let’s get rid of these duplicates. This process is called “dedupping.”
```
# Create "dedup_reads.bam" file - same as before, but without duplicates
java -jar MarkDuplicates.jar \
INPUT=sorted_reads.bam \
OUTPUT=dedup_reads.bam \
M=metrics.txt
```
What in the world is in the metrics file? It looks like a job file that keeps a log of the duplicates. Finally, as a last step in pre-preprocessing, we need to create an index file for the BAM file:
```
java -jar BuildBamIndex.jar \
INPUT=dedup_reads.bam
```
We are now done with pre-preprocessing!

Local Realignment Around Indels

In neuroimaging analysis, dealing with noise is like brushing your teeth in the morning. The same seems to be true for sequence alignment. An indel is a term that we use to describe either an insertion or deletion. When we do this initial mapping, a read that aligns with the edge of an indel can look like single nucleotide polymorphisms (SNPs), but it’s just noise. This realignment procedure aims to correct these mapping artifacts. This step is basically running Smith-Waterman over the entirety of an aggregated BAM file, however the documentation notes that if you have a super large number of BAM files, you can run the script in “-knownsOnly” mode, which will clean up sites known in dbSNP or 1000 genomes. You might also want to do this if you are worried about false positives (eg, tagging a SNP as noise when it’s not!). Most of an individual’s known indels will be found here, and of course any novel sites will not be cleaned. If your lab has a list of known indel sites (called “gold_indels.vcf”) you can specify for GATK to use it (see below), or what I did is use the .vcf files provided in the tutorial zip. This is also where we want to use their dedupped_20.bam, otherwise you get an error message about having a file with a mismatched number of bases and qualities. Don’t forget to first index the bam file:
```
java -jar BuildBamIndex.jar \
INPUT=dedupped_20.bam
```
And now
```
java -jar GenomeAnalysisTK.jar \
-T RealignerTargetCreator \ # the analysis type
-R human_b37_20.fasta \ # our reference sequence
-I dedupped_20.bam \ # input sequence alignment map file
-known dbsnp_b37_20.vcf \ # known indel sites
-known Homo_sapiens_assembly19.1kg_pilot_indels.vcf \
-known indels_b37_20.vcf
-o target_intervals.list
```
At this point, if you don’t have the read group specified, the GATK will give you this error:
```
...SAM/BAM file dedup_reads.bam is malformed: SAM file doesn't have any read groups defined in the header.
```
Oops. This is where not specifying the read group information when I first use bwa came back to bite me in the butt! If you do have this read information, skip the next two steps. If not, let’s see if we can fix this… thank goodness picard has a nice little script to do just that:
```
java -jar AddOrReplaceReadGroups.jar \
INPUT=dedup_reads.bam \
OUTPUT=addrg_reads.bam \
RGID=group1 RGLB=lib1 RGPL=illumina RGPU=unit1 RGSM=sample1
```
I wasn’t entirely sure where to get this kind of information for a random fq file that I downloaded. I’m guessing it would be in the documentation from wherever the data is sequenced. So, I just used the dummy information above. I then created a fixed BAM index file:
```
java -jar BuildBamIndex \
INPUT=addrg_reads.bam
```
So now we have another bam index file, with the same name but appended with “bai.” Now, you would try the RealignerTargerCreator command from above once more, but with the addrg_reads.bam file.

I’ll admit to you that I’ve gone through this sequence of steps muliple times, and finally when I didn’t get an error message on this step I wanted to dance! And guess what? It’s tracking your every move too… sent to Amazon S3!
```
INFO 15:46:30,009 GATKRunReport - Uploaded run statistics report to AWS S3
```
Our output is a list of intervals that are deemed to need realignment, spit out in a text file called “target_intervals.list.”

Now we should take this list… and do the realignments! Let’s compare to the command above to better familiarize with GATK. The -T argument obviously is specifying the analysis type, R is always going to have our reference, I the input, and the rest are self explanatory:
```
java -jar GenomeAnalysisTK.jar
-T IndelRealigner
-R human_b37_20.fasta
-I dedupped_20.bam
-targetIntervals target_intervals.list
-known dbsnp_b37_20.vcf
-known Homo_sapiens_assembly19.1kg_pilot_indels.vcf
-known indels_b37_20.vcf
-o realigned_reads.bam
```
The output of this step is “realigned_reads.bam,” or a file similar to the original dedup_reads.bam, but cleaned up! Now we can move on to the next step.

Base Quality Score Recalibration (BSQR)

As we noted earlier, one of the lines (for each read) in these sequence files (both the raw and BAM/SAM files) is an indicator of the quality of the read. This is a super important metric, and unfortunately it is subject to error as well. We are going to use methods from machine learning to make it so that these quality scores are closer to representing the probability of an error in matching to our reference genome. For more details, see here: http://www.broadinstitute.org/gatk/guide/article?id=44. I recommend scrolling down to the plots to see the “before” and “after” shots of the quality score distribution. First we are going to figure out what patterns of covariation we have in our sequence dataset:
```
java -jar GenomeAnalysisTK.jar \
-T BaseRecalibrator \
-R human_b37_20.fasta \
-I realigned_reads.bam \
-L 20 \
-knownSites dbsnp_b37_20.vcf
-knownSites Homo_sapiens_assembly19.1kg_pilot_indels.vcf
-knownSites indels_b37_20.vcf
-o recal_data.grp
```
This is an initial model, and we then create a secondary model that looks for remaining covariation after recalibration:
```
java -jar GenomeAnalysisTK.jar \
-T BaseRecalibrator \
-R human_b37_20.fasta \
-I realigned_reads.bam \
-L 20 \
-knownSites dbsnp_b37_20.vcf
-knownSites Homo_sapiens_assembly19.1kg_pilot_indels.vcf
-knownSites indels_b37_20.vcf
-BQSR recal_data.grp \
-o post_recal_data.grp
```
The -BQSR points GATK to our previously-made file with the recalibration data. To make these analyses a little less “black box,” let’s look at the base quantities before and after recalibration:
```
java -jar GenomeAnalysisTK.jar \
-T AnalyzeCovariates \
-R human_b37_20.fasta \
-L 20 \
-before recal_data.grp \
-after post_recal_data.grp \
-plots recalibration_plots.pdf
```
This didn’t work the first time I tried it - I got an error about the RScript. I figured out that I was missing some libraries in R, so here is how to install them. I had to download an older version of gplots from the archive: http://cran.r-project.org/src/contrib/Archive/gplots/
```
install.packages('ggplot2')
install.packages('gtools')
install.packages('gdata')
install.packages('caTools')
install.packages('reshape')
install.packages('gplots_2.8.0.tar.gz',repos=NULL, type="source")
install.packages('gsalib')
```
If installing those doesn’t work for you, try running the script on your own to debug:
https://github.com/broadgsa/gatk-protected/blob/master/public/R/scripts/org/broadinstitute/sting/utils/recalibration/BQSR.R

If you run the AnalyzeCovariates with the flag “-l DEBUG” you can see where the output .csv file is put, which you would need for the R script. I’m glad that it finally worked for me and I didn’t have to do that! My plots generally looked like this, which looked ok, so I moved on to applying the recalibration to my bam file:
```
java -jar GenomeAnalysisTK.jar \
-T PrintReads \
-R human_b37_20.fasta \
-I realigned_reads.bam \
-L 20 \
-BQSR recal_data.table \
-o recal_reads.bam
```
This “recal_reads.bam” file now has quality scores that are more accurate. It looks like the old ones get written over, but if you are an anxious personality type and don’t want to lose them, use the “-emit-original-quals” flag to keep them in the file with the tag “OQ.” We are now done with recalibration, and can move on to variant discovery! Note that if you have large data, you might want to use the “ReduceRead” tool to compress your data before moving forward:

Variant Discovery

What does variant discovery mean? It means finding sites in your reads (the bam file) where the genome is different from the reference, and suck out those sites to see what the difference is. This process contains two steps:
- variant calling (designed to maximize sensitivity, or the true positive rate),
- variant quality score recalibration (VQSR, filters to improve specificity, true negative rate).
Before we run this: a few notes. The default mode is called “DISCOVERY,” and this means we choose the most probable alleles in the data. If you are only interested in a certain set of alleles, change that arg to “GENOTYPE_GIVEN_ALLELES” and then pass a .vcf file with the “-alleles” argument. I don’t have a good sense for the confidence thresholds, so I’m going to use the default suggested in the documentation:
```
java -jar GenomeAnalysisTK.jar \
-T HaplotypeCaller \
-R human_b37_20.fasta \
-I recal_reads.bam \ # reduced or not
-L 20 \
--genotyping_mode DISCOVERY \
-stand_emit_conf 10 \
-stand_call_conf 30 \
-o raw_variants.vcf
```
When I was running the above, I saw this message:
```
WARN 13:25:31,915 DiploidExactAFCalc - this tool is currently set to genotype at most 6 alternate alleles in a given context, but the context at 20:13967957 has 8 alternate alleles so only the top alleles will be used; see the --max_alternate_alleles argument
```
Despite the enormous runtime of the above command, I decided to run it again, using the max_alternate_alleles argument. That would give me two files to work with, raw_variants.vcf, and raw_variants_maxaa.vcf.
```
java -jar GenomeAnalysisTK.jar \
-T HaplotypeCaller \
-R human_b37_20.fasta \
-I recal_reads.bam \ # reduced or not
-L 20 \
--genotyping_mode DISCOVERY \
-max_alternative_alleles
-stand_emit_conf 10 \
-stand_call_conf 30 \
-o raw_variants_maxaa.vcf \
```
Both of these files have alleles (SNPs, insertions, and deleltions) from our sequence that are different from the reference genome. Let’s take a look! I always recommend using vim to look at these - it doesn’t freak out with the big file size like gedit does:

As a reminder, we just maximized sensitivity, meaning that we aren’t going to miss real variants. However, in that nice little .vcf file, we probably have some false positives too, and need to do some filtering to reduce that number of false positives. For the below, I returned to the Broad Institute resource bundle, and downloaded (and unzipped) the following files into a “ref” folder:
```
1000G_omni2.5.b37.vcf
1000G_omni2.5.b37.vcf.idx
dbsnp_b37_20.vcf
dbsnp_b37_20.vcf.idx
hapmap_3.3.b37.vcf
hapmap_3.3.b37.vcf.idx
Mills_and_1000G_gold_standard.indels.b37.vcf
Mills_and_1000G_gold_standard.indels.b37.vcf.idx
```
I downloaded the index files as well, but you could also make them like we did earlier. This is one of those “I’m not entirely sure if I’m doing the right thing, but I think so,” things, after reading the documentation here.
```
java -jar GenomeAnalysisTK.jar \
-T VariantRecalibrator \
-R human_b37_20.fasta \
-input raw_variants.vcf \
-resource:hapmap,known=false,training=true,truth=true,prior=15.0 hapmap_3.3.b37.vcf \
-resource:omni,known=false,training=true,truth=true,prior=12.0 1000G_omni2.5.b37.vcf \
-resource:1000G,known=false,training=true,truth=false,prior=10.0 Mills_and_1000G_gold_standard.indels.b37.vcf \
-resource:dbsnp,known=true,training=false,truth=false,prior=2.0 dbsnp_b37_20.vcf \
-an DP \
-an QD \
-an FS \
-an MQRankSum \
-an ReadPosRankSum \
-mode SNP \
-tranche 100.0 -tranche 99.9 -tranche 99.0 -tranche 90.0 \
-recalFile recalibrate_SNP.recal \ # recalibration report with data
-tranchesFile recalibrate_SNP.tranches \ # contains quality score thresholds
-rscriptFile recalibrate_SNP_plots.R
```
Then apply the recalibration:
```
java -jar GenomeAnalysisTK.jar \
-T ApplyRecalibration \
-R human_b37_20.fasta \
-input raw_variants.vcf \
-mode SNP \
--ts_filter_level 99.0 \
-recalFile recalibrate_SNP.recal \
-tranchesFile recalibrate_SNP.tranches \
-o recalibrated_snps_raw_indels.vcf
```
This didn’t actually work for me, and those files are indeed the ones they suggest using. I think it’s because my file is too small, so instead I’m going to apply hard filters, first to my SNPs, and then to the indels:
```
# This is for the SNPs
java -jar GenomeAnalysisTK.jar \
-T SelectVariants \
-R human_b37_20.fasta \
-V raw_variants.vcf \
-L 20 \
-selectType SNP \
-o raw_snps.vcf
```
That made a file with SNPs from the original file of raw variants, and now we need to apply a manual filter. Here I am again going to use the suggested values:
```
java -jar GenomeAnalysisTK.jar \
-T VariantFiltration \
-R human_b37_20.fasta \
-V raw_snps.vcf \
--filterExpression "QD < 2.0 || FS > 60.0 || MQ < 40.0 || HaplotypeScore > 13.0 || MappingQualityRankSum < -12.5 || ReadPosRankSum < -8.0" \
--filterName "vanessa_snp_filter" \
-o filtered_snps.vcf
```
Now if we look at the filtered_snps.vcf file, we can see PASS labels for the ones that passed the filter, and if it didn’t pass, it says “vanessa_snp_filter.” Next, let’s do the same for the indels (insertions and deletions) from the call set. You’ll notice that we are again using the “SelectVariants” option, but instead of “SNP” for the select type, we are choosing “INDEL”:
```
java -jar GenomeAnalysisTK.jar \
-T SelectVariants \
-R human_b37_20.fasta \
-V raw_HC_variants.vcf \
-L 20 \
-selectType INDEL \
-o raw_indels.vcf
```
Now we have the file “raw_indels.vcf” that just has the insertions and deletions from the original file of raw variants. Let’s again apply manual filters:
```
java -jar GenomeAnalysisTK.jar \
-T VariantFiltration \
-R human_b37_20.fasta \
-V raw_indels.vcf \
--filterExpression "QD < 2.0 || FS > 200.0 || ReadPosRankSum < -20.0" \ --filterName "vanessa_indel_filter" \ -o filtered_indels.vcf
```
We are done filtering! At this point it’s time for preliminary analysis, and next comes analysis. Since this post is quite a bit long, I’m going to end it here, and will do preliminary analysis in a second post.

Other Resource for Reference Genomes

Other than the GATK Resource Bundle, available via FTP, we can download reference genomes from either Ensemble or the UCSC Genome Browser. From the USCS Genome Browser, if you click on Human –> Full Data Set, and then scroll down, you will see many links for zipped files that end in “fa.” It looks like there are multiple choices, and obviously you would choose based on your particular data. To test, I chose the hg19 2bit file, and I found this application for converting it to fa, as recommended by some colleagues. In the case that you want whole genome (using the hg19.fa file) here is how to convert 2bit to fa:
```
chmod u+x twoBitToFa
./twoBitToFa hg19.2bit hg19.fa
```
That’s all for today!
·
Dec 22

3D Web Models with Three.js and Blender

An important part of being a researcher is visualization. It’s hard to present meaningful work if you don’t have good methods to display your data, and although most of us are pretty decent with plotting in Matlab and R, I would still say that it’s an undervalued skillset. There is also a pretty solid barrier between our research environments (eg., R or Matlab), and the places that people go to learn (eg, the internet), although recent services like Plot.ly are bringing down that wall.

If we think for a minute about the broad range of expertise that would be required to be both awesome at research and all web visualization technologies, it’s quite a bit. There are people with entire careers devoted to making web pages, dynamic web content, or animation, so how is the researcher to fit in this expertise on top of normal work, paper writing, and teaching? I think I’ve probably laid out the problem well enough. While we can’t be the best at all these technologies, I think that it’s important to minimally stay on top of the field, and do enough little projects over time so that when push comes to shove, you can build what you need. This is the rationle for my little bit of fun today! It might be a little over-analytical for what comes down to just another thing I wanted to try (and waited for Christmas break to do so). My end goal is to build an in browser racing game using WebGL, which is a JavaScript API that can talk to your graphics card from the browser, and render beautiful content. I don’t have a research use-case in mind yet, but I can imagine it would be super awesome to have a data or brain rendering in some future online toolkit that I might develop. Since I haven’t a clue what I’m doing, I thought that I’d start very simple - let’s create a floating 3D object in the browser. Then we can figure out how to get it under the user control, and animating some road and background. That’s a racing game, right?

Step 1: Prepare a Vector Graphic

Back from my days of doing models in Maya, I remember this website TurboSquid where you can download lots of free models. That isn’t so helpful though, because I would want models of my own images and data. So for this learning tutorial, it’s important to make our own models from scratch. First, let’s choose an image and turn it into a vector graphic using Inkscape on linux. For this purpose, I am calling on my Mr. Catsup:

Now, download Inkscape, which is the Linux (free!) version of Adobe Illustrator. Adobe has recently done away with all single licenses, and users are required to pay ridiculous monthly fees for its “Creative Cloud.” It’s amazing what companies will do to make money, and ironically it means that they are likely going to lose a huge customer base. Here is how to turn the graphic into a vector graphic:
1. Open up your image.
2. Select the image, and go to Path –> Trace Bitmap
3. In the Trace Bitmap window, check “remove background”
4. Click Update to see the result, and tweak the settings and thresholds until you like it.
5. Click OK, and now we have a vector graphic! Be sure to delete the original graphic (in the layer underneath), and save as an .svg file.
Here is our vectorized Mr. Catsup:

Step 2: Create a 3D Model

If we somehow got a 2D image floating in a browser, that would just be OK. I’d really like to extrude him a bit, and give him some depth! With a little searching, I stumbled on something amazing - there is an open source (free!) modeling software (akin to Maya) for linux, called Blender! This is so exciting! First, download blender, extract it to a folder on your machine, and get it started by running ./blender. The learning curve on this one is a little steep, only because there are so many buttons, and if you don’t have a mouse like me, you are forced to try zooming, panning, and moving around the model space with the key and mousepad alone.

Once in blender, go to File –> Import, and select your svg. Note that the new workspace opens with a cube, and you should delete it, and you’ll likely need to zoom in to see your tiny svg. This is what my imported graphic looked like at first - I didn’t see anything!

To zoom in, click shift+B and then click and drag to select the origin area (where your imported graphic is):

It will then look something like this:

We would do much better looking at it from the top down. Click on the View menu in the bottom left, and select “Top”

Also note that three buttons to the right of this, there is a dropdown menu, and we are currently in “Object Mode.” Object mode seems to be where you can grab entire pieces and move them around, versus “Edit mode,” which allows you to select individual vertices, edges, meshes, etc.

There he is! Now, let’s make him bigger. First, let’s snap the origin (the thing with arrows coming out of it) to our geometry, so when we transform it uses his center as the vanishing point. Click on “origin” in the menu to the left, and then “Origin to geometry.” Then press a to select him (a is the general select/unselect all button), and the selection is indicated by orange lines around the entire model. Then, click on “scale” in the “Object Tools” menu on the left. When I did this, he turned white, and moving my cursor made him dynamically get bigger and smaller, but it wouldn’t allow for making him as large as I liked. Instead of moving your cursor around, just left click (so he’s orange again), and then notice the “Resize” panel with “X,Y,Z” boxes in the panel to the left? I put in 10 for each of these, and he was nicely sized:

Now we need to zoom out, and if you don’t have a mouse, this is possible with holding the middle keypad button, and then moving up and down. We now need to extrude him. Change from “Object Mode” to “Edit mode,” and then press B, click and drag around his entirety. This will select all of his vertices! And you should note if you press A you can select and deselect, and had you selected a smaller region, you would only select that subset of vertices. Once he is selected, look at the menu on the right, and find the tab that has two points connected by a line. This is the “Curves” panel. Find the “Geometry” section, and then enter a value between 0 and .2 under extrude:

We are almost done! We now need to convert him to a mesh. If we were to export him as is, it would be an empty json file (I know because I learned the hard way!). We also can’t do anything with textures if he doesn’t have skin. To convert to mesh, first go into object mode, and hit Alt+C, and select to convert to mesh:

For now, we won’t give him a texture, because I haven’t yet figured out the UV mapping, and I think we can color him with webGL.

Step 3: Export your Model

We need to enable an extra module in Blender to export to three.js, which is basically a JavaScript WebGL library for dummies.
1. Download three.js somewhere on your computer. Find the folder utils/exporters/blender/2.65/scripts/addons/io_mesh_threejs and move the entire folder into the directory /blender*/2.69/scripts/addons
2. Then in Blender go to File --> User Preferences, search for “three” (you may need to click the refresh button in the bottom right), and then click the check box to the right of the plugin to enable it. Close user preferences.</span>
3. Then to export, in blender go to File --> Export --> Three.js (.js) Done! Now we have our model! Open the file in a text editor to make sure that it isn’t an empty object. If it is, make sure that you converted to a mesh, and possibly try selecting the entire thing before exporting. I didn’t need to restart my blender, but if you are having trouble, a restart is a good idea.
Step 4: Use three.js to Animate the Model

First, create a directory for your site, copy your ketchup.js (the exported model) into a “models” folder, and create another directory called “js.” In the JS folder, copy the following files from three.js into your js folder:
```
threejs/examples/js/controls/OrbitControls.js
threejs/build/three.min.js
```
The secret to learning anything that you haven’t a clue about is starting a template, and then tweaking. For this aim, I found a nice example from the guys at Treehouse, and I didn’t do very much in the way of editing, because I was too eager to get it in the browser. As long as your directory structure is ok for the js and models folder, and you change the name of the model, it should be good to go! Also keep in mind that WebGL doesn’t work in all browsers - on my linux it didn’t work in Chrome, but it worked in firefox. To see the code, just go to either of my final pages below, right click –> View Source.

Here is the finished Mr. Catsup!
And I also made another one called “brain map” that is rather puntastic. If my brain were a globe with water and continents, it would look something like this.

What Next?

When I want to learn something, it’s good to start with basic goals, and then slowly add complexity. A good “next” step would be to figure out texture mapping, and use three.js to do simple user controlled movements. Then, toward my car racing goal, I’d want to animate a background and road. That will be for a later time, however, because I want to work a little bit on something else now :)
·
Dec 21

Happy Holidays from Goggles Optional!

From our podcast family to yours, wishing you a merry holiday season, 2013!

A Nature Christmas from Goggles Optional on Vimeo.
·
Dec 17

Connecting to MySQL Databases in R 3.0.1 using RODBC

I want to connect to MySQL databases from R. It’s easy to do in Python, but nothing is more painful than trying to get RMySQL working in R for Windows. This is possibly punishment from some higher god for having an installation of Windows, period. Don’t hate, Linux gods, your Gimp and Inkspace just aren’t up to par! As a solution, I stumbled on the RODBC package, which I actually got working. I followed the steps here, which I will re-iterate specially for Windows 7:

1. Install R

Obviously, first you need R installed for Windows. I have version 3.0.1, run via RStudio. If you want/need to install multiple versions, you can still use the Windows executable, and check “Full Installation,” and then “customized startup.” One of the boxes asks if you want to save the version into a Windows registry - I’m guessing that you wouldn’t want to check that. To switch between versions in R-Studio, just hold down the Control key when it’s booting up, or once in RStudio, go to Tools –> Options –> General, and the “R Version” selection box is right at the top.

2. Install RODBC

Either launch Rgui.exe (in your start menu or under ”C:\Program Files\R\R-2.15.2\bin\i386\Rgui.exe,” or launch R-Studio. If you’ve used R before, this is probably all set up. Now type
```
install.packages('RODBC')
```
3. Configure MySQL

You first need a MySQL Connector/ODBC. I’m not a Windows database expert, but I have MySQL Workbench installed, and know that the guts to connect to a MySQL server with ODBC is slightly different. You can download MySQL Connector/ODBC from here, or if you already have MySQL Workbench, there is a “MySQL Installer” in the start menu under “MySQL” that will allow you to select “MySQL Connector/ODBC.” I chose the MSI-installer, the ansi version. Install away, Merrill!</span>

4. Adding your data source

This was a little confusing at first, because I’m used to specifying credentials from within R or python, as opposed to having them stored in my computer. You actually need to add your different connections to the “ODBC Data Source Administrator,” which I found under Control Panel –> Administrative Tools –> Data sources (ODBC). The first tab that you see is “User DSN.” A DSN in a Data Source Name, or one of your connections. Thank you, Microsoft, for the excess of random acronyms that make things seem much more complicated than they actually are :)
1. First hit “Add,” and select “MySQL ODBC 5.2a Driver.”
2. When you hit finish, it will open up a “MySQL Connector/ODBC Data Source Configuration” dialog.
3. Nope, you can’t leave it empty! Enter the hostname under TCP-IP, and your username and password. You can specify the credentials from within R, which is probably a better idea, but you should enter them now to test the connection.
4. Come up with a nice name for it under “Data Source Name.”(DSN!)
5. I would hit test to test the connection. For me it worked the first time!
6. There are additional options under “Details,” if you desire to tweak.
7. Press OK. You are done.
5. Connecting from R

First, load the package, and use the odbcConnect function (with the first argument being the dsn) to connect to your database:
```
library(RODBC)
dsn = 'candyDatabase'
conn = odbcConnect(dsn)
query = sqlFetch('candyDatabase.TABLE')
```
If you have trouble with the case (eg, the sqlFetch tells you it cannot find “candyDatabase.table,” then in your initial connection you need to specify the case varible:
```
ch = odbcConnect('candyDatabase',case="nochange") # other options include toupper and tolower
```
There are beautiful examples of how to work with your database in the documentation for ODBC. To see, type the following in R:
```
RShowDoc("RODBC", package="RODBC")
```
Happy databasing! :)
·
Dec 15

NDAR on the Cloud

The National Database for Autism Research (NDAR) is moving toward a cloud-based data service (the standard now is to log in and download data with a java applet). What does this mean? It means that you can either get data packaged and sent to you via Amazon S3, or NDAR will make a database instance for you, also from your Amazon account.

Setup

You first need an AWS account (free for Amazon users), as well as having an account and access to NDAR (quite a bit of work, but hands down one of the best databases out there).

Create an Amazon EC2 Security Group

To make your database instance, you have to give NDAR permission. This is where an EC2 security group comes in, and following the proper steps to take, log into your AWS Management console:
- Open the Amazon EC2 console
- In the navigation pane on the left, click “Security Groups”
- Click Create Security Group, the big grey button at the top
- The name of the security group should be “NDAR,” and the description can be anything
Open up a Port for NDAR

Next, we need to open up a port for inbound connections from NDAR. Select the security group NDAR by clicking on it, and look at the bottom of the page to see details about the group:
- Click on the “Inbound” tab
- Next to “create a security rule,” “Custom TCP Rule” should be selected
- Under “port range,” add “1521”
- Click “Add Rule”
- Click “Apply Rule Changes”
###

Contact NDAR!

If you want to arrange a meeting for someone to help you through the next steps, you can submit a help request. You can also give it a go on your own by:
- Clicking on the register tab (it will ask you to login) to see this screen:
- You can find your Amazon Web Services Account Number in the top right of the Manage your Account portal.
- The External ID is just a tag to identify the connection, and it doesn’t really matter what it is
- If everything is set up correctly, when you click “Attach AWS Account,” it will tell you that it’s been successfully added.
- Next, query NDAR as you normally would, and create a data package.
- Instead of launching their java applet, click on the “Launch to NDAR Hosted Database” button.
- Note: you can also go directly to the miNDAR tab and select an old data package you’ve created.
- In the next screen, you will be asked to select a “Dataspace Number.” NDAR will give you three, meaning that you can create three database instances:
- Select an empty slot (eg, Empty 2), and then click “Create New Password.” The password that you create will be needed to connect to your instance, so don’t lose it. Also note that the requirements are very stringent: I had to do it multiple times before the “Launch Cloud Database” button was un-grayed. If it seems buggy, try clicking “reset” and trying again.
- You will then see the loading icon of doom while the instance is created. Be patient - it can take a long time:
When it finishes, the “Loading…” will go away, and you will see something similar to the following:

Your Oracle database is now ready. You can log in at mindar.xxxxxxxxxx.xx-xxxx-1.rds.amazonaws.com Port xxxx service MINDAR. You have been assigned the username: flast, flast2, and flast3. Use the corresponding login for the dataspace and the password created before deployment to log in.

That’s it! You can now connect to your database from your software of choice. Keep in mind that you might either have an SQL or an Oracle database, and those two things differ in how you connect to them. I specifically asked for SQL, and so I am connecting using the MySQLdb module in Python, or SQL Workbench for quick graphical stuffs. Command line always works too. If you have an Oracle database, I didn’t try this, but it looks like you can use the cx_oracle module for python.

###

What about the Data Dictionary?

Ideally, the data dictionary would come in one of the tables. I don’t think that this is (yet) a standard, but if you ask it seems like something that is possible to do. Also keep in mind that each behavioral data has the variable names in the second row, and that you can find complete data dictionaries for all metrics on the ndar website. I was fully prepared to be downloading a gazillion text files and creating my own database data dictionary, but thank goodness I didn’t need to!
·