When doing any sort of genetic analysis, we use a program called Plink to read in subject data from a .ped/.map file, and help us pull the SNPs that we are interested in. Other variables in Plink include individual and family IDs and the allele combo that each participant has for each SNP. It was a pretty cumbersome process to take the output produced by this program (a strangely formatted .list file) and wrestle with it to get it into excel columns, and then come up with some custom (likely error prone) process to match the many subject family and individual identifiers with whatever identifier is used in the study. This script makes that process, which might have taken someone an hour, complete in under a minute!
This script reads in a .list file created with the software program “Plink” and allows the user to select SNPs that he/she is interested in, as well as a subject ID lookup table, and a gene reference table. The script matches the study IDs with the correct participants, as well as appends the gene identifier to the name of each SNP. The output is a .csv file for easy import into Excel, SPSS, etc. The name of the output file is the name of the list file appended with the date.
Running the Script
- Save the script to your directory of choice, and prepare a .list file, and optionally, a subject lookup table and a genetic reference file.
- Either add the location of the script as a path (addpath(genpath(‘mypath’)), or navigate to where the script is to run it.
- To run, simply type “Plink_View” on the MATLAB command line. You can either run with no input arguments, or specify your listfile and lookup table as arguments 1 and 2, respectively. Here are your choices:
Plink_View – will be prompted for listfile, lookup, and ref
Plink_View(listfile) – will not have gene names and other_IDs
Plink_View(listfile,lookup) – will not have gene names
Plink_View(listfile,lookup,ref) – will have gene names and other_IDs
** listfile**: Must be a .list file generated by plink, with each row listing the gene identifier, the SNP name, the allele combo, and then groups of family IDs followed by individual IDs. The script expects various allele combos for each SNP (11, 12, 22, 00) and family IDs to be followed by the individual ID.
**lookup: ** Must be an excel file (.xls/.xlsx) with two columns – it is OPTIONAL. The individual identifier should be in the first column, and the “other_ID” in the second column. The user is allowed to select if he/she would like to lookup by individual ID, family ID, or Family_Individual ID (the concatenation of the two).
ref: is an rs_gene_reference excel (.xls/.xlsx) document that is OPTIONAL. It should be formatted so that the first column has the SNP name, and the second column has the gene symbol. If the user specifies a gene reference file, then the gene symbol is appended to the SNP name in the output table. If not, the output table will only have the SNP name.
How does it work?
**1. The script starts by evaluating how much information the user has provided. In the case of zero arguments, it prompts the user for all of the input variables (giving the choice to add the lookup table and gene reference table, since those two are optional!) In the case that input arguments are provided on the command line, it goes right into step 2.
- The script reads the second column of the list file, which is always the long list of SNPs. Regardless of the number of repeats of each SNP (based on the number of allele combos) the script presents the user with a popup GUI with a complete list of the SNPs. The user is allowed to select the SNPs from the list that he/she wants included in his/her output file.
The script then loads the genetic reference table, and matches each chosen SNP with its gene. These names are saved into another variable for use in the creation of the output file. They will be the headers for each SNP. For example, if a SNP is called rs16 and it is from the gene “THSD7A”, the column header that includes each subject’s allele will have the title “rs16_THSD7A.” In the case that the user does not use a gene reference table, the name will simply be “rs5522.
The script next delves into the task of reading the data. This is reliant on Plink producing a consistently formatted output .list file. Here is a sample output with two SNPs (shortened of course!):
7 rs16 11 7612 1 6931 1 3658 1 4346 1 9744
7 rs16 12 941 1 8955 1 9705 1 7220 1 9232 1 5650 1 698 1
7 rs16 22 5966 1 9412 1 9519 1 5922 1 2772 1 7129 1 5537 1
7 rs16 00
7 rs8 11 7680 1 7468 1 3862 1 4992 1
7 rs8 12 8955 1 9519 1 5922 1 4346 1 8796 1 8905 1 8702 1
7 rs8 22 941 1 7612 1 5966 1 6931 1 9412 1 9705 1 3658 1 7220 1 \ 7 rs8 00 6904 1 7312 1 7393 1
Each line follows the following format:
(gene) (SNP) (allele) (family ID 1) (individual ID 1) (family ID 2) (Individual ID 2)… etc
The script reads in one line, one element at a time, skipping over the (gene). This is done by locating the next empty space, and then referencing the line from the start to that space -1. When an element has been read and saved to a temporary variable, it is cut from the beginning of the line.
The SNP name is added only in the case that the user indicated wanting it. A DATA structure is created that holds the various fields of each person. In the case that we have a .list file with multiple SNPs, it is very likely that an individual will appear twice. So each person is read in with their family and individual ID, and then searched for in the structure. In the case that they already exist, the new SNP information is added. In the case that they don’t exist, a new subject is added with the new family and individual ID.
Once all subjects and their SNPs have been added, the script loads the study ID lookup table, and goes through each subject saved in the DATA structure, searching for a matching ID in the lookup table. In the case that the ID is found, it is placed into the “Study_ID” field for the subject. In the case that it is not found, this field will remain blank.
The last step is printing all of the data to a .csv file with the proper headers and comma placement. The script gives the .csv file the same name as the list file provided by the user appended with the date.