Here are the presentations for Chapter 1: Learning R from the upcoming text Applied Population Genetics. More information on this text can be found here.
Previous Dyerlab member, Anna Tucker, just got her thesis work accepted for publication in Auk. Be on the lookout for the following:
Opportunistic conspecific brood parasitism in a box-nesting population of Prothonotary Warblers (Protonotaria citrea)
I will be posting portions of all 10 chapters of my upcoming textbook, Applied Population Genetics, as early draft chapters to this website over the spring semester. Read more
Been working on a lexicographic analysis of ‘Sustainability’ as published by the journals PNAS and Sustainability. Here are the stemmed word forms for 366 published articles represented as a hierarchical clustering. The wordclouds represent the top 10 word stems per group.
Every time I upgrade in any significant way, two R libraries seem to raise their ugly heads and scream like a spoiled child— rgdal and rgeos . Why do these two have to be SOOOO much of a pain? Why can’t we have a auto build of a binary with all the options in it for OSX? Who knows? I always feel like I get the fuzzy end of the lollipop with these two. Here is my latest approach for getting them going.
In R, there is often the need to merge two
data.frame objects (say one with individual samples and the other with population coordinates. The
merge() function is a pretty awesome though it may take a little getting used to.
Here are some things to remember:
- You need to have two data.frame objects to merge
- The first one in the function call will be the one merged on-to the second one is added to the first.
- Each will need a column to use as an index—it is a column that will be used to match rows of data. If they are the same column names then the function will do it automagically, if no common names are found in the names() of either data.frame objects, you can specify the columns using the optional by.x= and by.y= function arguments.
Much of the work in my laboratory uses spatial data in some context. As such it is important to try to be able to grab and use spatial data to in an easy fashion. At present, R is probably the best way to grab, visualize, and analyze spatial data. For this example, I went to http://worldclim.org and downloaded the elevation (altitude) for tile 13 (eastern North America) as a GeoTiff. A GeoTiff is a specific type of image format that has spatial data contained within it. The tile data has a pixel resolution of 30 arc seconds which puts us in the general area of ~ 1km. First, we need to get things set up to work.
# Set the working directory to where you want it.
# load in the raster library
Loading required package: raster
Loading required package: sp
Here is a short (39 minute) video of some basic graphics approaches in R I use in a class on population genetics.
The default CRAN repository is not the only place that R packages are stored. You can also find them on github. When I develop libraries for R, I typically develop them on http://github.com/dyerlab and then upload them to CRAN when I get to major milestones. The latest versions of all my software will always be found on github. So here is how to install packages directly. Read more
I’ve been working on integrating the Swift language into my analysis workflow but much of what I do involves the GNU Scientific Libraries for matrix analysis and other tools. Here is a quick tutorial on how to install the GSL library on a clean OSX platform.
- It is easiest if you have XCode installed. You can get this from the App Store for free. Go download it and install it.
- Download the latest version of the GSL libraries. You can grab them by:
- Looking for your nearest mirror site listed at http://www.gnu.org/prep/ftp.html and connecting to it.
- Open the directory
gsl/where all the versions will be listed. Scroll down and grab
- Open the terminal (Utilities -> Terminal.app) and type:
- Unpack the archive by:
tar zxvf gsl-latest.tar.gzthen
cd gsl-1.16/(or whatever the version actually was, it will probably be some number larger than 1.16).
- Inside that folder will be a README file (which you probably won’t read) and an INSTALL file (which you should read). In that folder it will tell you to:
sudo make install. This last command will require you to type in your password as it is going to install something into the base system.
- All the libraries and header files will be installed into the
There seems to be some nefarious conspiracy against packaging spatial R packages on the mac platform. Don’t quite understand it but it sucks. Here is how to install the rgeos package.
If you try the normal way, you get the following error:
package ‘rgeos’ is available as a source package but not as a binary
Warning in install.packages : package ‘rgeos’ is not available (as a binary package for R version 3.1.3)
which is not very helpful. Read more
Working on some code and was having a tough time configuring the color palette in GGally since it does not produce a ggplot object. It appears to be a larger problem. So, here is one hack, redefine the ggplot function and change the default palette there. Need to make a dyerlab::palette now…
ggplot <- function(...) ggplot2::ggplot(...) + scale_color_brewer(palette="Set1")
ggpairs(df,columns = 3:7,axisLabels="none",color="color")
If you use Google Docs for your writing, there are several cool tricks you can use to increase your efficiency. Here is one thing that has made it much easier when it comes to turning in assignments. Previously, one would create a document in some word processor, work on it, put it on a thumb drive, take it home and work on it, take it back to school, perhaps a lab computer, maybe it is also worked on in the library, etc. Eventually, you finish the document and then to turn it in you can either print it off (got to go find a printer or where I put that extra paper) or email it in. This last option is terrible if you have a large class!
If you are using Google Docs, you can just share it with the instructor. In the sharing options, you can designate that you share with someone but only allow them to make “suggestions”. This keeps the integrity of your document in place while allowing another person to mark it up. Once you share it, they can open it and write in it but any and all changes to the document are indicated via a highlight color. Since both of you are working on the document, there is no need to email it back and forth, there is only one document.
Here is a short video how that is done if you need more visual input.
This quick tutorial is for how you set up your site to make it able to syndicate to a class site. I am using the BIOL310 Genetics Online course as an example. You are going to need the following:
- A category given to you by the professor to use on your site to indicate which posts should be sent over to the class site.
- A blog. Here I am running WordPress as it is the supported one from VCU. Others are available if you already have a blog going, if not got to rampages.us and sign up as a VCU student and make one. Consider it a digital portfolio for all your work.
- Send your professor the address of your blog.
Below is a video of the process. It is pretty easy to do.
That should be it. Once your professor has the link and sets up syndication, your posts (when the category is applied to them) will show up on the site.
This is the main package that provides data types and routines for spatial analysis of genetic marker data. The previous version is currently available on CRAN and you can install it rom within your R environtment by invoking the command
If you want to keep up with the latest developments of this package, you can use the version found on GitHub. Install it from within R as:
and that should get you up-to-date. You’ll need to have a fully working LaTeX install and some other stuff to build it if you fork.
The Users Manual for the package with several examples can be found here
I have started a github account for this package, you can get access to the whole codebase read about it on the wiki, and contribute to the project from its repo at https://github.com/dyerlab.
This document shows you how to extract data from rasters.
Getting The Libraries
First, I’ll load in some packages to get the ability to work with raster data and to load in the Arapatus attenuatus data set (it is part of the default gstudiopackage).
Loading and Cropping Rasters
We can load in the raster, and then crop it to just the are we need. These rasters were downloaded from [http://www.worldclim.org] and are much larger than the study area. This just makes it easier on the computer to not have to deal with such large areas. After cropping it, we will load in the annual precip and temperature data as well.
tmp <- raster("alt_22.bil")
e <- extent(c(-115, -109, 22, 30))
baja_alt <- crop(tmp, e)
baja_temp <- crop(raster("bio1_22.bil"), e)
baja_prec <- crop(raster("bio12_22.bil"), e)
Getting Example Data from Araptus attenuatus
Now, lets grab the Araptus data and look at the data and plot out the locations.
## Species Cluster Population ID Latitude
## CladeA: 75 CBP-C :150 32 : 19 101_10A: 1 Min. :23.1
## CladeB: 36 NBP-C : 84 75 : 11 101_1A : 1 1st Qu.:24.6
## CladeC:252 SBP-C : 18 Const : 11 101_2A : 1 Median :26.2
## SCBP-A: 75 12 : 10 101_3A : 1 Mean :26.3
## SON-B : 36 153 : 10 101_4A : 1 3rd Qu.:27.5
## 157 : 10 101_5A : 1 Max. :29.3
## (Other):292 (Other):357
## Longitude LTRS WNT EN EF
## Min. :-114 01:01:147 03:03 :108 01:01 :225 : 2
## 1st Qu.:-113 01:02: 86 01:01 : 82 01:02 : 52 01:01:219
## Median :-112 02:02:130 01:03 : 77 02:02 : 38 01:02: 52
## Mean :-112 02:02 : 62 03:03 : 22 02:02: 90
## 3rd Qu.:-110 : 11 01:03 : 7
## Max. :-109 03:04 : 8 03:04 : 6
## (Other): 15 (Other): 13
## ZMP AML ATPS MP20
## : 33 08:08 : 51 05:05 :155 05:07 : 64
## 01:01: 46 07:07 : 42 03:03 : 69 07:07 : 53
## 01:02: 51 07:08 : 42 09:09 : 66 18:18 : 52
## 02:02:233 04:04 : 41 02:02 : 30 05:05 : 48
## : 23 07:09 : 14 05:06 : 22
## 07:09 : 22 08:08 : 9 11:11 : 12
## (Other):142 (Other): 20 (Other):112
plot(arapat, zoom = 6)
Extracting Point Data
To elevation, temperature and precipitation from the rasters for each sampling location, we need to translate them into points first. I’ll first grab the coordinate data as a data.frame.
coords <- StrataCoordinates(arapat)
Then we can grab them using the normal functions in the sp library.
pts <- SpatialPoints(coords[, 2:3])
coords$elev <- extract(baja_alt, pts)
coords$temp <- extract(baja_temp, pts)
coords$prec <- extract(baja_prec, pts)
## Strata Longitude Latitude elev temp prec
## 1 88 -114.3 29.33 681 178 143
## 11 9 -113.9 29.01 361 195 148
## 20 84 -113.7 28.97 368 197 124
## 29 175 -113.5 28.73 240 205 106
## 36 177 -114.0 28.66 177 203 120
## 46 173 -112.9 28.41 26 223 102
## 56 171 -113.2 28.22 522 195 145
## 66 89 -113.4 28.04 290 203 117
## 76 159 -113.3 27.53 87 211 123
## 85 SFr -113.0 27.36 305 205 107
## 94 160 -112.5 27.40 371 212 124
## 104 162 -112.4 27.20 248 220 119
## 114 12 -112.7 27.18 488 202 130
## 124 161 -113.0 27.04 29 216 91
## 134 93 -112.0 26.95 66 230 81
## 144 165 -112.4 26.25 NA NA NA
## 154 169 -111.4 26.21 6 238 134
## 164 58 -111.4 26.02 12 240 129
## 173 166 -112.1 25.91 70 224 84
## 183 64 -111.3 25.61 397 214 245
## 188 168 -111.2 25.56 354 217 235
## 198 51 -111.6 25.35 80 222 140
## 205 Const -111.7 25.02 50 217 127
## 216 77 -110.7 24.88 52 231 193
## 226 164 -111.5 24.75 35 213 95
## 236 75 -110.7 24.59 21 231 166
## 247 163 -111.0 24.21 196 219 127
## 257 ESan -110.4 24.46 9 237 175
## 265 153 -110.5 24.13 31 235 165
## 275 48 -110.3 24.21 117 233 217
## 285 156 -110.0 24.04 0 237 196
## 291 157 -110.1 24.02 609 208 443
## 301 73 -109.9 24.01 103 233 227
## 311 Aqu -110.1 23.29 142 226 227
## 319 Mat -110.1 23.09 5 234 159
## 324 98 -109.6 23.08 75 237 233
## 328 101 -110.6 27.91 8 249 244
## 337 32 -109.3 26.64 18 244 337
## 356 102 -109.1 26.38 10 245 346
Plotting Trend lines.
Cool, lets sort this by latitude
coords <- coords[order(coords$Latitude), ]
and then plot out some values to look at what is going on.
ggplot(coords, aes(x = Latitude, y = elev)) + geom_line(color = "gray") + theme_bw() + geom_text(aes(y = elev + 10, label = Strata), color = "blue")
ggplot(coords, aes(x = Latitude, y = temp)) + geom_line(color = "gray") + theme_bw() + geom_text(aes(y = temp + 5, label = Strata), color = "blue")
ggplot(coords, aes(x = Latitude, y = prec)) + geom_line(color = "gray") + theme_bw() + geom_text(aes(y = prec + 10, label = Strata), color = "blue")
This covers some basic unix commands so that you can log into a machine and move around in it with ease. It is by no means comprehensive.
To log into a machine you must use SSH. This is a secure shell and is encrypted on both ends so that others cannot snoop on your passwords or activities. If you are a windows person (shudder), you will have to use a GUI application to log into the server. Download one and install it. The server we will be using in this exercise is:
and can also be referenced by the raw IP address as:
Both of these addresses are the same. You will need both a user name and password on the server before you can log into it.
If you are outside VCU network, you will have to log into our VPN (see vcu.edu and search for webvpn if you do not know how to do this). On a mac/unix/linux box, you can use the terminal application to log into the server as:
or with the raw IP address. Note: if you have a user name that is different on your local machine than it is on the server, you need to specify that in the ssh call as:
Once you log into the remote server you will be greeted with a prompt from which you will be able to interact with the computer directly (e.g., no need for gui-pointy-clicky stuff).
When you are logged into the server, you are actually working in an interactive programming environment. By default, the dyerlab servers run the Bash Shell. This is a command line interface that has the ability to do quite a few things that reduce the amount of tedium in your computational life. The entire philosophy of unix is to provide an environment where:
- There is a strict partition between the stuff that you do in your own account and the stuff others do. Security at the OS level is priority number one and as such has a fairly strict set of requirements for what you as a user can see and do. This is good.
- Things work well together. Programs are small. There is no monolithic program that tries to do everything (except emacs but that is a flamewar of a different type). Programs do one type of task and do not try to multitask. This is very good because we have evolved an ecosystem of programs that are very efficient.
- It is assumed that unix programs read and write text files, not binary files. It goes against the philosophy of the OS for a program to sequester its data into formats that are unreachable by other programs. If a program does this then it is no longer a citizen of the unix program community and as such has cut itself off from the breadth of opportunities that such an interaction entails. This is a very very good thing because you as a user can build complicated workflows in the environment that achieve things that would take much longer in the normal clickity-clickity mouse world we typically inhabit.
To move around a unix box on a server you are not in front of, you must do it from the command line. You will always be logged into your ‘home’ account. This is literally at the location in the file system /home/your_user_name. Everyone has their own home directory and you cannot see what is in other peoples home directories unless they do some rather severe hacks (violating #1 above). However, there are places you can put stuff to share materials between users, it is just not within the home directory. Here is an overview of some basic commands. You can always find out where you are by typing:
|pwd||Prints out the current directory you are in.|
|ls||Lists the files and folders in that directory (see ls -l for a long listing)|
|cd||Change to your home directory|
|cd ..||Changes to directory that is the immediate parent of the current one|
|cd folder||Changes the directory to ‘folder’|
|exit||Logs you out of the computer|
|chmod||Changes the permission/privileges for files.|
|echo||Prints out variables/content to the shell output|
|cat||Concatenates text to the terminal output|
|head||Prints out the top lines of a file|
|wc||Counts words (default) or lines (-l) in a file|
|grep||Gets a regular expression (e.g., used for text searching)|
Almost all programs (pwd, ls, cd, etc.) have a manual associated with them. This is because there are often many options to modify the behavior of the output. You can get to the manual for any program by:
All manual entries have the same format and go through all the options available.
There is a plethora of editors available to the unix user. Perhaps the easiest one on the servers you’ll be using is nano. This is a simple editor that allows you to open, change, and save text files. Given that this is a command-line environment, the menu-like options are available via keyboard combinations indicated at the bottom of the screen. In most unix environments the symbol ‘^’ denotes the control key (e.g., ^-Q would be holding down the control key and typing Q), and ‘M’ denotes the key labelled ‘alt’ (the modifier key). In nano these combinations act as the menu functionality.
To start an editing session type:
and it will make a new file. If you have an existing file, you can edit it as:
for example and it will open it up and start you on an editing session.
Putting and Getting Files
To move documents from your computer to a remote server, we use the SSH Copy command ‘scp’. This is built into the ssh software itself.
scp file remoteserver:~/
The last part, ‘:~/’ is required as it tells scp that the remoteserver is a server and you want the file to go into your home directory ‘~/’
scp myremoteserver:~/thefile .
This will pull a file from the remote server to your current directory. Again notice that the period is there, a period in the options to a program specifically means ‘this place in the directory heirarchy.’
If you are using windows, there is a GUI drag and drop interface with your SSH client.
Running a Command Line Program
On a server all programs are run from the command line. You have already seen that you can run a program by typing its name and hitting return. The only reason this works is because there are executable binary files located in specific directories on the machine. These directories are all called ‘bin’ and they are located in parts of the folder heirarchy dictating who can access them. If you type at the command prompt:
it will return all the bin files that you are, by default, able to search for programs. The program echo prints out the value of a variable and all Bash Shell variables are indicated by a dollar sign (and by convention are typed in all upcase).
If a program is not in a ‘bin’ folder you access it will not run when you simply type its name. What you have to do is to tell the Bash Shell the location of the program specifically, even if it is in the same directory as you are.
There is one more requirement to execute a program. The program must be labelled as something that can be executed! In unix, scripts such as Bash Scripts, perl, python, lua, and many other languages can be used to make programs but these programs are just text files containing the instructions necessary to perform a task. To designate a program as a specific one that can be executed (if its contents actually can do something) you need to set the executable bit on the file by changing the mode of the file. This is done using chmod and telling it to add executable privileges.
chmod +x program_file
Many things have executable privileges, try a ls -l and look at the far left column to see the privileges, the last digit is either x or – indicating executable status or not (n.b., folders are executable, if you turn it off you will not be able to change directory into that folder).
As mentioned above, programs are small and do specialized things but can work together in ways that are quite dynamic. This is done by redirecting output. Here are some examples. In the directory /usr/share/velvet_1.2.10/data, there are several example fasta files that come with velvet. Copy the test_reference.fa file to your directory as:
cp /usr/share/velvet_1.2.10/data/test_reads.fa .
You should be able to see this file in your directory (via ls). Lets count how many lines are in that file. I see the following output (yours will be a bit different because you are not in my home directory (all the stuff to the left of the $ is not typed and is part of the bash prompt).
rodney@chesapeake ~ $ wc -l test_reads.fa
So this file has 285,716 lines in it. Lets now look at the first few lines:
This output is dumped to the terminal. To redirect this output to another place, say a file, we use the redirect operator ‘>’ otherwise known as the greater than sign.
rodney@chesapeake ~ $ head test_reads.fa > firstfew.fa
rodney@chesapeake ~ $ wc -l firstfew.fa
Piping Between Programs
Hooking together programs is the next step and it is called piping. It is accomplished by hooking together the output of one program with the input of another using the pipe character ‘|’ (the vertical line on the slash key on US english keyboards). Lets say I wanted to search for a particular sequence in the test_reads.fa file. I’ll use GATACA because it was a good movie. I use the grep command to find it by passing first what I am looking for and then the file I am looking for it in. I will then pipe this through the wc program to see how many lines have that sequence of nucleotides within it.
rodney@chesapeake ~ $ grep GATACA test_reads.fa | wc -l
So there are almost 800 reads with that sequence in it. We can continue to combine commands together beyond just these two. See if you can figure out what these commands do:
rodney@chesapeake ~ $ cat test_reads.fa | sort | head -100 > sortedfirstfew.fa
rodney@chesapeake ~ $ head sortedfirstfew.fa
OK, so when working on long-running programs, there is the need to be able to log in and out of a session without the various things you have running stopping each time. When you log into a unix machine, you start a ‘session’ and you can log in many times using the same user ID. When you log out, every process (running program) is purged from the memory and thus lost. So unless you plan on being logged into a terminal until a long running process is done running (some may take months), a better solution is needed. This is where a program called ‘screen’ comes in handy.
Screen is a program that is run after you log into the server. What this does is to then make a ‘virtual session’ (or many of them) that you can attach and detach your terminal session to. You start screen by typing:
and then a read the verbage and continue (via space bar or hitting return). After that it will look like your normal terminal session and you’ll be able to do whatever you want to do just like before. Screen is running in the background
So lets say you start a long running process like:
./VelvetOptimiser.pl -s 16 -e 31 -f “-short -raw ~/data/pedima/pedimaSNPs.fasta”
This can take a while to run. Now that you are already within a screen session, you can run it and then detach your terminal from that screen session. After you detach, the program will continue running just as before and you can re-attach at a later time to check on progress.
Here are the commands for listing, attaching, and detaching from screen sessions (n.b., they all start with CTRL+a followed by another letter.
|CTRL+a c||Creates a new shell window|
|CTRL+a k||Kills the current window|
|CTRL+a w||Lists all windows (the current session is marked with an ‘*’)|
|CTRL+a d||Detach from current session|
|CTRL+a D||Detach from current session and close the shell as well|
|CTRL+a 0-9||Go to session 0, 1, … 9|
After you detach (CTRL+a d) you can exit the server and when you come back, you can reattach as:
if you only have one screen session going, otherwise you can just start screen, list the open sessions, and then come back at that time.
Link to UMissouri
From NCBI: A basic protocol