Compiling RGDAL on OSX – Why do you hate me?

Compiling RGDAL on OSX – Why do you hate me?

Every time I upgrade in any significant way, two R libraries seem to raise their ugly heads and scream like a spoiled child— rgdal and rgeos .  Why do these two have to be SOOOO much of a pain?  Why can’t we have a auto build of a binary with all the options in it for OSX?  Who knows?  I always feel like I get the fuzzy end of the lollipop with these two.  Here is my latest approach for getting them going.

Read more

Merging data frames

In R, there is often the need to merge two data.frame objects (say one with individual samples and the other with population coordinates.  The merge()  function is a pretty awesome though it may take a little getting used to.
Here are some things to remember:

  1. You need to have two data.frame objects to merge
  2. The first one in the function call will be the one merged on-to the second one is added to the first.
  3. Each will need a column to use as an index—it is a column that will be used to match rows of data.  If they are the same column names then the function will do it automagically, if no common names are found in the names()  of either  data.frame objects, you can specify the columns using the optional by.x=  and by.y=  function arguments.

Read more

Loading in Rasters

Much of the work in my laboratory uses spatial data in some context.  As such it is important to try to be able to grab and use spatial data to in an easy fashion.  At present, R is probably the best way to grab, visualize, and analyze spatial data. For this example, I went to http://worldclim.org and downloaded the elevation (altitude) for tile 13 (eastern North America) as a GeoTiff.  A GeoTiff is a specific type of image format that has spatial data contained within it.  The tile data has a pixel resolution of 30 arc seconds which puts us in the general area of ~ 1km.   First, we need to get things set up to work.

Read more

Compiling the GSL Library for OSX

Compiling the GSL Library for OSX

I’ve been working on integrating the Swift language into my analysis workflow but much of what I do involves the GNU Scientific Libraries for matrix analysis and other tools.  Here is a quick tutorial on how to install the GSL library on a clean OSX platform.

  1. It is easiest if you have XCode installed.  You can get this from the App Store for free.  Go download it and install it.
  2. Download the latest version of the GSL libraries.  You can grab them by:
    1. Looking for your nearest mirror site listed at http://www.gnu.org/prep/ftp.html and connecting to it.
    2. Open the directory gsl/ where all the versions will be listed.  Scroll down and grab gsl-latest.tar.gz.
  3. Open the terminal (Utilities -> Terminal.app) and type:  cd ~/Downloads
  4. Unpack the archive by:  tar zxvf gsl-latest.tar.gz then cd gsl-1.16/ (or whatever the version actually was, it will probably be some number larger than 1.16).
  5. Inside that folder will be a README file (which you probably won’t read) and an INSTALL file (which you should read).  In that folder it will tell you to:  ./configure then  make then sudo make install. This last command will require you to type in your password as it is going to install something into the base system.
  6. All the libraries and header files will be installed into the /usr/local/ directory.

Install rgeos on OSX

There seems to be some nefarious conspiracy against packaging spatial R packages on the mac platform. Don’t quite understand it but it sucks.  Here is how to install the rgeos package.

If you try the normal way, you get the following error:

which is not very helpful.   Read more

Color palettes in ggpairs

Working on some code and was having a tough time configuring the color palette in GGally since it does not produce a ggplot object.  It appears to be a larger problem.  So, here is one hack, redefine the ggplot function and change the default palette there.  Need to make a dyerlab::palette now…

Turning In Assignments via Google Sharing

If you use Google Docs for your writing, there are several cool tricks you can use to increase your efficiency.  Here is one thing that has made it much easier when it comes to turning in assignments.  Previously, one would create a document in some word processor, work on it, put it on a thumb drive, take it home and work on it, take it back to school, perhaps a lab computer, maybe it is also worked on in the library, etc.  Eventually, you finish the document and then to turn it in you can either print it off (got to go find a printer or where I put that extra paper) or email it  in.  This last option is terrible if you have a large class!

If you are using Google Docs, you can just share it with the instructor.  In the sharing options, you can designate that you share with someone but only allow them to make “suggestions”.  This keeps the integrity of your document in place while allowing another person to mark it up.  Once you share it, they can open it and write in it but any and all changes to the document are indicated via a highlight color.  Since both of you are working on the document, there is no need to email it back and forth, there is only one document.

Here is a short video how that is done if you need more visual input.

Setting Up Your Site for Syndication

This quick tutorial is for how you set up your site to make it able to syndicate to a class site.  I am using the BIOL310 Genetics Online course as an example.  You are going to need the following:

  1. A category given to you by the professor to use on your site to indicate which posts should be sent over to the class site.
  2. A blog.  Here I am running WordPress as it is the supported one from VCU.  Others are available if you already have a blog going, if not got to rampages.us and sign up as a VCU student and make one.  Consider it a digital portfolio for all your work.
  3. Send your professor the address of your blog.

Below is a video of the process.  It is pretty easy to do.

That should be it. Once your professor has the link and sets up syndication, your posts (when the category is applied to them) will show up on the site.

GStudio: An R Package for Spatial Analysis of Marker Data

This is the main package that provides data types and routines for spatial analysis of genetic marker data. The previous version is currently available on CRAN and you can install it rom within your R environtment by invoking the command

install.packages("gstudio")

If you want to keep up with the latest developments of this package, you can use the version found on GitHub.  Install it from within R as:

require(devtools)
install_github("dyerlab/gstudio")

and that should get you up-to-date.  You’ll need to have a fully working LaTeX install and some other stuff to build it if you fork.

The Users Manual for the package with several examples can be found here

I have started a github account for this package, you can get access to the whole codebase read about it on the wiki, and contribute to the project from its repo at https://github.com/dyerlab.

Extracting Data from Rasters

This document shows you how to extract data from rasters.

 

Getting The Libraries

First, I’ll load in some packages to get the ability to work with raster data and to load in the Arapatus attenuatus data set (it is part of the default gstudiopackage).

 

Loading and Cropping Rasters

We can load in the raster, and then crop it to just the are we need. These rasters were downloaded from [http://www.worldclim.org] and are much larger than the study area. This just makes it easier on the computer to not have to deal with such large areas. After cropping it, we will load in the annual precip and temperature data as well.

 

Getting Example Data from Araptus attenuatus

Now, lets grab the Araptus data and look at the data and plot out the locations.

 

 

png-3

Extracting Point Data

To elevation, temperature and precipitation from the rasters for each sampling location, we need to translate them into points first. I’ll first grab the coordinate data as a data.frame.

 

Then we can grab them using the normal functions in the sp library.

 

 

Plotting Trend lines.

Cool, lets sort this by latitude

 

and then plot out some values to look at what is going on.

png

 

png-1

 

png-2

Unix Basics

This covers some basic unix commands so that you can log into a machine and move around in it with ease.  It is by no means comprehensive.

Logging In

To log into a machine you must use SSH.  This is a secure shell and is encrypted on both ends so that others cannot snoop on your passwords or activities.   If you are a windows person (shudder), you will have to use a GUI application to log into the server.  Download one and install it.  The server we will be using in this exercise is:

chesapeake.envs.vcu.edu

and can also be referenced by the raw IP address as:

128.172.178.27

Both of these addresses are the same.  You will need both a user name and password on the server before you can log into it.

If you are outside VCU network, you will have to log into our VPN (see vcu.edu and search for webvpn if you do not know how to do this).  On a mac/unix/linux box, you can use the terminal application to log into the server as:

ssh chesapeake.envs.vcu.edu

or with the raw IP address.  Note: if you have a user name that is different on your local machine than it is on the server, you need to specify that in the ssh call as:

ssh server_user_id@chesapeake.envs.vcu.edu

Once you log into the remote server you will be greeted with a prompt from which you will be able to interact with the computer directly (e.g., no need for gui-pointy-clicky stuff).

Bash

When you are logged into the server, you are actually working in an interactive programming environment.  By default, the dyerlab servers run the Bash Shell.  This is a command line interface that has the ability to do quite a few things that reduce the amount of tedium in your computational life.  The entire philosophy of unix is to provide an environment where:

  1. There is a strict partition between the stuff that you do in your own account and the stuff others do.  Security at the OS level is priority number one and as such has a fairly strict set of requirements for what you as a user can see and do.  This is good.
  2. Things work well together.  Programs are small.  There is no monolithic program that tries to do everything (except emacs but that is a flamewar of a different type).  Programs do one type of task and do not try to multitask.  This is very good because we have evolved an ecosystem of programs that are very efficient.
  3. It is assumed that unix programs read and write text files, not binary files.  It goes against the philosophy of the OS for a program to sequester its data into formats that are unreachable by other programs.  If a program does this then it is no longer a citizen of the unix program community and as such has cut itself off from the breadth of opportunities that such an interaction entails.  This is a very very good thing because you as a user can build complicated workflows in the environment that achieve things that would take much longer in the normal clickity-clickity mouse world we typically inhabit.

Basic Commands

To move around a unix box on a server you are not in front of, you must do it from the command line.  You will always be logged into your ‘home’ account.  This is literally at the location in the file system /home/your_user_name. Everyone has their own home directory and you cannot see what is in other peoples home directories unless they do some rather severe hacks (violating #1 above).  However, there are places you can put stuff to share materials between users, it is just not within the home directory.  Here is an overview of some basic commands. You can always find out where you are by typing:

 

Program Description
pwd Prints out the current directory you are in.
ls Lists the files and folders in that directory (see ls -l for a long listing)
cd Change to your home directory
cd .. Changes to directory that is the immediate parent of the current one
cd folder Changes the directory to ‘folder’
exit Logs you out of the computer
chmod Changes the permission/privileges for files.
echo Prints out variables/content to the shell output
cat Concatenates text to the terminal output
head Prints out the top lines of a file
wc Counts words (default) or lines (-l) in a file
grep Gets a regular expression (e.g., used for text searching)

Almost all programs (pwd, ls, cd, etc.) have a manual associated with them.  This is because there are often many options to modify the behavior of the output.  You can get to the manual for any program by:

man prog

All manual entries have the same format and go through all the options available.

Editing Files

There is a plethora of editors available to the unix user.  Perhaps the easiest one on the servers you’ll be using is nano.  This is a simple editor that allows you to open, change, and save text files.  Given that this is a command-line environment, the menu-like options are available via keyboard combinations indicated at the bottom of the screen.  In most unix environments the symbol ‘^’ denotes the control key (e.g., ^-Q would be holding down the control key and typing Q), and ‘M’ denotes the key labelled ‘alt’ (the modifier key).  In nano these combinations act as the menu functionality.

To start an editing session type:

nano

and it will make a new file.  If you have an existing file, you can edit it as:

nano myfile.fasta

for example and it will open it up and start you on an editing session.

Putting and Getting Files

To move documents from your computer to a remote server, we use the SSH Copy command ‘scp’.  This is built into the ssh software itself.

scp file remoteserver:~/

The last part, ‘:~/’ is required as it tells scp that the remoteserver is a server and you want the file to go into your home directory ‘~/’

scp myremoteserver:~/thefile .

This will pull a file from the remote server to your current directory.  Again notice that the period is there, a period in the options to a program specifically means ‘this place in the directory heirarchy.’

If you are using windows, there is a GUI drag and drop interface with your SSH client.

Running a Command Line Program

On a server all programs are run from the command line.  You have already seen that you can run a program by typing its name and hitting return.  The only reason this works is because there are executable binary files located in specific directories on the machine.  These directories are all called ‘bin’ and they are located in parts of the folder heirarchy dictating who can access them.  If you type at the command prompt:

echo $PATH

it will return all the bin files that you are, by default, able to search for programs.  The program echo prints out the value of a variable and all Bash Shell variables are indicated by a dollar sign (and by convention are typed in all upcase).

If a program is not in a ‘bin’ folder you access it will not run when you simply type its name.  What you have to do is to tell the Bash Shell the location of the program specifically, even if it is in the same directory as you are.

There is one more requirement to execute a program.  The program must be labelled as something that can be executed!  In unix, scripts such as Bash Scripts, perl, python, lua, and many other languages can be used to make programs but these programs are just text files containing the instructions necessary to perform a task.  To designate a program as a specific one that can be executed (if its contents actually can do something) you need to set the executable bit on the file by changing the mode of the file.  This is done using chmod and telling it to add executable privileges.

chmod +x program_file

Many things have executable privileges, try a ls -l and look at the far left column to see the privileges, the last digit is either x or – indicating executable status or not (n.b., folders are executable, if you turn it off you will not be able to change directory into that folder).

Redirecting Output

As mentioned above, programs are small and do specialized things but can work together in ways that are quite dynamic.  This is done by redirecting output.  Here are some examples.  In the directory /usr/share/velvet_1.2.10/data, there are several example fasta files that come with velvet.  Copy the test_reference.fa file to your directory as:

cp /usr/share/velvet_1.2.10/data/test_reads.fa .

You should be able to see this file in your directory (via ls).  Lets count how many lines are in that file.  I see the following output (yours will be a bit different because you are not in my home directory (all the stuff to the left of the $ is not typed and is part of the bash prompt).

rodney@chesapeake ~ $ wc -l test_reads.fa
285716 test_reads.fa

So this file has 285,716 lines in it.  Lets now look at the first few lines:

>SEQUENCE_0_length_35
GGATATAGGGCCAACCCAACTCAACGGCCTGTCTT
>SEQUENCE_1_length_35
CGACGAATGACAGGTCACGAATTTGGCGGGGATTA
>SEQUENCE_2_length_35
CCAAATAGGTCCTTACATCATGAGACGGGCCAAAT
>SEQUENCE_3_length_35
CGAGATGTATACCTCTAACACTGTGTTCCAAGTAC
>SEQUENCE_4_length_35
AAGCTCCCGCAATGGATCTTGTGACGGGCTGCTCG

This output is dumped to the terminal.  To redirect this output to another place, say a file, we use the redirect operator ‘>’ otherwise known as the greater than sign.

rodney@chesapeake ~ $ head test_reads.fa > firstfew.fa
rodney@chesapeake ~ $ wc -l firstfew.fa
10 firstfew.fa

Piping Between Programs

Hooking together programs is the next step and it is called piping.  It is accomplished by hooking together the output of one program with the input of another using the pipe character ‘|’ (the vertical line on the slash key on US english keyboards).  Lets say I wanted to search for a particular sequence in the test_reads.fa file. I’ll use GATACA because it was a good movie.  I use the grep command to find it by passing first what I am looking for and then the file I am looking for it in.  I will then pipe this through the wc  program to see how many lines have that sequence of nucleotides within it.

rodney@chesapeake ~ $ grep GATACA test_reads.fa  | wc -l
790

So there are almost 800 reads with that sequence in it.  We can continue to combine commands together beyond just these two.  See if you can figure out what these commands do:

rodney@chesapeake ~ $ cat test_reads.fa | sort | head -100 > sortedfirstfew.fa
rodney@chesapeake ~ $ head sortedfirstfew.fa
AAAAAAAACGGGCTTATAGACCATGCAGGCTTCAT
AAAAAAACACTATACAGCCAGAGTTCCTTCTTCTT
AAAAAAACCCTTCTGTGTTTGATCTACCTACTATA
AAAAAAACGGGCTTATAGACCATGCAGGCTTCATG
AAAAAAACGTAAGGAGCGTTTATGCCAAACGAAGA
AAAAAAAGGCTCGTGACTGTCATCATCGAGACGCC
AAAAAAAGTGGGGTTCAAACACTCTATCCATGAAG
AAAAAAATTGACTGTTAATGGCAATTTCAAGTTAT
AAAAAACAGCGAACCAGATCTTATTTTGCTTCTAC
AAAAAACATGACAACGAGAGCAACCCGGGCATTTG

Screen

OK, so when working on long-running programs, there is the need to be able to log in and out of a session without the various things you have running stopping each time.  When you log into a unix machine, you start a ‘session’ and you can log in many times using the same user ID.  When you log out, every process (running program) is purged from the memory and thus lost.  So unless you plan on being logged into a terminal until a long running process is done running (some may take months), a better solution is needed.  This is where a program called ‘screen’ comes in handy.

Screen is a program that is run after you log into the server.  What this does is to then make a ‘virtual session’ (or many of them) that you can attach and detach your terminal session to.  You start screen by typing:

screen

and then a read the verbage and continue (via space bar or hitting return). After that it will look like your normal terminal session and you’ll be able to do whatever you want to do just like before.  Screen is running in the background

So lets say you start a long running process like:

./VelvetOptimiser.pl -s 16 -e 31 -f “-short -raw ~/data/pedima/pedimaSNPs.fasta”

This can take a while to run.  Now that you are already within a screen session, you can run it and then detach your terminal from that screen session. After you detach, the program will continue running just as before and you can re-attach at a later time to check on progress.

Here are the commands for listing, attaching, and detaching from screen sessions (n.b., they all start with CTRL+a followed by another letter.

Keys Description
CTRL+a c Creates a new shell window
CTRL+a k Kills the current window
CTRL+a w Lists all windows (the current session is marked with an ‘*’)
CTRL+a d Detach from current session
CTRL+a D Detach from current session and close the shell as well
CTRL+a 0-9 Go to session 0, 1, … 9

After you detach (CTRL+a d) you can exit the server and when you come back, you can reattach as:

screen -r

if you only have one screen session going, otherwise you can just start screen, list the open sessions, and then come back at that time.

Velvet

Link to UMissouri

http://umbc.rnet.missouri.edu/resources/How2RunVELVETonClark.html

From NCBI: A basic protocol

http://www.ncbi.nlm.nih.gov/pmc/articles/PMC2952100/