Saturday, September 22, 2018

Lesson 4: Reading and Writing Data in R

In this post we'll look at reading and writing text files, delimited files, as well as excel files.

Reading delimited file data sets into R is pretty straight forward.  R includes several base functions that allow you to easily read your delimited files directly into a data frame.

Similarly, reading and writing excel data is easy with the right R libraries.  Setting-up the libraries requires a few extra steps.  First we need to install two library packages.  Then we can load the needed libraries into our R session and access the required R functions to read and write excel files.  We'll get into the nitty gritty details of this a little later in the post.

Prep Work

Before we get into the fun stuff, we need to save some files to our desktop. 

You can download the the example files here.
There are four files:
  1. my_doc.txt
  2. tab_delim.txt
  3. comma_delim.csv
  4. excel_example.xls
I've created the files to save you some time, but you can also just make your own files if you don't feel like downloading mine.

Part #1:  Reading & Writing Text Files

Okay, so now that you have your files saved to your desktop, set your working directory to your desktop.
setwd("~/Desktop/")

Next, you can use the file I gave you in the link above or you try and create a simple text file using R that contain the infamous sentence "The quick brown fox jumps over the lazy dog."
To create the file, first create a string variable called "contents" with the "infamous sentence" stored in it.

contents <- "The quick brown fox jumps over the lazy dog."

Now, write that "infamous sentence" to a file called "my_doc.txt".

writeLines(contents, "my_doc.txt")

So, good job, you just created a text file using R.  Now let's read in that same text file using the "readLines" function and store it in a string variable called "read_it".

First create a variable called "my_text_file" where you will store the name of the newly created text file "my_doc.txt". 

my_text_file <- "my_doc.txt"
  
Next, read in and store the contents of the text file into a string variable.

read_it <- readLines(my_text_file)

Now let's create a copy of the file by using the "writeLines" function and save the new file as "my_doc2.txt".

writeLines(read_it, "my_doc2.txt")

Another faster way to make a copy of a file is using the "file.copy" function.
We covered file & directory operations in Lesson 2 (check it out for more info).
Using the "file.copy" function save a copy of "my_doc.txt" as "my_doc3.txt".

file.copy(from = "my_doc.txt", to = "my_doc3.txt")


Part #2:  Reading & Writing Delimited Files

Next we're going to discuss dealing with delimited file data.  Reading & writing delimited files, whether they are tab, comma, or semi-colon delimited is easy in R.  R comes loaded ready to go with base functions that allow us to read in this type of data.

So let's get started!  Use the "read.delim" function and specify the delimiter using the "sep" argument.  For example, let's read in a tab delimited file called "tab_delim.txt" and store it in a data frame called "tab" where the delimiter argument is "sep" is a tab.  A tab punctuation is specified in R as " \t ". 

tab <- read.delim(file ="tab_delim.txt", sep = "\t")

Writing the contents of the "tab" data frame is just as simple.  We will use a base function called "write.table" and we'll specify the NEW file's name as "tab_delim2.txt".  Be sure to use the "row.names" argument to prevent row numbers from being exported along with the data.

write.table(x = tab, file = "tab_delim2.txt", sep = "\t", row.names = FALSE)

Next, let's move on to comma delimited files or comma-separated value (csv) files.  To do this we use another base function called "read.csv", which is a special version of the "read.delim" function (imagine that!).
We like to us the "read.csv" function for csv-files because it already knows your delimiter is a comma, so you don't have to specify the "sep" argument.

Now, let's go ahead and read in a comma-delimited or comma-separated value (csv) file called "comma_delim.csv" and store it in a data frame called "comma".

comma <- read.csv(file = "comma_delim.csv")

Similar to the "read.csv" function, the "write.csv" function doesn't need you to specify the "sep" argument, since a comma delimiter is implied.  Finally, let's go ahead and write this 

write.csv(x = comma, path = "comma_delim2.csv")


Part #3:  Reading & Writing Excel Files

For this next part, you'll need to install two different library packages for R; the "readxl" and "writexl" packages.

install.packages("readxl")
install.packages("writexl")

Once you've installed the packages, load the associated libraries for those packages.

library(readxl)
library(writexl)

Now we're ready to read in some Microsoft Excel spreadsheet data.  Exciting, right!!  I know!
Microsoft Excel is a great program, but often it is limited in its capabilities when it comes to LARGER data sets.  I'm talking over 1,000,000 rows of data as is common now days, even this many rows is considers small data.  When we talk about BIG DATA we're talking 100's of Millions of rows.  Also trying to wrangle and tidy data of this magnitude can be impossible in Excel.  This is where a program like R can really shine!

Now we are not going to deal with a file with that many rows todays. We're going to keep it simple for today and leave to 7 rows... yeah laugh it up!

First, let's read in "Sheet1" from an excel-file called "excel_example.xls" and store in it a data frame called "excel".
NOTE: "read_xls" can read a variety of excel extensions including: xls, xlsx, xlsm, etc.

excel <- read_xls(path = "excel_example.xls", sheet = "Sheet1")

Now, let's save the excel data frame to a NEW excel file called "excel_example2.xls".
NOTE: "write_xlsx" can write to the basic excel extensions including: xls and xlsx.

write_xlsx(x = excel, path = "excel_example2.xls")

Great stuff, I know!

While the examples we covered today may seem basic, it really is important to know.
I use these functions daily and I'm sure you will too as you continue to use R!


Good Luck!


Only one more step.  Let's save that script you just created by clicking on "File" -> "Save As..." and let's name your script "R_Lesson4.R"  Click "Save".
Congratulations!  You've completed Lesson 4!


DOWNLOAD CODE Here is the code from my GitHub gist "R Lesson 4 - Reading and Writing Data in R" in case you'd rather just copy and paste it and then play around with it.


Saturday, September 15, 2018

Lesson 3: Vectors, Lists, Matrices, and Data Frames in R

Data structures are next on the list of things we need to discuss because they are central to just about everything we do in R.  Data structures in R come in four main flavors: vectors, lists, matrices and data frames.  We'll discuss each of these data structures this week, along with special cases of vectors, such as scalars and factors, as well as a special case of matrices, called arrays.

Okay, let's get started!

PART #1: VECTORS, SCALARS, AND FACTORS

Vectors are able to hold one kind of data type or mode (e.g. numeric, character).
Vectors can be created using the combine function "c()".
So, let's first create a vector with numeric data.

my_first_vector <- c(1, 2, 3, 4, 5)

Now let's create a vector with character data or strings.

my_second_vector <- c("Orange", "Apple", "Banana", "Pear", "Peach")

To reference an element or value in a vector use an index value enclosed in square brackets "[ ]".  For example, to reference the first element in the vector, use "[1]".

my_second_vector[1]

One can also create subsets of the original vector by using a continuous index range from x to y by using the following syntax "[x:y]".  For example, to create a new vector that is a subset of the second vector's last three items, use an index rand of "[3:5]".

my_third_vector <- my_second_vector[3:5]

You can also create a new vector that is a subset of the original vector's first and last items using the "[c(1,5)]".

my_fouth_vector <- my_second_vector[c(1,5)]


To add a value to a vector and create a new vector, use the combine function "c()" once again, referencing the original vector in the combine function.

my_fifth_vector <- c(my_second_vector, "Apricot")

Next we'll discuss Scalars.  Scalars are vectors with only one element
Pi in R is stored as a scalar.  You can test this by referencing it's "first" element

pi[1]

Factors are similar to vectors except a factor stores each unique value in the vector as a 'level' or category.  Factors can used to label or categorize your data.

Let's turn the fifth vector we created into a list of factors.

my_first_factors <- factor(my_fifth_vector)

PART #2: LISTS

Unlike vectors, lists are able to hold more than one kind of data type or mode (e.g. numeric, character).
Let's create a list with numeric and character data modes using the "list()" function.

my_first_list <- list(1, 2, 3, "car", "truck", "van")

Referencing values or elements in a list is similar to referencing element in a vector.  To reference the fourth element in the list use "[[4]]"

my_first_list[[4]]

Similar to vectors, one can also create subsets of the original list by using a continuous index range from x to y by using the following syntax "[x:y]".  For example, To create a new list that is a subset of the original list's last three items use the "[3:5]".

my_second_list <- my_first_list[3:5]

And again, similar to vectors, to create a new list that is a subset of the original list's first and last items use "[c(1,5)]".

my_third_list <- my_first_list[c(1,5)]

Finally, similar to vectors, to add a value to a list and create a new list, use the combine function "c()" once again, referencing the original list in the combine function.

my_fourth_list <- c(my_first_list, "suv")

PART #3: MATRICES AND ARRAYS

You may or may not remember matrices from high school or college math.  Matrices are (typically) numeric vectors with one or more dimensions.  A matrix can be produced from multiple vectors.

a <- c(1,2,3)
b <- c(4,5,6)
c <- c(7,8,9)

To create a matrix we use the "matrix()" function.  First, you should know that there are three parameters used by the "matrix()" function that define what the matrix looks like.  The "data" parameter allows you to specify what vectors to use for data using the "c()" function. The "nrow" and "ncol" allow you tell R how many rows and columns define the matrix.

So, let's create a matrix using the "matrix()" function and the three vectors above.

my_first_matrix <- matrix(data = c(a,b,c), nrow = 3, ncol = 3)

Similar to vectors and lists you can reference specific values in the matrix.  For instance, to reference the item in the first row of the third column in the matrix use "[1,3]".

my_first_matrix[1,3]

Now for the next exercise, we'll need to create a fourth vector.

d <- c(10,11,12)

To add the new vector "d" to "my_first_matrix" as an additional column you need to use the column bind or "cbind" function.

my_second_matrix <- cbind(my_first_matrix,d)

To add the new vector 'd' to 'my_first_matrix' as an additional row you need to use the row bind or "rbind" function.

my_third_matrix <- rbind(my_first_matrix,d)


Next, let's talk about arrays.  Arrays are just matrices with 3 or more dimensions.
To make our first array, let's make a vector with 27 items.  For example number values from 1 to 27.

my_first_array <- c(1:27)

Now to create the array, let's divide the vector up into three 3 x 3 matrices and stack them using the dimension or "dim" function and combine or "c()" function to specify how to structure the vector in to a 3-dimensional array.

dim(my_first_array) <- c(3,3,3)

Review the results of your first first array.

print(my_first_array)

PART #4: DATA FRAMES

Data frames are the last data structure we'll discuss this week.  Data frames are designed to hold tabular data.  They are similar to spreadsheets.  They have columns and rows and can store all data types.  Data frames can be built from vectors, lists, or matrices.  Let's build our first data frame from "my_first_vector" and "my_second_vector" using the "data.frame" function.  For this example, we're going also going to set a parameter called "stringsAsFactors" to FALSE.  This will make sure that the string data is saved as "character" data type rather than "factor" data type.

my_first_df <- data.frame(my_first_vector, my_second_vector, stringsAsFactors = FALSE)

We can also create data frames from lists, so we'll do that next
But, before we do that we have to create a list of vectors.

my_vector_list <- list(my_first_vector, my_second_vector)

When using lists to create a data frame, you need to use the "as.data.frame" function rather than the "data.frame" function.

my_other_df <- as.data.frame(my_vector_list, stringsAsFactors = FALSE)

Since the default column names are the vector names, let's rename the columns.

names(my_first_df) <- c("Quantity", "Fruit")

Using matrix notation, let see what's in row 3.

my_first_df[3,]

Now, Let see what's in column 2.

my_first_df[,2]

OR use the name of column 2 instead to also see what's in it.

my_first_df$Fruit

Next, Let's see what's in row 4, column 2.

my_first_df[4,2]

Finally, let's change the value in row 4, column 2 from "Pear" to "Plum".

my_first_df[4,2] <- "Plum"


Only one more step.  Let's save that script you just created by clicking on "File" -> "Save As..." and let's name your script "R_Lesson3.R"  Click "Save".
Congratulations!  You've completed Lesson 3!


DOWNLOAD CODE Here is the code from my GitHub gist "R Lesson 3 - Vectors, Lists, Matrices, and Data Frames in R" in case you'd rather just copy and paste it and then play around with it.



Sunday, September 9, 2018

Lesson 2: Working Directory & File Operations in R

So you got your first program behind you!  Now we'll take a look at how you can determine where R is storing your new script.  The following is a tutorial on how to determine your "working directory", set your working directory and then do some simple file operations while you're in the "working directory".

So, you might be asking "What exactly is the working directory"?  The working directory is where R-Studio saves your scripts as well as R-Studio's .Rdata files that contain workspace  settings and data so that you can load a past session and pick-up where you left off.  Additionally, the working directory is where R-Studio will look if you ask it to read a file if you do not provide an absolute path.

STEP#1 - CHECK YOUR WORKING DIRECTORY

To display the working directory in the R-Studio console you can use the 'getwd' function.

getwd()

My working directory is currently set to:

[1] "/Users/RGuy/Documents"

STEP#2 - SET YOUR WORKING DIRECTORY

Next, to change the working directory  to my Desktop in R-Studio, you can use the 'setwd' function.

work_directory = "~/Desktop"
setwd(work_directory)

STEP#3 - SEE CONTENTS OF WORKING DIRECTORY

Next, let's display the contents of our working directory
dir(work_directory)

The contents of my working directory currently are:
[1] "a.jpg"       "b.jpg"      "Documents"   "temp_submit"

STEP#4 - PERFORM VARIOUS FILE OPERATIONS

You can perform various file operations while in R-Studio. Let's store a list of the files and folders in the working directory and then do the following: a) Copy a file, b) Rename a file, c) Determine time file was last modified, and lastly d) Determine size of file. Store a list of the working directory's contents (files & folders)
directory_contents = dir(work_directory)

a. Copy 'a.jpg' and rename the it to 'a_copy.jpg'.  Remember that 'a.jpg was the first item list in the working directory contents. Therefore, since we stored the directory's content information in the 'directory_content' variable as a list, we can reference item #1 (a.jpg) in the list by typing 'directory_contents[1]'.
file.copy(directory_contents[1], "a_copy.jpg")

b. Next we'll rename 'a_copy.jpg' to 'c.jpg'
file.rename("a_copy.jpg", "c.jpg")

c. Now, we'll determine the date & time the file 'c.jpg' was last modified
file.mtime("c.jpg")

d. Finally, we'll see what the size of the file 'c.jpg' is in bytes.  NOTE: Divide value by 1,000 for kilobytes, and 1,000,000 for megabytes.
file.size("c.jpg")

STEP 5: SAVING YOUR SCRIPT

Only one more step.  Let's save that script you just created by clicking on "File" -> "Save As..." and let's name your script "R_Lesson2.R"  Click "Save".
Congratulations!  You've completed Lesson 2!

DOWNLOAD CODE Here is the code from my GitHub gist "R Lesson 2 - R-Studio Working Directory & File Operations" in case you'd rather just copy and paste it and then play around with it.

Tuesday, September 4, 2018

Lesson 1: Writing Your First Program In R Using R-Studio

So, you're ready to write your first program in R!  I'm really excited for you!  R is my one of my favorite programming languages because once you get the hang of the syntax, it's very intuitive and very well supported by the community.  Many times answers to your questions can be found through a simple Google search or by visiting Stack Exchange.

As you may (or may not) know when first learning a programming language, the first program you typically write is "Hello, World!".  So, that is what we are going to do here next with R in R-Studio!

NOTE:  To run the code in R-Studio, you can put your cursor all the way to the left on the line of code you'd like to execute.  For example place your cursor to the left of the 'p' in the 'print' function and press "ctrl + enter" (Windows/Linux) or "command + enter" (Mac OSX).  This will run that line of code and display "Hello, World!" in the R-Studio console window.


STEP 1: USE PRINT FUNCTION TO DISPLAY A STRING

To display a text string variable in the R-Studio console, you can use the 'print' function.
print("Hello, World!")

STEP 2: ASSIGN A STRING TO A VARIABLE AND USE PRINT TO DISPLAY IT

Next, let's assign "Hello, World!" to a string variable and use the 'print' function to display it in the console.
x = "Hello, World!"
print(x)

STEP 3: ASSIGN TWO STRINGS TO TWO VARIABLES AND USE PASTE TO DISPLAY IT

Next we will use the 'paste' function to combine variables and strings and like print, it will display the result in the console.

x = "Hello, World"
y = "once again!"
paste(x,y, sep = " ")

STEP 4: LEARNING MORE ABOUT R FUNCTIONS

At any time during your R-Studio session you can display information about a given functions by preceding the function with a '?'. This will display helpful information about the function in the viewer on the right.
?print
?paste

STEP 5: SAVING YOUR SCRIPT

Only one more step.  Let's save that script you just created by clicking on "File" -> "Save As..." and let's name your script "R_Lesson1.R"  Click "Save".
Congratulations!  You've completed Lesson 1!

DOWNLOAD CODE
Here is the code from my GitHub gist "R Lesson 1 - First Program" in case you'd rather just copy and paste it and then play around with it.