STAT 19000: Project 10 — Fall 2020
Motivation: Functions are powerful. They are building blocks to more complex programs and behavior. In fact, there is an entire programming paradigm based on functions called functional programming. In this project, we will learn to apply functions to entire vectors of data using sapply
.
Context: We’ve just taken some time to learn about and create functions. One of the more common "next steps" after creating a function is to use it on a series of data, like a vector. sapply
is one of the best ways to do this in R.
Scope: r, sapply, functions
Dataset
The following questions will use the dataset found in Scholar:
/class/datamine/data/okcupid/filtered
Questions
Please make sure to double check that the your submission does indeed contain the files you think it does. You can do this by downloading your submission from Gradescope after uploading. If you can see all of your files and they open up properly on your computer, you should be good to go. |
Please make sure to look at your knit PDF before submitting. PDFs should be relatively short and not contain huge amounts of printed data. Remember you can use functions like |
Question 1
Load up the the following datasets into data.frames named users
and questions
, respectively: /class/datamine/data/okcupid/filtered/users.csv
, /class/datamine/data/okcupid/filtered/questions.csv
. This is data from users on OkCupid, an online dating app. In your own words, explain what each file contains and how they are related — its always a good idea to poke around the data to get a better understanding of how things are structured!
Be careful, just because a file ends in |
-
R code used to solve the problem.
-
1-2 sentences describing what each file contains and how they are related.
Question 2
grep
is an incredibly powerful tool available to us in R. We will learn more about grep
in the future, but for now, know that a simple application of grep
is to find a word in a string. In R, grep
is vectorized and can be applied to an entire vector of strings. Use grep
to find a question that references "google". What is the question?
If at first you don’t succeed, run |
To prepare for Question 3, look at the entire row of the |
-
R code used to solve the problem.
-
The
text
of the question that references Google.
Question 3
In (2) we found a pretty interesting question. What is the percentage of users that Google someone before the first date? Does the proportion change by gender (as defined by gender2
)? How about by gender_orientation
?
The two videos posted in Question 2 might help. |
If you look at the column of |
Use the |
the correct column of users
,
breaking up the data according to gender2
or according to gender_orientation
,
and use this as your function in the tapply
:
function(x) {prop.table(table(x, useNA="always"))}
-
R code used to solve this problem.
-
The results of running the code.
-
Written answers to the questions.
Question 4
In Project 8, we created a function called count_words
. Use this function and sapply
to create a vector which contains the number of words in each row of the column text
from the questions
dataframe. Call the new vector question_length
, and add it as a column to the questions
dataframe.
count_words <- function(my_text) {
my_split_text <- unlist(strsplit(my_text, " "))
return(length(my_split_text[my_split_text!=""]))
}
-
R code used to solve this problem.
-
The result of
str(questions)
(this shows how yourquestions
data frame looks, after adding the new column calledquestion_length
).
Question 5
Consider this function called number_of_options
that accepts a data frame (for instance, questions
)…
number_of_options <- function(myDF) {
table(apply(as.matrix(myDF[ ,3:6]), 1, function(x) {sum(!(x==""))}))
}
…and counts the number of questions that have each possible number of responses. For instance, if we calculate number_of_options(questions)
we get:
`
0 2 3 4
590 936 519 746
`
which means that: 590 questions have 0 possible responses; 936 questions have 2 possible responses; 519 questions have 3 possible responses; and 746 questions have 4 possible responses.
Now use the split
function to break the data frame questions
into 7 smaller data frames, according to the value in questions$Keywords
. Then use the sapply
function to determine, for each possible value of questions$Keywords
, the analogous breakdown of questions with different numbers of responses, as we did above.
You can write:
|
The way sapply
works is the the first argument is by default the first argument to your function, the second argument is the function you want applied, and after that you can specify arguments by name. For example:
test1 <- c(1, 2, 3, 4, NA, 5)
test2 <- c(9, 8, 6, 5, 4, NA)
mylist <- list(first=test1, second=test2)
# for a single vector in the list
mean(mylist$first, na.rm=T)
# what if we want to do this for each vector in the list?
# how do we remove na's?
sapply(mylist, mean)
# we can specify the arguments that are for the mean function
# by naming them after the first two arguments, like this
sapply(mylist, mean, na.rm=T)
# in the code shown above, na.rm=T is passed to the mean function
# just like if you run the following
mean(mylist$first, na.rm=T)
mean(mylist$second, na.rm=T)
# you can include as many arguments to mean as you normally would
# and in any order. just make sure to name the arguments
sapply(mylist, mean, na.rm=T, trim=0.5)
# or sapply(mylist, mean, trim=0.5, na.rm=T)
# which is similar to
mean(mylist$first, na.rm=T, trim=0.5)
mean(mylist$second, na.rm=T, trim=0.5)
-
R code used to solve this problem.
-
The results of the running the code.
Question 6
Lots of questions are asked in this okcupid
dataset. Explore the dataset, and either calculate an interesting statistic/result using sapply
, or generate a graphic (with good x-axis and/or y-axis labels, main labels, legends, etc.), or both! Write 1-2 sentences about your analysis and/or graphic, and explain what you thought you’d find, and what you actually discovered.
-
R code used to solve this problem.
-
The results from running your code.
-
1-2 sentences about your analysis and/or graphic, and explain what you thought you’d find, and what you actually discovered.
OPTIONAL QUESTION
Does it appear that there is an association between the length of the question and whether or not users answered the question? Assume NA means "unanswered". First create a function called percent_answered
that, given a vector, returns the percentage of values that are not NA. Use percent_answered
and sapply
to calculate the percentage of users who answer each question. Plot this result, against the length of the questions.
|
|
-
R code used to solve this problem.
-
The plot.
-
Whether or not you think there may or may not be an association between question length and whether or not the question is answered.