STAT 29000: Project 14 — Fall 2020
Motivation: As we learned earlier in the semester, bash scripts are a powerful tool when you need to perform repeated tasks in a UNIX-like system. In addition, sometimes preprocessing data using UNIX tools prior to analysis in R or Python is useful. Ample practice is integral to becoming proficient with these tools. As such, we will be reviewing topics learned earlier in the semester.
Context: We’ve just ended a series of projects focused on SQL. In this project we will begin to review topics learned throughout the semester, starting writing bash scripts using the various UNIX tools we learned about in Projects 3 through 8.
Scope: awk, UNIX utilities, bash scripts, fread
Dataset
The following questions will use ENTIRE_PLOTSNAP.csv from the data folder found in Scholar:
/anvil/projects/tdm/data/forest/
To read more about ENTIRE_PLOTSNAP.csv that you will be working with:
Questions
Question 1
Take a look at at ENTIRE_PLOTSNAP.csv
. Write a line of awk code that displays the STATECD
followed by the number of rows with that STATECD
.
-
Code used to solve the problem.
-
Count of the following `STATECD`s: 1, 2, 4, 5, 6
Question 2
Unfortunately, there isn’t a very accessible list available that shows which state each STATECD
represents. This is no problem for us though, the dataset has LAT
and LON
! Write some bash that prints just the STATECD
, LAT
, and LON
.
There are 92 columns in our dataset: |
-
Code used to solve the problem.
-
The output of your code piped to
head
.
Question 3
fread
is a "Fast and Friendly File Finagler". It is part of the very popular data.table
package in R. We will learn more about this package next semester. For now, read the documentation here and use the cmd
argument in conjunction with your bash code from (2) to read the data of STATECD
, LAT
, and LON
into a data.table
in your R environment.
-
Code used to solve the problem.
-
The
head
of the resultingdata.table
.
Question 4
We are going to further understand the data from question (3) by finding the actual locations based on the LAT
and LON
columns. We can use the library revgeo
to get a location given a pair of longitude and latitude values. revgeo
uses a free API hosted by photon in order to do so.
For example:
library(revgeo)
revgeo(longitude=-86.926153, latitude=40.427055, output='frame')
The code above will give you the address information in six columns, from the most-granular housenumber
to the least-granular country
. Depending on the coordinates, revgeo
may or may not give you results for each column. For this question, we are going to keep only the state
column.
There are over 4 million rows in our dataset — we do not want to hit photon’s API that many times. Instead, we are going to do the following:
-
Unless you feel comfortable using
data.table
, convert yourdata.table
to adata.frame
:
my_dataframe <- data.frame(my_datatable)
-
Calculate the average
LAT
andLON
for eachSTATECD
, and call the newdata.frame
,dat
. This should result in 57 rows of lat/long pairs. -
For each row in
dat
, run a reverse geocode and append thestate
to a new column calledSTATE
.
To calculate the average |
|
Here is some extra help:
|
It is okay to get "Not Found" for some of the addresses. |
-
Code used to solve the problem.
-
The
head
of the resultingdata.frame
.