R Notes

Last week: dplyr

\(\color{blue}{\text{summarize()}}\) - summarize \(\color{blue}{\text{variables}}\)

Summarize according to a summary function

Summary functions include

Summary Functions
first(): first value sum(): sum of values
last(): last value n(): number of values
nth(.x, n): nth value n_distinct(): number of distinct values
min(): minimum value mean(): mean value
max(): maximum value var(): variance
median(): median value sd(): standard deviation
quantile(.x, probs = .25): *IQR(): interquartile range

Things to note: * multiple summary functions can be called within the same command * we can give the summary values new names (though we don’t have to);

Summarize is especially helpful when combined with group_by

\(\color{green}{\text{group_by()}}\) - group \(\color{green}{\text{rows}}\)

Aggregate/group by value(s) of column(s).

  • we can group by more than one variable at once
  • we can perform other operations after group_by as well, like mutate

\(\color{blue}{\text{mutate()}}\) - create new \(\color{blue}{\text{variables}}\)

Create new columns or alter existing columns

  • we can mutate new variables as functions of other variables (ratios, conditions, ranks, etc.)
  • we can mutate multiple variables in the same command
  • You can mutate based on conditions, e.g., : if_else, case_when
df <- df %>% 
  mutate(newvar = if_else(condition, value_if_true, value_if_false, value_if_na))

df <- df %>% 
  mutate(newvar = case_when(
    condition1 ~ value1, 
    condition2 ~ value2, 
    condition3 ~ value3, 
    TRUE ~ value_everything_else)
  • \(\color{blue}{\text{summarize(across())}}\) - apply summary function to select \(\color{blue}{\text{variables}}\)
  • \(\color{blue}{\text{summarize(across(where()))}}\) - apply summary function to \(\color{blue}{\text{variables}}\) by conditions
  • across() can also be used within mutate

Last week: Factors

Factors are variables which take on a limited number of values, aka categorical variables. In R, factors are stored as a vector of integer values with the corresponding set of character values you’ll see when displayed (colloquially, labels; in R, levels).

forcats

The forcats package, part of the tidyverse, provides helper functions for working with factors. Including

  • fct_infreq(): reorder factor levels by frequency of levels
  • fct_reorder(): reorder factor levels by another variable
  • fct_relevel(): change order of factor levels by hand
  • fct_recode(): change factor levels by hand
  • fct_collapse(): collapse factor levels into defined groups
  • fct_lump(): collapse least/most frequent levels of factor into “other”

Some tidyr

Splitting/Combining

Data from one column to multiple colums, or from multiple columns into one

  • unite(data, col, ..., sep = "_", remove = TRUE, na.rm = FALSE)
  • separate(data, col, into, sep = "[^[:alnum:]]+", remove = TRUE, convert = FALSE, extra = "warn", fill = "warn", ...)

Some stringr

stringr provides a set of functions to make working with strings easier. Built on stringi, it implements some of the most frequenlty used string manipulation functions.

All functions in stringr start with str_ and take a vector of strings as the first argument. Some key functions:

  • Getting characters from a string: str_sub(x, start = 1L, end = -1L)
  • Adding/removing white space (or characters): str_pad(string, width, side = c("left", "right", "both"), pad = " ") or str_trim(string, side = c("both", "left", "right")) or str_wrap(string, width = 80, indent = 0, exdent = 0)
  • Modifying case: str_to_upper(string) or str_to_lower(string or str_to_title(string
  • A whole slew of pattern matching: str_detect(), str_count(), str_subset(), str_locate(), str_extract(), str_replace()

See stringr for more. And the stringr vignette on regular expressions,

R Markdown

Combine code, results, and prose into dynamic and reproducible documents suitable for sharing! These notes are made with R Markdown!

Cheat Sheets!

Because nobody can remember all of this!

Artwork by @allison_horst

Artwork by @allison_horst

Court Data Notes

From PS 1

  • Outcomes of interest noted in PS1 (for circuit court): SentenceTime (4), FineAmount(4), ProbationTime (4), HearingPlea (4), ConcludedBy (3), ChargeType (2), Costs (2), HearingType (2), Charge, DefenseAttorney, OffenseDate, ArrestDate, HearingResult, SentenceSuspended, ConcurrentConsecutive, ProgramType
  • Vars to understand better noted in PS1 (for circuit court): ChargeType (3), HearingType (3), Class (2), Charge (2), Costs (2), ProgramType (2), ProbationType (2), DispositionCode (2), SentenceSuspended (2), RestitutionAmount, DrivingRestrictions, JailPenitentiary, AmendedCode, HearingResult, fips, locality, HearingRoom, DOB, AKA

Stuff I know or have recently learned

  • It’s worth looking at the online case information system from which this data is scraped: https://eapps.courts.state.va.us/gdcourts/changeCourt.do
  • FIPS/FIPS3 is more useful as a descriptor of jurisdiction than locality; Birth year is masked in DOB; AKA records any known aliases
  • Digging into class: only felonies and misdemeanors appear to have class designations; felonies have six classes and misdemeanors have four, but each may also be Unclassified (U) or classified as Other (O). Other (O) is said to have penalties defined in the code (this one feels more speculative to me, but potentially verifiable in the data); Unclassified (U) is said not to fully fit the designated class definitions (here for example). Penalty ranges are defined by the charge type and class.
  • The code section of a charge refer directly to the sections in the Code of Virginia; the specific charge can change as the case flows through the system (AmendedCharge)
  • Title 18.2 defines crimes and offenses: https://law.lis.virginia.gov/vacode/title18.2/
  • Title 19.2 defines criminal procedure: https://law.lis.virginia.gov/vacode/title19.2/
  • Title 46.2 defines motor vehicle operations: https://law.lis.virginia.gov/vacode/title46.2/, etc.
  • Latin terms (I keep forgetting):
    • noelle prosequi: the prosecutor will drop prosecution
    • capias: an arrest warrant (though I’m confused about it’s presence as a response value in CaseType in the general court data, where the law student says it means “failure to appear”; it makes more sense in the description under HearingResult in circuit court data)
    • nolo contendere: plaintiff doesn’t contest misdemeanor charge

XKCD, Randall Munroe, https://xkcd.com/2494/