Public Interest::Data Ethics & Practice

R Notes

Last week: dplyr

- summarize
- group
- create new

Last week: Factors

forcats

Some tidyr

Splitting/Combining

Some stringr
R Markdown
Cheat Sheets!

Court Data Notes

From PS 1
Stuff I know or have recently learned

R Notes

Last week: `dplyr`

- summarize

Summarize according to a summary function

Summary functions include

Summary Functions
first(): first value	sum(): sum of values
last(): last value	n(): number of values
nth(.x, n): nth value	n_distinct(): number of distinct values
min(): minimum value	mean(): mean value
max(): maximum value	var(): variance
median(): median value	sd(): standard deviation
quantile(.x, probs = .25):	*IQR(): interquartile range

Things to note: * multiple summary functions can be called within the same command * we can give the summary values new names (though we don’t have to);

Summarize is especially helpful when combined with group_by

- group

Aggregate/group by value(s) of column(s).

we can group by more than one variable at once
we can perform other operations after group_by as well, like mutate

- create new

Create new columns or alter existing columns

we can mutate new variables as functions of other variables (ratios, conditions, ranks, etc.)
we can mutate multiple variables in the same command
You can mutate based on conditions, e.g., : if_else, case_when

df <- df %>% 
  mutate(newvar = if_else(condition, value_if_true, value_if_false, value_if_na))

df <- df %>% 
  mutate(newvar = case_when(
    condition1 ~ value1, 
    condition2 ~ value2, 
    condition3 ~ value3, 
    TRUE ~ value_everything_else)

- apply summary function to select
- apply summary function to by conditions
across() can also be used within mutate

Last week: Factors

Factors are variables which take on a limited number of values, aka categorical variables. In R, factors are stored as a vector of integer values with the corresponding set of character values you’ll see when displayed (colloquially, labels; in R, levels).

`forcats`

The forcats package, part of the tidyverse, provides helper functions for working with factors. Including

fct_infreq(): reorder factor levels by frequency of levels
fct_reorder(): reorder factor levels by another variable
fct_relevel(): change order of factor levels by hand
fct_recode(): change factor levels by hand
fct_collapse(): collapse factor levels into defined groups
fct_lump(): collapse least/most frequent levels of factor into “other”

Some `tidyr`

Splitting/Combining

Data from one column to multiple colums, or from multiple columns into one

unite(data, col, ..., sep = "_", remove = TRUE, na.rm = FALSE)
separate(data, col, into, sep = "[^[:alnum:]]+", remove = TRUE, convert = FALSE, extra = "warn", fill = "warn", ...)

Some `stringr`

stringr provides a set of functions to make working with strings easier. Built on stringi, it implements some of the most frequenlty used string manipulation functions.

All functions in stringr start with str_ and take a vector of strings as the first argument. Some key functions:

Getting characters from a string: str_sub(x, start = 1L, end = -1L)
Adding/removing white space (or characters): str_pad(string, width, side = c("left", "right", "both"), pad = " ") or str_trim(string, side = c("both", "left", "right")) or str_wrap(string, width = 80, indent = 0, exdent = 0)
Modifying case: str_to_upper(string) or str_to_lower(string or str_to_title(string
A whole slew of pattern matching: str_detect(), str_count(), str_subset(), str_locate(), str_extract(), str_replace()

See stringr for more. And the stringr vignette on regular expressions,

R Markdown

Combine code, results, and prose into dynamic and reproducible documents suitable for sharing! These notes are made with R Markdown!

See Wickham and Grolemund’s Chapter 27 in R4DS for key details!
See Yihui Xie’s comprehensive book for everything you could ever want to know about R Markdown

Cheat Sheets!

Because nobody can remember all of this!

Artwork by @allison_horst

Court Data Notes

From PS 1

Outcomes of interest noted in PS1 (for circuit court): SentenceTime (4), FineAmount(4), ProbationTime (4), HearingPlea (4), ConcludedBy (3), ChargeType (2), Costs (2), HearingType (2), Charge, DefenseAttorney, OffenseDate, ArrestDate, HearingResult, SentenceSuspended, ConcurrentConsecutive, ProgramType
Vars to understand better noted in PS1 (for circuit court): ChargeType (3), HearingType (3), Class (2), Charge (2), Costs (2), ProgramType (2), ProbationType (2), DispositionCode (2), SentenceSuspended (2), RestitutionAmount, DrivingRestrictions, JailPenitentiary, AmendedCode, HearingResult, fips, locality, HearingRoom, DOB, AKA

Stuff I know or have recently learned

It’s worth looking at the online case information system from which this data is scraped: https://eapps.courts.state.va.us/gdcourts/changeCourt.do
FIPS/FIPS3 is more useful as a descriptor of jurisdiction than locality; Birth year is masked in DOB; AKA records any known aliases
Digging into class: only felonies and misdemeanors appear to have class designations; felonies have six classes and misdemeanors have four, but each may also be Unclassified (U) or classified as Other (O). Other (O) is said to have penalties defined in the code (this one feels more speculative to me, but potentially verifiable in the data); Unclassified (U) is said not to fully fit the designated class definitions (here for example). Penalty ranges are defined by the charge type and class.
The code section of a charge refer directly to the sections in the Code of Virginia; the specific charge can change as the case flows through the system (AmendedCharge)
Title 18.2 defines crimes and offenses: https://law.lis.virginia.gov/vacode/title18.2/
Title 19.2 defines criminal procedure: https://law.lis.virginia.gov/vacode/title19.2/
Title 46.2 defines motor vehicle operations: https://law.lis.virginia.gov/vacode/title46.2/, etc.
Latin terms (I keep forgetting):
- noelle prosequi: the prosecutor will drop prosecution
- capias: an arrest warrant (though I’m confused about it’s presence as a response value in CaseType in the general court data, where the law student says it means “failure to appear”; it makes more sense in the description under HearingResult in circuit court data)
- nolo contendere: plaintiff doesn’t contest misdemeanor charge

XKCD, Randall Munroe, https://xkcd.com/2494/

Public Interest::Data Ethics & Practice

Michele Claibourn

2022-02-02

R Notes

Last week: `dplyr`

- summarize

- group

- create new

Last week: Factors

`forcats`

Some `tidyr`

Splitting/Combining

Some `stringr`

R Markdown

Cheat Sheets!

Court Data Notes

From PS 1

Stuff I know or have recently learned

Public Interest::Data Ethics & Practice

Michele Claibourn

2022-02-02

R Notes

Last week: dplyr

summarize() - summarize variables

group_by() - group rows

mutate() - create new variables

Last week: Factors

forcats

Some tidyr

Splitting/Combining

Some stringr

R Markdown

Cheat Sheets!

Court Data Notes

From PS 1

Stuff I know or have recently learned

Last week: `dplyr`

- summarize

- group

- create new

`forcats`

Some `tidyr`

Some `stringr`