R Syntax Cheatsheet - Livia's Notes

# R Syntax Cheatsheet #tool/rstats Vieles ist aus einem Guide von Henrik Andersen und etwas eingedampft, ich wollte mir ein persönliches Cheatsheet machen ## Basic Syntax ```R # variablenzuweisung x <- 5 x = 5 # mathe 2 + 2 2 - 2 2 * 2 2 / 2 3^3 # 3 * 3 * 3 # logische operationen 2 == 2 3 != 3 2 > 3 2 >= 3 2 < 3 2 <= 3 ``` Variablen sollten am besten sinnvoll benannt werden `alter <- 5`, untertypen mit unterstrich `alter_zentriert`. Punkte zwischen Wörter werden auch für das Aufrufen von Funktionen eines Objektes bzw einer Klasse benutzt (beispiel: `as.integer()`, darum sollten sie vermieden werden. Heißt: **don't** do `alter.zentriert`! Bindestriche sind ein minus! ## Datentypen - double: 5.24, 12.888, 4 - integer: 4L, 12L, 101L - logical: TRUE, FALSE - strings: "hi", 'bye' überprüfen der Datentypen geht mit `typeof()` **coercion**, also typenvonvertierung, passiert automatisch. Manuelle Konvertierung geht mit der Klasse `as` beispielsweise mit `as.integer()` und `as.character()` ```R x = c(1, 4); typeof(x) ``` **Scientific notation** is often used in `R` for printing values much smaller than one: ```R d = 0.0000123 print(d) ``` ergebnis: 1.23e-05, also 0.0000123 ### Vektoren Vektoren: c(a,b,c) können verschiedene Datentypen vereinen. Automatische Konvertierung zwischen Datentypen: ```R x = c(TRUE, FALSE, TRUE) # typ bool y = c(TRUE, 2, 3L, "four") # typ string VORSICHT z = c(TRUE, 3, 4L) # typ float ``` *Vorsicht* automatische konvertierung zu strings sobald strings involviert sind! ```R w = c("yes", TRUE, 3, 4L) w; typeof(w) ``` > [1] "character" #### Vektoren können flexibel bearbeitet werden ```R presidents <- c("obama","trump") print(presidents) ``` > [1] "obama", "trump" Einträge aus Objekten löschen: ```R presidents <- presidents[presidents != "trump"] print(presidents) ``` > [1] "obama" Einträge zu Objekten hinzufügen: ```R presidents <- presidents[presidents, "harris"] print(presidents) ``` > [1] "obama", "harris" ### Listen, Tabellen und Dataframes Es sind Listen mit verschiedenen datentypen möglich: ```R new_list = list("a", 2, 33L, TRUE) # auch geschachtelt, also effektiv eine tabelle new_df = list(c("a", "b", "c"), c(2, 3, 4), c(33L, 44L, 55L), c(TRUE, TRUE, FALSE)) ``` Dataframes funktionieren ähnlich wie geschachtelte listen, sind aber ein eigener datentyp and als solcher vektorisiert und adressierbar: ```R df_people = data.frame(age = c(23, 24, 25), eye_color = c("red", "green", "blue"), height = c(183, 175, 192), born_in_kms = c(TRUE, FALSE, TRUE)) ``` es gibt auch tidyverse dataframes, tibble's: ```R library(tibble) dataframe <- tibble(x = 1,2,3), y = 2 * x) ``` ### Vektoren und/oder Iteration Statt mit Loops durch Reihen oder Spalten eines Dataframes zu gehen (wie in anderen sprachen) hat R Vektoren. Konkret heißt das, ich kann einfach eine Operation für alle werte einer variable in einem dataframe machen. ```R centered_age = dataframe$age - mean(dataframe$age, na.rm = TRUE) ``` ...statt aufwändig iterieren: ```R for (i in 1:nrow(dataframe$age)) { centered_age[dataframe$age[i, ] - mean(dataframe$age, na.rm = TRUE)] } ``` (der for-loop läuft auch, ist aber unnötig kompliziert) Gleichzeitig kann ein for-loop trotzdem in einigen Situationen praktisch sein. Hier ein Beispiel: ```R # Make a dataframe. dfPeople = data.frame(age = c(23, 24, 25), eye_color = c("red", "green", "blue"), height = c(183, 175, 192), born_in_kms = c(TRUE, FALSE, TRUE)) # Determine appropriate data types for each variable for (i in 1:ncol(dfPeople)) { print(typeof(dfPeople[ , i])) } ``` Einmal durch alle ## Missing Values Fehlende werte, also `NA` sind "ansteckend" und erzeugen weitere missing values. *Vorsicht*, große Fehlerquelle! ## Convinience functions es gibt bereits einige im basic paket: ```R mean(c(2, 3, 4)) sum(c(2, 3, 4)) min(c(2, 3, 4)) max(c(2, 3, 4)) cumsum(c(1, 2, 3, 4, 5)) # kumulative summe prod(1, 2, 3, 4, 5) # produkt # Notice prod() takes multiple arguments, no need for c() cumprod(c(1, 2, 3, 4, 5)) # kumulative produkt # cumprod() takes a single argument, like cumsum() ``` mehr funktionen gibts mit paketen. ## Subsetting eckige klammer um zahlen auszuschneiden. We can choose values from a vector using square brackets []. For example, [1] means select the first element, [2] means select the second element, and so on. ```R x = c(7, 8, 9) x[1]; # ergebnis: 7 x[c(1, 2)] # ergebnis: 7, 8 ``` doppelpunkt um alle zahlen zwischen zwei zahlen zu wählen. We can use : to enter ranges. For example, 1:5 gives us the integers 1, 2, 3, 4, 5 (using interval notation it would be because the endpoints are included) ```R 1:5 # ergebnis: 1 2 3 4 5 ``` kann man auch kombinieren: ```R w = c(3, 4, 2, 1, 6, 4, 2, 6, 3, 3) w[1:6] # ergebnis: 3 4 2 1 6 4 ``` Normally, we subset dataframes by columns, i.e., for selecting specific variables. ```R df = data.frame(age = c(23L, 33L, 45L), married = c(T, F, F), height = c(149.4, 189.2, 175.3), eyes = c("blue", "green", "brown")) ``` There is a special operator for doing so `df$eyes` ```R mean(df$age) # Ergebnis: 33.66667 ``` ## Packages wenn es nicht installiert ist aber trotzdem genutzt wird, kommet eine meldung. chillig. Beispiel Paket `lavan` installieren und laden: ```R install.packages("lavaan") #anführungszeichen wichtig! library(lavaan) ``` ## Loading External Data use a package::function(argument) to load data: ```R # install.packages("haven") library(haven) albus = haven::read_sav("allbus2021.sav") ``` Working with this a little bit: ```R str(albus) # view structure (including variable names) head(albus) # view first few entries, by default 6 head(albus$lm20) # examine a single variable with $variable ``` ## Special Data Types - For now, there are two relevant special data types: `factor`, `labelled` (date also interesting) - for important applications like linear regression, character variables are incompatible - assigning them integers is also bullshit, cause m=1 and w=2 nb=3 would mean nb's are 3 times more valuable than males (which is kinda true but also, no) **factors** encode data as integer values but treat the integers as categorical elements with labels to aid interpretation: ```R sex = factor(c(1, 2, 3, 1, 2, 3), # set of data levels = 1:3, # there are 3 different levels labels = c("m", "f", "d")) # male female diverse sex ``` The other important special data type are labelled variables. These are similar to factor variables in which they assign labels to integer values,2 but they are used for ordinal (ordered categorical) variables. The values of ordinal variables can be put in order, say, from largest to smallest, but the distances between categories may not be the same for all portions of the scale. The package haven comes with a function **labelled** which takes three arguments: - x: the numerical vector of data, - labels: the category labels (which need not be exhaustive), and - label: the variable label. ```R attitude = haven::labelled(c(1, 3, 5, 4, 2, 1, 5, 4, 1, 2), labels = c("Disagree fully" = 1, "Neither nor" = 3, "Agree fully" = 5), label = "An Likert-style attitude measure") attitude ``` **Unlike factor variables**, with labelled variables not every possible integer value needs a label. So we can label the endpoints, or the middle of a scale, for example. This is particularly helpful with [[Likert Skalierung]] style variables that are usually considered ordinal variables. ## Pfade und Working Directory Statt lange absolute Pfade zu verwenden etweder ein RStudio Projekt anlegen, oder noch einfacher (und portabler): ein working directory (wd) ganz am anfang eines skriptes setzen. ```R setwd("/Users/lia/Library/Mobile Documents/com~apple~CloudDocs/40 Areas/43 Uni/M6 Statistik II Übung R/") # set working directory ``` ab da sind alle pfade relativ. Beispiel zum laden einer SPSS datei (auch das Paket heaven wird geladen): ```R # install.packages("haven") library(haven) albus = haven::read_sav("Daten/allbus2021.sav") # load files from "Data" folder in current working directory ``` ## recode / rekodieren beispiel: ```r > head(df$work) <labelled<double>[6]>: BEFRAGTE(R) BERUFSTAETIG? [1] 1 1 4 4 4 4 Labels: value label -42 DATENFEHLER: MFN -41 DATENFEHLER -9 KEINE ANGABE 1 HAUPTBERUFL.VOLLZEIT 2 HAUPTBERUFL.TEILZEIT 3 NEBENHER BERUFSTAE. 4 NICHT ERWERBSTAETIG ``` Aufgabe: Rekodieren Sie die Variable work (BEFRAGTE(R) BERUFSTAETIG) um: - 1 = NICHT ERWERBSTAETIG - 2 = NEBENHER BERUFSTAE. + TEILZEIT - 3 = VOLLZEIT, GANZTAGS …und faktorisieren sie ```r library(dplyr) df <- df |> dplyr::mutate( work_rec = case_when( # 4 -> 1 = NICHT ERWERBSTAETIG work == 4 ~ 1, # 2 oder 3 -> 2 = NEBENHER BERUFSTAE. + TEILZEIT work %in% c(2, 3) ~ 2, # 1 -> 3 = VOLLZEIT, GANZTAGS work == 1 ~ 3, # Sonderfälle -> NA work %in% c(-42:0) ~ NA_real_, TRUE ~ NA_real_ ), work_rec = factor( work_rec, levels = 1:3, labels = c("NICHT ERWERBSTAETIG", # 1 "NEBENHER/TEILZEIT", # 2 "VOLLZEIT" # 3 ) ) ) ```