Day 1 - Girls in Data Science#

../_images/programming_women.jpg

Photos/Images by #WOCinTech/#WOCinTech Chat (CC-BY)

Team Introduction#

  • Lead Instructor: Katie Burak (Assistant Professor of Teaching, Department of Statistics, UBC)

  • Data Science Mentors: Sky Sheng, PhD student

Thank you to our sponsors!

../_images/sponsor1.png ../_images/sponsor2.png ../_images/sponsor3.png

Camp Motivation#

The main motivation for the development of this camp is to promote and uplift women and gender minorities in data science. The leaky pipeline is a phrase describing the phenomenon that even though women are interested in careers in STEM fields (including data science), they are disproportionately lost along the way as the level of their education and careers progresses. My hope is that camps like this one help to encourage and empower women to pursue careers in STEM fields, while exposing them to new and exciting topics in data science.

Ice Breaker Activity#

In small groups of 3–4, take turns answering the following questions to get to know each other:

  1. What’s your favorite school subject and why do you like it?

  2. Where is one place you’ve always wanted to visit and what would you do there?

  3. What’s one thing everyone in your group has in common that isn’t obvious?

Day 1 Learning Objectives#

  • Edit and execute code and markdown in Jupyter notebooks

  • Understand the essentials of the R programming language

  • Apply the basics of data manipulation and data visualization in R

We’ve got a lot to do - let’s get started!#

https://media.giphy.com/media/6onMzNPjtFeCI/giphy.gif

1. Getting started with Jupyter & R#

This webpage is called a Jupyter notebook. A notebook is a place to write computer code for analysis, view the results of the analysis, as well as to narrate the analysis with rich formatted text…

1.1. Text Cells#

In a notebook, each rectangle containing text or code is called a cell.

Text cells (like this one) can be edited by double-clicking on them. They’re written in a simple format called Markdown to add formatting and section headings. For more practice using Markdown, try this online tutorial!

After you edit a text cell, click the “run cell” button at the top that looks like ▶ to confirm any changes. (Try not to delete the instructions of the lab.)

1.2. Code Cells#

Other cells contain code in the R language. Running a code cell will execute all of the code it contains.

To run the code in a cell, first click on that cell to activate it. It will be highlighted with a blue rectangle to the left of it when activated. Next, either press Run ▶ or hold down the shift key and press return or enter. print(“Hello, World!”) Try running the next cell:

print("Hi Sky!")
[1] "Hi Sky!"

The above code cell contains a single line of code, but cells can also contain multiple lines of code. When you run a cell, the lines of code are executed in the order in which they appear. Every print expression prints a line. Run the next cell and notice the order of the output.

Note to self bla bla bla…

print("First this line is printed,")
print("and then this one.")
[1] "First this line is printed,"
[1] "and then this one."

1.3. Writing Jupyter Notebooks#

You can use Jupyter notebooks for your own projects or documents. When you make your own notebook, you’ll need to create your own cells for text and code.

To add a cell, click the + button in the menu bar of this tab. The newly created cell will start out as a code cell. You can change it to a text cell by clicking inside it so it’s highlighted, clicking the drop-down box next to the restart and run all button (⏩) in the menu bar of this tab, and changing it from “Code” to “Markdown”.

Exercise#

Add a code cell below this one. Write code in it that prints out:

My name is ____!

Run your cell to verify that it works.

Next, add a text/Markdown cell and write the text “My name is ____!” in it.

1.4. Comments#

Below you see lines like this in code cells:

# Test cell; please do not change!

That is called a comment. It doesn’t make anything happen in R; R ignores anything on a line after a #. Instead, it’s there to communicate something about the code to you, the human reader. Comments are extremely useful and can help increase how readable our code is.

The below code cell contains comments (one at the start of a line, and one after some other code). Run the cell. You will see that everything after a comment symbol # is ignored by R.

# you can use comments to document your code, or make R ignore some code without deleting it entirely
# print("this is a commented line that R will ignore. You won't see this text in the output!")

print("hello!") # you can also put comments at the end of a line of code
[1] "hello!"

1.5. Errors#

R is a language, and like natural human languages, it has rules. It differs from natural language in two important ways:

  1. The rules are simple. You can learn most of them in a few weeks and gain reasonable proficiency with the language in a semester.

  2. The rules are rigid. If you’re proficient in a natural language, you can understand a non-proficient speaker, glossing over small mistakes. A computer running R code is not smart enough to do that.

Whenever you write code, you’ll make mistakes (everyone who writes code does, even your course instructor!). When you run a code cell that has errors, R will sometimes produce error messages to tell you what you did wrong.

Errors are totally okay; even experienced programmers make many errors. It’s a natural part of the coding process. When you make an error, you just have to find the source of the problem, fix it, and move on.

We have made an error in the next cell. Remove the # symbol below (i.e., uncomment the line), and then run the cell to see what happens.

print("This line is missing something.")
[1] "This line is missing something."

ws1_error_image.png

There’s a lot of terminology in programming languages, but you don’t need to know it all in order to program effectively. Even though the error message can seem cryptic, if you read it carefully you’ll often find hints as to what went wrong. For example, above, you’ll see the message unexpected end of input (among a lot of other technical jargon). In other words, R reached the end of the line of code, and wasn’t expecting to reach the end – it thinks there is still something missing!

Of course, even if you do your best to interpret the error message, sometimes you may get stuck figuring out what went wrong and how to fix it. In that case, ask someone.

Try to fix the code above so that you can run the cell and see the intended message instead of an error.

1.6 Saving your work#

Its important to save your work often so you don’t lose your progress! At the top of the screen, go to the File menu then Save Notebook. Also, there are keyboard shorcuts for saving your work too: control + s on Windows, or command + s on Mac. Once you’ve saved your work, you will see a message at the bottom of the screen that says Saving completed.

1.7 Numbers#

Quantitative information arises everywhere in data science. In addition to representing commands to print out lines, our R code can represent numbers and methods of combining numbers. The expression 3.2500 evaluates to the number 3.25. (Run the cell and see.)

3.2500
3.25

Notice that we didn’t have to write print(). When you run a notebook cell, Jupyter helpfully prints out that value for you.

2
3
4
2
3
4

Above, you should see that the three numbers (2, 3, and 4) are printed out. In R, simply inputting numbers and running the cell will generate all the numbers that you listed.

1.8 Arithmetic#

The line in the next cell performs some mathemtical operations. Run them!

2.0 - 1.5 + 45
45.5
2 * 2
4
6/2
3
log(2)
0.693147180559945

Many basic arithmetic operations are built in to R. This webpage describes many arithmetic operators that might be useful in this camp.

1.9 Names#

In natural language, we have terminology that lets us quickly reference very complicated concepts. We don’t say, “That’s a large mammal with brown fur and sharp teeth!” Instead, we just say, “Bear!”

Similarly, an effective strategy for writing code is to define names for data as we compute it, like a lawyer would define terms for complex ideas at the start of a legal document to simplify the rest of the writing.

In R, we do this with objects. An object has a name on the left side of an <- sign and an expression to be evaluated on the right.

answer <- 3*2 + 4
answer
10

When you run that cell, R first evaluates the first line. It computes the value of the expression 3 * 2 + 4, which is the number 10. Then it gives that value the name answer. At that point, the code in the cell is done running.

After you run that cell, the value 10 is bound to the name answer:

answer
10

Note: You can also use an = sign for assignment

When naming objects in R there are some rules:

  1. Names in R can have letters (upper- and lower-case letters are both okay and count as different letters e.g. “Answer” and “answer” will be treated as different objects), underscores, dots, and numbers.

  2. The first character can’t be a number (otherwise a name might look like a number).

  3. Names can’t contain spaces, since spaces are used to separate pieces of code from each other.

Other than those rules, what you name something doesn’t matter to R. For example, the next cell does the same thing as the above cell, except everything has a different name:

answer1 <- 840
b <- 2 * a
answer2 <- 12
d <- c * b
d
Error in eval(expr, envir, enclos): object 'a' not found
Traceback:

Another common pattern is that a series of lines in a single cell will build up a complex computation in stages, naming the intermediate results.

However, names are very important for making your code readable to yourself and others. The cell above is shorter, but it’s totally useless without an explanation of what it does.

There is also cultural style associated with different programming languages. In the modern R style, object names should use only lowercase letters, numbers, and _. Underscores (_) are typically used to separate words within a name (e.g., answer_one).

Exercise#

Question: How old will you be in 2050?

  • Assign an object called age which is your age now.

  • Compute how many years will pass until 2050 by subtracting the current year from 2050 (call this object years).

  • Finally, name an object future_age which adds together age and years to get your age in 2050.

age <- 16
years <- 2050-2025
future_age <- age + years
future_age
41

1.10 Functions#

The most common way to combine or manipulate values in R is by calling functions. R comes with many built-in functions that perform common operations.

We used a function print() at the beginning of this notebook when we printed text from a code cell. Here we’ll demonstrate using another function toupper() that converts text to uppercase:

greeting <- toupper("Why, hello there!")
greeting
'WHY, HELLO THERE!'

Try using the function tolower now!

tolower(greeting)
'why, hello there!'

Some functions take multiple arguments, separated by commas. For example, the built-in max function returns the maximum argument passed to it.

biggest <- max(4,7,23,1,5)
biggest
23

Try to use the min function now to find the smallest number!

1.11 Packages#

R has many built-in functions, but we can also use functions that are stored within packages created by other R users. We are going to use a package, called tidyverse, to load, modify and plot data. This package has already been installed for you. Later in the course you will learn how to install packages so you are free to bring in other tools as you need them for your data analysis.

To use the functions from a package you first need to load it using the library function. This needs to be done once per notebook (and a good rule of thumb is to do this at the very top of your notebook so it is easy to see what packages your R code depends on).

library(tidyverse)
── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
 dplyr     1.1.4      readr     2.1.5
 forcats   1.0.0      stringr   1.5.1
 ggplot2   3.5.1      tibble    3.2.1
 lubridate 1.9.3      tidyr     1.3.1
 purrr     1.0.2     
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
 dplyr::filter() masks stats::filter()
 dplyr::lag()    masks stats::lag()
 Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors

Note: it is normal and expected that a message is printed out after loading the tidyverse and some packages. Generally, this message let’s you know if functions from the different packages were loaded share the same name (which is confusing to R), and if so, which one you can access using just it’s name (and which one you need to refer the package name and the function name to refer to it, this is called masking). Additionally, the tidyverse is a special R package - it is a meta-package that bundles together several related and commonly used packages. Because of this it lists the packages it does the job of loading.

1.12 Looking for Help#

No one, even experienced, professional programmers remember what every function does, nor do they remember every possible function argument/option. So both experienced and new programmers (like you!) need to look things up, A LOT!

One of the most efficient places to look for help on how a function works is the R help files. Let’s say we wanted to pull up the help file for the toupper() function. We can do this by typing a question mark in front of the function we want to know more about. Remove the hashtag and run the cell below to find out more about toupper().

# ?toupper

At the very top of the file, you will see the function itself and the package it is in (in this case, it is base). Next is a description of what the function does. You’ll find that the most helpful sections on this page are “Usage”, “Arguments” and “Examples”.

  • Usage gives you an idea of how you would use the function when coding–what the syntax would be and how the function itself is structured.

  • Arguments tells you the different parts that can be added to the function to make it more simple or more complicated. Often the “Usage” and “Arguments” sections don’t provide you with step by step instructions, because there are so many different ways that a person can incorporate a function into their code. Instead, they provide users with a general understanding as to what the function could do and parts that could be added. At the end of the day, the user must interpret the help file and figure out how best to use the functions and which parts are most important to include for their particular task.

  • The Examples section is often the most useful part of the help file as it shows how a function could be used with real data. It provides a skeleton code that the users can work off of.

Beyond the R help files there are many resources that you can use to find help. Stack overflow, an online forum, is a great place to go and ask questions such as how to perform a complicated task in R or why a specific error message is popping up. Oftentimes, a previous user will have already asked your question of interest and received helpful advice from fellow R users.

2. Introduction to Data Science#

What is data science exactly?#

Data science is the use of reproducible and auditable processes to obtain value (i.e., insight) from data.

Every good data analysis begins with a question—like the above—that you aim to answer using data. As it turns out, there are actually a number of different types of question regarding data:

  • descriptive

  • exploratory

  • predictive

  • inferential

  • causal

  • mechanistic.

Note: In this camp, we will focus on the first 3 types of questions.

  • Descriptive: A question which asks about summarized characteristics of a data set without interpretation (i.e., report a fact, describe characteristics)

  • Exploratory: A question asks if there are patterns, trends, or relationships within a single data set. Often used to propose hypotheses for future study. (discovery of ideas and thoughts)

  • inferential: determine if association observed in your exploratory analysis hold in a different sample that is representative of a population

  • predictive: what predicts what grade someone will achieve in a certain class

  • causal: whether changing one factor will change another factor

  • mechanistic: how e.g. how a medication leads to a reduction in the number of illnesses

Why are we using programming languages to do data analysis?#

VS.

  • There are many advantages to using R (or another language, like Python or Julia):

    • statistical analysis functions that go beyond Excel

    • free and open-source

    • transparent & reproducible code

    • can handle large amounts of data and complex analyses

  • Using a programming language is like baking with a recipe:

    • Ingredients = data

      https://www.thespruceeats.com/thmb/FYR4bNLrj304CEaE2aSGPYzygzY=/4680x2632/smart/filters:no_upscale()/greek-butter-cookies-1705307-step-01-5bfef717c9e77c00510e3bf9.jpg
    • Recipe = code

      https://cdn.pixabay.com/photo/2014/12/21/23/28/recipe-575434_640.png
  • Someone else can use your recipe (code) to bake the same cake (produce the same data analyses)

  • Spreadsheets in Excel make it very difficult to understand where results came from

In the data science workflow (source: Grolemund & Wickham, R for Data Science)

2.1 Reading in Tabular Data#

Loading/importing data#

  • 4 most common ways to do this in Data Science

    1. read in a text file with data in a spreadsheet format

    2. read from a database (e.g., SQLite, PostgreSQL)

    3. scrape data from the web

    4. use a web API (Application Programming Interface) to read data from a website

We will focus on option 1!

It is important to read in data carefully and check results after! This will help reduce bugs and speed up your analyses down the road… think of it as tying your shoes before you run; not exciting, but if done wrong it will trip you up later!

Different ways to locate a file / dataset#

Local (on your computer)

  • An absolute path locates a file with respect to the “root” folder on a computer

    • starts with /

      e.g. /home/instructor/documents/timesheet.xlsx

  • A relative path locates a file relative to your working directory

    • doesn’t start with /

      e.g. documents/timesheet.xlsx
      (where working directory is /home/instructor/)

Remote (on the web)

via “URL” that starts with http:// or https://

http://traffic.libsyn.com/mbmbam/MyBrotherMyBrotherandMe367.mp3

Absolute vs relative paths: Which should you use?

  • Generally to ensure your code can be run on a different computer, you should use relative paths

  • e.g. Alice is working inside the folder /home/alice/project/. To specify where to load data from in her Jupyter notebook, she uses the absolute path /home/alice/project/data/happiness_report.csv. What issue will arise when she shares the notebook with her collaborator Keeran who tries to read in the data on their computer?

/home/keeran/project/data/happiness_report.csv

What relative data path could they use to collaborate more effectively?

data/happiness_report.csv

  • Even though stored their files in the same place on their computers (in their home folders), the absolute paths are different due to their different usernames.

  • If Alice has code that loads the happiness_report.csv data using an absolute path, the code won’t work on Keeran’s computer. But the relative path from inside the project folder (data/happiness_report.csv) is the same on both computers; any code that uses relative paths will work on both.

  • Abosolute paths are like GPS coordinates, they take you to one specific location regardless of where you are starting from. Relative paths are like directions, they are based off your starting point (e.g. go to blocks north and then one west).

A data set is a structured collection of numbers and characters. Aside from that, there are really no strict rules; data sets can come in many different forms! Perhaps the most common form of data set that you will find in the wild, however, is tabular data. Think spreadsheets in Microsoft Excel: tabular data are rectangular-shaped and spreadsheet-like.

When we load tabular data into R, it is represented as a data frame object. We refer to the rows as observations and columns as variables.

The main kind of data file that we will learn how to load into R as a data frame is the comma-separated values format (.csv for short). These files have names ending in .csv, and can be opened and saved using common spreadsheet programs like Microsoft Excel and Google Sheets.

To load data into R so that we can do things with it (e.g., perform analyses or create data visualizations), we will need to use a function. A function is a special word in R that takes instructions (we call these arguments) and does something. The function we will use to load a .csv file into R is called read_csv (made accessible by loading the tidyverse R package). In its most basic use-case, read_csv expects that the data file:

  • has column names (or headers)

  • uses a comma (,) to separate the columns

  • does not have row names

Please note that data comes in many forms and there are a wide variety of functions and approaches to loading in your data, but in this camp we will be focusing on reading in tabular data using read_csv.

We will now look at an Instagram data set focusing on 200 of the most popular instagram accounts. Let’s try reading it in with read_csv using a relative path!

insta <- read_csv('data/insta.csv')
Rows: 200 Columns: 8
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr (7): name, channel_Info, Category, Posts, Followers, Avg. Likes, Eng Rate
dbl (1): rank

 Use `spec()` to retrieve the full column specification for this data.
 Specify the column types or set `show_col_types = FALSE` to quiet this message.
insta <- insta |> select(-'Avg. Likes')

Run the following code chunk before continuing to rename/reformat some of the variables:

head(insta)
A tibble: 6 × 7
ranknamechannel_InfoCategoryPostsFollowersEng Rate
<dbl><chr><chr><chr><chr><chr><chr>
1instagram brand photography 7.3K580.1M0.1%
2cristiano male Health, Sports & Fitness3.4K519.9M1.4%
3leomessi male Health, Sports & Fitness1K 403.7M1.7%
4kyliejennerfemaleentertainment 7K 375.9M1.7%
5selenagomezfemaleentertainment 1.8K365.3M1.1%
6therock male entertainment 7K 354.3M0.3%
insta <- suppressWarnings(insta |>
              mutate(Followers = as.numeric(str_replace(Followers, "M", ""))*1e6) |> 
              rename(Channel = channel_Info, eng_rate = 'Eng Rate') |> 
             mutate(
                    Posts = case_when(
                      str_detect(Posts , "K") ~ as.numeric(str_replace(Posts , "K", "")) * 1000,
                      str_detect(Posts, "M") ~ as.numeric(str_replace(Posts, "M", "")) * 1000000,
                      TRUE ~ as.numeric(Posts)  # If no suffix, just convert to numeric
                    )
                  ) |>               
             mutate(eng_rate = as.numeric(str_replace(eng_rate, "%", ""))) |> 
             mutate(Category = if_else(is.na(Category), "Not Available", Category))|> 
             mutate(Channel = if_else(is.na(Channel), "Not Available", Channel)))

It looks like this data set has 200 rows (observations) representing the top 200 instgram accounts and the following 8 columns (variables):

  • rank (1-200)

  • name (Instagram handle)

  • channel (brief description of the account)

  • category

  • posts

  • followers

  • Eng Rate (calculates the account’s engagement rate by dividing the total number of likes and comments received by the total number of followers, expressed as a percentage).

Without looking at the entire data set, we can find out attributes of our data such as the number of rows, columns, or overall dimension.

# Number of rows
nrow(insta)

# Number of columns
ncol(insta)

# Dimension 
dim(insta)
200
7
  1. 200
  2. 7

2.2 Data Wrangling!#

https://raw.githubusercontent.com/allisonhorst/stats-illustrations/main/rstats-artwork/dplyr_wrangling.png

The cartoon illustrations are by Allison Horst

  • In the real world, when you get data, it’s usually very messy

    • inconsistent format (commas, tabs, semicolons, missing data, extra empty lines)

    • split into multiple files (e.g. yearly recorded data over many years)

    • corrupted files, custom formats

  • when you read it successfully into R, it will often still be very messy

  • you need to make your data “tidy”

What is Tidy Data?#

https://github.com/allisonhorst/stats-illustrations/raw/main/rstats-artwork/tidydata_1.jpg

Illustrations from the Openscapes blog Tidy Data for reproducibility, efficiency, and collaboration by Julia Lowndes and Allison Horst”

https://github.com/allisonhorst/stats-illustrations/raw/main/rstats-artwork/tidydata_2.jpg

True or False: The Instagram data set is tidy#

True!

  • each row corresponds to a single observation,

  • each column corresponds to a single variable, and

  • each cell (row, column pair) correspond to a single value

Tools for tidying and wrangling data#

  • tidyverse package functions from:

    • dplyr package (select, filter, mutate, group_by, summarize)

2.1.1 Mutate#

The mutate function transforms old columns to add new columns.

e.g. convert engagement rate to a decimal

head(mutate(insta, eng_rate = eng_rate / 100)) # head function returns only the first 6 rows
A tibble: 6 × 7
ranknameChannelCategoryPostsFollowerseng_rate
<dbl><chr><chr><chr><dbl><dbl><dbl>
1instagram brand photography 73005801000000.001
2cristiano male Health, Sports & Fitness34005199000000.014
3leomessi male Health, Sports & Fitness10004037000000.017
4kyliejennerfemaleentertainment 70003759000000.017
5selenagomezfemaleentertainment 18003653000000.011
6therock male entertainment 70003543000000.003

Note: The above creates a new dataframe, it does not save it to the original insta df. We would need to assign it to a new variable if we want to save it.

2.1.2 Select#

The select function is used to select a subset of columns (variables) from a dataframe.

?select
select {dplyr}R Documentation

Keep or drop columns using their names and types

Description

Select (and optionally rename) variables in a data frame, using a concise mini-language that makes it easy to refer to variables based on their name (e.g. a:f selects all columns from a on the left to f on the right) or type (e.g. where(is.numeric) selects all numeric columns).

Overview of selection features

Tidyverse selections implement a dialect of R where operators make it easy to select variables:

  • : for selecting a range of consecutive variables.

  • ! for taking the complement of a set of variables.

  • & and | for selecting the intersection or the union of two sets of variables.

  • c() for combining selections.

In addition, you can use selection helpers. Some helpers select specific columns:

  • everything(): Matches all variables.

  • last_col(): Select last variable, possibly with an offset.

  • group_cols(): Select all grouping columns.

Other helpers select variables by matching patterns in their names:

  • starts_with(): Starts with a prefix.

  • ends_with(): Ends with a suffix.

  • contains(): Contains a literal string.

  • matches(): Matches a regular expression.

  • num_range(): Matches a numerical range like x01, x02, x03.

Or from variables stored in a character vector:

  • all_of(): Matches variable names in a character vector. All names must be present, otherwise an out-of-bounds error is thrown.

  • any_of(): Same as all_of(), except that no error is thrown for names that don't exist.

Or using a predicate function:

  • where(): Applies a function to all variables and selects those for which the function returns TRUE.

Usage

select(.data, ...)

Arguments

.data

A data frame, data frame extension (e.g. a tibble), or a lazy data frame (e.g. from dbplyr or dtplyr). See Methods, below, for more details.

...

<tidy-select> One or more unquoted expressions separated by commas. Variable names can be used as if they were positions in the data frame, so expressions like x:y can be used to select a range of variables.

Value

An object of the same type as .data. The output has the following properties:

  • Rows are not affected.

  • Output columns are a subset of input columns, potentially with a different order. Columns will be renamed if new_name = old_name form is used.

  • Data frame attributes are preserved.

  • Groups are maintained; you can't select off grouping variables.

Methods

This function is a generic, which means that packages can provide implementations (methods) for other classes. See the documentation of individual methods for extra arguments and differences in behaviour.

The following methods are currently available in loaded packages: dplyr (data.frame).

Examples

Here we show the usage for the basic selection operators. See the specific help pages to learn about helpers like starts_with().

The selection language can be used in functions like dplyr::select() or tidyr::pivot_longer(). Let's first attach the tidyverse:

library(tidyverse)

# For better printing
iris <- as_tibble(iris)

Select variables by name:

starwars %>% select(height)
#> # A tibble: 87 x 1
#>   height
#>    <int>
#> 1    172
#> 2    167
#> 3     96
#> 4    202
#> # i 83 more rows

iris %>% pivot_longer(Sepal.Length)
#> # A tibble: 150 x 6
#>   Sepal.Width Petal.Length Petal.Width Species name         value
#>         <dbl>        <dbl>       <dbl> <fct>   <chr>        <dbl>
#> 1         3.5          1.4         0.2 setosa  Sepal.Length   5.1
#> 2         3            1.4         0.2 setosa  Sepal.Length   4.9
#> 3         3.2          1.3         0.2 setosa  Sepal.Length   4.7
#> 4         3.1          1.5         0.2 setosa  Sepal.Length   4.6
#> # i 146 more rows

Select multiple variables by separating them with commas. Note how the order of columns is determined by the order of inputs:

starwars %>% select(homeworld, height, mass)
#> # A tibble: 87 x 3
#>   homeworld height  mass
#>   <chr>      <int> <dbl>
#> 1 Tatooine     172    77
#> 2 Tatooine     167    75
#> 3 Naboo         96    32
#> 4 Tatooine     202   136
#> # i 83 more rows

Functions like tidyr::pivot_longer() don't take variables with dots. In this case use c() to select multiple variables:

iris %>% pivot_longer(c(Sepal.Length, Petal.Length))
#> # A tibble: 300 x 5
#>   Sepal.Width Petal.Width Species name         value
#>         <dbl>       <dbl> <fct>   <chr>        <dbl>
#> 1         3.5         0.2 setosa  Sepal.Length   5.1
#> 2         3.5         0.2 setosa  Petal.Length   1.4
#> 3         3           0.2 setosa  Sepal.Length   4.9
#> 4         3           0.2 setosa  Petal.Length   1.4
#> # i 296 more rows

Operators:

The : operator selects a range of consecutive variables:

starwars %>% select(name:mass)
#> # A tibble: 87 x 3
#>   name           height  mass
#>   <chr>           <int> <dbl>
#> 1 Luke Skywalker    172    77
#> 2 C-3PO             167    75
#> 3 R2-D2              96    32
#> 4 Darth Vader       202   136
#> # i 83 more rows

The ! operator negates a selection:

starwars %>% select(!(name:mass))
#> # A tibble: 87 x 11
#>   hair_color skin_color  eye_color birth_year sex   gender    homeworld species
#>   <chr>      <chr>       <chr>          <dbl> <chr> <chr>     <chr>     <chr>  
#> 1 blond      fair        blue            19   male  masculine Tatooine  Human  
#> 2 <NA>       gold        yellow         112   none  masculine Tatooine  Droid  
#> 3 <NA>       white, blue red             33   none  masculine Naboo     Droid  
#> 4 none       white       yellow          41.9 male  masculine Tatooine  Human  
#> # i 83 more rows
#> # i 3 more variables: films <list>, vehicles <list>, starships <list>

iris %>% select(!c(Sepal.Length, Petal.Length))
#> # A tibble: 150 x 3
#>   Sepal.Width Petal.Width Species
#>         <dbl>       <dbl> <fct>  
#> 1         3.5         0.2 setosa 
#> 2         3           0.2 setosa 
#> 3         3.2         0.2 setosa 
#> 4         3.1         0.2 setosa 
#> # i 146 more rows

iris %>% select(!ends_with("Width"))
#> # A tibble: 150 x 3
#>   Sepal.Length Petal.Length Species
#>          <dbl>        <dbl> <fct>  
#> 1          5.1          1.4 setosa 
#> 2          4.9          1.4 setosa 
#> 3          4.7          1.3 setosa 
#> 4          4.6          1.5 setosa 
#> # i 146 more rows

& and | take the intersection or the union of two selections:

iris %>% select(starts_with("Petal") & ends_with("Width"))
#> # A tibble: 150 x 1
#>   Petal.Width
#>         <dbl>
#> 1         0.2
#> 2         0.2
#> 3         0.2
#> 4         0.2
#> # i 146 more rows

iris %>% select(starts_with("Petal") | ends_with("Width"))
#> # A tibble: 150 x 3
#>   Petal.Length Petal.Width Sepal.Width
#>          <dbl>       <dbl>       <dbl>
#> 1          1.4         0.2         3.5
#> 2          1.4         0.2         3  
#> 3          1.3         0.2         3.2
#> 4          1.5         0.2         3.1
#> # i 146 more rows

To take the difference between two selections, combine the & and ! operators:

iris %>% select(starts_with("Petal") & !ends_with("Width"))
#> # A tibble: 150 x 1
#>   Petal.Length
#>          <dbl>
#> 1          1.4
#> 2          1.4
#> 3          1.3
#> 4          1.5
#> # i 146 more rows

See Also

Other single table verbs: arrange(), filter(), mutate(), reframe(), rename(), slice(), summarise()


[Package dplyr version 1.1.4 ]