Intro to R Programming - Lesson 3 (Data Management)-EW帮帮网

Introduction

Welcome back! In the first two lessons, we explored what R is all about and how to get data into R. If you’ve followed along, you should now feel comfortable creating datasets, importing them from files, and understanding the basic data structures like vectors, matrices, and data frames. That’s a big achievement, because without clean and structured data, no analysis or fancy machine learning model will work properly.

In this lesson, we’re diving into one of the most important (and sometimes most frustrating) aspects of working with data: data management. Think of this as the part where you tidy up your messy desk before starting work. You don’t want to analyze a dataset with missing values scattered all over, weird codes like -99 representing “unknown,” or cryptic column names that nobody understands. If you’ve ever had to clean up someone else’s Excel sheet before, you know what I mean.

The goal here is to show you how R makes data management easier and more reproducible than working by hand in Excel. We’ll walk through creating new variables, recoding and renaming columns, dealing with missing values, working with dates, and more. Along the way, I’ll show you both base R methods and modern tidyverse approaches (like dplyr). By the end, you’ll have a toolkit for preparing your data in a way that’s clean, consistent, and ready for analysis or visualization.

A Running Example

Throughout this lesson, we’ll use the classic mtcars dataset. It comes with R and contains fuel consumption and specifications for 32 cars. Let’s take a peek.

head(mtcars)

This dataset has columns like mpg (miles per gallon), hp (horsepower), wt (weight in 1000 lbs), and am (0 = automatic, 1 = manual transmission). It’s small, but it’s great for demonstrations. Later, I’ll also show you how to simulate missing values and merge with another dataset to make things more realistic.

⸻

Creating New Variables

One of the most common tasks in data management is creating new variables from existing ones. For example, what if we wanted to know the power-to-weight ratio of each car? That’s horsepower divided by weight.

mtcars$power_to_weight <- mtcars$hp / mtcars$wt
head(mtcars[, c("hp", "wt", "power_to_weight")])

Notice how easy that was: just a simple arithmetic operation and we’ve created a new column in the data frame. You can use this same idea for anything: ratios, percentages, log transformations, or even combining text variables.

⸻

Recoding Variables

Raw data often comes with cryptic codes. For instance, the am variable in mtcars is coded as 0 for automatic and 1 for manual. That’s fine for the computer, but humans prefer readable labels.

We can recode this variable using ifelse:

mtcars$transmission <- ifelse(mtcars$am == 0, "Automatic", "Manual")
table(mtcars$transmission)

Now we have a nice categorical variable that makes sense when we plot or summarize. In real datasets, you might see codes like -9 for missing, or 1/2/3 for “Agree / Neutral / Disagree.” Recoding those into meaningful labels is one of the first cleanup steps you should do.

⸻

Renaming Variables

Sometimes datasets come with terrible column names like var1, var2, or worse, long phrases with spaces. You can rename columns in base R with names(mtcars) but the dplyr package makes it super convenient.

library(dplyr)
mtcars <- rename(mtcars, miles_per_gallon = mpg)
head(mtcars)

Now instead of remembering that mpg means miles per gallon, you have a clear column name.

⸻

Handling Missing Values

Missing data is everywhere. Maybe a survey respondent skipped a question, or a sensor malfunctioned during data collection. In R, missing values are represented as NA.

Let’s simulate missing values:

mtcars$hp[c(3, 7)] <- NA
summary(mtcars$hp)

We can drop rows with missing values:

mtcars_no_na <- na.omit(mtcars)

Or we can replace missing values with something like the mean:

mtcars$hp <- ifelse(is.na(mtcars$hp), mean(mtcars$hp, na.rm=TRUE), mtcars$hp)

The right choice depends on your context. Dropping rows might bias results if missingness is systematic, while imputing with the mean might oversimplify. The important thing is that R gives you flexible options.

⸻

Working with Dates

Dates and times are notoriously tricky. Luckily, R has good support for them. You can use as.Date() to convert strings into dates.

today <- Sys.Date()
birthday <- as.Date("1995-10-23")
age_in_days <- today - birthday
age_in_days

There are also great packages like lubridate that make date handling much more pleasant (e.g., parsing messy formats, extracting years, months, weekdays).

⸻

Type Conversions

Sometimes your numbers come in as text (like “42”) or categorical data comes in as numbers. You’ll need to convert them.

x <- c("1", "2", "3")
as.numeric(x)

Or maybe you want to turn a numeric variable into a factor:

mtcars$cyl <- as.factor(mtcars$cyl)
str(mtcars$cyl)

This becomes especially useful when running models, since R treats factors as categorical predictors.

⸻

Sorting Data

Sorting is straightforward with order() in base R or arrange() in dplyr.

arrange(mtcars, desc(miles_per_gallon))

This is much cleaner than messing around with Excel filters.

⸻

Merging Datasets

In real life, you’ll rarely get all your data in one neat file. You might have customer demographics in one table and purchase history in another. That means merging.

df1 <- data.frame(id=1:3, score=c(90,85,88))
df2 <- data.frame(id=2:4, grade=c("A","B","A"))
merge(df1, df2, by="id", all=TRUE)

This gives us a combined dataset, filling in NA where information is missing.

⸻

Subsetting Data

Sometimes you only want part of the data. Maybe just the cars with more than 25 miles per gallon and 4 cylinders.

subset(mtcars, miles_per_gallon > 25 & cyl == 4)

You can also drop or select columns easily:

mtcars_small <- mtcars[, c("miles_per_gallon", "hp", "wt")]
head(mtcars_small)

⸻

Data Manipulation with dplyr

Modern R users love dplyr because it makes common operations concise and readable. The %>% operator (pipe) lets you chain commands like a sentence.

mtcars %>%
  group_by(cyl) %>%
  summarise(avg_mpg = mean(miles_per_gallon))

This reads like English: “Take mtcars, group by cylinder, and calculate the average miles per gallon.”

⸻

Using SQL in R

If you’re coming from a database background, you might find it easier to write SQL queries directly. The sqldf package lets you do just that.

library(sqldf)
sqldf("SELECT cyl, AVG(miles_per_gallon) as avg_mpg FROM mtcars GROUP BY cyl")

This is especially nice when working with colleagues who already know SQL but not R.

⸻

Summary

Data management is often less glamorous than modeling or visualization, but it’s arguably the most critical step. If your dataset is messy, your results will be meaningless. In this lesson, we covered a lot: creating and recoding variables, renaming, handling missing values, working with dates, type conversions, sorting, merging, subsetting, and using both dplyr and SQL for manipulation.

If you practice these techniques, you’ll find yourself spending less time struggling with messy spreadsheets and more time actually doing analysis. And trust me, future-you will thank present-you for writing clean, reproducible code instead of clicking around in Excel at 2 AM.

Next up in Lesson 4: Getting Started with Graphs. This is where things get visual and fun!

⸻

Quiz Time

Here are a few questions to check your understanding. Try to answer them before peeking at the solutions.
1. How do you create a new variable in R that is the ratio of horsepower to weight in the mtcars dataset?
2. What function would you use to rename the variable mpg to miles_per_gallon using dplyr?
3. What symbol does R use for missing values?
4. How can you replace missing values in a column with the mean of that column?
5. Write a command using dplyr to group the mtcars dataset by cyl and calculate the average horsepower.
6. Which package allows you to run SQL queries directly on data frames?
7. What’s the difference between arrange() and order() in terms of syntax and readability?

⸻

Quiz Answers

Please comment in the discussion if you wish to obtain the answers to the quiz.

Intro to R Programming - Lesson 3 (Data Management)