Hello!
The dplyr library in R allows you to manipulate data, filter, select, sort, group data, and much more.
In this article we will look at this library!
To begin, let’s install:
install.packages("dplyr")
Main functions of dplyr
Function filter()
used to select rows from data that meet certain conditions:
library(dplyr)
# создаем фрейм данных с информацией о студентах
students <- data.frame(
name = c("Alice", "Bob", "Charlie", "David"),
age = c(22, 25, 21, 24),
grade = c("A", "B", "C", "A")
)
# фильтруем данные, чтобы выбрать студентов, старше 21 года
filtered_students <- filter(students, age > 21)
print(filtered_students)
Function select()
used to select specific columns from data:
# выбираем только столбцы с именами студентов и их оценками
selected_data <- select(students, name, grade)
mutate()
allows you to create new variables based on existing data:
# создаем новую переменную для студентов с возрастом более 23 лет
mutated_data <- mutate(students, senior = ifelse(age > 23, "Yes", "No"))
print(mutated_data)
summarize()
used to calculate a summary of the data:
# вычисляем средний возраст студентов
summary_data <- summarize(students, average_age = mean(age))
print(summary_data)
arrange()
used to sort data by specific columns:
# сортируем студентов по возрасту по возрастанию
sorted_data <- arrange(students, age)
print(sorted_data)
Working with Data Groups
group_by()
used to group data into one or more columns:
library(dplyr)
# создаем фрейм данных с информацией о продажах
sales <- data.frame(
product = c("A", "B", "A", "B", "A"),
amount = c(100, 150, 200, 120, 180)
)
# группируем данные по продукту
grouped_sales <- group_by(sales, product)
print(grouped_sales)
After grouping the data, you can apply summarize()
to calculate statistics for each group:
summary_data <- summarize(grouped_sales, avg_sales = mean(amount))
print(summary_data)
You can calculate several statistics at once for each group, again using summarize()
:
summary_data <- summarize(grouped_sales,
avg_sales = mean(amount),
max_sales = max(amount))
print(summary_data)
Of course, you can perform other operations, such as filtering or creating new variables:
# фильтрация данных для каждой группы, оставляя только значения выше среднего
filtered_data <- grouped_sales %>%
filter(amount > mean(amount))
print(filtered_data)
Data merging and consolidation
Data fusion methods:
-
left_join()
: Join data by key from the first (left) data set. -
right_join():
combining data by key from the second (right) data set. -
inner_join()
: Returns only rows that have matching key values in both datasets. -
full_join()
: Returns all rows from both data sets, padding missing values with NA if data is missing.
# пример
df1 <- data.frame(id = c(1, 2, 3),
name = c("Ivan", "Kolya", "Nastya"))
df2 <- data.frame(id = c(2, 3, 4),
age = c(25, 30, 35))
# left_join(): объединяем данные по ключу "id", оставляя все строки из df1
left_merged <- left_join(df1, df2, by = "id")
print(left_merged)
# right_join(): объединяем данные по ключу "id", оставляя все строки из df2
right_merged <- right_join(df1, df2, by = "id")
print(right_merged)
# inner_join(): объединяем данные по ключу "id", оставляя только строки с совпадающими значениями ключа
inner_merged <- inner_join(df1, df2, by = "id")
print(inner_merged)
# full_join(): объединяем данные по ключу "id", возвращая все строки из обоих наборов данных
full_merged <- full_join(df1, df2, by = "id")
print(full_merged)
Examples of using
Sales data analysis:
library(dplyr)
# грузим данные о продажах
sales_data <- read.csv("sales_data.csv")
# фильтрация данных для определенного периода времени
sales_filtered <- sales_data %>%
filter(Date >= as.Date("2023-01-01") & Date <= as.Date("2024-12-31"))
# группировка данных по продукту и вычисление суммарных продаж
sales_summary <- sales_filtered %>%
group_by(Product) %>%
summarise(Total_Sales = sum(Sales))
print(sales_summary)
Processing of customer data:
library(dplyr)
customer_data <- read.csv("customer_data.csv")
# фильтрация данных для определенной страны
customer_filtered <- customer_data %>%
filter(Country == "Spain")
# создание новой переменной для расчета возраста клиентов
customer_processed <- customer_filtered %>%
mutate(Age = year(now()) - Year_of_Birth)
# выборка нужных столбцов
customer_selected <- customer_processed %>%
select(Name, Age, Gender, Email)
print(customer_selected)
A small analysis of site traffic:
library(dplyr)
web_traffic <- read.csv("web_traffic.csv")
# фильтрация данных для определенной страны и периода времени
traffic_filtered <- web_traffic %>%
filter(Country == "USA", Date >= as.Date("2023-01-01"))
# вычисление средней продолжительности сеанса
average_session_duration <- traffic_filtered %>%
summarise(Avg_Session_Duration = mean(Session_Duration))
print(average_session_duration)
Summary statistics on employees and their salaries:
library(dplyr)
employee_data <- read.csv("employee_data.csv")
# группировка данных по отделу и вычисление средней зарплаты
salary_summary <- employee_data %>%
group_by(Department) %>%
summarise(Avg_Salary = mean(Salary))
# сортировка результатов по убыванию средней зарплаты
salary_summary_sorted <- salary_summary %>%
arrange(desc(Avg_Salary))
print(salary_summary_sorted)
All the necessary tools and methods for data manipulation and analysis can be mastered in OTUS online courses: in the catalog you can see a list of all programs, and in the calendar — sign up for open lessons.
Acknowledgement and Usage Notice
The editorial team at TechBurst Magazine acknowledges the invaluable contribution of the author of the original article that forms the foundation of our publication. We sincerely appreciate the author’s work. All images in this publication are sourced directly from the original article, where a reference to the author’s profile is provided as well. This publication respects the author’s rights and enhances the visibility of their original work. If there are any concerns or the author wishes to discuss this matter further, we welcome an open dialogue to address potential issues and find an amicable resolution. Feel free to contact us through the ‘Contact Us’ section; the link is available in the website footer.