A2 - Part II

Valentine’s Day Spending

Author

Sara S

Dataset: Valentine’s Day Spending

Introduction

This analysis explores how much people spend on gifts at different ages, focusing on items like Candy, Clothing, and Flowers. It aims to show trends in spending habits and preferences among age groups.

Installing Packages

library(tidyverse)
── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
✔ dplyr     1.1.4     ✔ readr     2.1.5
✔ forcats   1.0.0     ✔ stringr   1.5.1
✔ ggplot2   3.5.1     ✔ tibble    3.2.1
✔ lubridate 1.9.3     ✔ tidyr     1.3.1
✔ purrr     1.0.2     
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag()    masks stats::lag()
ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(mosaic)
Registered S3 method overwritten by 'mosaic':
  method                           from   
  fortify.SpatialPolygonsDataFrame ggplot2

The 'mosaic' package masks several functions from core packages in order to add 
additional features.  The original behavior of these functions should not be affected by this.

Attaching package: 'mosaic'

The following object is masked from 'package:Matrix':

    mean

The following objects are masked from 'package:dplyr':

    count, do, tally

The following object is masked from 'package:purrr':

    cross

The following object is masked from 'package:ggplot2':

    stat

The following objects are masked from 'package:stats':

    binom.test, cor, cor.test, cov, fivenum, IQR, median, prop.test,
    quantile, sd, t.test, var

The following objects are masked from 'package:base':

    max, mean, min, prod, range, sample, sum
library(skimr)

Attaching package: 'skimr'

The following object is masked from 'package:mosaic':

    n_missing
library(ggformula)
library(ggprism)

Reading the dataset

gifts_age <- readr::read_csv("https://raw.githubusercontent.com/rfordatascience/tidytuesday/master/data/2024/2024-02-13/gifts_age.csv")
Rows: 6 Columns: 9
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr (1): Age
dbl (8): SpendingCelebrating, Candy, Flowers, Jewelry, GreetingCards, Evenin...

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
gifts_age
# A tibble: 6 × 9
  Age   SpendingCelebrating Candy Flowers Jewelry GreetingCards EveningOut
  <chr>               <dbl> <dbl>   <dbl>   <dbl>         <dbl>      <dbl>
1 18-24                  51    70      50      33            33         41
2 25-34                  40    62      44      34            33         37
3 35-44                  31    58      41      29            42         30
4 45-54                  19    60      37      20            42         31
5 55-64                  18    50      32      13            43         29
6 65+                    13    42      25       8            44         24
# ℹ 2 more variables: Clothing <dbl>, GiftCards <dbl>

Organizing the data

glimpse(gifts_age)
Rows: 6
Columns: 9
$ Age                 <chr> "18-24", "25-34", "35-44", "45-54", "55-64", "65+"
$ SpendingCelebrating <dbl> 51, 40, 31, 19, 18, 13
$ Candy               <dbl> 70, 62, 58, 60, 50, 42
$ Flowers             <dbl> 50, 44, 41, 37, 32, 25
$ Jewelry             <dbl> 33, 34, 29, 20, 13, 8
$ GreetingCards       <dbl> 33, 33, 42, 42, 43, 44
$ EveningOut          <dbl> 41, 37, 30, 31, 29, 24
$ Clothing            <dbl> 33, 27, 26, 20, 19, 12
$ GiftCards           <dbl> 23, 19, 22, 23, 20, 20
inspect(gifts_age)

categorical variables:  
  name     class levels n missing                                  distribution
1  Age character      6 6       0 18-24 (16.7%), 25-34 (16.7%) ...             

quantitative variables:  
                 name   class min    Q1 median    Q3 max     mean        sd n
1 SpendingCelebrating numeric  13 18.25   25.0 37.75  51 28.66667 14.733183 6
2               Candy numeric  42 52.00   59.0 61.50  70 57.00000  9.777525 6
3             Flowers numeric  25 33.25   39.0 43.25  50 38.16667  8.886319 6
4             Jewelry numeric   8 14.75   24.5 32.00  34 22.83333 10.870449 6
5       GreetingCards numeric  33 35.25   42.0 42.75  44 39.50000  5.089204 6
6          EveningOut numeric  24 29.25   30.5 35.50  41 32.00000  6.066300 6
7            Clothing numeric  12 19.25   23.0 26.75  33 22.83333  7.359801 6
8           GiftCards numeric  19 20.00   21.0 22.75  23 21.16667  1.722401 6
  missing
1       0
2       0
3       0
4       0
5       0
6       0
7       0
8       0
skimr::skim(gifts_age)
Data summary
Name gifts_age
Number of rows 6
Number of columns 9
_______________________
Column type frequency:
character 1
numeric 8
________________________
Group variables None

Variable type: character

skim_variable n_missing complete_rate min max empty n_unique whitespace
Age 0 1 3 5 0 6 0

Variable type: numeric

skim_variable n_missing complete_rate mean sd p0 p25 p50 p75 p100 hist
SpendingCelebrating 0 1 28.67 14.73 13 18.25 25.0 37.75 51 ▇▁▂▂▂
Candy 0 1 57.00 9.78 42 52.00 59.0 61.50 70 ▃▃▃▇▃
Flowers 0 1 38.17 8.89 25 33.25 39.0 43.25 50 ▃▃▃▇▃
Jewelry 0 1 22.83 10.87 8 14.75 24.5 32.00 34 ▅▁▂▁▇
GreetingCards 0 1 39.50 5.09 33 35.25 42.0 42.75 44 ▃▁▁▁▇
EveningOut 0 1 32.00 6.07 24 29.25 30.5 35.50 41 ▃▇▃▃▃
Clothing 0 1 22.83 7.36 12 19.25 23.0 26.75 33 ▃▇▁▇▃
GiftCards 0 1 21.17 1.72 19 20.00 21.0 22.75 23 ▃▇▁▃▇

Data Dictionary

Quantitative Variables

  • SpendingCelebrating <dbl>: It is the amount spent on celebrating occasions.

  • Candy <dbl>: Represents the amount spent on candy.

  • Flowers <dbl>: It is the amount spent on flowers.

  • Jewelry <dbl>: Shows amount spent on jewelry.

  • GreetingCards <dbl>: Indicates amount spent on greeting cards.

  • EveningOut <dbl>: It is the amount spent on evening outings.

  • Clothing <dbl>: Shows the amount spent on clothin.

  • GiftCards <dbl>: Represents the amount spent on gift cards.

Qualitative Variables

  • Age <chr>: This variable categorizes individuals into age groups (e.g., 18-24, 25-34, etc.), useful for demographic analysis.

Analyzing the qualitative data

Age

gifts_age%>% count(Age)
# A tibble: 6 × 2
  Age       n
  <chr> <int>
1 18-24     1
2 25-34     1
3 35-44     1
4 45-54     1
5 55-64     1
6 65+       1

Observations

Each age group is evenly distributed with 1 observation each. This equal distribution provides a balanced foundation for analyzing patterns or trends across various age groups.

Analyzing the quantitative data

SpendingCelebrating

gf_histogram(~SpendingCelebrating, data = gifts_age)

Candy

gf_histogram(~Candy, data = gifts_age)

Flowers

gf_histogram(~Flowers, data = gifts_age)

Jewelry

gf_histogram(~Jewelry, data = gifts_age)

Greeting Cards

gf_histogram(~GreetingCards, data = gifts_age)

Evening Out

gf_histogram(~EveningOut, data = gifts_age)

Clothing

gf_histogram(~Clothing, data = gifts_age)

Gift Cards

gf_histogram(~GiftCards, data = gifts_age)

Observations

The histograms for the eight spending categories—SpendingCelebrating, Candy, Flowers, Jewelry, GreetingCards, EveningOut, Clothing, and GiftCards—show different patterns in how much people spend across various age groups. Most categories reveal that individuals tend to spend similar amounts, especially around mid-range values. For instance, SpendingCelebrating and GiftCards generally show lower spending, while Candy and Jewelry have more even spending across different amounts. The EveningOut category has a noticeable peak around 30, indicating that many people spend around this amount. Similarly, Clothing shows more spending between 20 and 25, while GreetingCards tends to cluster around higher amounts, particularly 42 and 43. Flowers demonstrate a consistent spending pattern without big gaps. Overall, while spending habits differ by category, there is a common trend where most people prefer to spend similar mid-range amounts, reflecting shared habits in gift-giving and leisure activities.

Transforming the data

Notes:

  • data.frame: It is a specific type of data structure that organizes data into rows and columns, allowing for efficient storage and manipulation of datasets in a tabular format. In contrast, names like gifts_age or gifts_extra are user-defined variable names that represent specific instances of data frames, helping to describe the content they hold and making the code easier to understand.

  • pivot_longer: It reshapes data from a wide format to a long format. This means it takes multiple columns and combines them into key-value pairs, creating a more compact data structure. It helps make data analysis and visualization easier by organizing related data points into a single column, which is useful for comparing and plotting.

gifts_extra <- data.frame(
  Age = c("18-24", "25-34", "35-44", "45-54", "55-64", "65+"),
  SpendingCelebrating = c(51, 40, 31, 19, 18, 13),
  Candy = c(70, 62, 58, 60, 50, 42),
  Flowers = c(50, 44, 41, 37, 32, 25),
  Jewelry = c(33, 34, 29, 20, 13, 8),
  GreetingCards = c(33, 33, 42, 42, 43, 44),
  EveningOut = c(41, 37, 30, 31, 29, 24),
  Clothing = c(33, 27, 26, 20, 19, 12),
  GiftCards = c(23, 19, 22, 23, 20, 20)
)

gifts <- gifts_extra %>%
  pivot_longer(
    cols = -Age,               
    names_to = "Gifts",        
    values_to = "AmountsSpent" 
  )
gifts%>%
  dplyr::mutate(
    Age = as_factor(Age),      
    Gifts = as_factor(Gifts)  
  )
# A tibble: 48 × 3
   Age   Gifts               AmountsSpent
   <fct> <fct>                      <dbl>
 1 18-24 SpendingCelebrating           51
 2 18-24 Candy                         70
 3 18-24 Flowers                       50
 4 18-24 Jewelry                       33
 5 18-24 GreetingCards                 33
 6 18-24 EveningOut                    41
 7 18-24 Clothing                      33
 8 18-24 GiftCards                     23
 9 25-34 SpendingCelebrating           40
10 25-34 Candy                         62
# ℹ 38 more rows
gifts
# A tibble: 48 × 3
   Age   Gifts               AmountsSpent
   <chr> <chr>                      <dbl>
 1 18-24 SpendingCelebrating           51
 2 18-24 Candy                         70
 3 18-24 Flowers                       50
 4 18-24 Jewelry                       33
 5 18-24 GreetingCards                 33
 6 18-24 EveningOut                    41
 7 18-24 Clothing                      33
 8 18-24 GiftCards                     23
 9 25-34 SpendingCelebrating           40
10 25-34 Candy                         62
# ℹ 38 more rows

Target Variables

The target variable is what you’re trying to predict or understand in an analysis. In this case, AmountsSpent is the focus. When examining spending on items like Candy, Clothing, and Flowers across different age groups reveals how spending habits change with age, making AmountsSpent the key outcome of interest in the analysis.

Predictor Variables

Predictor variables are the independent variables that are used to explain or predict the outcome in an analysis. The predictor variables for this dataset are as follows:

  • Age: It categorizes people by their years of life, helping to analyze how spending habits vary across different life stages.

  • Gifts: These are categories of gifts and activities like Candy or EveningOut. Analyzing spending on these items can help understand consumer preferences across different age groups.

Research Question

How does gift spending vary across different age groups?

Is there a significant difference in spending between items such as Candy, Clothing, and Flowers across age groups?

Plot

gf_line(AmountsSpent ~ Age, data = gifts, group = ~ Gifts, color = ~ Gifts) %>%
 gf_point() %>%
  gf_labs(
    title = "Valentine's Day Spending over Age",
    x = "Age Group in Years",
    y = "Amounts Spent per Gift",
    caption = "Tidy Tuesday 13-Feb-2024. I was teaching O&C that day"
  ) %>%
  gf_refine(scale_color_manual(values = c("Candy" = "red", "Clothing" = "blue", "EveningOut" = "green", "Flowers" = "orange", "GiftCards" = "purple", "GreetingCards" = "pink", "Jewelry" = "yellow", "SpendingCelebrating" = "black"))) 

Plot with shapes

gf_line(AmountsSpent ~ Age, data = gifts, group = ~ Gifts, color = ~ Gifts) %>%
 gf_point(shape = ~ Gifts, size = 3) %>%  # Use shape directly
  gf_labs(
    title = "Valentine's Day Spending over Age",
    x = "Age Group in Years",
    y = "Amounts Spent per Gift",
    caption = "Tidy Tuesday 13-Feb-2024. I was teaching O&C that day"
  ) %>%
  gf_refine(scale_color_manual(values = c("Candy" = "red", "Clothing" = "blue", "EveningOut" = "green", 
                              "Flowers" = "orange", "GiftCards" = "purple", 
                              "GreetingCards" = "pink", "Jewelry" = "yellow", 
                              "SpendingCelebrating" = "black"))) %>%
  gf_refine(scale_shape_manual(values = c("Candy" = 16, "Clothing" = 1, "EveningOut" = 17, 
                                          "Flowers" = 4, "GiftCards" = 18, "GreetingCards" = 25, 
                                          "Jewelry" = 8, "SpendingCelebrating" = 3)))

Plot Analysis

  1. Type of Chart.

    The chart is a line plot with points(shapes), where the different points represent different items like candy, clothing, etc.

  2. Variables Used for Various Geometrical Aspects.

    X-axis Variables: Represents the age groups such as 18-24, 25-34, 35-44, 45-54, 55-64 and 65+

    Y-axis Variables: Represents the amount spent on each category

    Colour: Each item is represented by a different color to distinguish spending patterns.

    Shape: Different shapes for data points represent different gifts

  3. What activity might have been carried out to obtain the data graphed here?

    A research survey was likely conducted, asking respondents from different age groups about their spending habits on Valentine’s Day gifts. Participants probably reported how much they spent on items like candy, clothing, flowers, and gift cards.

  4. What pre-processing of the data was required to create the chart?

    Pivoting: The dataset is reshaped from wide to long format using the pivot_longer function. This transformation consolidates the amounts spent on each gift category into a single column, AmountsSpent, while retaining Age as an identifier variable. Each row in the new dataset now corresponds to the spending on a specific gift category for a particular age group.

    Factor Conversion: After pivoting, the code uses the mutate function under the dplyr package to convert the Age and Gifts columns into factors. This conversion is essential for better handling of categorical variables in visualizations, ensuring that the plotting functions recognize these variables as categorical rather than continuous.

  5. Hypothesis/Research Question.

    How does gift spending vary across different age groups?

  6. Two-Line Story Based on the Graph.

    Spending on candy and celebrations decreases steadily with age, while spending on greeting cards and gift cards remains relatively stable across age groups. The data suggests that older individuals prefer more practical gifts like gift cards, while younger people spend more on traditional gifts like candy and flowers. This reveals that younger age groups tend to spend more on gifts for Valentine’s Day, particularly on candy and flowers, indicating that romantic gestures are more prominent among younger individuals and highlighting differing approaches to celebration across age brackets.

Inferences and My Journey

I initially began working on a case study titled “Women Live Longer.” I made significant progress on it and almost done with then case study when I ran into issues with replicating the graph. I tried multiple different versions, but none of them worked as expected. The closest attempt was using the following code:

##ggplot(life_expectancy_data, aes(x = LifeExpMale, y = LifeExpFemale, size = Population))+   ##geom_point(shape = 21, color = "black", fill = NA) +  
  ##geom_abline(intercept = 0, slope = 1, linetype = "dashed") + 
  ##labs(title = "Life Expectancy across Countries and Years", 
     ##x = "For Men (Years)", 
     ##y = "For Women (Years)") +  
  ##scale_size_continuous(breaks = c(5e8, 1e9), 
                        ##labels = c("5.0e+08", "1.0e+09"), 
                        ##guide = guide_legend(override.aes = list(fill = NA, color = "black"))) +  
  ##xlim(50, 90) + 
  ##ylim(50, 90)    

However, the resulting plot had issues. The circles representing data points were inaccurately placed, and despite my efforts, I couldn’t figure out how to correct it. Additionally, I wanted to attach the index file along with the rest of my A2, but the file was bugged, preventing me from rendering and pushing it to Git.

After spending many hours without success, I decided to switch to the “Valentine’s Day Spending” case study. This one was much easier to understand and implement. I had no problems transforming the data and replicating the graphs. Identifying the quantitative and qualitative variables, along with the target and predictor variables, was also pretty easy.

However, I did encounter a challenge while analyzing the graphs for quantitative variables. Most of them showed very little variation, which made them look different from typical histograms. This lack of diversity made it hard to draw meaningful observations from those graphs.

I also felt confused about whether to use the transformed dataset or the original dataset when deciding on the quantitative and qualitative variables, as well as identifying the target and predictor variables. In the end, I decided to use the original dataset for identifying the quantitative and qualitative variables, and then I transformed it to determine the target and predictor variables.

When transforming the data, using the pivot_longer() function was helpful. It reshaped the data into a long format, bringing together the amounts spent on each gift category into a single column. This change made it easier to visualize and compare spending across categories. By converting the Age and Gifts columns into factors, the analysis ensured they were treated correctly as categorical variables, which helped clarify the results.

The visualizations I created using gf_line() and gf_point() effectively illustrated spending trends across different age groups. These functions allowed me to create clear line plots that displayed the data in an organized way. By using different colors and shapes for each gift category, I was able to make the information more visually appealing and easier to interpret.

Learning how to implement shapes in the code was quite simple. had to look up which numbers corresponded to each shape and apply them in my plots. This added another layer of clarity, as it helped distinguish between the various spending categories. However, I’m still figuring out how to create the shape for spendingcelebrating, which looks like a circle with a ‘+’ sign inside. My final guess is that maybe having a code where one shape is added to another might achieve the final shape. Despite this small challenge, using different shapes in my visualizations significantly improved my ability to see and compare spending patterns across age groups. It helped me understand how different age groups prioritize their spending on gifts.

One important observation from the analysis is the difference in gift preferences among various age groups. Younger people, especially those aged 18-34, tend to spend more on traditional romantic gifts like candy and flowers, showing a clear preference for symbolic gifts. In contrast, older individuals (ages 55 and up) are less interested in extravagant items like jewelry and evening outings, but they consistently spend on practical gifts, highlighting a focus on usefulness and sentimental value. Distinct spending patterns also appeared across different gift categories; for example, spending on clothing and jewelry decreases with age, while greeting cards remain popular across all groups. Interestingly, spending on evening outings peaks among those in their mid-30s to mid-40s, suggesting this age group values experiences more than their younger or older counterparts. Overall, these trends provide valuable insights into how consumer behavior changes at different life stages.