COVID-19 Example Data

Author

Juluis Visnel Foyet

Published

March 12, 2026

The dataset

The dataset used in this project is an example Covid data of 82,101 observations and 31 variables relating to demographics, symptoms and outcomes of Covid (Hospitalization and dead). It can be downloaded at https://github.com/appliedepi/epiRhandbook_eng/tree/master/data/covid_example_data

Packages Importation

library(readxl) # To import excel datasets
library(tidyverse) # Data wrangling and Viz
library(freqtables)
library(gt)
library(patchwork)

Data Importation

After importing the dataset, I use the glimpse function to have a feel about the size and the variables. R seems to have accurately captured the type of each variable.

df <- read_excel('covid_example_data.xlsx')

glimpse(df)

Rows: 82,101
Columns: 31
$ PID                    <chr> "3a85e6992a5ac52f", "c6b5281d5fc50b96", "53495a…
$ reprt_creationdt_FALSE <dttm> 2020-03-22, 2020-02-01, 2020-02-10, 2020-03-20…
$ case_dob_FALSE         <dttm> 2004-11-08, 1964-06-07, 1944-04-06, 1964-06-25…
$ case_age               <dbl> 16, 57, 77, 57, 56, 65, 47, 61, 36, 42, 74, 27,…
$ case_gender            <chr> "Male", "Male", "Female", "Female", "Male", "Ma…
$ case_race              <chr> "WHITE", "WHITE", "BLACK", "BLACK", "WHITE", "B…
$ case_eth               <chr> "NON-HISPANIC/LATINO", "NON-HISPANIC/LATINO", "…
$ case_zip               <dbl> 30308, 30308, 30315, 30213, 30004, 30314, 30313…
$ Contact_id             <chr> "Yes-Symptomatic", "Yes-Symptomatic", "Yes-Symp…
$ sym_startdt_FALSE      <dttm> 2020-03-20, 2020-01-28, 2020-02-10, 2021-05-19…
$ sym_fever              <chr> "Yes", "No", "Yes", "No", "Yes", "Yes", "No", "…
$ sym_subjfever          <chr> "Yes", "No", NA, "Yes", "Yes", "Yes", "No", "Ye…
$ sym_myalgia            <chr> "No", "Yes", "Yes", "Yes", "Yes", "No", "Unk", …
$ sym_losstastesmell     <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, "Yes", NA, …
$ sym_sorethroat         <chr> "Yes", "No", "Yes", "Yes", "No", "Unk", "Yes", …
$ sym_cough              <chr> "Yes", "Yes", "Yes", "Yes", "Yes", "Yes", "Yes"…
$ sym_headache           <chr> "Yes", "No", NA, "Yes", "No", "Unk", "Yes", "No…
$ sym_resolved           <chr> "No, still symptomatic", "No, still symptomatic…
$ sym_resolveddt_FALSE   <dttm> NA, NA, NA, NA, NA, 2020-02-21, NA, NA, NA, NA…
$ contact_household      <chr> "Yes", "No", NA, "No", "No", "No", "No", "No", …
$ hospitalized           <chr> "No", "No", "Yes", NA, "Yes", "Yes", "Yes", "No…
$ hosp_admidt_FALSE      <dttm> NA, NA, 2020-02-08, NA, 2020-02-26, 2020-01-27…
$ hosp_dischdt_FALSE     <dttm> NA, NA, NA, NA, NA, 2020-02-21, NA, NA, NA, 20…
$ died                   <chr> "No", "No", "No", "No", NA, "Yes", "No", NA, "N…
$ died_covid             <chr> "No", "No", "No", "No", NA, "Yes", "No", NA, "N…
$ died_dt_FALSE          <dttm> NA, NA, NA, NA, NA, 2020-02-21, NA, NA, NA, NA…
$ confirmed_case         <chr> "Yes", "Yes", "Yes", "Yes", "Yes", "Yes", "Yes"…
$ covid_dx               <chr> "Confirmed", "Confirmed", "Confirmed", "Confirm…
$ pos_sampledt_FALSE     <dttm> 2020-03-22, 2020-02-01, 2020-02-10, 2021-01-17…
$ latitude_JITT          <dbl> 33.776645460, 33.780510140, 33.730233310, 33.55…
$ longitude_JITT         <dbl> -84.385685230, -84.389474740, -84.384251890, -8…

Data Exploration

Race and Ethnicity Formating

From glimpse, I noticed that race and ethnicity are in all caps. I have a preference for cap only at the beginning of each word.

I also choose to focus on confirmed cases.

df <- df %>% 
  filter(confirmed_case == "Yes") %>% 
  mutate(
    case_race = str_to_title(case_race),
    case_eth = str_to_title(case_eth),
    sym_myalgia = str_to_title(sym_myalgia) # The symptom analysis below revealed that the case is no consistent here.
  )

Missing values

Up to 7 (died_df_FALSE, hosp_dischdt_FALSE, hosp_admdt_FALSE, sym_resolveddt_FALSE, sym_losstastesmell, died_covid, sym_resolved) variables have missing values for more than 50% of observations. The good news is that some of this are not missing per se, but not applicable. For example, if a person did not died, died_df_FALSE does not apply to them.

df %>% 
  summarise(
    across(everything(), ~sum(is.na(.)))
  ) %>% 
  gather(
    Variables, n_NAs, 1:31
  ) %>% 
  arrange(-n_NAs) %>% 
  mutate(Perc_NAs = round((n_NAs*100)/nrow(df), 2)) %>% 
  gt() %>% 
  fmt_number(columns = n_NAs, use_seps = T, decimals = 0)

Variables	n_NAs	Perc_NAs
died_dt_FALSE	80,357	97.92
hosp_dischdt_FALSE	78,564	95.74
hosp_admidt_FALSE	77,079	93.93
sym_resolveddt_FALSE	65,772	80.15
sym_losstastesmell	50,711	61.80
died_covid	42,285	51.53
sym_resolved	42,279	51.52
sym_subjfever	37,895	46.18
sym_startdt_FALSE	37,465	45.65
died	36,819	44.87
contact_household	36,725	44.75
hospitalized	32,473	39.57
sym_sorethroat	32,231	39.28
Contact_id	32,194	39.23
sym_myalgia	32,127	39.15
sym_headache	32,008	39.00
sym_cough	31,620	38.53
sym_fever	31,567	38.47
case_race	2,627	3.20
case_eth	2,571	3.13
longitude_JITT	164	0.20
pos_sampledt_FALSE	119	0.15
case_gender	60	0.07
latitude_JITT	58	0.07
case_dob_FALSE	47	0.06
case_age	47	0.06
case_zip	13	0.02
PID	0	0.00
reprt_creationdt_FALSE	0	0.00
confirmed_case	0	0.00
covid_dx	0	0.00

Demographics

This section will include age, gender, race, and ethnicity of cases, and I choose to stratify by gender.

Gender

Females are the most represented number of cases in this dataset (52.78%). A very small proportion of cases (0.42%) are of unknown gender.

df %>% 
  select(case_gender) %>% 
  na.omit() %>% 
  freq_table(case_gender) %>% 
  select(cat, n, percent) %>% 
  rename(Gender = cat) %>%
  gt() %>% 
  grand_summary_rows(
    columns = c(n, percent),
    fns = sum ~ sum(.),
    fmt = ~fmt_number(., use_seps = T, decimals = 0)
  ) %>% 
  fmt_number(columns = percent, decimals = 2) %>% 
  fmt_number(columns = n, use_seps = T, decimals = 0)

	Gender	n	percent
	Female	43,280	52.78
	Male	38,376	46.80
	Unknown	346	0.42
sum	—	82,002	100

Race

The black race is the most represented (44.11%), followed by the white race (39.76%). Unknown races, Asians, American Indians/Alaska Natives, and Native Hawaiians/Pacific Islanders represent minorities in this dataset, with less than 5% representativity each.

df %>% 
  select(case_race) %>% 
  na.omit() %>% 
  freq_table(case_race) %>% 
  select(cat, n, percent) %>% 
  rename(Race = cat) %>% 
  arrange(-percent) %>% 
  gt() %>% 
  grand_summary_rows(
    columns = c(n, percent),
    fns = sum ~ sum(.),
    fmt = ~fmt_number(., use_seps = T, decimals = 0)
  ) %>% 
  fmt_number(columns = percent, decimals = 2) %>% 
  fmt_number(columns = n, use_seps = T, decimals = 0)

	Race	n	percent
	Black	35,037	44.11
	White	31,584	39.76
	Other	5,860	7.38
	Unknown	3,720	4.68
	Asian	3,071	3.87
	American Indian/Alaska Native	84	0.11
	Native Hawaiian/Pacific Islander	79	0.10
sum	—	79,435	100

Ethnicity

Non-Hispanos/Latinos largely dominate (78.81%).

df %>% 
  select(case_eth) %>% 
  na.omit() %>% 
  freq_table(case_eth) %>% 
  select(cat, n, percent) %>% 
  rename(Ethnicity = cat) %>% 
  arrange(-percent) %>% 
  gt() %>% 
  grand_summary_rows(
    columns = c(n, percent),
    fns = sum ~ sum(.),
    fmt = ~fmt_number(., use_seps = T, decimals = 0)
  ) %>% 
  fmt_number(columns = percent, decimals = 2) %>% 
  fmt_number(columns = n, use_seps = T, decimals = 0)

	Ethnicity	n	percent
	Non-Hispanic/Latino	62,649	78.81
	Hispanic/Latino	8,621	10.85
	Not Specified	8,221	10.34
sum	—	79,491	100

Age

Age spans from 0 to 106 years, with an average of 39.69 and a standard deviation of 19.15.

p1 <- ggplot(df %>% filter(case_age>0), aes(x = case_age, y=" "))+
  geom_boxplot()+
  scale_x_continuous(breaks = seq(0, 120, 20), expand = c(.01,0))+
  theme_bw()+
  theme(
    axis.title = element_blank(),
    axis.text = element_text(size = 11, colour = "black")
  )

p2 <- ggplot(df %>% filter(case_age>0), aes(x = case_age))+
  geom_histogram(color = "black", fill = "white")+
  scale_x_continuous(breaks = seq(0, 120, 20), expand = c(.01,0))+
  theme_bw()+
  theme(
    axis.title.y = element_blank(),
    axis.title.x = element_text(size = 12, face = "bold"),
    axis.text = element_text(size = 11, colour = "black")
  )

p1+p2+
  plot_layout(ncol = 1, heights = c(.5,1))

Demographics by gender

# Compute the frequency distribution of race and ethnicity for per gender.
race_eth_gender <- c("case_race", "case_eth") %>% # a function from purrr that works like a loop.
  map(\(i){
    df %>% 
      select(case_gender, !!rlang::sym(i)) %>% 
      na.omit() %>%
      freq_table(!!rlang::sym(i), case_gender)
  }) %>% 
  list_rbind()

# Format the result from above into a publication ready table.
race_eth_gender %>% 
  freq_format(recipe = "n (percent_row)", name = 'perc', digits = 2) %>% 
  select(row_var, row_cat, col_cat, perc) %>% 
  spread(col_cat, perc, fill = 0) %>% 
  left_join(
    race_eth_gender %>% 
      group_by(row_var, row_cat) %>% 
      summarise(Total = paste0(sum(n), " (100)"), .groups = "drop"),
    by = join_by(row_var, row_cat)
  ) %>% 
  gt(
    groupname_col = "row_var",
    row_group_as_column = T,
    rowname_col = "row_cat"
  ) %>% 
  tab_spanner(
    label = "Case Gender",
    columns = Female:Unknown
  )

		Case Gender			Total
		Female	Male	Unknown	Total
case_eth	Hispanic/Latino	4547 (52.74)	4060 (47.09)	14 (0.16)	8621 (100)
	Non-Hispanic/Latino	33627 (53.68)	28894 (46.13)	117 (0.19)	62638 (100)
	Not Specified	3879 (47.28)	4171 (50.83)	155 (1.89)	8205 (100)
case_race	American Indian/Alaska Native	46 (54.76)	38 (45.24)	0	84 (100)
	Asian	1449 (47.18)	1617 (52.65)	5 (0.16)	3071 (100)
	Black	19916 (56.86)	15046 (42.96)	65 (0.19)	35027 (100)
	Native Hawaiian/Pacific Islander	38 (48.10)	40 (50.63)	1 (1.27)	79 (100)
	Other	3024 (51.61)	2797 (47.74)	38 (0.65)	5859 (100)
	Unknown	1733 (46.74)	1842 (49.68)	133 (3.59)	3708 (100)
	White	15830 (50.13)	15691 (49.69)	55 (0.17)	31576 (100)

# Viz
ggplot(data = race_eth_gender, aes(x = row_cat, y = percent_row, fill = col_cat))+
  geom_col(position = "dodge")+
  facet_wrap(~row_var, scales = "free_x", space = "free_x")+
  scale_y_continuous(breaks = seq(0, 60, 10), limits = c(0,60), expand = c(.01,0))+
  labs(y = "Percentage (%)", fill = "Gender")+
  theme_bw()+
  theme(
    legend.position = "top",
    axis.text.x = element_text(angle = 15, hjust = 1),
    axis.title.x = element_blank(),
    strip.background = element_rect(fill = "white"),
    strip.text = element_text(size = 12, face = "bold", hjust = 0),
    axis.title.y = element_text(size = 12, face = "bold"),
    legend.title = element_text(size = 12, face = "bold"),
    axis.text = element_text(size = 11, colour = "black")
  )

p1 <- ggplot(df %>% filter(case_age>0 & !is.na(case_gender)), aes(x = case_age, y=" ", fill = case_gender))+
  geom_boxplot()+
  scale_x_continuous(breaks = seq(0, 120, 20), expand = c(.01,0))+
  theme_bw()+
  theme(
    axis.title = element_blank(),
    axis.text = element_text(size = 11, colour = "black")
  )

p2 <- ggplot(df %>% filter(case_age>0 & !is.na(case_gender)),
             aes(x = case_age, fill = case_gender, fill = case_gender))+
  geom_histogram(color = "black")+
  scale_x_continuous(breaks = seq(0, 120, 20), expand = c(.01,0))+
  theme_bw()+
  theme(
    # axis.title.y = element_blank(),
    axis.title.x = element_text(size = 12, face = "bold"),
    axis.text = element_text(size = 11, colour = "black")
  )

p1+p2+
  plot_layout(ncol = 1, heights = c(.5,1), guides = "collect")&
  theme(legend.position = "top")

Symptoms

Overall

This analysis reveals that top 3 Covid symptoms include cough (prevalence = 43.49%), headache (43.29%) and loss of taste-smell (40.59%). The least prevalent symptom is sore throat (25.10%).

names(df %>% 
        select(
          starts_with("sym_"),
          -c(sym_startdt_FALSE, sym_resolved, sym_resolveddt_FALSE)
        )) %>% 
  map(\(i){
    df %>% 
      select(!!rlang::sym(i)) %>% 
      na.omit() %>% 
      freq_table(!!rlang::sym(i))
  }) %>% 
  list_rbind() %>% 
  freq_format(recipe = "n (percent)", name = 'perc', digits = 2) %>% 
  select(var, cat, perc) %>% 
  spread(cat, perc) %>% 
  gt %>% 
  tab_spanner(
    label = "Symptom Occurrence",
    columns = No:Yes
  )

var	Symptom Occurrence
var	No	Unk	Yes
sym_cough	27452 (54.42)	1054 (2.09)	21936 (43.49)
sym_fever	33936 (67.21)	1446 (2.86)	15113 (29.93)
sym_headache	27175 (54.29)	1211 (2.42)	21668 (43.29)
sym_losstastesmell	18092 (57.71)	534 (1.70)	12725 (40.59)
sym_myalgia	29197 (58.47)	1219 (2.44)	19519 (39.09)
sym_sorethroat	36085 (72.41)	1237 (2.48)	12509 (25.10)
sym_subjfever	30446 (68.93)	1023 (2.32)	12698 (28.75)

By Gender

symp_by_gender <- names(df %>% 
                          select(
                            starts_with("sym_"),
                            -c(sym_startdt_FALSE, sym_resolved, sym_resolveddt_FALSE)
                          )) %>% 
  map(\(i){
    df %>% 
      select(case_gender, !!rlang::sym(i)) %>% 
      na.omit() %>%
      freq_table(case_gender, !!rlang::sym(i))
  }) %>% 
  list_rbind()

# Format the result from above into a publication ready table.
symp_by_gender %>% 
  freq_format(recipe = "n (percent_row)", name = 'perc', digits = 2) %>% 
  select(col_var, row_cat, col_cat, perc) %>% 
  spread(col_cat, perc, fill = 0) %>% 
  left_join(
    symp_by_gender %>% 
      group_by(col_var, row_cat) %>% 
      summarise(Total = paste0(sum(n), " (100)"), .groups = "drop"),
    by = join_by(col_var, row_cat)
  ) %>% 
  rename("Case Gender" = row_cat) %>% 
  gt(
    groupname_col = "col_var",
    row_group_as_column = T
  ) %>% 
  tab_spanner(
    label = "Symptom Occurrence",
    columns = No:Yes
  )

	Case Gender	Symptom Occurrence			Total
	Case Gender	No	Unk	Yes	Total
sym_cough	Female	14563 (53.76)	542 (2.00)	11984 (44.24)	27089 (100)
	Male	12833 (55.17)	508 (2.18)	9921 (42.65)	23262 (100)
	Unknown	53 (63.86)	4 (4.82)	26 (31.33)	83 (100)
sym_fever	Female	18990 (70.11)	734 (2.71)	7361 (27.18)	27085 (100)
	Male	14877 (63.80)	708 (3.04)	7732 (33.16)	23317 (100)
	Unknown	65 (76.47)	4 (4.71)	16 (18.82)	85 (100)
sym_headache	Female	13649 (50.76)	610 (2.27)	12631 (46.97)	26890 (100)
	Male	13469 (58.37)	597 (2.59)	9008 (39.04)	23074 (100)
	Unknown	54 (65.85)	4 (4.88)	24 (29.27)	82 (100)
sym_losstastesmell	Female	9307 (54.59)	266 (1.56)	7476 (43.85)	17049 (100)
	Male	8750 (61.43)	266 (1.87)	5229 (36.71)	14245 (100)
	Unknown	34 (62.96)	2 (3.70)	18 (33.33)	54 (100)
sym_myalgia	Female	15488 (57.79)	613 (2.29)	10698 (39.92)	26799 (100)
	Male	13647 (59.22)	603 (2.62)	8796 (38.17)	23046 (100)
	Unknown	58 (70.73)	3 (3.66)	21 (25.61)	82 (100)
sym_sorethroat	Female	18913 (70.65)	626 (2.34)	7232 (27.01)	26771 (100)
	Male	17103 (74.46)	607 (2.64)	5260 (22.90)	22970 (100)
	Unknown	64 (78.05)	4 (4.88)	14 (17.07)	82 (100)
sym_subjfever	Female	16679 (70.38)	510 (2.15)	6511 (27.47)	23700 (100)
	Male	13721 (67.25)	510 (2.50)	6172 (30.25)	20403 (100)
	Unknown	41 (71.93)	3 (5.26)	13 (22.81)	57 (100)

From the table above and the graph below, the prevalence of headache, loss of taste/smell and sore throat seems higher in females, while fever seems more prevalent in males. The prevalences of cough and myalgia differ by less than 1% between both genders. These results need to be perceived carefully, since about 1-3% of cases have unknown symptom status, talkless of about 39-62% of participants having missing values for symptom status.

# Viz
ggplot(symp_by_gender, aes(x = row_cat, y = percent_row, fill = col_cat))+
  geom_col(color = "black")+
  geom_text(data = symp_by_gender %>% filter(col_cat == "Yes"),
            aes(label = paste0(round(percent_row,1),"%")),
            position = position_stack(vjust = .5), color = "white")+
  facet_wrap(~col_var, scales = "free_x", nrow = 2)+
  scale_y_continuous(breaks = seq(0, 100, 20), expand = c(.01,0))+
  scale_fill_manual(values = c("#00BA38", "grey", "#F8766D"))+
  labs(y = "Percentage (%)", fill = "Symptom Presence")+
  theme_bw()+
  theme(
    legend.position = "top",
    axis.title.x = element_blank(),
    strip.background = element_rect(fill = "white"),
    strip.text = element_text(size = 12, face = "bold", hjust = 0),
    axis.title.y = element_text(size = 12, face = "bold"),
    legend.title = element_text(size = 12, face = "bold"),
    axis.text = element_text(size = 11, colour = "black")
  )

Covid Outcome

Overall, 10.68% of cases were hospitalized, 3.76% died, and 3.36% died from covid.

c("hospitalized", "died", "died_covid") %>% 
  map(\(i){
    df %>% 
      select(!!rlang::sym(i)) %>% 
      na.omit() %>% 
      freq_table(!!rlang::sym(i))
  }) %>% 
  list_rbind() %>% 
  freq_format(recipe = "n (percent)", name = 'perc', digits = 2) %>% 
  select(var, cat, perc) %>% 
  spread(cat, perc, fill = 0) %>% 
  gt() %>% 
  tab_spanner(
    label = "Outcome Occurrence",
    columns = No:Yes
  )

var	Outcome Occurrence
var	No	Under Review	Unknown	Yes
died	43383 (95.89)	0	158 (0.35)	1702 (3.76)
died_covid	38318 (96.33)	123 (0.31)	0	1336 (3.36)
hospitalized	44295 (89.32)	0	0	5294 (10.68)

By Gender

out_by_gender <- c("hospitalized", "died", "died_covid") %>% 
  map(\(i){
    df %>% 
      select(case_gender, !!rlang::sym(i)) %>% 
      na.omit() %>%
      freq_table(case_gender, !!rlang::sym(i))
  }) %>% 
  list_rbind()

# Format the result from above into a publication ready table.
out_by_gender %>% 
  freq_format(recipe = "n (percent_row)", name = 'perc', digits = 2) %>% 
  select(col_var, row_cat, col_cat, perc) %>% 
  spread(col_cat, perc, fill = 0) %>% 
  left_join(
    out_by_gender %>% 
      group_by(col_var, row_cat) %>% 
      summarise(Total = paste0(sum(n), " (100)"), .groups = "drop"),
    by = join_by(col_var, row_cat)
  ) %>% 
  rename("Case Gender" = row_cat) %>% 
  gt(
    groupname_col = "col_var",
    row_group_as_column = T
  ) %>% 
  tab_spanner(
    label = "Outcome Occurrence",
    columns = No:Yes
  )

	Case Gender	Outcome Occurrence				Total
	Case Gender	No	Under Review	Unknown	Yes	Total
died	Female	23263 (96.23)	0	74 (0.31)	837 (3.46)	24174 (100)
	Male	20062 (95.49)	0	83 (0.40)	865 (4.12)	21010 (100)
	Unknown	54 (98.18)	0	1 (1.82)	0	55 (100)
died_covid	Female	20539 (96.69)	71 (0.33)	0	633 (2.98)	21243 (100)
	Male	17728 (95.92)	52 (0.28)	0	703 (3.80)	18483 (100)
	Unknown	47 (100.00)	0	0	0	47 (100)
hospitalized	Female	23800 (89.78)	0	0	2709 (10.22)	26509 (100)
	Male	20411 (88.78)	0	0	2580 (11.22)	22991 (100)
	Unknown	76 (93.83)	0	0	5 (6.17)	81 (100)

From the table and the graph, a greater proportion of male cases died from covid or another cause. A greater proportion of males got hospitalized too.

# Viz
ggplot(out_by_gender, aes(x = row_cat, y = percent_row, fill = col_cat))+
  geom_col(color = "black")+
  geom_text(data = out_by_gender %>% filter(col_cat == "Yes"),
            aes(label = paste0(round(percent_row,1),"%")),
            position = position_stack(vjust = .5), color = "white")+
  facet_wrap(~col_var, scales = "free_x", nrow = 1)+
  scale_y_continuous(breaks = seq(0, 100, 20), expand = c(.01,0))+
  scale_fill_manual(values = c("#00BA38", "grey", "grey", "#F8766D"))+
  labs(y = "Percentage (%)", fill = "Outcome Occured")+
  theme_bw()+
  theme(
    legend.position = "top",
    axis.title.x = element_blank(),
    strip.background = element_rect(fill = "white"),
    strip.text = element_text(size = 12, face = "bold", hjust = 0),
    axis.title.y = element_text(size = 12, face = "bold"),
    legend.title = element_text(size = 12, face = "bold"),
    axis.text = element_text(size = 11, colour = "black")
  )

Weekly trend of Outcomes

Given the number of NAs for hospitalization date and the dead date, I used the report creation date to study weekly trends of cases, hospitalization and dead.
To capture the weekly trend, instead of using week (1-53), ISO week or epidemiological week, I decided to floor report creation date to the nearest week. The resulting date version, does not make a lot of sense by its own, but it is very convenient for visualization. Below is a table illustrating the date flooring.

tibble(
  Original_Date = seq(dmy("08032026"), dmy("17032026"), "1 day")
) %>% 
  mutate(
    Floored_Date = floor_date(Original_Date, unit = "week")
  ) %>% 
  gt() %>% 
  fmt_date(date_style = "wday_day_month_year")

Original_Date	Floored_Date
Sunday 8 March 2026	Sunday 8 March 2026
Monday 9 March 2026	Sunday 8 March 2026
Tuesday 10 March 2026	Sunday 8 March 2026
Wednesday 11 March 2026	Sunday 8 March 2026
Thursday 12 March 2026	Sunday 8 March 2026
Friday 13 March 2026	Sunday 8 March 2026
Saturday 14 March 2026	Sunday 8 March 2026
Sunday 15 March 2026	Sunday 15 March 2026
Monday 16 March 2026	Sunday 15 March 2026
Tuesday 17 March 2026	Sunday 15 March 2026

Hospitalization

n Cases

…

p1 <- df %>% 
  mutate(
    hosp_week = floor_date(reprt_creationdt_FALSE, unit = "week")
  ) %>% 
  select(hosp_week, case_gender) %>% 
  na.omit() %>% 
  group_by(hosp_week, case_gender) %>% 
  count() %>% 
  ggplot(aes(x = hosp_week, y = n, fill = case_gender))+
  geom_col(color = "grey")+
  scale_y_continuous(breaks = seq(0, 3500, 500), expand = c(.01,0))+
  scale_x_date(date_breaks = "2 months", date_labels = "%b\n%Y",
               limits = c(ymd("2019-12-01"), ymd("2021-08-27")))+
  scale_fill_manual(values = c("#F8766D", "#00BA38", "#619CFF"))+
  labs(y = "n Cases", x = "Report Creation Date")+
  theme_bw()+
  theme(
    legend.position = "top",
    axis.title = element_text(size = 12, face = "bold"),
    legend.title = element_text(size = 12, face = "bold"),
    axis.text = element_text(size = 11, colour = "black")
  )

% of Hospitalization from n Cases

…

hospi_week <- df %>% 
  mutate(
    hosp_week = floor_date(reprt_creationdt_FALSE, unit = "week")
  ) %>% 
  group_by(hosp_week) %>% 
  mutate(n = n()) %>%
  group_by(hosp_week, n) %>% 
  summarise(n_hospi = sum(hospitalized == "Yes", na.rm = T), .groups = "drop") %>% 
  mutate(perc_hospi = (n_hospi*100)/n)


hospi_week_gender <- df %>% 
  mutate(
    hosp_week = floor_date(reprt_creationdt_FALSE, unit = "week")
  ) %>% 
  group_by(hosp_week, case_gender) %>% 
  mutate(n = n()) %>% 
  group_by(hosp_week, case_gender, n) %>% 
  summarise(n_hospi = sum(hospitalized == "Yes", na.rm = T), .groups = "drop") %>% 
  mutate(perc_hospi = (n_hospi*100)/n) %>% 
  ungroup() %>% na.omit()

p2 <- ggplot(hospi_week, aes(x = hosp_week, y = perc_hospi))+
  geom_line(aes(color = "_Global"))+
  geom_point(aes(color = "_Global"))+
  geom_line(data = hospi_week_gender, 
            aes(x = hosp_week, y = perc_hospi, colour = case_gender))+
  geom_point(data = hospi_week_gender, 
             aes(x = hosp_week, y = perc_hospi, colour = case_gender))+
  scale_y_continuous(breaks = seq(0, 100, 20), expand = c(.01,0))+
  scale_x_date(date_breaks = "2 months", date_labels = "%b\n%Y",
               limits = c(ymd("2019-12-01"), ymd("2021-08-27")))+
  scale_color_manual(values = c("black", "#F8766D", "#00BA38", "#619CFF"))+
  labs(y = "% Hospitalized from n Cases", x = "Report Creation Date")+
  theme_bw()+
  theme(
    legend.position = "top",
    axis.title = element_text(size = 12, face = "bold"),
    legend.title = element_text(size = 12, face = "bold"),
    axis.text = element_text(size = 11, colour = "black")
  )

n Deads

…

p3 <- df %>% 
  mutate(
    hosp_week = floor_date(reprt_creationdt_FALSE, unit = "week")
  ) %>% 
  select(hosp_week, case_gender, died) %>% 
  na.omit() %>% 
  group_by(hosp_week, case_gender, died) %>% 
  count() %>% 
  filter(died == "Yes") %>% 
  ggplot(aes(x = hosp_week, y = n, fill = case_gender))+
  geom_col(color = "grey")+
  scale_y_continuous(breaks = seq(0, 70, 10), expand = c(.01,0))+
  scale_x_date(date_breaks = "2 months", date_labels = "%b\n%Y",
               limits = c(ymd("2019-12-01"), ymd("2021-08-27")))+
  scale_fill_manual(values = c("#F8766D", "#00BA38"))+
  labs(y = "n Deaths", x = "Report Creation Date")+
  theme_bw()+
  theme(
    legend.position = "top",
    axis.title = element_text(size = 12, face = "bold"),
    legend.title = element_text(size = 12, face = "bold"),
    axis.text = element_text(size = 11, colour = "black")
  )

n Deads from Covid

p4 <- df %>% 
  mutate(
    hosp_week = floor_date(reprt_creationdt_FALSE, unit = "week")
  ) %>% 
  select(hosp_week, case_gender, died_covid) %>% 
  na.omit() %>% 
  group_by(hosp_week, case_gender, died_covid) %>% 
  count() %>% 
  filter(died_covid == "Yes") %>% 
  ggplot(aes(x = hosp_week, y = n, fill = case_gender))+
  geom_col(color = "grey")+
  scale_y_continuous(breaks = seq(0, 70, 10), expand = c(.01,0))+
  scale_x_date(date_breaks = "2 months", date_labels = "%b\n%Y",
               limits = c(ymd("2019-12-01"), ymd("2021-08-27")))+
  scale_fill_manual(values = c("#F8766D", "#00BA38"))+
  labs(y = "n Deaths from Covid", x = "Report Creation Date")+
  theme_bw()+
  theme(
    legend.position = "top",
    axis.title = element_text(size = 12, face = "bold"),
    legend.title = element_text(size = 12, face = "bold"),
    axis.text = element_text(size = 11, colour = "black")
  )

Bamm!!

p1+p2+p3+p4+
  plot_layout(ncol = 1)