This notebook is part of a GitHub repository: https://github.com/pessini/european-voters
MIT Licensed
Author: Leandro Pessini
Photo: The Irish Times — BREXIT: THE FACTS
This analysis will investigate a dataset provided by European Social Survey (ESS) which is a cross-national survey of attitudes and behaviour from European citizens. The topics covered by ESS are very heterogeneous and include media and social trust, politics, immigration, citizen involvement, health and care, economic, work and well-being.
The focus will be on which aspects can influence a person to vote for their country to leave or remain a member of the European Union. The variables selected are mostly socio-demographic such as education, employment status and Union membership status.
EDUYR About how many years of education have you completed, whether full-time or part-time? Please report these in full-time equivalents and include compulsory years of schooling.
EISCED Generated variable: Highest level of education, ES - ISCED 9 (What is the highest level of education you have successfully completed?)
UEMP3M Have you ever been unemployed and seeking work for a period of more than three months?
MBTRU Are you or have you ever been a member of a trade union or similar organisation? IF YES, is that currently or previously?
VTEURMMB Imagine there were a referendum in [country] tomorrow about membership of the European Union. Would you vote for [country] to remain a member of the European Union or to leave the European Union?
YRBRN And in what year were you born?
ISCED is the reference international classification for organising education programmes and related qualifications by levels and fields. ISCED 2011 (levels of education) has been implemented in all EU data collections since 2014.
Levels
More info about ISCED can be found here).
# Change the default plots size
options(repr.plot.width=15, repr.plot.height=10)
options(warn=-1)
# Suppress summarise info
options(dplyr.summarise.inform = FALSE)
# Check if the packages that we need are installed
want = c("dplyr", "ggplot2", "ggthemes", "gghighlight", "foreign", "scales", "survey", "srvyr", "caret",
"ggpubr", "forcats")
have = want %in% rownames(installed.packages())
# Install the packages that we miss
if ( any(!have) ) { install.packages( want[!have] ) }
# Load the packages
junk <- lapply(want, library, character.only = T)
# Remove the objects we created
rm(have, want, junk)
Selecting the variables which will be used for the data analysis
survey_rawdata <- read.spss("ESS9e03_1.sav", use.value.labels=T, max.value.labels=Inf, to.data.frame=TRUE)
variables <- c("cntry",
"eduyrs",
"eisced",
"uemp3m",
"mbtru",
"vteurmmb",
"yrbrn",
"agea",
"gndr",
"anweight",
"psu",
"stratum")
european_survey <- survey_rawdata[,variables]
head(european_survey)
paste0("Number of rows in the dataset: ", nrow(european_survey))
# Checking for NA's in the dataset
sapply(european_survey, function(x) sum(is.na(x)))
# For the purpose of this analysis, considering Vote as Leave or Remain
european_survey$vteurmmb <- as.character(european_survey$vteurmmb)
european_survey$vteurmmb[european_survey$vteurmmb == "Remain member of the European Union"] <- "Remain"
european_survey$vteurmmb[european_survey$vteurmmb == "Leave the European Union"] <- "Leave"
european_survey$vteurmmb[european_survey$vteurmmb == "Would submit a blank ballot paper"] <- NA
european_survey$vteurmmb[european_survey$vteurmmb == "Would spoil the ballot paper"] <- NA
european_survey$vteurmmb[european_survey$vteurmmb == "Would not vote"] <- NA
european_survey$vteurmmb[european_survey$vteurmmb == "Not eligible to vote"] <- NA
european_survey$vteurmmb <- as.factor(european_survey$vteurmmb)
# Cleaning responses that are not able to fit into ISCED
european_survey$eisced <- as.character(european_survey$eisced)
european_survey$eisced[european_survey$eisced == "Not possible to harmonise into ES-ISCED"] <- NA
european_survey$eisced[european_survey$eisced == "Other"] <- NA
# Cleaning NA values
df_european_survey <- european_survey[complete.cases(european_survey), ]
sapply(df_european_survey, function(x) sum(is.na(x)))
# Different way to clean the variable leaving as yes or no
df_european_survey$uemp3m <- as.character(df_european_survey$uemp3m)
df_european_survey$uemp3m <- as.factor(df_european_survey$uemp3m)
# Creating a new feature Education by aggregating the ISCED"s levels
# Low, Medium and High Education
df_european_survey <- df_european_survey %>%
mutate(Education = case_when(
eisced == "ES-ISCED I , less than lower secondary" ~ "Low Education",
eisced == "ES-ISCED II, lower secondary" ~ "Low Education",
eisced == "ES-ISCED IIIb, lower tier upper secondary" ~ "Medium Education",
eisced == "ES-ISCED IIIa, upper tier upper secondary" ~ "Medium Education",
eisced == "ES-ISCED IV, advanced vocational, sub-degree" ~ "Medium Education",
eisced == "ES-ISCED V1, lower tertiary education, BA level" ~ "High Education",
eisced == "ES-ISCED V2, higher tertiary education, >= MA level" ~ "High Education",
TRUE ~ eisced))
df_european_survey$Education <- as.factor(df_european_survey$Education)
df_european_survey$eisced <- as.factor(df_european_survey$eisced)
# For the purpose of this analysis, considering the answer if the respondent ever been a member
# of a trade union or similar organisation - "Yes, currently" and "Yes, previously" as simple Yes
df_european_survey$mbtru <- as.character(df_european_survey$mbtru)
df_european_survey$mbtru[df_european_survey$mbtru == "Yes, currently"] <- "Yes"
df_european_survey$mbtru[df_european_survey$mbtru == "Yes, previously"] <- "Yes"
df_european_survey$mbtru <- as.factor(df_european_survey$mbtru)
# Transforming as numeric the variable Years of Education
df_european_survey$eduyrs <- as.numeric(df_european_survey$eduyrs)
# Creating a new feature as per age (eg. young, young adult, older adult, elderly)
df_european_survey$agea <- as.numeric(df_european_survey$agea)
df_european_survey <- df_european_survey %>%
mutate(Age_Band = case_when(
agea < 20 ~ "<20",
agea >= 20 & agea < 40 ~ "20-39",
agea >= 40 & agea <= 65 ~ "40-65",
agea > 65 ~ ">65"))
df_european_survey$Age_Band <- as.factor(df_european_survey$Age_Band)
Conventionally there are four main geographical regions or subregions in Europe.
Northern Europe refers to the portion of Europe to the north of Western Europe, the English Channel, and the Baltic Sea; it also includes the Baltic republics of Estonia, Latvia, and Lithuania.
Western Europe is bounded by the Atlantic Ocean in the west, the English Channel and the North Sea to the north, and the Alps in the south.
Conventionally Eastern Europe is the geographical region east of Germany and west of the Ural Mountains. The United Nations geo-scheme lists ten countries including the former Eastern bloc countries of Poland, Czechia, and Slovakia (formerly Czechoslovakia), Hungary, Romania, and Bulgaria, the former Soviet republics of Belarus and Ukraine, as well as European Russia.
Southern Europe or Mediterranean Europe refers to the mainly subtropical southern portion of the continent. The region is bounded by the Mediterranean Sea in the south. There are 13 sovereign countries in Southern Europe; seven of those states are members of the European Union.
northern <- c("Denmark","Finland","Ireland","Latvia","Lithuania","Sweden")
western <- c("Austria","Belgium","France","Germany","Netherlands")
eastern <- c("Bulgaria","Czechia","Hungary","Poland","Slovakia")
southern <- c("Slovenia","Cyprus","Spain","Croatia","Italy","Portugal")
df_european_survey <- df_european_survey %>% mutate(Region = case_when(cntry %in% northern ~ "Northern Europe",
cntry %in% western ~ "Western Europe",
cntry %in% eastern ~ "Eastern Europe",
cntry %in% southern ~ "Southern Europe",
TRUE ~ "Europe"))
The analysis of survey data often uses complex sample designs and weighting adjustments in order to make the sample look more like the intended population of the survey. As ESS is a cross-national survey and countries implement different sample designs, it is important to use weights in all analyses to take into consideration the country context, and therefore avoid bias in the outcome.
Post-stratification weights intended purpose is to decrease the impact of coverage, sampling and nonresponse error. This weight is based on gender, age, education and geographical region.
Clustering produces more precise population estimates than a simple random design would achieve but this makes survey results appear more homogeneous. To address this problem ESS uses Clustering Adjustments.
According to ESS documentation:
It is recommended that by default you should always use anweight (analysis weight) as a weight in all analysis. This weight is suitable for all types of analysis, including when you are studying just one country, when you compare across countries, or when you are studying groups of countries.
anweight corrects for differential selection probabilities within each country as specified by sample design, for nonresponse, for noncoverage, and for sampling error related to the four post-stratification variables, and takes into account differences in population size across countries.
Details about how ESS weights the data can be found here.
There are 2 R packages which help us with complex surveys design: survey
and srvyr
In ESS dataset the clustering variable is psu, stratification is indicated by stratum, and weighting by anweight.
srvyr
library which is based on survey
brings a dplyr syntax-style.
weighted_df_ess <- df_european_survey %>% as_survey_design(ids=psu, strata=stratum, weights=anweight)
# Lonely PSUs - http://r-survey.r-forge.r-project.org/survey/exmample-lonely.html
options(survey.lonely.psu = "adjust")
weighted_df_ess
# Classifying happiness with EU by splitting countries with more than 15% of voting to Leave as Unfavorable
happiness_EU <- weighted_df_ess %>%
group_by(cntry,vteurmmb) %>%
summarise(proportion = survey_mean()) %>%
filter(vteurmmb == "Leave") %>%
mutate(EU_Opinion = ifelse(proportion < .16, "Favorable", "Unfavorable")) %>%
group_by(EU_Opinion) %>% summarise(total = n()) %>%
mutate(prop = total / sum(total),
label = paste0(round(total / sum(total) * 100, 0), "%"),
label_y = cumsum(prop) - 0.5 * prop)
happiness_overview <- happiness_EU %>%
ggplot(aes(x = "", y = prop)) +
geom_bar(aes(fill = fct_reorder(EU_Opinion, prop, .desc = FALSE)), lineend = 'round',
stat = "identity", width = .5, alpha=.9) +
coord_flip() +
scale_fill_manual(values = c("#67a9cf", "#ef8a62")) +
geom_text(aes(y = label_y, label = paste0(label, "\n", EU_Opinion)), size = 8, col = "white", fontface = "bold") +
labs(x = "", y = "%",
title = "How happy member nations are with European Union?",
subtitle = "Considering more than 15% of votes to Leave the EU as Unfavorable view") +
theme_void() +
theme(axis.title.x = element_blank(), axis.text.x = element_blank(), axis.ticks.x = element_blank()) +
theme(legend.position = "none",
plot.title=element_text(vjust=.8,family='', face='bold', colour='#636363', size=25),
plot.subtitle=element_text(vjust=.8,family='', face='bold', colour='#636363', size=15))
#ef8a62 - Happy
#67a9cf - Not so Happy
options(repr.plot.width=12, repr.plot.height=5)
happiness_overview
The majority of countries surveyed have shown a favorable view regarding the European Union. However, not everyone is happy with the institution. Across the 22 EU member countries surveyed a median of 32% hold an unfavorable view.
# changing the global plot size back
options(repr.plot.width=15, repr.plot.height=10)
countries_by_Vote_Leave <- weighted_df_ess %>% group_by(cntry,vteurmmb) %>%
summarise(total = survey_total(), prop = survey_mean()) %>%
filter(vteurmmb == "Leave") %>%
arrange(desc(prop)) %>%
head(15)
countries_by_Vote_Leave %>%
mutate(factor(cntry, levels = .$cntry),
label = paste0(round(prop * 100, 0), "%")) %>%
ggplot(aes(x=reorder(cntry,prop), y=prop)) +
geom_segment(aes(xend = cntry, yend = 0), color = "#67a9cf", size=1.2) +
geom_point(size = 18, color="#67a9cf") +
geom_text(face="bold", color = "white", size = 5, aes(label = label)) +
geom_hline(aes(yintercept = .20), colour = "#8da0cb", linetype ="longdash", size = .8) +
annotate("text", x = 12.5, y = .23, family='', face='bold', colour='#636363', size=7,
label = "2 countries \n with more than 20%") +
scale_y_continuous(labels = scales::percent) +
labs(x = "", y = "",
title = "Countries with the highest proportion of votes to Leave the EU",
subtitle = "% is approximate") +
theme_minimal() + coord_flip() +
theme(axis.title.x = element_blank(), axis.text.x = element_blank(), axis.ticks.x = element_blank(),
axis.text.y = element_text(face="bold", color="#636363", size=18),
plot.title=element_text(vjust=1.5,family='', face='bold', colour='#636363', size=25),
plot.subtitle=element_text(vjust=1.5,family='', face='bold', colour='#636363', size=15))
The countries with the highest proportion of votes to leave the EU are Czechia, Italy, France, Finland and Cyprus. All of them with more than 17% of respondents inclined to vote for their country to Leave the EU in a hypothetical referendum.
Czechia and Italy are the only countries with more than 20% of voting intentions to Leave the EU.
countries_by_Vote_Remain <- weighted_df_ess %>% group_by(cntry,vteurmmb) %>%
summarise(total = survey_total(), prop = survey_mean()) %>%
filter(vteurmmb == "Remain") %>%
arrange(desc(prop)) %>%
head(15)
countries_by_Vote_Remain %>%
mutate(factor(cntry, levels = .$cntry),
label = paste0(round(prop * 100, 0), "%")) %>%
ggplot(aes(x=reorder(cntry,prop), y=prop)) +
geom_segment(aes(xend = cntry, yend = 0), color = "#ef8a62", size=1.2) +
geom_point(size = 18, color="#ef8a62") +
geom_text(face="bold", color = "white", size = 5, aes(label = label)) +
scale_y_continuous(labels = scales::percent) +
labs(x = "", y = "",
title = "Countries with the highest proportion of votes to Remain member of the EU",
subtitle = "% is approximate") +
theme_minimal() + coord_flip() +
theme(axis.title.x = element_blank(), axis.text.x = element_blank(), axis.ticks.x = element_blank(),
axis.text.y = element_text(face="bold", color="#636363", size=18),
plot.title=element_text(vjust=1.5,family='', face='bold', colour='#636363', size=25),
plot.subtitle=element_text(vjust=1.5,family='', face='bold', colour='#636363', size=15))
On contrary Poland, Ireland, Spain, Portugal, and Lithuania have more than 92% of their population voting intentions to Remain a member of the EU.
weighted_df_ess %>%
group_by(Region,vteurmmb) %>%
summarise(total = round(survey_total(),2), proportion = round(survey_mean(),2)) %>%
mutate(label = paste0(round(proportion * 100, 2), "%"),
label_y = cumsum(proportion) - 0.5 * proportion) %>%
ggplot(aes(x= fct_reorder2(Region, vteurmmb, proportion, .desc = FALSE), y=proportion)) +
geom_bar(aes(fill=vteurmmb), position = position_stack(reverse = TRUE) ,stat="identity", width = .4) +
scale_fill_manual(values = c("#67a9cf", "#ef8a62")) +
scale_y_continuous(labels = scales::percent) +
coord_flip() +
geom_text(aes(y=label_y, label = paste0(label, "\n", vteurmmb)),
col = "white",
size = 6,
fontface = "bold") +
labs(x = "", y = "", fill = "",
title = "Voting intention on European Regions")+
theme_minimal() +
theme(legend.position = "none",
axis.title.x = element_blank(), axis.text.x = element_blank(), axis.ticks.x = element_blank(),
axis.text.y = element_text(face="bold", color="#636363", size=18),
axis.title.y = element_blank(),
plot.title=element_text(vjust=.5,family='', face='bold', colour='#636363', size=25),
plot.subtitle=element_text(vjust=.5,family='', face='bold', colour='#636363', size=15))
Northern and Southern are the European regions with the highest EU rejection rate, each of them marking 14% voting intentions to Leave the Brussels-based institution.
Curiously, those regions are also home to 4 countries that have the highest voting intention to Remain.
Countries with the happiest citizens regarding the EU:
Ireland and Lithuania from the Northern region. Spain and Portugal from the Southern region.
Eastern is the region that holds the most favorable views of European Union. But not all countries in the Eastern region are happy, Czechia is a country there that presented the highest voting intention to Leave the EU.
weighted_df_ess %>%
group_by(gndr,Age_Band) %>%
summarise(total = round(survey_total(),2), proportion = survey_mean()) %>%
mutate(label = paste0(round(proportion * 100, 2), "%"),
label_y =