Exploratory Data Analysis on the Student Alcohol Consumption dataset (Code)
This post is an execution of the explanations from this blog post. This analysis was done as part of fulfilling the Data Mining course in Multimedia University. People who contributed to this were Aaron Patrick Nathaniel, Lim Yue Hng (Neil) and Nicolas Raj.
Libraries and loading the .csv file
The .csv file we chose to explore is student-mat.csv
Preprocessing - Checking For Special Values
We used a function to check if the values in the dataset contain any special values. After running the function, we found out that there were no special values in the entire dataset. Thus, no cleaning is required.
Preprocessing - Finding Outliers In Grades
Now we want to check for outliers in the columns which has their First Period Grade (G1), Second Period Grade (G2) and Final Period Grade (G3) with respect to their workday alcohol consumption (Dalc). We want to pay more attention to G3’s boxplot because we think that for students, their final grade is all that matters and that they will put in more effort for a good final grade compared to a first period or second period grade.
There seems to be many outliers in G3’s data.
Comparing mean grades of G1, G2 and G3 with respect to Dalc
We will now compare the mean grades of G1, G2 and G3 with respect to the workday alcohol consumption levels (Dalc). Our main focus will again be G3. When you hover over the bar chart, please ignore the Dalc values. The values of Dalc for each group of bar chart is supposed to be the same.
It seems that the mean grades for G3 for those that consume alcohol at the lowest level (1) are quite close to those that consume alcohol at the highest level (5). Why is that so?
Looks like that there is an abnormal number of students who have a Dalc of 1 and have G3 values of 0. This has decreased the mean value of G3 significantly when Dalc = 1. However, we cannot remove these values because we cannot be sure if those values are errors or the truth. There may be other factors unrelated to alcohol consumption which is affecting the students’ grades.
Associating Family Relationship With Alcohol Consumption Levels
As stated in the previous section, our hypothesis is that people who have bad family relationships tend to drink more. Let us check if our hypothesis holds somewhat true.
As we can see, the general trend is a decreasing trend as family relationship (famrel) increases. This would infer that when students have a close relationship with their family members, their levels of alcohol consumption (no matter if it is a workday alcohol consumption or weekend alcohol consumption) will decrease.
Associating Relationship Status With Alcohol Consumption Levels
Based on the research we mentioned in the previous section, people with an intimate relationship with their significant other should have lower alcohol consumption levels. However, our dataset only states whether or not the student has a romantic partner. This does not represent the level of intimacy between them. This relationship is made with the assumption that their level of intimacy is high.
The level of alcohol consumption for students who have romantic partners (Couple) are similar to those without romantic partners (Single). There is only a 0.03 difference between the mean alcohol level consumption between couples and singles in both workday alcohol consumption and weekend alcohol consumption with couples being the leader in the former and singles being the leader in the latter.
This could mean that we should not take romantic relationships in high school to be as intimate as adult relationships. From our dataset, there is no significant difference in alcohol consumption levels with relationship status.
Associating Health With Alcohol Consumption Levels
In the previous section we made the guess that people who are not healthy (health = 1) would drink less than people who are healthy (health = 5) to prevent their health from deteriorating even further.
Although our initial guess holds true, the graph is not a constant increasing trend for both the workday alcohol consumption and weekend alcohol consumption. However, it cannot be denied that for the most part, people who are healthy consume higher level of alcohol.
The R Script for the exploratory exercise can be found here.