Student Performance EDA · kaggle

The challenge

"Which levers move grades?" Demographic features (gender, parent_education) sit alongside behavioral ones (study hours, attendance, sleep). The deliberate analytical choice in this repo is to cluster on behavioral features only — using K-means controlled for demographics so the personas reflect habit patterns rather than background — and predict the overall_score with a Random Forest.

Data schema

data schema student_performance_data.csv

column type note

gender categorical encoded via OneHot drop-first

internet_access categorical Yes / No

extra_classes categorical Yes / No

parent_education categorical multi-level education ladder

study_hours_per_day numeric behavioral · clustering input

attendance_percentage numeric behavioral · clustering input

participation_score numeric behavioral · clustering input

sleep_hours numeric behavioral · clustering input

assignment_score numeric performance · feature for RF

midterm_score numeric performance · feature for RF

final_exam_score numeric performance · feature for RF

overall_score target RandomForestRegressor target

grade categorical A · B · C · D · F ordered — used in box plots

The split between behavioral and performance features is the conceptual hinge. Behavior gets clustered; performance gets predicted; the grade label drives the visual EDA.

Dataset spotlight

grade distribution · 10,000 students

chart data

label	value
grade A	154 students
grade B	2,704 students
grade C	5,073 students
grade D	2,008 students
grade F	61 students

The target's class balance. Roughly normal-shaped — A and F tails are smaller; B/C/D dominate.

feature × feature correlation

study

attend

assign

midterm

final

particip

sleep

overall

study

1.00

-0.01

0.01

0.00

-0.01

-0.00

attend

-0.01

1.00

-0.02

-0.01

0.01

-0.01

0.15

assign

-0.01

-0.02

1.00

0.01

-0.01

-0.00

0.40

midterm

0.01

-0.02

0.01

1.00

-0.00

0.00

0.02

0.53

final

0.00

-0.01

0.01

-0.00

1.00

-0.01

0.69

particip

-0.01

0.01

-0.01

0.00

-0.01

1.00

-0.01

0.24

sleep

-0.01

-0.00

0.02

-0.01

1.00

-0.00

overall

-0.00

0.15

0.40

0.53

0.69

0.24

-0.00

1.00

scale −1 0 +1

chart data

pair	correlation
study x study	1.00
study x attend	-0.01
study x assign	-0.01
study x midterm	0.01
study x final	0.00
study x particip	-0.01
study x sleep	-0.01
study x overall	-0.00
attend x study	-0.01
attend x attend	1.00
attend x assign	-0.02
attend x midterm	-0.02
attend x final	-0.01
attend x particip	0.01
attend x sleep	-0.01
attend x overall	0.15
assign x study	-0.01
assign x attend	-0.02
assign x assign	1.00
assign x midterm	0.01
assign x final	0.01
assign x particip	-0.01
assign x sleep	-0.00
assign x overall	0.40
midterm x study	0.01
midterm x attend	-0.02
midterm x assign	0.01
midterm x midterm	1.00
midterm x final	-0.00
midterm x particip	0.00
midterm x sleep	0.02
midterm x overall	0.53
final x study	0.00
final x attend	-0.01
final x assign	0.01
final x midterm	-0.00
final x final	1.00
final x particip	-0.01
final x sleep	-0.01
final x overall	0.69
particip x study	-0.01
particip x attend	0.01
particip x assign	-0.01
particip x midterm	0.00
particip x final	-0.01
particip x particip	1.00
particip x sleep	-0.01
particip x overall	0.24
sleep x study	-0.01
sleep x attend	-0.01
sleep x assign	-0.00
sleep x midterm	0.02
sleep x final	-0.01
sleep x particip	-0.01
sleep x sleep	1.00
sleep x overall	-0.00
overall x study	-0.00
overall x attend	0.15
overall x assign	0.40
overall x midterm	0.53
overall x final	0.69
overall x particip	0.24
overall x sleep	-0.00
overall x overall	1.00

The bottom row is the only one that matters. final_exam (0.69) and midterm (0.53) drive overall_score; assignment (0.40) and participation (0.24) help; attendance (0.15) is mild. Sleep and study hours are uncorrelated with overall_score in this synthetic dataset — the assumed behavioral lever isn't really a lever.

distribution of study_hours_per_day

chart data

label	value
0–1 h	0 students
1–2 h	1,118 students
2–3 h	1,138 students
3–4 h	1,103 students
4–5 h	1,144 students
5–6 h	1,161 students
6–7 h	1,063 students
7–8 h	1,069 students
8–9 h	1,092 students
9–10 h	1,112 students

Self-reported study hours, 0–10 per day. Roughly uniform from 1–8 with thinner tails — no obvious bimodal split between 'studiers' and 'non-studiers'.

distribution of sleep_hours per night

chart data

label	value
0–1 h	0 students
1–2 h	0 students
2–3 h	0 students
3–4 h	0 students
4–5 h	2,042 students
5–6 h	1,958 students
6–7 h	1,968 students
7–8 h	1,951 students
8–9 h	2,072 students
9–10 h	9 students
10–11 h	0 students
11–12 h	0 students

Sleep concentrates between 4 and 10 hours — typical for the demographic. No outliers near 0 or 12+.

Analytical pipeline

one CSV, four analyses

01
load no train/test split EDA fits on the whole table — the goal is structure, not generalization.
02
RF importance predicts overall_score All 11 features after preprocessing → score_feature_importance.png
03
K-means k=3 + PCA behavioral subset only Clusters on (study, attendance, participation, sleep). PCA scatter → student_clusters_pca.png
04
correlation heatmap 7 numerical + target Pearson, vlag cmap → student_correlation.png
05
grade box plots study & sleep vs grade Two box plots side-by-side, hue per grade → grade_distributions.png

Modeling choices

analytical hyperparameters

preprocessing

numeric pipeline: StandardScaler
categorical pipeline: OneHotEncoder · drop='first' · sparse=False
transformer: ColumnTransformer (sklearn)

score predictor

model: RandomForestRegressor
n_estimators: 100
random_state: 42
n_jobs: -1

behavioral clustering

algorithm: K-Means
k: 3
n_init: 10
random_state: 42
features: study_hours · attendance · participation · sleep
projection: PCA (n_components=2)

Cluster profiles (mean of behavioral features + overall_score) are printed to stdout (analyze_student.py lines 107–109).

analyze_student.py L31–L55 python · 25 lines

categorical_cols = ['gender', 'internet_access', 'extra_classes', 'parent_education']
numerical_cols = ['study_hours_per_day', 'attendance_percentage', 'assignment_score', 
                  'midterm_score', 'final_exam_score', 'participation_score', 'sleep_hours']

X = df[categorical_cols + numerical_cols]
y = df['overall_score']

preprocessor = ColumnTransformer(
    transformers=[
        ('num', StandardScaler(), numerical_cols),
        ('cat', OneHotEncoder(drop='first', sparse_output=False), categorical_cols)
    ])

pipeline = Pipeline([
    ('preprocessor', preprocessor),
    ('regressor', RandomForestRegressor(n_estimators=100, random_state=42, n_jobs=-1))
])

pipeline.fit(X, y)

cat_encoder = pipeline.named_steps['preprocessor'].named_transformers_['cat']
cat_features = cat_encoder.get_feature_names_out(categorical_cols)
all_features = numerical_cols + list(cat_features)

importances = pipeline.named_steps['regressor'].feature_importances_

The behavioral angle

Almost every Kaggle student-performance template will throw all features at a Random Forest and call it done. The repo's deliberate choice — cluster on the four behavioral features in isolation — controls for demographics so the personas reflect what students do, not who they are. The natural archetypes that emerge (high-performer, middle-tier, struggling) sit on behavior, not background.

Box plots of study hours and sleep hours against the A→F grade ladder surface the actionable intervention point. They're plotted alongside the cluster scatter so the reader can move from "here's a persona" to "here's how to move someone between personas" in two glances.

Companion to: AI Adoption · Fortune 500 — same pipeline shape, very different domain.