Student Performance EDA
EDA on student grades — Random Forest predictor of overall_score, behavioral K-means personas, sleep / study correlations, A→F grade distributions.
The challenge
“Which levers move grades?” Demographic features (gender, parent_education) sit alongside behavioral ones (study hours, attendance, sleep). The deliberate analytical choice in this repo is to cluster on behavioral features only — using K-meansK-means clusteringtabularPartition n samples into k clusters by iteratively assigning each sample to its nearest centroid and re-computing centroids as cluster means. Minimizes within-cluster variance.full entry →Wikipedia controlled for demographics so the personas reflect habit patterns rather than background — and predict the overall_score with a Random ForestRandom ForesttabularAn ensemble of decision trees, each trained on a bootstrap sample of the data and a random subset of features. Aggregates by majority vote (classification) or mean (regression). Robust default.full entry →Wikipedia.
Data schema
student_performance_data.csv The split between behavioral and performance features is the conceptual hinge. Behavior gets clustered; performance gets predicted; the grade label drives the visual EDA.
Dataset spotlight
The target's class balance. Roughly normal-shaped — A and F tails are smaller; B/C/D dominate.
The bottom row is the only one that matters. final_exam (0.69) and midterm (0.53) drive overall_score; assignment (0.40) and participation (0.24) help; attendance (0.15) is mild. Sleep and study hours are uncorrelated with overall_score in this synthetic dataset — the assumed behavioral lever isn't really a lever.
Self-reported study hours, 0–10 per day. Roughly uniform from 1–8 with thinner tails — no obvious bimodal split between 'studiers' and 'non-studiers'.
Sleep concentrates between 4 and 10 hours — typical for the demographic. No outliers near 0 or 12+.
Analytical pipeline
- 01 load no train/test split EDA fits on the whole table — the goal is structure, not generalization.
- 02 RF importance predicts overall_score All 11 features after preprocessing → score_feature_importance.png
- 03 K-means k=3 + PCA behavioral subset only Clusters on (study, attendance, participation, sleep). PCA scatter → student_clusters_pca.png
- 04 correlation heatmap 7 numerical + target Pearson, vlag cmap → student_correlation.png
- 05 grade box plots study & sleep vs grade Two box plots side-by-side, hue per grade → grade_distributions.png
Modeling choices
- numeric pipeline
- StandardScaler
- categorical pipeline
- OneHotEncoder · drop='first' · sparse=False
- transformer
- ColumnTransformer (sklearn)
- model
- RandomForestRegressor
- n_estimators
- 100
- random_state
- 42
- n_jobs
- -1
- algorithm
- K-Means
- k
- 3
- n_init
- 10
- random_state
- 42
- features
- study_hours · attendance · participation · sleep
- projection
- PCA (n_components=2)
Cluster profiles (mean of behavioral features + overall_score) are printed to stdout (analyze_student.py lines 107–109).
categorical_cols = ['gender', 'internet_access', 'extra_classes', 'parent_education']
numerical_cols = ['study_hours_per_day', 'attendance_percentage', 'assignment_score',
'midterm_score', 'final_exam_score', 'participation_score', 'sleep_hours']
X = df[categorical_cols + numerical_cols]
y = df['overall_score']
preprocessor = ColumnTransformer(
transformers=[
('num', StandardScaler(), numerical_cols),
('cat', OneHotEncoder(drop='first', sparse_output=False), categorical_cols)
])
pipeline = Pipeline([
('preprocessor', preprocessor),
('regressor', RandomForestRegressor(n_estimators=100, random_state=42, n_jobs=-1))
])
pipeline.fit(X, y)
cat_encoder = pipeline.named_steps['preprocessor'].named_transformers_['cat']
cat_features = cat_encoder.get_feature_names_out(categorical_cols)
all_features = numerical_cols + list(cat_features)
importances = pipeline.named_steps['regressor'].feature_importances_ The behavioral angle
Almost every Kaggle student-performance template will throw all features at a Random Forest and call it done. The repo’s deliberate choice — cluster on the four behavioral features in isolation — controls for demographics so the personas reflect what students do, not who they are. The natural archetypes that emerge (high-performer, middle-tier, struggling) sit on behavior, not background.
Box plots of study hours and sleep hours against the A→F grade ladder surface the actionable intervention point. They’re plotted alongside the cluster scatter so the reader can move from “here’s a persona” to “here’s how to move someone between personas” in two glances.
Companion to: AI Adoption · Fortune 500 — same pipeline shape, very different domain.