§ Tabular Active

Student Performance EDA

EDA on student grades — Random Forest predictor of overall_score, behavioral K-means personas, sleep / study correlations, A→F grade distributions.

dataset 792 KB · grades A · B · C · D · F tabular · 11 features · 1 score + 1 letter target
metric Random Forest R² (runtime) higher is better
infra sbl1 · local
last touched 2026-04-02
technique stack
RandomForestRegressorK-means k=3 (behavioral only)PCA (2 components)letter-grade box plots

The challenge

“Which levers move grades?” Demographic features (gender, parent_education) sit alongside behavioral ones (study hours, attendance, sleep). The deliberate analytical choice in this repo is to cluster on behavioral features only — using K-means controlled for demographics so the personas reflect habit patterns rather than background — and predict the overall_score with a Random Forest.

Data schema

data schema student_performance_data.csv
column type note
gender categorical encoded via OneHot drop-first
internet_access categorical Yes / No
extra_classes categorical Yes / No
parent_education categorical multi-level education ladder
study_hours_per_day numeric behavioral · clustering input
attendance_percentage numeric behavioral · clustering input
participation_score numeric behavioral · clustering input
sleep_hours numeric behavioral · clustering input
assignment_score numeric performance · feature for RF
midterm_score numeric performance · feature for RF
final_exam_score numeric performance · feature for RF
overall_score target RandomForestRegressor target
grade categorical A · B · C · D · F ordered — used in box plots

The split between behavioral and performance features is the conceptual hinge. Behavior gets clustered; performance gets predicted; the grade label drives the visual EDA.

Dataset spotlight

Analytical pipeline

Modeling choices

analytical hyperparameters
preprocessing
numeric pipeline
StandardScaler
categorical pipeline
OneHotEncoder · drop='first' · sparse=False
transformer
ColumnTransformer (sklearn)
score predictor
model
RandomForestRegressor
n_estimators
100
random_state
42
n_jobs
-1
behavioral clustering
algorithm
K-Means
k
3
n_init
10
random_state
42
features
study_hours · attendance · participation · sleep
projection
PCA (n_components=2)

Cluster profiles (mean of behavioral features + overall_score) are printed to stdout (analyze_student.py lines 107–109).

analyze_student.py L31–L55 python · 25 lines
categorical_cols = ['gender', 'internet_access', 'extra_classes', 'parent_education']
numerical_cols = ['study_hours_per_day', 'attendance_percentage', 'assignment_score', 
                  'midterm_score', 'final_exam_score', 'participation_score', 'sleep_hours']

X = df[categorical_cols + numerical_cols]
y = df['overall_score']

preprocessor = ColumnTransformer(
    transformers=[
        ('num', StandardScaler(), numerical_cols),
        ('cat', OneHotEncoder(drop='first', sparse_output=False), categorical_cols)
    ])

pipeline = Pipeline([
    ('preprocessor', preprocessor),
    ('regressor', RandomForestRegressor(n_estimators=100, random_state=42, n_jobs=-1))
])

pipeline.fit(X, y)

cat_encoder = pipeline.named_steps['preprocessor'].named_transformers_['cat']
cat_features = cat_encoder.get_feature_names_out(categorical_cols)
all_features = numerical_cols + list(cat_features)

importances = pipeline.named_steps['regressor'].feature_importances_

The behavioral angle

Almost every Kaggle student-performance template will throw all features at a Random Forest and call it done. The repo’s deliberate choice — cluster on the four behavioral features in isolation — controls for demographics so the personas reflect what students do, not who they are. The natural archetypes that emerge (high-performer, middle-tier, struggling) sit on behavior, not background.

Box plots of study hours and sleep hours against the A→F grade ladder surface the actionable intervention point. They’re plotted alongside the cluster scatter so the reader can move from “here’s a persona” to “here’s how to move someone between personas” in two glances.

Companion to: AI Adoption · Fortune 500 — same pipeline shape, very different domain.