§ Tabular Active

AI Adoption · Fortune 500

EDA + clustering on a synthetic Fortune-500 AI-adoption panel. Random Forest ROI predictor, K-means personas, PCA projection, correlation atlas, use-case trajectories.

dataset 572 KB · 2020–2025 · Fortune 500 tabular · 7 features · 1 target
metric Random Forest R² (runtime) higher is better
infra sbl1 · local
last touched 2026-05-17 kaggle competition ↗
technique stack
RandomForestRegressor (100 trees)OneHot + StandardScalerK-means k=4PCA (2 components)correlation atlas

The challenge

Three questions, asked of the same Fortune-500 panel:

  1. What predicts AI-adoption ROI?Random Forest feature importance.
  2. What kinds of AI-adopters exist?K-means personas in financial × maturity space, projected with PCA.
  3. How have use cases evolved year over year? — temporal trajectory.

Plus a correlation atlas over the four core numerical metrics.

Data schema

data schema ai-adoption-fortune500-synthetic-dataset-2020-2025.csv
column type note
Industry categorical sector — fed into RF after OneHot
Company_Type categorical ownership / structure
Employee_Size categorical bucketed headcount
Use_Case categorical deployed AI use case — top 6 tracked over time
Year numeric 2020 – 2025
Revenue_USD numeric company revenue
AI_Maturity_Score numeric 0 – 100 maturity scale
Uses_AI categorical filter — ROI model trained on Yes only
AI_ROI_Percent target the dependent variable

9 columns; 4 categorical + 3 numerical + 1 filter + 1 target. analyze.py lines 35–39 partition them into the RF preprocessor.

Dataset spotlight

Analytical pipeline

Modeling choices

analytical hyperparameters
preprocessing
numeric pipeline
StandardScaler
categorical pipeline
OneHotEncoder · drop='first' · sparse=False
unknown handling
handle_unknown='ignore'
transformer
ColumnTransformer (sklearn)
ROI predictor
model
RandomForestRegressor
n_estimators
100
random_state
42
n_jobs
-1 (all cores)
clustering
algorithm
K-Means
k
4
n_init
10
random_state
42
clustering features
Revenue_USD · AI_ROI_Percent · AI_Maturity_Score
projection
PCA (n_components=2)

Source: analyze.py lines 33–95. Cluster profile means are printed to stdout (line 115); the PCA explained-variance ratio is annotated on the scatter.

analyze.py L33–L65 python · 33 lines
print("Training Random Forest to predict AI ROI...")
# Features to use
categorical_cols = ['Industry', 'Company_Type', 'Employee_Size', 'Use_Case']
numerical_cols = ['Year', 'Revenue_USD', 'AI_Maturity_Score']

X = df_ai[categorical_cols + numerical_cols]
y = df_ai['AI_ROI_Percent']

# Preprocessing
preprocessor = ColumnTransformer(
    transformers=[
        ('num', StandardScaler(), numerical_cols),
        ('cat', OneHotEncoder(drop='first', sparse_output=False, handle_unknown='ignore'), categorical_cols)
    ])

# Pipeline
pipeline = Pipeline([
    ('preprocessor', preprocessor),
    ('regressor', RandomForestRegressor(n_estimators=100, random_state=42, n_jobs=-1))
])

pipeline.fit(X, y)

# Get feature names after one-hot encoding
cat_encoder = pipeline.named_steps['preprocessor'].named_transformers_['cat']
cat_features = cat_encoder.get_feature_names_out(categorical_cols)
all_features = numerical_cols + list(cat_features)

# Get importances
importances = pipeline.named_steps['regressor'].feature_importances_
importance_df = pd.DataFrame({'Feature': all_features, 'Importance': importances})
importance_df = importance_df.sort_values(by='Importance', ascending=False).head(15)

What’s distinct

The dataset is synthetic but laid out like a real Fortune-500 panel — multi-industry, multi-year, with a maturity score that crosses revenue tiers. The K-means k=4 split is interesting precisely because synthetic ROI values can’t be over-fit by the model; the personas that emerge correspond to organizational readiness archetypes (high-revenue mature deployers, high-revenue early adopters, low-revenue mature, low-revenue exploratory) regardless of which industry a row sits in.

Companion to: Student Performance EDA — same pipeline shape, very different domain.