AI Adoption · Fortune 500 · kaggle

The challenge

Three questions, asked of the same Fortune-500 panel:

What predicts AI-adoption ROI? — Random Forest feature importance.
What kinds of AI-adopters exist? — K-means personas in financial × maturity space, projected with PCA.
How have use cases evolved year over year? — temporal trajectory.

Plus a correlation atlas over the four core numerical metrics.

Data schema

data schema ai-adoption-fortune500-synthetic-dataset-2020-2025.csv

column type note

Industry categorical sector — fed into RF after OneHot

Company_Type categorical ownership / structure

Employee_Size categorical bucketed headcount

Use_Case categorical deployed AI use case — top 6 tracked over time

Year numeric 2020 – 2025

Revenue_USD numeric company revenue

AI_Maturity_Score numeric 0 – 100 maturity scale

Uses_AI categorical filter — ROI model trained on Yes only

AI_ROI_Percent target the dependent variable

9 columns; 4 categorical + 3 numerical + 1 filter + 1 target. analyze.py lines 35–39 partition them into the RF preprocessor.

Dataset spotlight

industry distribution · 6,000 rows

chart data

label	value
Technology	702 rows
E-commerce	685 rows
Finance	680 rows
Telecom	670 rows
Energy	667 rows
Retail	653 rows
Manufacturing	644 rows
Healthcare	638 rows
Logistics	631 rows
Automotive	12 rows
Consumer Goods	12 rows
Industrial	6 rows

Nine industries are evenly represented (~640–700 rows each); three (Automotive, Consumer Goods, Industrial) are deeply underrepresented. Any cross-industry model trained naively will have variance problems on the tail.

adoption rate by year — % of rows with Uses_AI=Yes

chart data

label	value
2020	81 %
2021	81 %
2022	81 %
2023	80 %
2024	80 %
2025	81 %

Stable at 81–81% across 2020–2025. The synthetic dataset assumes a flat adoption rate; in a real Fortune 500 panel we'd expect a clear upward trend.

AI ROI distribution (Uses_AI=Yes rows only)

chart data

label	value
10.0–13.0	534 rows
13.0–16.0	497 rows
16.0–19.0	458 rows
19.0–22.0	436 rows
22.0–25.0	438 rows
25.0–28.0	464 rows
28.0–31.0	517 rows
31.0–34.0	496 rows
34.0–37.0	508 rows
37.0–40.0	500 rows

Roughly uniform from 10% to 40% ROI. The synthetic generator clearly drew ROI from a uniform-ish distribution; the slight U-shape is sampling variance.

metric correlation

Year

Revenue

ROI

Maturity

Year

1.00

0.00

-0.01

-0.00

Revenue

0.00

1.00

0.02

0.15

ROI

-0.01

0.02

1.00

0.61

Maturity

-0.00

0.15

0.61

1.00

scale −1 0 +1

chart data

pair	correlation
Year x Year	1.00
Year x Revenue	0.00
Year x ROI	-0.01
Year x Maturity	-0.00
Revenue x Year	0.00
Revenue x Revenue	1.00
Revenue x ROI	0.02
Revenue x Maturity	0.15
ROI x Year	-0.01
ROI x Revenue	0.02
ROI x ROI	1.00
ROI x Maturity	0.61
Maturity x Year	-0.00
Maturity x Revenue	0.15
Maturity x ROI	0.61
Maturity x Maturity	1.00

The only meaningful signal: AI Maturity ↔ ROI = 0.61. Year and Revenue are independent of ROI in this synthetic dataset — the model has to lean on Maturity (and the categorical Use_Case channel) to predict anything.

Analytical pipeline

four parallel branches off one filtered dataframe

01
load + filter Uses_AI == Yes ROI is only defined for AI-using firms; the filter is the first step inside analyze.py.
02
RF feature importance predicts AI_ROI_Percent Top 15 importances → roi_feature_importance.png
03
K-means k=4 + PCA clusters on (Revenue, ROI, Maturity) Personas plotted in PCA 2D → company_clusters_pca.png
04
use-case evolution groupby year × use case Top-6 use cases over 2020–2025 → use_case_evolution.png
05
correlation atlas 4×4 Pearson heatmap Year · Revenue · ROI · Maturity → metric_correlation.png

One CSV in, four PNGs out. Each branch in analyze.py is bounded by a ===== comment block; they're independent and can be re-ordered.

Modeling choices

analytical hyperparameters

preprocessing

numeric pipeline: StandardScaler
categorical pipeline: OneHotEncoder · drop='first' · sparse=False
unknown handling: handle_unknown='ignore'
transformer: ColumnTransformer (sklearn)

ROI predictor

model: RandomForestRegressor
n_estimators: 100
random_state: 42
n_jobs: -1 (all cores)

clustering

algorithm: K-Means
k: 4
n_init: 10
random_state: 42
clustering features: Revenue_USD · AI_ROI_Percent · AI_Maturity_Score
projection: PCA (n_components=2)

Source: analyze.py lines 33–95. Cluster profile means are printed to stdout (line 115); the PCA explained-variance ratio is annotated on the scatter.

analyze.py L33–L65 python · 33 lines

print("Training Random Forest to predict AI ROI...")
# Features to use
categorical_cols = ['Industry', 'Company_Type', 'Employee_Size', 'Use_Case']
numerical_cols = ['Year', 'Revenue_USD', 'AI_Maturity_Score']

X = df_ai[categorical_cols + numerical_cols]
y = df_ai['AI_ROI_Percent']

# Preprocessing
preprocessor = ColumnTransformer(
    transformers=[
        ('num', StandardScaler(), numerical_cols),
        ('cat', OneHotEncoder(drop='first', sparse_output=False, handle_unknown='ignore'), categorical_cols)
    ])

# Pipeline
pipeline = Pipeline([
    ('preprocessor', preprocessor),
    ('regressor', RandomForestRegressor(n_estimators=100, random_state=42, n_jobs=-1))
])

pipeline.fit(X, y)

# Get feature names after one-hot encoding
cat_encoder = pipeline.named_steps['preprocessor'].named_transformers_['cat']
cat_features = cat_encoder.get_feature_names_out(categorical_cols)
all_features = numerical_cols + list(cat_features)

# Get importances
importances = pipeline.named_steps['regressor'].feature_importances_
importance_df = pd.DataFrame({'Feature': all_features, 'Importance': importances})
importance_df = importance_df.sort_values(by='Importance', ascending=False).head(15)

What's distinct

The dataset is synthetic but laid out like a real Fortune-500 panel — multi-industry, multi-year, with a maturity score that crosses revenue tiers. The K-means k=4 split is interesting precisely because synthetic ROI values can't be over-fit by the model; the personas that emerge correspond to organizational readiness archetypes (high-revenue mature deployers, high-revenue early adopters, low-revenue mature, low-revenue exploratory) regardless of which industry a row sits in.

Companion to: Student Performance EDA — same pipeline shape, very different domain.