AI Adoption · Fortune 500
EDA + clustering on a synthetic Fortune-500 AI-adoption panel. Random Forest ROI predictor, K-means personas, PCA projection, correlation atlas, use-case trajectories.
The challenge
Three questions, asked of the same Fortune-500 panel:
- What predicts AI-adoption ROI? — Random ForestRandom ForesttabularAn ensemble of decision trees, each trained on a bootstrap sample of the data and a random subset of features. Aggregates by majority vote (classification) or mean (regression). Robust default.full entry →Wikipedia feature importance.
- What kinds of AI-adopters exist? — K-meansK-means clusteringtabularPartition n samples into k clusters by iteratively assigning each sample to its nearest centroid and re-computing centroids as cluster means. Minimizes within-cluster variance.full entry →Wikipedia personas in financial × maturity space, projected with PCAPCA — Principal Component AnalysistabularA linear dimensionality reduction technique: find orthogonal directions of maximum variance in the data, project onto the top-k. Used for visualization (k=2) and feature compression.full entry →Wikipedia.
- How have use cases evolved year over year? — temporal trajectory.
Plus a correlationPearson correlationmathStandard linear-correlation coefficient: cov(X, Y) / (σ_X · σ_Y). Ranges from −1 (perfect negative) through 0 (no linear relationship) to +1 (perfect positive). Doesn't detect non-linear patterns.full entry →Wikipedia atlas over the four core numerical metrics.
Data schema
ai-adoption-fortune500-synthetic-dataset-2020-2025.csv 9 columns; 4 categorical + 3 numerical + 1 filter + 1 target. analyze.py lines 35–39 partition them into the RF preprocessor.
Dataset spotlight
Nine industries are evenly represented (~640–700 rows each); three (Automotive, Consumer Goods, Industrial) are deeply underrepresented. Any cross-industry model trained naively will have variance problems on the tail.
Stable at 81–81% across 2020–2025. The synthetic dataset assumes a flat adoption rate; in a real Fortune 500 panel we'd expect a clear upward trend.
Roughly uniform from 10% to 40% ROI. The synthetic generator clearly drew ROI from a uniform-ish distribution; the slight U-shape is sampling variance.
The only meaningful signal: AI Maturity ↔ ROI = 0.61. Year and Revenue are independent of ROI in this synthetic dataset — the model has to lean on Maturity (and the categorical Use_Case channel) to predict anything.
Analytical pipeline
- 01 load + filter Uses_AI == Yes ROI is only defined for AI-using firms; the filter is the first step inside analyze.py.
- 02 RF feature importance predicts AI_ROI_Percent Top 15 importances → roi_feature_importance.png
- 03 K-means k=4 + PCA clusters on (Revenue, ROI, Maturity) Personas plotted in PCA 2D → company_clusters_pca.png
- 04 use-case evolution groupby year × use case Top-6 use cases over 2020–2025 → use_case_evolution.png
- 05 correlation atlas 4×4 Pearson heatmap Year · Revenue · ROI · Maturity → metric_correlation.png
One CSV in, four PNGs out. Each branch in analyze.py is bounded by a ===== comment block; they're independent and can be re-ordered.
Modeling choices
- numeric pipeline
- StandardScaler
- categorical pipeline
- OneHotEncoder · drop='first' · sparse=False
- unknown handling
- handle_unknown='ignore'
- transformer
- ColumnTransformer (sklearn)
- model
- RandomForestRegressor
- n_estimators
- 100
- random_state
- 42
- n_jobs
- -1 (all cores)
- algorithm
- K-Means
- k
- 4
- n_init
- 10
- random_state
- 42
- clustering features
- Revenue_USD · AI_ROI_Percent · AI_Maturity_Score
- projection
- PCA (n_components=2)
Source: analyze.py lines 33–95. Cluster profile means are printed to stdout (line 115); the PCA explained-variance ratio is annotated on the scatter.
print("Training Random Forest to predict AI ROI...")
# Features to use
categorical_cols = ['Industry', 'Company_Type', 'Employee_Size', 'Use_Case']
numerical_cols = ['Year', 'Revenue_USD', 'AI_Maturity_Score']
X = df_ai[categorical_cols + numerical_cols]
y = df_ai['AI_ROI_Percent']
# Preprocessing
preprocessor = ColumnTransformer(
transformers=[
('num', StandardScaler(), numerical_cols),
('cat', OneHotEncoder(drop='first', sparse_output=False, handle_unknown='ignore'), categorical_cols)
])
# Pipeline
pipeline = Pipeline([
('preprocessor', preprocessor),
('regressor', RandomForestRegressor(n_estimators=100, random_state=42, n_jobs=-1))
])
pipeline.fit(X, y)
# Get feature names after one-hot encoding
cat_encoder = pipeline.named_steps['preprocessor'].named_transformers_['cat']
cat_features = cat_encoder.get_feature_names_out(categorical_cols)
all_features = numerical_cols + list(cat_features)
# Get importances
importances = pipeline.named_steps['regressor'].feature_importances_
importance_df = pd.DataFrame({'Feature': all_features, 'Importance': importances})
importance_df = importance_df.sort_values(by='Importance', ascending=False).head(15) What’s distinct
The dataset is synthetic but laid out like a real Fortune-500 panel — multi-industry, multi-year, with a maturity score that crosses revenue tiers. The K-means k=4 split is interesting precisely because synthetic ROI values can’t be over-fit by the model; the personas that emerge correspond to organizational readiness archetypes (high-revenue mature deployers, high-revenue early adopters, low-revenue mature, low-revenue exploratory) regardless of which industry a row sits in.
Companion to: Student Performance EDA — same pipeline shape, very different domain.