Introduction: When AI Outpaces VCs
In September 2025, Decrypt published a headline that resonated across the venture and academic worlds: "AI now predicts startup success better than VCs." The article referenced VCBench, a benchmark where large language models like GPT-4o and DeepSeek-V3 outperformed top-tier venture firms in identifying which founders were likely to succeed.
But at Wharton, in collaboration with PitchBob, the focus shifted from prediction to improvement. Instead of asking whether AI could foresee startup outcomes, the experiment explored a more pragmatic question:
Can AI actually help founders raise their performance before they ever meet an investor?
Background & Case Design: Three Steps to a Pitch
Wharton’s entrepreneurship courses guide students through three structured pitch iterations:
- One-minute pitch: A single slide and a single speaker. The goal is message clarity — a crisp articulation of the problem and proposed solution.
- Two-minute pitch: The story expands — students must explain market context, competition, and early customer insights.
- Four-minute pitch: A full investor presentation, including GTM strategy, financials, unit economics, and a funding ask.
PitchBob was integrated not as a grading assistant but as an AI pitch coach that students could use before their live sessions. For each stage, participants uploaded their decks, received structured scores (1–10 scale) and narrative feedback, and could iterate their slides before presenting to professors.
The feedback rubric evolved with each stage — mirroring how founders mature their storytelling:
- Stage 1 (v1): Focused on clarity, slide design, and narrative flow.
- Stage 2 (v2): Added value proposition, differentiation, and basic monetization.
- Stage 3 (v3): Included GTM, traction, team fit, fundraising logic, and risk analysis.
The model’s role was not to judge, but to guide — identifying weak spots before human evaluation.
Data and Methodology
Across three stages, the dataset contained:
- v1 (One-minute) — 27 submissions, 332 feedback items
- v2 (Two-minute) — 34 submissions, 186 feedback items
- v3 (Four-minute) — 31 submissions, 203 feedback items
Each dataset contained numeric scores and qualitative recommendations.
Average scores dropped over time — from 9.8 → 6.5 → 5.9 — reflecting how rubrics became stricter and expectations more complex. This isn’t a decline in performance but evidence of progressive rigor: students transitioned from basic communication skills to multi-dimensional business articulation.
Feedback texts were analyzed across 14 categories:
Problem clarity, Market sizing, Competition, GTM/Sales, Pricing/Model, Traction/Metrics, Unit Economics, Team/Founder-Market Fit, Product/Demo, Storytelling, Ask/Fundraising, Regulatory/IP, Data/AI, Risks/Assumptions.
Findings: Where Student Pitches Break Down
Most frequent problem areas
- Storytelling and visuals (47%) — Overcrowded slides, lack of clear narrative structure.
- Pricing & business model (40%) — Monetization often left vague or implicit.
- Traction (38%) — Few presented pilots, KPIs, or user validation.
- Product/Demo (33%) — Abstract ideas, few tangible interfaces.
- Competition (31%) — "No competitors" remained a common and weak claim.
- GTM (6%) — Go-to-market consistently underdeveloped, especially among first-time founders.
What drags scores down
- Overclaiming AI (Δ = -0.53) — Projects claiming "AI-powered" without proof consistently scored lower.
- Competition (Δ = -0.28) — Poorly mapped or missing landscape.
- No demo (Δ = -0.26) — Lack of product tangibility reduced credibility.
What lifts scores up
- Founder-market fit (Δ = +0.19) — Teams referencing domain experience earned higher scores.
- Market sizing (Δ = +0.13) — Clear, sourced TAM/SAM/SOM improved trust.
- Basic unit economics (Δ = +0.06) — Even rough LTV/CAC signals business awareness.
Practice effect
Correlation between feedback volume and score (r ≈ 0.23) showed that more specific, actionable AI feedback led to measurable improvement in later submissions. The learning loop worked.
Iteration 3: Advanced Pitches and Coaching Impact
By the third round, the AI rubric included seven key dimensions:
Slide design, Market & solution clarity, Competition & model, Product focus, Experiments & next steps, Funding needs, Storytelling & visuals.
Average performance patterns:
- Visual and storytelling quality improved by +2.1 points compared to v2.
- Funding needs and next-step planning remained the weakest areas — median scores under 4.
- Market logic and business modeling became the main differentiators: teams with structured revenue slides or clear pricing models consistently scored above 7.
Qualitative feedback at this stage revealed new behaviors:
- Students started referencing data sources, pilots, and partnerships more frequently.
- AI’s early comments ("show user flow," "quantify pain point") were visibly reflected in revised decks.
- Professors later confirmed that students arrived more prepared, saving 15–20 minutes of class time per session on basic clarifications.
This indicates that AI coaching didn’t replace teaching — it shifted human focus toward higher-order discussion: feasibility, growth strategy, and investor logic.
What This Means for Universities and Accelerators
The three-round Wharton experiment shows a scalable model for AI-augmented entrepreneurial education:
- Structured iteration builds habits.
Students internalize the feedback cycle — learn to think like investors, not just presenters. - AI democratizes mentorship.
Every student gets detailed, VC-level commentary, regardless of faculty bandwidth. - Data drives curriculum refinement.
Aggregated AI feedback highlights where classes struggle most — GTM, pricing, metrics — guiding professors on where to deepen instruction. - Accelerators can pre-screen smarter.
The same system can triage hundreds of early applicants, flagging which founders have strong storytelling, credible GTM, or financial realism. - Investors gain cleaner pipelines.
Startups entering demo days after AI-based coaching deliver higher-signal materials — saving time and reducing screening bias.
Key Recommendations
For universities:
- Integrate AI feedback loops before class reviews. Use the model to standardize expectations and shorten feedback cycles.
- Add reflection checkpoints where students compare their AI and professor feedback — to understand judgment criteria.
For accelerators:
- Use pre-AI-reviewed decks as the new baseline for admission.
- Track improvement over multiple cohorts — identify systemic weaknesses (e. g., GTM) to adapt curriculum focus.
For founders:
- Treat AI as a "practice investor." Revise, re-upload, and learn pattern recognition — what consistently boosts credibility.
Conclusion
The Wharton × PitchBob experiment demonstrates that AI can evolve from a static evaluator into a dynamic coaching companion.
Across three pitch cycles, students didn’t just learn how to "present better" — they learned how to think more like founders: evidence-based, structured, and investor-aware.
For universities and accelerators, this model offers a replicable framework: scalable, data-driven, and proven to improve engagement and outcomes.
In short, AI feedback transforms pitching from a one-off exam into a guided rehearsal process — making every iteration a step closer to the real investor conversation.
We are deeply grateful to Professor Sergey Netessine and the Wharton School for the opportunity to collaborate on this pioneering experiment. Universities interested in exploring similar AI-driven feedback programs for their entrepreneurship courses are warmly invited to connect with us to co-create the next generation of learning experiences.


