Job Market Paper
Job Market Paper
Analysts' Belief Formation in Their Own Words (This version: November 2025) [SSRN Link]
Invited for Dual Submission at Journal of Financial Economics
Best Dissertation Proposal in AI and Finance, sponsored by Texas A&M University, Mays Business School
First Place, Chicago Quantitative Alliance (CQA) Academic Competition
Best Paper Award, The 4th Hong Kong Conference for Fintech, AI, and Big Data in Business
Selected presentations: Oxford SMLFin Seminar, MFA 2025, Eastern Finance Association 2025, NBER Behavioral Finance 2025, The 9th MPWZ-CEPR Text-As-Data Workshop, FIRS 2025 (PhD Session), The 4th Hong Kong Conference for Fintech, AI, and Big Data in Business, The 13th Helsinki Finance Summit on Investor Behavior, European Finance Association 2025, CQA Annual Conference, Texas A&M Mays Business School, AFA 2026
Abstract:
I study how equity analysts form subjective beliefs about firms' earnings using their own written text from over 1.1 million equity research reports. Using large language models, I identify the topics discussed by analysts and represent topic-level information using textual embeddings. I introduce a novel text-instrumented Coibion-Gorodnichenko regression to uncover analysts' over- and underreaction to specific information. Using this new procedure, I find pervasive underreaction in short-term earnings forecasts across topics, whereas overreaction in long-term forecasts is concentrated in qualitative, intangible topics rather than quantitative, statistical ones. Revisions driven by qualitative information in long-term earnings forecasts strongly predict future stock returns. Finally, I use textual data to investigate the behavioral mechanisms underlying the documented misreactions. The empirical results suggest that overconfidence is an important driver of the overreaction to qualitative information, while herding appears to be important in explaining the overall underreaction observed in short-term forecasts.
Working Papers
APT or “AIPT”? The Surprising Dominance of Large Factor Models (This version: July 2025)
Coauthors: Antoine Didisheim, Bryan Kelly, and Semyon Malamud
Previously titled "Complexity in Factor Pricing Model"
Selected presentations (* by coauthor): NBER Big Data and Securities Markets 2023*, AFA 2024
Abstract:
We introduce artificial intelligence pricing theory (AIPT). In contrast with the APT's foundational assumption of a low dimensional factor structure in returns, the AIPT conjectures that returns are driven by a large number of factors. We first verify this conjecture empirically and show that nonlinear models with an exorbitant number of factors (many more than the number of training observations or base assets) are far more successful in describing the out-of-sample behavior of asset returns than simpler standard models. We then theoretically characterize the behavior of large factor pricing models, from which we show that the AIPT's "many factors'' conjecture faithfully explains our empirical findings, while the APT's "few factors'' conjecture is contradicted by the data.
On the Testability of the Anchor Words Assumption in Topic Models [Online Appendix] (This version: August 2024)
Revise and Resubmit, Quantitative Economics (Special Issue on Economics and AI/ML)
Coauthors: Simon Freyaldenhoven, Dingyi Li, and José Luis Montiel Olea
Selected presentations (* by coauthor): ESIF Machine Learning Conference 2024*
Abstract:
Topic models are a simple and popular tool for the statistical analysis of textual data. Their identification and estimation is typically enabled by assuming the existence of anchor words; that is, words that are exclusive to specific topics. In this paper we show that the existence of anchor words is statistically testable: there exists a hypothesis test with correct size that has nontrivial power. This means that the anchor-word assumption cannot be viewed simply as a convenient normalization. Central to our results is a simple characterization of when a column-stochastic matrix with known nonnegative rank admits a separable factorization. We test for the existence of anchor words in two different datasets derived from the transcripts of the meetings of the Federal Open Market Committee (FOMC) - the body of the Federal Reserve System that sets monetary policy in the United States - and reject the null hypothesis that anchor words exist in one of them.
Shrinkage Alignment in High-Dimensional Portfolios (This version: October 2025)
Coauthor: Mo Pourmohammadi
This paper subsumes "The Double-edged Sword of Data Mining: Implications on Asset Pricing and Information Efficiency"
Runner-up, the Kuldeep Shastri Outstanding Doctoral Student Paper, Eastern Financial Association 2024
Selected presentations: WUSTL EGSC 2023, Eastern Finance Association 2024, FMA Applied Finance Conference 2024, NFA (PhD Session) 2024
Abstract:
We study how shrinkage affects portfolio efficiency when the number of assets approaches or exceeds the sample size. Standard methods such as ridge impose uniform shrinkage, treating all assets as ex-ante identical and creating inefficiency when profitability is heterogeneous. Empirically, this “one-size-fits-all’’ design produces a hump-shaped relationship between model complexity and out-of-sample Sharpe ratios: adding assets can paradoxically reduce performance. We introduce shrinkage alignment, showing that efficiency requires matching shrinkage strength to each asset’s true profitability. Building on this insight, we propose Sharpe Ratio Shrinkage (SRS), a data-driven approach that aligns shrinkage intensity with empirical Sharpe ratios. SRS outperforms conventional methods under profitability heterogeneity and restores the "virtue of complexity" in high-dimensional portfolio construction.
What Drives Trading in Financial Markets? A Big Data Perspective (Updated draft coming soon)
Coauthor: Anton Lines
Selected presentations (* by coauthor): CICF 2023*, FutFinInfo 2024*, AFA 2025*
Abstract:
We train neural networks to predict the trading volume of institutional investors, allowing us to evaluate the relative importance of hundreds of public information signals. On average, macroeconomic signals and announcement date indicators contribute the most to the model’s explanatory power (58%), while classical firm-level characteristics account for only 35%. We document large heterogeneity in which information sources different institutions pay attention to, and in their responses to the same information. Our results suggest that limited attention and differences of opinion—particularly about the state of the macroeconomy—are the two most salient mechanisms driving institutional trading decisions.
Publications
The Macroeconomic Effects of Fiscal Policy Uncertainty around the World
AEA Papers and Proceedings, May 2025, Vol 115, p.182-187
Coauthors: Gee Hee Hong, and Anh Dinh Minh Nguyen
Selected presentations (* by coauthor): IMF, SITE 2024*, AEA 2025, EWSC 2025*
Abstract:
How adverse is the impact of fiscal policy uncertainty on economic and financial variables? To answer this question, we construct a novel cross-country database of news-based fiscal policy uncertainty indicators. Importantly, we track fiscal events that attract global attention, which we refer to as "global" fiscal policy uncertainty. We find that heightened fiscal policy uncertainty triggers contractionary effects, lowering industrial production in both advanced and emerging market economies. It also raises sovereign borrowing costs, generates synchronous movements in global financial variables including risk aversion, and strengthens the US dollar, even after accounting for US monetary policy shocks.
Robust Machine Learning Algorithms for Text Analysis
Quantitative Economics, Nov 2024, Vol 15, Issue 4, p.939-970
Abstract:
We study the Latent Dirichlet Allocation model, a popular Bayesian algorithm for text analysis. Our starting point is the generic lack of identification of the model’s parameters, which suggests that the choice of prior matters. We then characterize by how much the posterior mean of a given functional of the model’s parameters varies in response to a change in the prior, and we suggest two algorithms to approximate this range. Both of our algorithms rely on obtaining multiple Nonnegative Matrix Factorizations of either the posterior draws of the corpus’ population term-document frequency matrix or of its sample analogue. The key idea is to maximize/minimize the functional of interest over all these nonnegative matrix factorizations. To illustrate the applicability of our results, we revisit recent work on the effects of increased transparency on discussions regarding monetary policy decisions in the United States.
Work in Progress
Motivated Reasoning and the Confirmation Bias: Evidence from Equity Analysts
Coauthors: Ben Matthies and Kaushik Vasudevan
Machine Learning Forecasts as Rational Expectation Benchmarks
Coauthor: Mo Pourmohammadi
Long-term Predictability via Machine Learning
Coauthors: Antoine Didisheim and Mo Pourmohammadi