Working Papers

The Double-edged Sword of Data Mining: Implications on Asset Pricing and Information Efficiency

Abstract:

Does data mining always increase price efficiency? Not necessarily. I incorporate data mining into a standard asset pricing model and identify a novel cost of complexity that arises endogenously from data mining. When a data miner explores alternative data, she faces a scarcer training history relative to potential predictors (increasing complexity) and an increasing difficulty in extracting useful signals (decreasing return in data efficacy). The cost of complexity and decreasing return in data efficacy together imply a finite optimal data mining level, such that excess data mining will lead to lower price informativeness. Empirically, I provide evidence of decreasing return in data efficacy in the context of the "factor zoo'', and I show that the release of satellite data reduces price informativeness in a difference-in-difference setting.

Abstract:

We theoretically characterize the behavior of machine learning asset pricing models. We prove that expected out-of-sample model performance---in terms of SDF Sharpe ratio and test asset pricing errors---is improving in model parameterization (or "complexity''). Our empirical findings verify the theoretically predicted "virtue of complexity'' in the cross-section of stock returns. Models with an extremely large number of factors (more than the number of training observations or base assets) outperform simpler alternatives by a large margin.

Abstract:

We use deep Bayesian neural networks to investigate the determinants of trading activity in a large sample of institutional equity portfolios. Our methodology allows us to evaluate hundreds of potentially relevant explanatory variables, estimate arbitrary nonlinear interactions among them, and aggregate them into interpretable categories. Deep learning models predict trading decisions with up to 86% accuracy out-of-sample, with market liquidity and macroeconomic conditions together accounting for most (66-91%) of the explained variance. Stock fundamentals, firm-specific corporate news, and analyst forecasts have comparatively low explanatory power. Our results suggest that market microstructure considerations and macroeconomic risk are the most crucial factors in understanding financial trading patterns.

On the Testability of the Anchor Words Assumption in Topic Models (with Simon Freyaldenhoven, Dingyi Li, and José Luis Montiel Olea)

Abstract:

Topic models are a simple and popular tool for the statistical analysis of textual data. Their identification and estimation is typically enabled by assuming the existence of anchor words; that is, words that are exclusive to specific topics. In this paper we show that the existence of anchor words is statistically testable: there exists a hypothesis test with correct size that has nontrivial power. This means that the anchor-word assumption cannot be viewed simply as a convenient normalization. Central to our results is a simple characterization of when a column-stochastic matrix with known nonnegative rank admits a \emph{separable} factorization. We test for the existence of anchor words in two different datasets derived from the transcripts of the meetings of the Federal Open Market Committee (FOMC) - the body of the Federal Reserve System that sets monetary policy in the United States - and reject the null hypothesis that anchor words exist in one of them.

Robust Machine Learning Algorithms for Text Analysis (with José Luis Montiel Olea, and James Nesbit)

Conditionally Accepted at Quantitative Economics

Abstract:

We study the Latent Dirichlet Allocation model, a popular Bayesian algorithm for text analysis. Our starting point is the generic lack of identification of the model’s parameters, which suggests that the choice of prior matters. We then characterize by how much the posterior mean of a given functional of the model’s parameters varies in response to a change in the prior, and we suggest two algorithms to approximate this range. Both of our algorithms rely on obtaining multiple Nonnegative Matrix Factorizations of either the posterior draws of the corpus’ population term-document frequency matrix or of its sample analogue. The key idea is to maximize/minimize the functional of interest over all these nonnegative matrix factorizations. To illustrate the applicability of our results, we revisit recent work on the effects of increased transparency on discussions regarding monetary policy decisions in the United States.

Abstract:

This paper studies the social value of eliminating mispricing in the US stock markets. We show that the mispricing of an asset, relative to a candidate asset pricing model, equals to the marginal social value in an economy where such asset pricing model provides correct description of prices. We further show how to use the traditional time-series alphas to compute mispricing and welfare gain from eliminating mispricing. Using an instrumented factor model, we find empirically that on average the mispricing relative to CAPM translates into a welfare cost of about 6.2% of total market capitalization in US stock market, and increases to more than 15% during the Tech Bubble and the Global Financial Crisis, which suggests a large potential gain from active management that eliminates such mispricing. We find that stocks with extreme risk measurements contribute more to the total welfare gain. However, trading strategies based on single-style investment portfolio can only trade away little mispricing in the cross-section, and thus provide only little welfare gain.

Work in Progress