Methods

Find an accessible summary of our Methods in the VoxEU column

"Leveraging large language models for large-scale information retrieval in economics"

Data Sources

Our analysis is based on a comprehensive collection of over 44,000 working papers from two major economic research institutions:

National Bureau of Economic Research (NBER): 28,186 working papers.
Centre for Economic Policy Research (CEPR): 16,666 working papers.

These papers span from 1980 to 2023 and cover a wide range of economics subfields, providing a broad view of the research landscape. They encompass various empirical strategies, including Randomized Controlled Trials (RCTs), Instrumental Variables (IV), Difference-in-Differences (DiD), and Regression Discontinuity Designs (RDD).

AI-Powered Information Retrieval Process

We employed a multi-stage process using a custom Large Language Model (LLM) to extract and analyze information from the working papers. This approach allowed us to efficiently process the text and extract detailed structured data necessary for our analysis.

1. Qualitative Summary Extraction

In the first stage, the AI model analyzed each paper to extract a curated summary of key elements, including:

Research Questions: As presented in the abstract, introduction, and full text.
Causal Claims: Detailed description of all the causal claims made, effect sizes, methods used, identification strategy, etc.
Data Usage and Accessibility: Details on data sources and their accessibility.
Metadata: Authors' names, institutional affiliations, fields of study, and methods used.

This initial extraction provided a structured overview of each paper, serving as a foundation for deeper analysis.

2. Extraction of Causal Claims

Using the summaries from the first stage, we:

Identified Detailed Causal Relationships: The AI model extracted cause and effect variables as described by the authors.
Determined Types of Causal Relationships: Classified relationships as direct effects, indirect effects, mediation, confounding, etc.
Recorded Causal Inference Methods: Documented methods used to establish each causal link, such as RCTs, IV, DiD, etc.
Constructed Edge Lists: Created a list for each paper where each row represents a causal claim, forming the basis for constructing causal graphs.

3. Data Usage and Accessibility Extraction

We gathered structured information regarding:

Data Ownership: Identifying whether data came from private companies, public sector entities, or researchers.
Data Accessibility: Whether the data is freely accessible or restricted.
Data Granularity and Context: Details on units of analysis, temporal and geographical context.

This information is used for assessing trends in data usage and implications for transparency and replicability in economic research.

4. Mapping Variables to Standardized Economic Concepts

To systematically analyze and aggregate the causal claims, we:

Standardized Variables: Mapped the cause and effect variables to official Journal of Economic Literature (JEL) codes.
Semantic Embeddings: Used AI techniques to compare variable descriptions with JEL code descriptions, assigning the most relevant code to each variable.
Constructed a Knowledge Graph: Created a directed network of JEL codes representing the causal relationships, allowing us to map and document the frontier of causal evidence over time.

Figure Notes: This flowchart illustrates our AI-powered approach to retrieving, assessing, and mapping causal claims and contributions from academic papers. The process begins with academic papers, from which the LLM extracts fields such as Author, Publication, Institution, Field, Method, and Data/Code Availability. These aspects feed into two main branches: Identification and Causal Claims. The Identification branch focuses on elements like Identification Strategy and Robustness Checks. The analysis extends to understanding precise measurements and contexts, as well as extrapolated concepts and contexts, leading to insights on contributions claimed and policy recommendations. The Causal Claims branch involves analyzing the causal relationships identified in the papers, consisting of arrays of source (cause) and sink (effect) variables. The analysis operates across three levels. First, for each source or sink node, we consider the source of sink as claimed by the author and as measured in the paper, including the type the owner of the data used. Second, for each source-sink edge, we examine the method(s) used to evidence a claim, and whether null result was found. Third, at the graph level, we assess the number of steps taken from cause to effect, the descriptions of these steps, and the overall complexity of the underlying narrative.

Retrieving Concepts Using AI

Figure Notes: This diagram illustrates our AI-driven methodology for analyzing and mapping causal linkages between economic concepts, represented by JEL (Journal of Economic Literature) codes. Starting with a corpus of working papers, we use a custom prompt and pre-trained language model to extract causal relationships, identifying source (cause) and sink (effect) variables within the text. The extracted causal claims are parsed to generate directed linkages between JEL codes, forming a knowledge graph that aggregates these relationships across the corpus. We employ OpenAI's vector embeddings to numerically represent descriptions of JEL codes and utilize cosine similarity with sources and sinks, assigning the most similar JEL code to each of the source and sink nodes. This approach enables us to construct a structured representation of causal evidence in economics over time, facilitating the exploration of interconnected economic concepts and the evolution of empirical research frontiers.

Our AI-driven methodology allows us to systematically analyze a vast corpus of economic research, uncovering trends in empirical methods, causal narratives, and data usage. By mapping causal claims to standardized concepts, we can explore the interconnectedness of economic ideas and how they have evolved over time.

For more detailed information, visit our full Data and Methods section in the paper [here].

Page updated

Google Sites

Report abuse