All projects

NLP · Topic Modeling

Topic Analysis of UN General Assembly Speeches, 1971–2018

Applying Latent Dirichlet Allocation and TopicGPT to nearly five decades of UNGA speeches to map the evolving thematic landscape of global diplomacy across 196 member states.

Methods  LDA & TopicGPT   ·   Corpus  7,507 speeches   ·   Countries  196   ·   Period  1971–2018

7,507

Speeches analyzed

47 yrs

Coverage (1971–2018)

0.2896

Best silhouette score (LDA clusters)

Overview

The United Nations General Assembly is one of the richest archives in modern diplomacy — nearly five decades of member states articulating national priorities, reacting to crises, and negotiating a shared global agenda. This project applies two complementary topic-modeling approaches, Latent Dirichlet Allocation (LDA) and TopicGPT, to 7,507 UNGA speeches spanning 196 countries from 1971 to 2018, systematically uncovering how the themes of international diplomacy have shifted over time.

Data and Preprocessing

The raw corpus consists of speech transcripts organized by country code, session, and year. Each file was cleaned to remove paragraph numbers, salutations, non-speech annotations, and formatting artifacts. After normalization, the text was tokenized into sentences to preserve grammatical context — a step that proved especially important for TopicGPT's contextual embeddings.

A custom pipeline grouped speeches by date and extracted metadata (country, session, year) into a structured Pandas DataFrame, enabling downstream comparative analysis across time periods and geographies. The chart below tracks the steady growth in UN membership over the study period.

Boxplots showing distribution of speech length and speech frequency by country

Distribution of speech length and participation frequency across member states

Bar chart showing the number of countries represented at the UN by year, 1971–2018

Steady increase in UN membership from 1971 to 2018

Modeling: LDA vs. TopicGPT

LDA is a probabilistic generative model that represents each document as a mixture of latent topics and each topic as a distribution over words. It completed in minutes, required no external API, and produced interpretable word-level topic distributions — ideal for rapid, broad thematic exploration.

TopicGPT leverages GPT-based contextual embeddings to cluster documents into topics, capturing semantic relationships that the bag-of-words assumption in LDA misses. The trade-off: significantly longer processing time, an OpenAI API key, and approximately $25 in compute cost. TopicGPT's output also arrived in non-uniform formats, requiring an additional cleaning pass to extract a consistent set of five keywords per entry before meaningful comparison was possible.

Bar charts of the top 25 most frequent topics from LDA and TopicGPT

Top 25 most frequent topics — LDA (left) vs. TopicGPT (right)

LDA surfaces broad recurring terms: "united," "nations," "international," "peace," "security," and "development" dominate. TopicGPT resolves these into specific initiatives and issues — Millennium Development Goals (322 occurrences), Human rights (321), Disarmament (296), Climate change (226), and Sustainable development (194). The contrast reveals that LDA excels at identifying macro-level rhetorical patterns while TopicGPT tracks the concrete agenda items that actually moved through the Assembly.

Word Clouds and Cluster Analysis

To visualize thematic differences, word clouds were generated from the top 100 topics of each model. LDA's cloud is anchored by sovereignty-era language; TopicGPT's reflects the post-Cold War shift toward human-security and development frameworks.

Word cloud of top 100 LDA topics from UNGA speeches

LDA — top 100 topics

Word cloud of top 100 TopicGPT topics from UNGA speeches

TopicGPT — top 100 topics

KMeans clustering (k‑means++ initialization) was applied to TF-IDF vectors of both model outputs. Two clusters emerged in each case. For LDA the optimal configuration required 20 initializations and achieved a silhouette score of 0.2896; TopicGPT's best configuration used 10 initializations and scored 0.2139. LDA's higher cohesion reflects its tighter word-level signal; TopicGPT's broader embeddings produce softer cluster boundaries.

PCA scatter plot of LDA topic clusters by country

LDA clusters — PCA projection by country

PCA scatter plot of TopicGPT topic clusters by country

TopicGPT clusters — PCA projection by country

The LDA cluster plot groups major powers (USA, UK, Russia, China) together around global-governance themes, while a second cluster captures countries whose speeches center on regional or developmental concerns (Iraq, South Sudan, Haiti). TopicGPT produces a different alignment: Russia, China, and the USA cluster around international-relations discourse, while small island states (Tonga, Nauru) form a separate group centered on climate vulnerability and regional sovereignty. The two models are thus complementary — LDA maps the structural cleavages of global politics; TopicGPT maps the issue-level ones.

Key Findings

LDA and TopicGPT together tell a coherent story about five decades of UNGA discourse. Both models confirm the centrality of peace, security, and international cooperation as enduring themes. TopicGPT additionally reveals the mainstreaming of climate change, human rights, and sustainable development — topics that barely registered in the 1970s corpus but dominate the post-2000 sessions. The dual-clustering approach demonstrates that geopolitical blocs (global powers vs. developing states) are legible from the text alone, without any prior labeling.

For policymakers the implication is practical: automated topic modeling at this scale can surface diplomatic shifts years before they crystallize into treaty language, providing an early-signal layer for foreign-policy analysis. For researchers, the gap between LDA's broad themes and TopicGPT's specific entities points to a productive direction — combining probabilistic models with contextual embeddings yields richer thematic maps than either approach alone.