Kapoor et al (2024): Qualitative Insights Tool (QualIT): LLM Enhanced Topic Modeling

💡 1. Research Question and Research Gap

To what extent can LLM-based topic modeling outperform LDA and BERTopic in extracting coherent and diverse topics from large text datasets?

To answer this question, the research team proposes Qualitative Insights Tool (QualIT), a novel LLM-based thematic modeling framework, and conducts comparative experiments with LDA and BERTopic based on the 20 NewsGroups dataset.

🌊 2. Introduction

Topic modeling is a representative NLP technique for automatically extracting latent topics from documents. Traditional methods like LDA generate topics based on word co-occurrence patterns, but they have limitations in capturing subtle contextual or semantic nuances within sentences. Recently, approaches leveraging large language models (LLMs) such as BERT and GPT have enabled more sophisticated semantic understanding, and embedding-based techniques like BERTopic have gained attention. However, BERTopic also has limitations, as it typically assigns only a single topic to each document, making it insufficient for capturing multiple topics within a single document. To address this issue, this paper proposes LLM-enhanced Topic Modeling, which combines the semantic understanding capabilities of LLMs with clustering techniques.

🛠️ 3. Data and Methodology

📂 Dataset

20 NewsGroups: A public dataset containing 20,000 news articles
Preprocessing: Lowercasing, stopword removal, and lemmatization

🧠 Models for Comparison

LDA (Latent Dirichlet Allocation): Implemented using Gensim with default parameters
BERTopic: Utilizes embedding vectors and HDBSCAN-based clustering
QualIT (LLM-based approach): Based on Claude-2.1 with parameters top_k=50, top_p=0

🔍 4. LLM-Enhanced Topic Modeling Method

🔹 Step 1: Extract Keyphrases

Each document is processed by an LLM to extract multiple meaningful keyphrases. Unlike traditional models (e.g., LDA, BERTopic) that assume a single topic per document, this step captures multiple thematic cues within one text. The extracted phrases serve as essential features for topic classification.

🔹 Step 2: Hallucination Verification

A coherence score based on cosine similarity is calculated to assess how well each keyphrase aligns with the document content. Keyphrases with low scores are identified as “AI hallucinations” and removed, leaving only contextually valid and reliable terms for further analysis.

🔹 Step 3: Clustering (Main & Sub-Topics)

Refined keyphrases are grouped using a K-Means clustering algorithm.

Clustering is conducted in two stages:

Main Topics: Documents with similar keyphrase patterns are grouped to identify overarching themes.
Sub-Topics: Within each main topic, a second clustering step is applied to extract more detailed and specific themes.

For each sub-cluster, the LLM is prompted again to analyze compressed content and generate representative sub-topics.

✅ 5. Experimental Setup & Results

🔹 Dataset and Model Configuration

The experiments were conducted using the 20 NewsGroups dataset, which contains approximately 20,000 news articles categorized into 20 topics.
LDA and BERTopic were run with default parameters, while the LLM-based method (QualIT) used the Claude-2.1 model.
- Key parameters: top_k = 50, top_p = 0 (optimized for document-level semantic coherence)

🔹 Evaluation Metrics

Topic Coherence: Measures how semantically similar the top-ranked words in each topic are.
Topic Diversity: The percentage of unique words across all topic outputs (from 0 to 1).

Table3&5

🔹 Table 3: Topic Coherence & Diversity by Number of Topics

This table compares the performance of three topic modeling methods — LDA, BERTopic, and QualIT — in terms of Topic Coherence and Topic Diversity, evaluated across different numbers of topics (10 to 50).

✅ LDA

Best coherence: 57.0% at 20 topics
Best diversity: 79.1% at 40 topics
Average performance: 51.4% coherence, 72.7% diversity
As a traditional approach, LDA shows the lowest scores overall in both coherence and diversity.

✅ BERTopic

Best coherence: 65.0% at 20 topics
Best diversity: 88.8% at 40 topics
Average performance: 61.0% coherence, 86.3% diversity
With embedding-based clustering, BERTopic outperforms LDA and offers moderate diversity and coherence.

✅ QualIT

Best coherence: 70.0% at 20 topics
Best diversity: 95.5% at 20 topics
Average performance: 64.4% coherence, 93.7% diversity
QualIT consistently performs the best across all topic counts. It is especially optimized for 20 topics, which matches the dataset’s ground-truth structure.

👥 Table 4: Human Evaluation Agreement

This table shows how often human evaluators (4 total) agreed on categorizing topic outputs from each model into one of the 20 ground-truth classes.

✅ QualIT

80% agreement with at least 2 evaluators
50% agreement with at least 3 evaluators
35% full agreement (all 4 evaluators)
QualIT provides topic outputs that are much clearer and easier for humans to interpret and classify consistently.

✅ BERTopic & LDA

LDA: 50% (2 evaluators), 25% (3 evaluators), 20% (all 4)
BERTopic: 45%, 25%, 20% respectively
Both LDA and BERTopic yield lower agreement scores, indicating more ambiguous or inconsistent topic groupings.

⚠️ 6. Limitations & Future Work

Processing Time
- QualIT takes approximately 2–3 hours per run
- BERTopic completes in about 30 minutes
  → Reducing runtime is essential for large-scale deployments.
Clustering Algorithm Improvements
- Currently uses K-Means, which requires predefined cluster numbers
- HDBSCAN, used in BERTopic, may provide more adaptive and nuanced topic boundaries
- Future work may explore replacing K-Means with HDBSCAN to improve granularity and accuracy

🧾 7. Conclusion

QualIT integrates the semantic understanding of LLMs with the structural power of clustering algorithms, delivering more coherent and diverse topics than traditional methods like LDA and BERTopic.
Especially useful in qualitative text analysis scenarios where interpretability and depth matter.
Future directions may include:
- Multilingual support
- Efficiency improvements
- Advanced clustering

A Comparative Study in Topic Modeling

📦 1. Dataset and Preprocessing

🔹 Source

Video_Games.jsonl from Amazon Reviews dataset

🔹 Preprocessing

Filtered only 1-star reviews with helpful_votes >= 1
Text length between 20 and 200 characters
Removed stopwords, applied lemmatization
Final dataset prepared for topic modeling

Preprocess Text Data

def preprocess_text(text):
    text = re.sub('\s+', ' ', text)  # Remove extra spaces
    text = re.sub('\S*@\S*\s?', '', text)  # Remove emails
    text = re.sub('\'', '', text)  # Remove apostrophes
    text = re.sub('[^a-zA-Z]', ' ', text)  # Remove non-alphabet characters
    text = text.lower()  # Convert to lowercase
    return text

Tokenize and Remove Stopwords

# Download NLTK stopwords
nltk.download('stopwords')
stop_words = stopwords.words('english')

# Tokenize and remove stopwords
def tokenize(text):
    tokens = gensim.utils.simple_preprocess(text, deacc=True)
    tokens = [token for token in tokens if token not in stop_words]
    return tokens

Lemmatize Tokens ```python try: nltk.data.find(‘corpora/wordnet’) except LookupError: nltk.download(‘wordnet’)

lemmatizer = WordNetLemmatizer()

def extract_lemmas(tokens): return [lemmatizer.lemmatize(token) for token in tokens]

4.   Create Dictionary and Corpus
```python
# Create dictionary and corpus
id2word = corpora.Dictionary(df['lemmas'])
texts = df['lemmas']
corpus = [id2word.doc2bow(text) for text in texts]

🧠 2. Topic Modeling with LDA, BERTopic, and QualIT

LDA

lda_model = gensim.models.ldamodel.LdaModel(corpus=corpus,
                                            id2word=id2word,
                                            num_topics=20, 
                                            random_state=100,
                                            update_every=1,
                                            chunksize=100,
                                            passes=10,
                                            alpha='auto',
                                            per_word_topics=True)

coherence
LDA Coherence Score: 0.30031160118766986
Diversity
Total words across all topics: 400
Unique words across all topics: 385
LDA Diversity Score (N=20): 0.9625

🔍 Interpretation

Topic coherence is moderate, which indicates that the most frequent words in each topic are somewhat semantically related.The high diversity score (96.25%) shows that across the top 20 topics, most keywords are unique, with little redundancy.However, many keywords are still general-purpose words like game, buy, use, play, which lack contextual nuance.

BERTopic

nlp = spacy.load("en_core_web_sm")

def spacy_tokenizer(text):
    return [token.lemma_ for token in nlp(text) if not token.is_stop and not token.is_punct and not token.is_space]


english_stopwords = stopwords.words('english')
vectorizer_model = CountVectorizer(tokenizer = spacy_tokenizer,
                                   stop_words=english_stopwords, 
                                   min_df=5, 
                                   max_df=0.9, 
                                   ngram_range=(1, 1))

model = BERTopic(verbose=True, 
                 embedding_model='all-MiniLM-L6-v2',  # all-MiniLM-L12-v2, paraphrase-MiniLM-L12-v2
                 vectorizer_model=vectorizer_model, 
                 language='english', 
                 nr_topics = 20,
                 top_n_words = 20,
                 calculate_probabilities=True
                )
topics, probs = model.fit_transform(df.text)

topic

BERTopic was configured to extract 20 topics using BERT embeddings and CountVectorizer. It produced clusters with clear keyword groupings, and representative documents were extracted per topic. Topics such as: “game”, “play”, “buy”, “money” – indicate general purchasing experiences. “price”, “$”, “scam”, “worth” – highlight value dissatisfaction. “headset”, “sound”, “hear” – show technical product malfunctions.

BERTopic Coherence Score (N=20): 0.3478
BERTopic Diversity Score (N=20): 0.7211

🔍 Interpretation

BERTopic’s coherence score was 0.3478, which is higher than the LDA, indicating that the key words in each topic are more semantically related. The Diversity Score was 0.7211, which means that about 72% of the top words in the entire topic were unique words. This figure is somewhat lower than the LDA, but it can contribute to increasing semantic association and consistency by overlapping some words between topics. Overall, BERTopic is interpreted as effectively balancing the interpretability of topics with the distinction between topics.

LLM

1. Extract keywords

# GPT 
def gpt_prompt(prompt):
    headers = {
        "Authorization": f"Bearer {api_key}",
        "Content-Type": "application/json"
    }
    data = {
        "model": "gpt-3.5-turbo", 
        "messages": [
            {"role": "system", "content": "You extract keywords for analyzing customer reviews"},
            {"role": "user", "content": prompt}
        ],
        "temperature": 0.3 #control creativity
    }
    response = requests.post("https://api.openai.com/v1/chat/completions", headers=headers, json=data)
    response.raise_for_status()
    result = response.json()["choices"][0]["message"]["content"].strip()
    result = " ".join(result.splitlines()).strip()
    if any(bad in result.lower() for bad in ["text"]):
        return "LLM Error"
    return result

# Prompt
def extract_keyphrases(text, index=None, total=None):
    if index is not None and total is not None:
        print(f"{index + 1} / {total}")

    prompt = f"""Analyze the following customer review and extract 5 to 10 key phrases that best represent the core topics, features, sentiments, and experiences mentioned in the review.

    Key phrases should capture the main subjects, specific product or service attributes, common issues, or positive aspects.

    Guidelines:
    - Each extracted phrase must clearly represent a specific point or idea from the review.
    - Formulate them as meaningful noun phrases, not just single words or a list of adjectives or verbs.
    - For example: "poor battery life", "excellent customer support", "difficult assembly process".
    - Output should be a single line, with key phrases separated by commas.
    - DO NOT include explanatory sentences, standalone adjectives, adverbs, verbs, or full sentences.
    - DO NOT infer or hallucinate meanings not present in the review.
    - Only include key phrases that are explicitly stated or strongly implied in the review text.
    - The extracted key phrases will be used for subsequent document clustering and topic summarization.

    [Customer Review]
    {text}

    Key Phrases:"""
    #return gemini_prompt(prompt)
    return gpt_prompt(prompt)

2. Halluciation check

results = []
model = SentenceTransformer("sentence-transformers/paraphrase-MiniLM-L6-v2")


for idx, row in sub_df.iterrows():
    text = str(row['text'])  # original text or pre-processed text
    keyphrases = str(row['Keywords']).split(',')

    # embedding
    text_embedding = model.encode(text, normalize_embeddings=True)
    kp_embeddings = model.encode(keyphrases, normalize_embeddings=True)

    # similarity score
    sims = [np.dot(text_embedding, kp) for kp in kp_embeddings]
    sims_score = np.mean(sims)

    # filtering
    valid_kps = [kp for kp, score in zip(keyphrases, sims) if score >= 0.10]  # control threshold 

    results.append({
        'original_keyphrases': keyphrases,
        'coherence_score': sims_score,
        'valid_keyphrases': valid_kps,
    })

3. clustering

def embed_text(text):
    try:
        return model.encode(str(text), normalize_embeddings=True)
    except:
        return np.zeros(model.get_sentence_embedding_dimension())

# embed text
sub_df['embedding'] = sub_df['valid_keyphrases'].apply(embed_text)
embeddings = np.vstack(sub_df['embedding'].values)

# K-Means clustering
n_clusters = 5  # of clusters
kmeans = KMeans(n_clusters=n_clusters, random_state=42)
sub_df['cluster'] = kmeans.fit_predict(embeddings)
sub_df['cluster'].value_counts().sort_index()

# 5. aggregate keywords by cluster
cluster_docs = (
    sub_df[sub_df["cluster"] != -1]
      .explode("valid_keyphrases")               
      .groupby("cluster")["valid_keyphrases"]
      .apply(lambda s: " ".join(s.astype(str)))   
      .to_dict()
)


# 6. top keywords by cluster
vectorizer = TfidfVectorizer(max_features=1000)
top_terms_per_cluster = {}

for cluster_id, text in cluster_docs.items():
    tfidf = vectorizer.fit_transform([text])
    feature_array = np.array(vectorizer.get_feature_names_out())
    tfidf_sorting = np.argsort(tfidf.toarray()).flatten()[::-1]
    top_terms = feature_array[tfidf_sorting][:20]
    top_terms_per_cluster[cluster_id] = top_terms.tolist()

# 7. result
print("\n Top keywords by cluster:\n")
for cid, terms in top_terms_per_cluster.items():
    print(f"Cluster {cid}: {' / '.join(terms)}")

Top keywords by cluster:

Cluster 0
in, charged, plugged, pre, money, edition, fire, not, all, stars,
switch, issue, work, hazard, didn, does, dlc, hot, digital, fully
Cluster 1
xbox, waste, problems, one, cent, better, of, working, way, live,
hell, for, fps, frustration, get, go, ignored, mac, in, false
Cluster 2
not, headset, worth, it, issue, of, card, to, cord, disappointment,
this, resolution, genitalia, connect, stick, cards, on, overspending, customization, cut
Cluster 3
game, of, time, the, games, waste, money, issue, play, not,
in, experience, compatibility, playing, player, copy, online, connection, for, buy
Cluster 4
issue, not, working, disappointed, malfunction, compatibility, product, with, amazon, game,
arrival, missing, be, failure, doesn, open, returning, disconnects, disk, do

LLM Coherence Score (avg): 0.4068
LLM Diversity Score (N=20): 0.8400

🔍 Interpretation

The LLM-based model had the highest coherence score of 0.4068, which was the highest among the three models. This means that the keywords in each topic are most meaningfully linked to the document. The diversity score of 0.84 is also high, showing that the keywords across the entire topic are properly distributed. Although it is somewhat lower than the LDA, LLM is designed to meaningfully utilize overlapping keywords by extracting context-based keyphrases and removing “Hallucination.”

📊 3. Coherence & Diversity Score Comparison

Model	Coherence Score	Diversity Score
LDA	0.3003	0.9625
BERTopic	0.3478	0.7211
LLM (QualIT)	0.4068	0.8400

🔍 Interpretation

The table shows that LLM (QualIT) outperforms the other models in topic coherence, meaning it generates more semantically consistent topics. While LDA achieves the highest diversity score, indicating broader coverage of unique words, it lacks semantic depth. BERTopic offers a good balance but falls between the two in both metrics. Overall, LLM is the most effective model in terms of producing both meaningful and distinguishable topics.