π¬ Movie Review Text Mining: Sentiment Analysis of Avengers: Endgame
π₯ 1. Introduction
Avengers: Endgame was one of the most globally successful films of the last decade.
Both Korean and American audiences rated the movie extremely highly.
- πΊπΈ Rotten Tomatoes (U.S.): 8.4 / 10
- π°π· Naver Movie (Korea): 9.5 / 10
At first glance, the conclusion seems simple -> Both countries loved the movie.
But hereβs the real question π
If both countries gave similarly high ratings, does a high rating always mean strong positive emotion in text?
This project investigates that question using Natural Language Processing (NLP)
π― 2. Research Objective
The goal of this project is to analyze and compare movie reviews of Avengers: Endgame from the United States and Korea using text mining techniques.
More specifically, this study aims to:
- π Compare frequently used words in U.S. and Korean reviews
- π Analyze sentiment distributions in both countries
- π€ Examine why high ratings may still produce different sentiment patterns
- π Explore whether cultural communication styles influence sentiment detection
Although both countries rated the movie highly, the distribution of textual sentiment may reveal deeper structural differences.
This is not just about whether people liked the movie. It is about how emotion is linguistically encoded across cultures.
π 3. Theoretical Background
3.1 β Star Ratings vs Textual Sentiment
Prior research suggests that star ratings and textual sentiment do not always align.
Wan & Nakayama (2022) demonstrated that numerical ratings reflect overall satisfaction, while textual reviews often capture more nuanced emotional expression.
This means:
- Two countries may assign similar ratings
- But differ in emotional intensity within the text
So relying solely on rating scores may hide cross-cultural variation in how emotions are expressed.
3.2 π Cultural Communication Styles
Cultural communication theory provides an additional lens.
According to cross-cultural communication research (e.g., Brand et al., 2022), cultures can be broadly categorized.
In high-context cultures:
- Communication tends to be indirect and subtle
- Emotional evaluation is often implied rather than explicitly stated
In low-context cultures:
- Communication is explicit and direct
- Emotional reactions are often expressed with strong evaluative words
This distinction is important because sentiment analysis models rely heavily on explicit emotional vocabulary.
If emotion is expressed indirectly, a model may struggle to detect it accurately.
π¦ 4. Data Collection π¦
To explore this question, I collected movie review data from two major platforms:
- πΊπΈ Rotten Tomatoes
- π°π· Naver Movie
Dataset Overview
- π Total reviews: 600
- πΊπΈ 300 English reviews
- π°π· 300 Korean reviews
- π Text-only reviews (free-text format)
- Each dataset contains a single column:
review
All reviews were manually collected from publicly available sources and stored in CSV format.
π₯ USA
πΉData Processing
1. Preprocess Text Data
def clean_text(text):
text = str(text) # make sure it is a string
text = re.sub(r'\n', ' ', text) # remove new lines
text = re.sub(r'\r', '', text) # remove return marks
text = re.sub(r'\t', ' ', text) # remove tabs
text = re.sub(r'\.', '', text) # remove dots
text = re.sub(r"[^A-Za-z\s]", "", text) # remove special marks, keep letters
text = re.sub(r'\d+', '', text) # remove numbers
text = re.sub(r'\s+', ' ', text) # fix many spaces to one
text = text.strip().lower() # cut side spaces and lowercase
2. Remove Stopword and Domain-Specific Word
stop_words = set(stopwords.words("english"))
keep_words = {"not", "no", "nor", "but", "however", "though", "although", "without"}
stop_words = stop_words - keep_words
domain_stopwords = {
"movie", "film", "movies", "films",
"cinema", "endgame", "marvel",
"avengers", "mcu", "infinity"
}
def remove_stopwords(text):
tokens = [
w for w in text.split()
if w not in stop_words
and w not in domain_stopwords
and len(w) > 2
]
return tokens
3. Normalization & Lemmatization
lemmatizer = WordNetLemmatizer()
def normalize_and_lemmatize(tokens):
# normalize films β movie
tokens = [re.sub(r"\bfilms?\b", "movie", w) for w in tokens]
# apply lemmatization
tokens = [lemmatizer.lemmatize(w) for w in tokens]
return tokens
πΉWord Frequency Analysis (U.S. Reviews)
The bar chart below presents the top 30 most frequent words in U.S. reviews.
Strong evaluative terms such as best, perfect, great, and amazing dominate the ranking. This indicates that American reviewers frequently rely on direct and high-intensity emotional vocabulary.
πΉWord Cloud Visualization
To complement the frequency distribution, a word cloud was generated to visualize important words.
The visual dominance of words like best, time, great, and perfect reinforces the observation that positivity is expressed explicitly and emphatically. While the bar chart provides precise rankings, the word cloud highlights the emotional intensity embedded in word choice.
πΉ Sentiment Analysis (VADER)
To quantify emotional polarity in U.S. reviews, I applied VADER (Valence Aware Dictionary and sEntiment Reasoner) β a lexicon-based sentiment analysis model designed for social and review text.
VADER computes a compound sentiment score ranging from -1 (most negative) to +1 (most positive).
Compute Sentiment Scores
# Sentiment Analysis (VADER)
sia = SentimentIntensityAnalyzer()
# Calculate sentiment scores (VADER compound score for each cleaned review)
df["sentiment_score"] = df["clean_review"].apply(lambda x: sia.polarity_scores(x)["compound"])
def classify(score):
if score > 0:
return "positive"
elif score < 0:
return "negative"
else:
return "neutral"
# Apply classification to each sentiment score
df["sentiment"] = df["sentiment_score"].apply(classify)
Reviews were categorized based on compound score:
Positive β score > 0
Negative β score < 0
Neutral β score = 0
πΉ Sentiment Distribution (U.S. Reviews)
The overwhelming dominance of positive reviews indicates that American audiences expressed strong approval of the film. This result aligns closely with the earlier word frequency analysis, where highly evaluative terms such as best, perfect, great, and amazing appeared frequently. This shows that American reviewers tend to express their emotions strongly and directly.
π¦ KOREA
πΉ Data Processing
1. Preprocess Text Data
def clean_korean_text(text):
text = str(text)
text = re.sub(r'\n', ' ', text) # Remove line breaks
text = re.sub(r'\r', '', text) # Remove carriage returns
text = re.sub(r'\t', ' ', text) # Remove tab characters
text = re.sub(r'[~!@#$%^&*()_+=<>?/.,:;\'\"βββββ¦β
ββ₯β‘]', '', text) # Remove special symbols
text = re.sub(r'\d+', '', text) # Remove numbers
text = re.sub(r'[γ±-γ
γ
-γ
£]+', '', text) # Remove isolated Korean consonants/vowels (e.g., γ
γ
, γ
γ
)
text = re.sub(r'\s+', ' ', text) # Remove multiple spaces
text = text.strip()
2. Remove Stopword and Domain-Specific Word
stop_words = [
# Particles, endings, and adverbs (grammatical function words)
"μ", "λ", "μ΄", "κ°", "μ", "λ₯Ό", "μ", "μ", "μμ", "μΌλ‘", "λ‘", "μ", "κ³Ό",
"λ", "λ§", "보λ€", "μ²λΌ", "κΉμ§", "κ»μ", "νν
", "μκ²", "λΌκ³ ",
"κ·Έλ¦¬κ³ ", "κ·Έλμ", "νμ§λ§", "κ·Έλ¬λ", "λ", "λν", "κ·Όλ°", "κ·Έλ°λ°",
"λκ°", "μ’", "λ무", "μ λ§", "μ§μ§", "μμ ", "μμ£Ό", "λ§μ΄", "λκ²", "κ·Έλ₯",
"μ΄κ±΄", "μ 건", "그건", "μ°λ¦¬", "λ΄κ°", "μ΄λ²", "μ΄μ ", "λλ", "λ€μ",
"νλ€", "μ΄λ€",
# Interjections and onomatopoeia (do not contribute directly to sentiment)
"μ", "μ", "μ΄", "μ", "ν", "γ
γ
", "γ
γ
", "γ
", "γ
", "γ·γ·", "νν", "ν΄", "μΊ¬",
# Miscellaneous unnecessary words
"λλ¬Έ", "μ λ", "κ²", "κ±°", "κ²", "μ", "λ―", "μ", "μ£ ", "λ€", "λ°", "μ€", "μν", "λ§λΈ", "μλκ²μ",
"건κ°", "μΈκ°", "κ±°λ", "λΌλ", "κ±°λ μ", "λ€μ", "μ
λλ€", "νμ΅λλ€", "λ΄€μ΄μ","μ΄λ²€μ Έμ€", "μ¬λ―Έμλ€"
]
3. Normalization & Tokenization
okt = Okt()
def normalize_and_tokenize(text):
# tokenization
tokens = okt.morphs(text, stem=True) # It is a morphological analysis function that splits a sentence into individual word units.
# stopword removal and length filtering
tokens = [w for w in tokens if w not in stop_words and len(w) > 1]
return " ".join(tokens)
πΉ Word Frequency Analysis (Korean Reviews)
The bar chart below presents the top 30 most frequent words in Korean reviews.
Words such as μ¬λ°λ€ (fun), μ’λ€ (good), μ΅κ³ (the best), and κ°λ (emotion) appear frequently, indicating clear positive sentiment. At the same time, experiential words like μκ° (time), λ§μ§λ§ (last), and μλ¦¬μ¦ (series) are also prominent. This suggests that Korean reviewers express emotion clearly, but often within a broader narrative or situational context rather than relying solely on intensifiers.
πΉ Word Cloud Visualization
To complement the frequency distribution, a word cloud was generated to visualize important words in Korean reviews.
The word cloud highlights emotional terms such as μ¬λ°λ€, μ’λ€, and κ°λ, but it also emphasizes relational and experiential words like λ³΄λ€ (to watch), μλ¦¬μ¦ (series), and μκ° (time). Compared to the U.S. reviews, Korean reviews still contain strong emotional expressions, but these words coexist with contextual and narrative terms. This pattern suggests that emotional expression in Korean reviews is often intertwined with storytelling and experiential reflection rather than expressed purely through intensifiers.
πΉ Sentiment Analysis (KNU Sentiment Lexicon)
To quantify emotional polarity in Korean reviews, I applied a lexicon-based sentiment analysis approach using the KNU Korean Sentiment Lexicon.
The KNU lexicon was constructed from a large Korean lexical database using deep learning-based classification of dictionary definitions.
Unlike VADER, this method does not generate a normalized compound score. Instead, it computes the overall sentiment score by summing the polarity values of all matched sentiment words within each review.
Compute Sentiment Scores
# Load the sentiment lexicon
knu_lex = pd.read_csv("knu_sentiment_lexicon.csv")
# Convert lexicon into dictionary
sentiment_dict = dict(zip(knu_lex["word"], knu_lex["polarity"]))
# Function to calculate sentiment score
def get_sentiment_score(text):
tokens = okt.morphs(text)
score = 0
for word in tokens:
if word in sentiment_dict:
score += sentiment_dict[word]
return score
# Apply sentiment scoring
df_kor["sentiment_score"] = df_kor["clean_review"].apply(get_sentiment_score)
def classify(score):
if score > 0:
return "positive"
elif score < 0:
return "negative"
else:
return "neutral"
df_kor["sentiment"] = df_kor["sentiment_score"].apply(classify)
Reviews were categorized based on compound score:
Positive β score > 0
Negative β score < 0
Neutral β score = 0
πΉ Sentiment Distribution (Korean Reviews)
The sentiment distribution reveals a noticeably different pattern from the U.S. results. Unlike the U.S. reviews, where positive sentiment overwhelmingly dominated, Korean reviews show a much larger proportion of neutral classifications. Although positive reviews still form a substantial portion, the high neutral ratio suggests that Korean audiences often express their opinions in a more subtle and context-dependent manner rather than through strongly polarized wording.
π Interpretation & Conclusion
1οΈβ£ Key Interpretation
Although both U.S. and Korean audiences gave Avengers: Endgame very high overall ratings, the sentiment distributions revealed a noticeable difference.In the U.S. dataset, approximately 77.7% of reviews were classified as positive. This indicates that American reviewers tended to express their approval clearly and strongly. The frequent use of highly evaluative words such as best, perfect, and amazing further supports this pattern.In contrast, Korean reviews showed only 35.9% positive, while 43.9% were classified as neutral. Despite the high overall rating in Korea (9.5/10) and the presence of many positive words in the frequency analysis, the sentiment model detected a much larger proportion of neutral expressions.This suggests that the difference is not necessarily about how much audiences liked the movie, but rather how they expressed their emotions linguistically.
2οΈβ£ Cultural Communication Style
One possible explanation for this pattern lies in cultural communication differences.
The United States is generally considered a low-context culture, where communication is explicit, direct, and emotionally transparent. Reviewers tend to use strong evaluative expressions such as:
- βIt was amazing.β
- βThis movie was absolutely terrible.β
These statements contain clear emotional signals, making it easier for sentiment analysis models to detect strong positive or negative polarity.
Korea, on the other hand, is often described as a high-context culture, where communication tends to be more subtle, indirect, and context-dependent. Instead of directly stating strong evaluations, reviewers may express their opinions through experiential descriptions, such as:
- βI lost track of time while watching it.β
- βI fell asleep while watching it.β
Although these sentences imply strong positive or negative reactions, they do not always contain explicit emotional keywords. As a result, lexicon-based sentiment models may classify them as neutral or low-intensity sentiment.
3οΈβ£ Final Conclusion
Overall, both U.S. and Korean audiences reacted positively to Avengers: Endgame. However, the sentiment analysis results reveal that emotional intensity and linguistic expression differ across cultures. The U.S. reviews demonstrated high-intensity, explicit sentiment expression, resulting in a dominant positive classification. In contrast, Korean reviews exhibited more balanced and context-driven emotional expression, leading to a higher proportion of neutral classifications. Therefore, sentiment analysis does not only measure audience preference. It also reflects deeper cultural patterns in how emotions are communicated through language. This study highlights the importance of considering cultural communication style when interpreting cross-linguistic sentiment analysis results.