Zum Inhalt springen
Text Mining With R

Text Mining With R 🔥

word_counts <- cleaned_austen %>% count(word, sort = TRUE) word_counts %>% head(10)

tf_idf <- cleaned_austen %>% count(book, word) %>% bind_tf_idf(word, book, n) %>% arrange(desc(tf_idf)) tf_idf %>% group_by(book) %>% slice_max(tf_idf, n = 3) 4.1. N-grams (Pairs of Words) austen_bigrams <- austen_books() %>% unnest_tokens(bigram, text, token = "ngrams", n = 2) Count common bigrams bigram_counts <- austen_bigrams %>% separate(bigram, into = c("word1", "word2"), sep = " ") %>% filter(!word1 %in% stop_words$word) %>% filter(!word2 %in% stop_words$word) %>% count(word1, word2, sort = TRUE) 4.2. Topic Modeling (Latent Dirichlet Allocation) Using tidytext + topicmodels to discover hidden themes. Text Mining With R

1. Introduction In the age of big data, most information exists as unstructured text —emails, social media posts, reviews, news articles, and research papers. Unlike numerical data, text cannot be directly fed into a statistical model. Text mining (or text analytics) is the process of transforming this free-form text into structured, quantifiable data for analysis, pattern discovery, and prediction. Text mining (or text analytics) is the process

sentiment_scores library(wordcloud) word_counts %>% with(wordcloud(word, n, max.words = 100, colors = brewer.pal(8, "Dark2"))) 3.7. Term Frequency – Inverse Document Frequency (TF-IDF) TF-IDF identifies words that are important to a document within a corpus. data(stop_words) cleaned_austen &lt

data(stop_words) cleaned_austen <- tidy_austen %>% anti_join(stop_words, by = "word") Count most common words:

with a bar chart:

Öffnet in einem neuen Fenster