Wiki text cleaner in r

1/5/2024

Most of the Indonesian people write their tweet by shortening it and there are lots of words, but it has the same meaning to it.įor example, in Indonesian, if we want to say ‘no’ we will say ‘tidak’. When we clean the tweets, there is an additional challenge that we have to do. For the stop word, we will use from this GitHub repository which you can download it here. The example of the stop words in Indonesian are tidak, ya, bukan, karena, untuk, and many more. ), and many more.īeside that, we have to remove words that don’t have any impact on semantic meaning to the tweet that we called stop word. Therefore, we will have 16418 tweets and 3 columns.Īs we can see from above, some tweets contain words and symbols that we remove, such as mentions (i.e. Also, we have to make an id column as the identifier of the tweet. We will not use all of the columns, instead, we pick only dates and also text from the tweet and also we remove the tweets that are duplicate by each other. After we run the code, we’ve got around 16746 tweets and 90 columns. In this case, we will take around 18000 tweets that are replied to the username. keep_all = T) %>% # Take The Text Only select(created_at, text) # Create id column as the tweet identifier data_fix <- 1:nrow(data_fix) # Convert the created_at to date format data_fix$created_at <- as.Date(data_fix$created_at, format = "%Y-%m-%d") Just like this code below, # Import the library library(rtweet) # Note: Use your own token twitter_token % # Remove Duplicate distinct(text. If we just want to take tweets that are replies to it, we can use a special keyword, which is to. The spam tweet that I’m talking about, such as tweets that are using some kind of hashtags but not talking about it or in other word out of context tweets.īased on that goal, we will gather the tweet from mentions to the account, which the username is kompascom. The reason why we have to take the comments from news media accounts is because their comments are on broad topics and also they have less spam tweets. In this case, we just want to know what are comments that are given by Indonesian netizens, what they are talking about, and how the sentiment from the tweet is.

You can register and read more information from here Note: Make sure that you have Twitter API keys for accessing the API. Before you gather the tweets, you have to consider some aspects, such as what are the goals that you want to achieve and where you want to take the tweet whether by searching it using some queries or gathering it from some users. The first step that we have to do is gather the data from Twitter. In the next article, I will show you how this text data can contain lots of information by exploration, sentiment analysis and then topic modelling. This article only explains how to gather and clean the data using R. In this article, I will show you how to do text mining on Twitter, especially on comments by Indonesian netizens which are taken from one of the largest media in Indonesia, which is Kompas. Therefore, Twitter is a great playground for those who want to be involved in Text Mining. The tweets contain lots of pieces of information to uncover. Twitter gives people a platform where they can give their opinions and also get information based on what they need. Based on data from Statcounter, 7.4% of Indonesia’s population are using it. Twitter is one of the popular social media in Indonesia.

This process can take a lot of information, such as topics that people are talking to, analyze their sentiment about some kind of topic, or to know which words are the most frequent to use at a given time. Text Mining is a process for mining data that are based on text format. Case study of tweets from comments on Indonesia’s biggest media.

0 Comments

Wiki text cleaner in r

Leave a Reply.

Author

Archives

Categories