SENTIMENT ANALYSIS FOR EXTRACTING STUDENT OPINION DATA ON HIGHER EDUCATION SERVICES USING THE NAIVE BAYES CLASSIFIER AND SUPPORT VECTOR MACHINE METHODS (CASE STUDY AKPRIND

Opinions are ideas, opinions, or the results of someone's subjective thoughts in explaining or addressing something. IST AKPRIND Yogyakarta provides comment and suggestion box facilities in the learning evaluation questionnaire. Opinions that have been collected can be used to determine the sentiment of the campus community. This sentiment information can be used in future campus development. The development of a system that can analyze sentiment automatically is designed by comparing the Naive Bayes Classifier (NBC) method and the support vector machine (SVM) optimized by selecting the Information Gain (IG) feature. Prior opinion data needs to be prepared before being analyzed. Preprocessing (text preprocessing) used includes: cleanning, text folding, normalization, stemming, stopword removal, convert negation, and tokenization. The results of this study show that the SVM method produces higher accuracy than NBC. The accuracy test shows the highest accuracy of SVM reaches 99.09% while NBC is 96.56%. The application of IG did not significantly affect the accuracy of the analysis. GI greatly influenced the analysis duration of the SVM method, which could shorten the time by 195.71%. This is an open access article under the CC–BY-SA license.


I. INTRODUCTION
Currently opinion mining or sentiment analysis has become a research topic that is in great demand in the field of text mining. Sentiment analysis aims to create automated tools that can extract subjective information from texts that are natural language such as opinions and sentiments, so as to create structured knowledge that can be used in decision support systems or decision making [1]. Opinion mining is considered as a combination of text mining and natural language processing. Sentiment analysis is a classification process into two tendencies (binary classification), namely positive and negative [2]. The one method of text mining that can be used to solve opinion mining problems is the Naïve Bayes Classifier (NBC). NBC can be used to classify opinions into positive and negative opinions. NBC can work well as a text classifier method [3]. In addition to NBC, the Support Vector Machine (SVM) method is also used in text classification. Standard SVM takes a set of input data, and predicts, for any given input, the probability that the input is a member of one of the classes, so SVM is also a binary linear nonprobabilistic classifier. [4]. The collaboration of the NBC and SVM methods will further improve the accuracy of the text classification results [5] AKPRIND Institute of Science & Technology Yogyakarta in an effort to improve its services to students always conducts a survey of service assessments at the end of each semester related to public services, facilities and the learning process from lecturers. Collecting student opinions is given through a questionnaire form at the end of each semester. This questionnaire is filled out by students and contains positive, negative, or neutral opinions. The results of this questionnaire can be used as an indicator for assessing the quality of services and facilities at IST AKPRIND Yogyakarta. Students as one of the important aspects of activities on campus, their opinions can have an effect on improving quality.
In this study, a student opinion data processing system was made using Naive Bayes Classifier and Information Gain and Support Vector Machine (SVM). By combining these methods, it is hoped that sentiment analysis can be carried out more quickly, easily, and with a fairly high level of accuracy and effectiveness. The result will be able to know the tendency of students to be positive, negative or neutral. So the results can be used for service improvement and even better performance.

II. RELATED WORK A. Text Mining
Sentiment analysis is fundamentally used to express one's unique opinion. The most recent cutting-edge in conclusion divided classes into two categories: positive and negative. This section describes the literature review on the sentimental analysis, as well as the techniques used on user reviews.
Text mining is a technology used to analyze unstructured data in the form of text data. In text mining analysis there are two main phases, namely (1) Preprocessing and integration of unstructured data, (2) Statistical analysis of data that has been preprocessed to extract content from that contained in the text. [6]. Text mining is a transformation of text data into numeric data so that it is able to convert unstructured data into structured data [7]. Sentiment analysis is a very common field in text classification. Sentiment analysis is a process that analyzes and detects the sentiment of a text input having a positive, negative or neutral sentiment. However, until now, the sentiments that can be detected have become more diverse and not limited to only positive and negative, which can detect happiness, sadness, anger, fear, disgusted and surprised [8]. Sentiment analysis can be used, one of which is to monitor the quality or performance of an institution's products and services so that further conclusions can be drawn whether the service is accepted or not. Research in the field of sentiment analysis using Indonesian text has been carried out for various purposes, for example for service assessment. , prediction, facility assessment and others [9]- [11]. The methods used are varied, ranging from SVM, Naïve Bayes, KNN, to Deep Learning-based methods, such as Convolutional Neural Network (CNN).

B. Naive Bayes Classifier (NBC)
The NBC algorithm is often used for text classification problems. As an illustration, for example, training data is categorized into several k categories Cj={C1,C2,C3,...,Ck} and prior probability for each category is p(C1), where j = 12,3,...,k. Data collection is symbolized di=(w1,...,w2,...,wm), and words or features that are in the document di, made by calculating the probability value of all documents (posterior probability). Posterior probability of a document in a category can be calculated by the equation : In naive bayes classification opinion is represented in attributes ( 1, 2 , 3 , … ), a1 is the first word, a2 is the second word, and so on until the last word. V is the set of classes. At the time for classification this method will look for VMAP (category / class with the highest probability value) by enter attibutes ( 1, 2 , 3 , … ) using equation (2) (2) By applying the Bayes method, equation (2) can be written as in equation (3).
(4) The Naive Bayes classifier simplifies it by assuming that within each category, each attribute is conditionally independent of one another. So it becomes equation (5). P(vj) and the probabilities of the word ai for each category are calculated during training using formula (5) and formula (6).
Where docsj is the number of documents in category j and training is the number of documents used in the training process. While ni is the number of occurrences of the word ai in the vj category. Where ni is the number of words that appear in the vj category and vocabulary is the number of unique words in all training data [12], [13].

C. Support Vector Machine
SVM is used to find the best hyperplane by maximizing the distance between classes. Hyperplane is a function that is used as a data object separator based on its class. The distance between the hyperplane and the data objects varies. The outermost data object closest to the hyperplae is called a support vector. Support vectors are the most difficult to classify because of their almost overlapping positions with other classes. Given its critical nature, only this support vector is taken into account to find the most optimal hyperplane by SVM [14].
SVM receives input results from feature extraction in numerical form and patterns that will be used in the labeling process. The output of the SVM method is actually a line (hyperplane) that separates positive labeled opinions from negative opinions. From the hyperplane that has been formed, it becomes the basis for labeling new opinions using the kernel function K(xi,xd) [15].
In this study, one of the Polynomial kernel equations shown in equation (7) and the Gaussian Radial Basic Function kernel equation shown in equation (7) will be used. (8).
The training process uses a sequence learning algorithm with the following steps: -Calculate the hessian matrix using equation (9) : (9) -To do the following 3 calculations until the iteration limit: (10) -Will get support vector = (aj> thresholdSV), followed by calculating the value of the bias with equation (11).
The sentiment analysis process was taken from the student opinion questionnaire dataset on the services of the IST AKPRIND campus which was then carried outpre-processing to the dataset. The classification analysis will result in the orientation of positive opinions and negative opinions of the Naïve Bayes Classifier and Support Vector Machine. Additional featuresfeature extraction and selection in classification as a comparison of model performance. The process of this study is illustrated in figure 1:

C. Preprocessing
Data preprocessing is the process of transforming low-quality data into high-quality data that is easier to process [6]. In this study, several data preprocessing techniques were used, including dataset dimension reduction, case folding, punctuation removal, stopword removal, lemmatization, and tokenization. Dimensional reduction refers to the selection of dimensions required for research. The dimensions used in this case are text review. Case folding is the process of converting all letters to lowercase. Only the letters 'a' through 'z' are permitted. Furthermore, the letter is regarded as a delimeter or word separator.
The process of removing punctuation from a sentence is known as remove punctuation. Tokenization is the process of dividing an input string into tokens based on each compiler word. The principle is used to separate every word in a document. The removal of numbers, punctuation, and characters in this process, because the character is considered a word separator and has no effect on text processing.
The process of removing less important words that frequently appear on documents is known as stopword removal. It can eliminate stop words like "which," "the," and "and" to shorten the classification process.
For each tokenized word, lemmatization is the process of converting it into a word or root word. Each word affixed will be removed and converted into a basic word during the lemmatization process, allowing it to further optimize when text processing is completed. Lemmatization is used to convert the words "applied," "words," and "saw" to "apply," "word," and "see." Exsample preprocessing of the research dataset is shown in Table 1.

D. Feature Selection Information Gain
Feature selection is the process of selecting features to get features that have a big influence on the analysis process. By using this process, it is hoped that the analysis process will be efficient and the results of the analysis will be accurate. - For example, the weight of the information gain will be calculated from the "Bagus" feature. In table 2, out of 15 "Bagus" feature documents, 5 documents with positive sentiments and 3 documents with negative sentiments appear. There are 6 documents with positive sentiments and 5 of them contain "Bagus" features. Then there are 9 documents with negative sentiments and 3 of them contain "Bagus" features. The entropy value can be calculated:: The last step is to calculate the information gain weight. The information gain weight is used to select features that do not have a major influence in the analysis process. The hope is to streamline the analysis process by using a few features that have a big impact.

E. Text Analytic using NBC and SVM
The method used is naive bayes classifer and support vector machine. In this method the data used is divided into 2, namely: Training Data and Test Data. The training data contains a collection of data whose sentiment values are known as in Figure  2, and is used as a benchmark to obtain new sentiment data. In this study, the training data used were 3,999 randomly selected data. Test Data is a collection of data for which the sentiment value is unknown. The test data is filled in by the remaining data that has not been given a sentiment value of 25,760 data.

IV. RESULT AND DISCUSSION
In this research, the application of a web-based sentiment analysis system has been made. The application interface can be seen in Figure 3. The application uses four types of analysis models, namely: : 1. Naive Bayes Clssifier 2. Naive Bayes Classifier with Information Gain 3. Support Vector Machine 4. Support Vector Machine with Information Gain.

Figure 3. Sentiment analysis application interface
The results show that the accuracy of each model can reach up to more than 90%. The results of the analysis of 25,760 opinions show that there are more negative sentiments than positive sentiments. The model that uses information gain shows a faster analysis process than without information gain. Details of the results of the analysis of the four types of analysis models can be seen in Table 2 and the comparison of the accuracy of the models can be seen in Table 3. Based on the results of the average accuracy test, it can be concluded that the Support Vector Machine method is more accurate and more stable than the Naive Bayes Classifier method, with an average accuracy of 97.09% with the highest value of 99.67%. This application can also be used for sentence analysis per category through the File Analysis button. An example of the results of the analysis can be seen in   V. CONCLUSIONS Based on the results of the research that has been carried out, it can be concluded that the combination of Naive Bayes Classifier with Information Gain and Support Vector Machine with Information Gain can analyze sentiment automatically. The results of trials using opinion data collected from 2014 to 2017 show that negative sentiment is more than positive sentiment. The accuracy of the analysis results reached 99.67% with an average of 97.09%. SVM method has higher accuracy than NBC. Support vector machine produces the highest accuracy reaching 99.67% and the lowest 94.17%. Meanwhile, the Naive Bayes classifier recorded the highest accuracy up to 96.56% and the lowest 81.69%. The application of information gain does not significantly affect the accuracy. However, it is very influential on the duration of the analysis, especially on the SVM method. In the test data analysis process, the application of information gain on SVM accelerates the duration of the analysis process by 195.71%.