[ad_1]
ML Techniques for Analysing Macedonian Restaurant Reviews
While machine learning models for natural language processing have traditionally focused on popular languages such as English and Spanish, less commonly spoken languages have seen much less development. However, with the recent rise in e-commerce due to the COVID-19 pandemic, even less commonly spoken languages like Macedonian are generating large amounts of data through online reviews. This has opened an opportunity to develop and train machine learning models for sentiment analysis on Macedonian restaurant reviews, which can help businesses better understand customer sentiment and improve their services. In this study, we tackle the challenges that arise from this problem and explore and compare various sentiment analysis models for analyzing sentiment in Macedonian restaurant reviews, ranging from the classical random forests to modern deep learning techniques and transformers.
Contents
- Challenges and preprocessing the data
- Creating vector embeddings
– LASER embeddings
– Multilingual universal text encoder
– OpenAI Ada v2 - Machine learning models
– Random forest
– XGBoost
– Support vector machines
– Deep learning
– Transformers - Results and Discussion
- Future work
- Conclusion
Language is a uniquely human communication tool, and computers cannot interpret it without appropriate processing techniques. To allow machines to analyse and understand language, we need to represent the complex semantic and lexical information in a way that can be processed computationally. One popular method for achieving this is through the use of vector representations. In recent years, additional to language specific representation models, multilingual models emerged. These models can capture the semantic context of text on a large number of languages.
However, for languages with Cyrillic script, an additional challenge arises as users on the internet often express themselves using Latin script, resulting in mixed data consisting of both Latin and Cyrillic text. To address this challenge, I used a dataset from a local restaurant of approximately 500 reviews, containing both Latin and Cyrillic script. The dataset also includes a small subset of english reviews, which will help to assess the performance on mixed data. Additionally, online texts can contain symbols such as emojis that need to be removed. Therefore preprocessing is a crucial step before any text embedding can be performed.
import pandas as pd
import numpy as np# load the dataset into a dataframe
df = pd.read_csv('/content/data.tsv', sep='\t')
# see the distribution of the sentiment classes
df['sentiment'].value_counts()
# -------
# 0 337
# 1 322
# Name: sentiment, dtype: int64
The dataset contains positive and negative classes with nearly equal distribution. For removing emojis, I used the python library emoji
, which can easily remove emojis and other symbols.
!pip install emoji
import emojiclt = []
for comm in df['comment'].to_numpy():
clt.append(emoji.replace_emoji(comm, replace=""))
df['comment'] = clt
df.head()
For the problem of Cyrillic and Latin text, i converted all texts into one or the other, so the machine learning models can be tested on both to compare the performance. I used the “cyrtranslit” library for this task. It supports most of the Cyrillic alphabets like Macedonian, Bulgarian, Ukrainian and others.
import cyrtranslit
latin = []
cyrillic = []
for comm in df['comment'].to_numpy():
latin.append(cyrtranslit.to_latin(comm, "mk"))
cyrillic.append(cyrtranslit.to_cyrillic(comm, "mk"))df['comment_cyrillic'] = cyrillic
df['comment_latin'] = latin
df.head()
For embedding models that I used, it’s generally not necessary to remove punctuation, stop-words and do other text cleaning. These models are designed to process natural language text, including punctuation marks, and are often able to capture the meaning of sentences more accurately when they are left intact. With that the preprocesing of the text is finished.
Currently, there are no large-scale Macedonian representation models available. However, we can use multilingual models trained on Macedonian text. There are several such models available, but for this task, I have found that LASER and Multilingual Universal Sentence Encoder would be the most suitable options.
LASER
LASER (Language-Agnostic SEntence Representations) is a language-agnostic approach for generating high-quality multilingual sentence embeddings. The LASER model is based on a two-stage process, where the first stage is preprocessing the text, including tokenization, lowercasing, and applying sentencepiece. This part is language specific. The second stage involves mapping the preprocessed input text to a fixed-length embedding using a multi-layer bidirectional LSTM.
LASER has been shown to outperform other popular sentence embedding methods, such as fastText and InferSent, on a range of benchmark datasets. Additionally, the LASER model is open-source and freely available, making it easily accessible to everyone.
Creating the embeddings with LASER is a straightforward process:
!pip install laserembeddings
!python -m laserembeddings download-modelsfrom laserembeddings import Laser
# create the embeddings
laser = Laser()
embeddings_c = laser.embed_sentences(df['comment_cyrillic'].to_numpy(),lang='mk')
embeddings_l = laser.embed_sentences(df['comment_latin'].to_numpy(),lang='mk')
# save the embeddings
np.save('/content/laser_multi_c.npy', embeddings_c)
np.save('/content/laser_multi_l.npy', embeddings_l)
Multilingual Universal Sentence Encoder
Multilingual Universal Sentence Encoder (MUSE) is a pre-trained model for generating sentence embeddings, developed by Facebook. MUSE is designed to encode sentences in multiple languages into a common space.
The model is based on a deep neural network that uses an encoder-decoder architecture to learn a mapping between a sentence and its corresponding embedding vector in a high-dimensional space. MUSE is trained on a large-scale multilingual corpus, which includes texts from Wikipedia, news articles, and web pages.
!pip install tensorflow_text
import tensorflow as tf
import tensorflow_hub as hub
import numpy as np
import tensorflow_text# load the MUSE module
module_url = "https://tfhub.dev/google/universal-sentence-encoder-multilingual-large/3"
embed = hub.load(module_url)
sentences = df['comment_cyrillic'].to_numpy()
muse_c = embed(sentences)
muse_c = np.array(muse_c)
sentences = df['comment_latin'].to_numpy()
muse_l = embed(sentences)
muse_l = np.array(muse_l)
np.save('/content/muse_c.npy', muse_c)
np.save('/content/muse_l.npy', muse_l)
OpenAI Ada v2
Towards the end of 2022, OpenAI announced their brand new state-of-the-art embedding model text-embedding-ada-002. As this model is built on GPT-3, it has multilingual processing capabilities. To compare the results between the Cyrillic and Latin reviews, I ran the model on both datasets
!pip install openaiimport openai
openai.api_key = 'YOUR_KEY_HERE'
embeds_c = openai.Embedding.create(input = df['comment_cyrillic'].to_numpy().tolist(), model='text-embedding-ada-002')['data']
embeds_l = openai.Embedding.create(input = df['comment_latin'].to_numpy().tolist(), model='text-embedding-ada-002')['data']
full_arr_c = []
for e in embeds_c:
full_arr_c.append(e['embedding'])
full_arr_c = np.array(full_arr_c)
full_arr_l = []
for e in embeds_l:
full_arr_l.append(e['embedding'])
full_arr_l = np.array(full_arr_l)
np.save('/content/openai_ada_c.npy', full_arr_c)
np.save('/content/openai_ada_l.npy', full_arr_l)
This section explores the various machine learning models utilized to predict sentiment in Macedonian restaurant reviews. From traditional machine learning models to deep learning techniques, we’ll look into the strengths and weaknesses of each model and compare their performance on the dataset.
Before running any models, the data should be split for training and testing for every embedding type. This can be easily done with with the sklearn
library.
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(embeddings_c, df['sentiment'], test_size=0.2, random_state=42)
Random Forests
Random Forests are a widely-used machine learning algorithm that uses an ensemble of decision trees to classify data points. The algorithm works by training each decision tree on a subset of the full dataset and a random subset of the features. During inference, each decision tree generates a prediction of the sentiment, and the final output is obtained by taking a majority vote of all the trees. This approach helps to prevent overfitting and can lead to more robust and accurate predictions.
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import classification_report, confusion_matrixrfc = RandomForestClassifier(n_estimators=100)
rfc.fit(X_train, y_train)
print(classification_report(y_test,rfc.predict(X_test)))
print(confusion_matrix(y_test,rfc.predict(X_test)))
XGBoost
XGBoost (eXtreme Gradient Boosting) is a powerful ensemble method, mainly used in tabular data. Like Random Forest, XGBoost also uses decision trees to classify data points, but with a different approach. Instead of training all trees at once, XGBoost trains each tree in a sequential manner, learning from the errors made by the previous tree. This process is called boosting, which means combining weak models to form a stronger one. Although XGBoost primarly produces great results with tabular data, it would be interesting to test it with vector embeddings as well.
from xgboost import XGBClassifier
from sklearn.metrics import classification_report, confusion_matrixrfc = XGBClassifier(max_depth=15)
rfc.fit(X_train, y_train)
print(classification_report(y_test,rfc.predict(X_test)))
print(confusion_matrix(y_test,rfc.predict(X_test)))
Support Vector Machines
Support Vector Machines (SVM) is a popular and powerful machine learning algorithm for classification and regression tasks. It works by finding the optimal hyperplane that separates the data into different classes, while maximizing the margin between the classes. SVM is particularly useful for high-dimensional data and can handle non-linear boundaries using kernel functions.
from sklearn.svm import SVC
from sklearn.metrics import classification_report, confusion_matrixrfc = SVC()
rfc.fit(X_train, y_train)
print(classification_report(y_test,rfc.predict(X_test)))
print(confusion_matrix(y_test,rfc.predict(X_test)))
Deep Learning
Deep Learning is an advanced machine learning method that utilizes artificial neural networks consisting of multiple layers and neurons. Deep learning networks demonstrate great performance with text and image data. Implementing these networks is a straightforward process using the library Keras.
import tensorflow as tf
from tensorflow import keras
from sklearn.model_selection import train_test_split
from sklearn.metrics import classification_report, confusion_matrixmodel = keras.Sequential()
model.add(keras.layers.Dense(256, activation='relu', input_shape=(1024,)))
model.add(keras.layers.Dropout(0.2))
model.add(keras.layers.Dense(128, activation='relu'))
model.add(keras.layers.Dense(1, activation='sigmoid'))
model.compile(optimizer='adam', loss='binary_crossentropy', metrics=['accuracy'])
history = model.fit(X_train, y_train, epochs=11, validation_data=(X_test, y_test))
test_loss, test_acc = model.evaluate(X_test, y_test)
print('Test accuracy:', test_acc)
y_pred = model.predict(X_test)
print(classification_report(y_test,y_pred.round()))
print(confusion_matrix(y_test,y_pred.round()))
Here, a neural network with two hidden layers and a rectified linear unit (ReLU) activation function was used. The output layer contains a single neuron with a sigmoid activation function, enabling the network to make binary predictions for positive or negative sentiment. The binary cross-entropy loss function is paired with the sigmoid activation to train the model. Additionally, Dropout was used to help prevent overfitting and improve the generalization of the model. I tested with various different hyper-parameters and found that this configuration works best for this problem.
With the following function we can visualize the training of the models.
import matplotlib.pyplot as pltdef plot_accuracy(history):
plt.plot(history.history['accuracy'])
plt.plot(history.history['val_accuracy'])
plt.title('Model Accuracy')
plt.xlabel('Epoch')
plt.ylabel('Accuracy')
plt.legend(['Train', 'Validation'], loc='upper left')
plt.show()
Transformers
Fine-tuning transformers is a popular technique in natural language processing that involves adjusting pre-trained transformer models to suit specific tasks. Transformers, such as BERT, GPT-2, and RoBERTa, are pre-trained on large amounts of text data and are capable of learning complex patterns and relationships in language. However, in order to perform well on specific tasks, such as sentiment analysis or text classification, these models need to be fine-tuned on task-specific data.
For these types of models, the vector representations we created earlier are not needed, as they directly process the tokens (extracted directly from the text). For this task of sentiment analysis in Macedonian I worked withbert-base-multilingual-uncased
which is the multilingual version of the BERT model.
HuggingFace has made fine-tuning the transformers a very simple task. Firstly the data needs to be loaded into a transformers dataset. Then the text is tokenized and finally the model is trained.
from sklearn.model_selection import train_test_split
from datasets import load_dataset
from transformers import TrainingArguments, Trainer
from sklearn.metrics import classification_report, confusion_matrix# create csv of train and test sets to be loaded by the dataset
df.rename(columns={"sentiment": "label"}, inplace=True)
train, test = train_test_split(df, test_size=0.2)
pd.DataFrame(train).to_csv('train.csv',index=False)
pd.DataFrame(test).to_csv('test.csv',index=False)
# load the dataset
dataset = load_dataset("csv", data_files={"train": "train.csv", "test": "test.csv"})
# tokenize the text
tokenizer = AutoTokenizer.from_pretrained('bert-base-multilingual-uncased')
encoded_dataset = dataset.map(lambda t: tokenizer(t['comment_cyrillic'], truncation=True), batched=True,load_from_cache_file=False)
# load the pretrained model
model = AutoModelForSequenceClassification.from_pretrained('bert-base-multilingual-uncased',num_labels =2)
# fine-tune the model
arg = TrainingArguments(
"mbert-sentiment-mk",
learning_rate=5e-5,
num_train_epochs=5,
per_device_eval_batch_size=8,
per_device_train_batch_size=8,
seed=42,
push_to_hub=True
)
trainer = Trainer(
model=model,
args=arg,
tokenizer=tokenizer,
train_dataset=encoded_dataset['train'],
eval_dataset=encoded_dataset['test']
)
trainer.train()
# get predictions
predictions = trainer.predict(encoded_dataset["test"])
preds = np.argmax(predictions.predictions, axis=-1)
# evaluate
print(classification_report(predictions.label_ids,preds))
print(confusion_matrix(predictions.label_ids,preds))
With that we have successfully fine-tuned BERT for sentiment analysis.
The results of the sentiment analysis on Macedonian restaurant reviews are promising, with several models achieving high accuracy and F1 scores. The experiments show that deep learning models and transformers, outperform traditional machine learning models like Random Forests and Support Vector Machines, although not by much. Transformers and deep neural networks using the new OpenAI embedding managed to break the 0.9 accuracy barrier.
The OpenAI embedding model textembedding-ada-002
managed to considerably boost the results obtained even from the classical ML models, especially on the Support Vector Machines. The best result in this study was achieved with this embedding on the Cyrillic text on a deep learning model.
Generally the Latin texts performed worse compared to the Cyrillic texts. Although I initially hypothesized that the performance of these models would be better, given the prevalence of similar words in Latin among other Slavic languages and the fact that the embedding models were trained on such data, the findings did not support this hypothesis.
In future work, it would be valuable to collect more data to further train and test the models, especially with a larger diversity of review topics and sources. Additionally, trying to incorporate more features such as metadata (e.g., reviewer’s age, gender, location) or temporal information (e.g., time of review) into the models might improve their accuracy. Finally, it would be interesting to extend the analysis to other less commonly spoken languages and compare the performance of the models with these trained on the Macedonian reviews.
In conclusion, this post has demonstrated the effectiveness of various machine learning models and embedding techniques for sentiment analysis of Macedonian restaurant reviews. Several classic machine learning models are explored and compared, such as Random Forests and SVM, as well as modern deep learning techniques including neural networks and transformers. The results have shown that fine-tuned transformer models and deep learning models with the newest OpenAI embedding outperform other methods, with a validation accuracy of up to 90%.
Thanks for reading!
Source link