Abstract:
Cyber bullying or cyber harassment is a form of bullying or harassment using electronic means, also known as online bullying. It has become increasingly common especially among teenagers, who bully or harass others on the internet, particularly on social media sites. The harmful effect of cyber bullying could make the victims to suffer depression, ill-health, emotional disorder, low self-esteem, physical violence and possibly commit suicide.
Discussion forums may be used to spread any message to a large population almost instantly. Also, it could be used to share views and ideas on politics, religion, and there are also people who could intentionally hurt religious or racial sentiments through malicious posts. Hence it becomes important to filter the posts on these forums. Some dataset of cyber bullying words were obtained from kaggle.com and positivewordsresearch.com/list-of-negative-words in a text and comma separated value (csv) format, which serve as training data. K-Nearest Neighbour was used as a supervised learning algorithm that uses a labelled database to predict the output of a post. Similarities or dissimilarities between words in a post and words in the database were found using Hamming Distance Metrics. Binary Logistic Regression was then used to estimate the probability of "suspicious post" given the values of suspicious words in a post. The model was implemented using python programming language. Evaluation was based on three metrics: Precision, Recall and F-measure. This work was observed to perform effectively and efficiently when it was compared with previous works.