top of page

SEMANTIC

RELATEDNESS ANALYSIS

Xiaohan Gao, Di An

andy.gif

1.Introduction

​

2.Dataset


3.Model


4.Result

​

5.Conclusion

​

PDF version of final report is available online. 

Home: Quote

INTRODUCTION

Semantic relatedness has many applications. We often have immediate interests to find similar resources after knowing one information resource on the internet. With the help of semantic relatedness, this can be achieved easier and smarter. For example, it is a better expirence if we browse the wikipedia pages and find out the recommendated pages already listed by intelligent machines. Or when we want to create a new Wikipedia entry, we do not bother to categorize this item -- the machines can help us to locate the most related area.

 

In this project, we decide to do a hands-on try to implement a NLP network after having the course. We train our model by feeding the preprocessed wikipage corpus to word2vec, using skip-gram/CBOW to get the word vectors and compute their cosine similarity. After all, wordsim353 and MC30 are used for evaluation.

Home: About
DATASET

We used the most up-to-date english wiki entries as the input data.[1] The original data set​ ​is a complete copy of all Wikimedia wikis, in the form of wikitext source and metadata embedded in XML. ​It is about 70 GB. ​ We utilized ‘wikiextractor​’[​2], an opensource tool based on python, to extract the wiki pages from XML and parsed them into text format.

 

To increase the utility, we took the place of the "link" (hyperlinks to other wiki pages) with phrase, indicating the title of linkpage. Wikiextractor supports keeping the links given specific parameters.

Home: Dataset

MODEL

Word2vec:

We utilized gensim (a python library) to generate the word vectors and models. For the CBOW, the training takes about 40 hours with 4 cores. In order to speed up the trainning, we implemented negative sampling[3] when train the skip-gram model. It took us 16 hours with 4 cores, which is a great boost.

 

CBOW training parameters:

min_count = 10 (discarding less-frequent words)

window = 10 (The maximum distance between the current and predicted word within a sentence)

 

Skip-gram training parameters:

min_count = 5 (discarding less-frequent words) We used a smaller min_count since the vocabulary size of our CBOW try is not ideal.

window = 10 (The maximum distance between the current and predicted word within a sentence)

sample = 1e-5 (downsampling rate)

Home: Model

RESULT

Following is an example to give an intuitive way to understand the semantic similarity
>>>print(model.most_similar("god"))
>>>[('lord', 0.8258302211761475), ('christ', 0.745830237865448), ('mercy', 0.7376500368118286), ('salvation', 0.7365933656692505), ('grace', 0.7206438183784485), ('truth', 0.7183467149734497), ('wisdom', 0.7103146314620972), ('spirit', 0.7078145742416382), ('righteousness', 0.7002202272415161), ('glory', 0.6989426612854004)]

Home: Result

The performance comparison is listed here

T-distributed Stochastic Neighbor Embedding (t-SNE) is a machine learning algorithm for visualization, It is a nonlinear dimensionality reduction technique well-suited for embedding high-dimensional data for visualization in a low-dimensional space of two or three dimensions.

we also use t-SNE to visualize the word vectors. Please see our report to get more detail.

Home: Result
CONCLUSION

In this project, we have used word2vec to generate the numerical representation for words and tasks like wordsim353 to evaluate the model. Glove model is also trained with 200 dimension to compare with skip-gram and CBOW.

 

t-SNE algorithm and other online tools[4] are used for data visualization.

 

From table 1 we can conclude that the performance is improved by the increment of word-vector dimenson (but more cost of training time as trade off!). Also, Glove seems to have better performance than word2vec.

 

There is still a gap between our result and the state-of-art performance. However, if we have more time to fine tune the hyperparameter of word2vec(window size, mincount, etc.), we believe we can improve the performace.

 

Wikipedia is a huge corpus, it takes us such a long time to train a model. Therefore we do not have time cover all situations. Time permitted, we will try other dimensional word vector to see how it affect the performance. we will also try other models like fastText [5], other evaluation task like analogy tasks to get deeper insight of this problem.

 

Our code is available on our github repo:https://github.com/andear/semantic-relatedness-wikipedia

​

Want to read more details in our final report? Visit this: https://drive.google.com/file/d/1L-nFOZGokDq49NkAPt2Yy8mxNAwpyB5e/view?usp=sharing

Home: Quote
bottom of page