Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Calculation and loading embedding model in metrics. #89

Open
hyades910739 opened this issue Jan 17, 2023 · 1 comment
Open

Calculation and loading embedding model in metrics. #89

hyades910739 opened this issue Jan 17, 2023 · 1 comment

Comments

@hyades910739
Copy link

  • OCTIS version: 1.11.0

Description

Hi, there's a couple of issues I want to discuss, mainly foucing on class WECoherencePairwise, WECoherenceCentroid, and metrics using embedding.
If these issues are reasonable, I am happy to create a PR for these issues. Thanks! :)

1. Calculation:

In score function, the L1 normalization for word embedding and Centroid (t) seems to be unnecessary because it use cosine similarity later on, the score will be identical without the normalization.
(See L122 and L196 in https://github.com/MIND-Lab/OCTIS/blob/master/octis/evaluation_metrics/coherence_metrics.py)

2.1 Default word embedding model:

Take WECoherenceCentroid for example, it load word2vec-google-news-300 as default model weight. Since the file size is big (1662 MB), it cost lots of time to download the model weight, which may not be a good thing for a default behavior. Also, some research (see Table 2 in [1]) shows that GloVe (glove-wiki-gigaword-50, 65 MB) may perform better than Word2Vec.

So, I think it may be a good idea to switch default model to glove-wiki-gigaword-50, or other word2vec model with smaller model size.

([1] Ding, Ran, Ramesh Nallapati, and Bing Xiang. "Coherence-aware neural topic modeling." arXiv preprint arXiv:1809.02687 (2018).)

2.2 parameter name and function input:

Also, it seems reasonable to be able to calculate WECoherence with different embedding model. Currently you can use word2vec_path to pass any w2v format model (e.g., glove-wiki-gigaword-50 provided by gensim), but the parameter name is really confusing. So if we want to support models other than word2vec, we should change the parameter name word2vec_path to more general one.

2.3 KeyedVectors as function input:

Currently, if you want to use several embedding related metrics (e.g., use WECoherencePairwise and WECoherenceCentroid at the same time), every metric instance will create it's own embedding model, which is unnecessary. To avoid this, we can take KeyedVectors as input directly.

2.4 support dict-like embedding object.

When calculate the metrics, we only need embedding for few words in topics. So there's no need to load the whole embedding model. In this case, we can use dict-like object to store embeddings we need.

In octis, the KeyedVectors only use in operation to check embedding exist, and use __getitem__ to get the embedding. So, we can simply convert dict like object to KeyedVectors, and make KeyedVectors related functions compatible with embeddings in dictionary format as input. Also, we only need to change code in __init__, without changing complex code like WordEmbeddingsInvertedRBOCentroid.scre.

Based on 2.1, 2.2, 2.3, 2.4 we can change WECoherenceCentroid.__init__ to following, and all metrics that use embedding can load model by function _load_keyedvectors.

from typing import Dict, Union
from numpy import ndarray
from gensim.models import KeyedVectors

class WECoherenceCentroid(AbstractMetric):
    def __init__(self, topk=10, model: Union[Dict[str,ndarray],KeyedVectors, None]=None, model_name: Optional[str]=None):
        '''
        If model is provided, use model first. If model is None, use model_name or default_model_name.
        '''
        if model is None and model_name is None:
            model_name = 'some_default_model_name'
        self._wv = _load_keyedvectors(model, model_name)

        
def _load_keyedvectors(model, model_name):
    if model is not None:
        if isinstance(model, KeyedVectors):
            pass
        elif isinstance(model, dict):
            model = _convert_dict_to_keyedvectors(dic)
        else:
            raise ValueError()                
    else model_name is not None:
        model = api.load(model_name)
    else:
        raise ValueError()
    return model

def _convert_dict_to_keyedvectors(dic):    
    vector_size = dic.values()
    arr = next(iter(dic.values()))
    assert len(arr.shape) == 1
    vector_size = arr.shape[0]
    kv = KeyedVectors(vector_size)    
    kv.key_to_index = {k:idx for idx, k in enumerate(dic.keys())}
    kv.vectors = np.stack(list(dic.values()))
    return kv
    
@hyades910739
Copy link
Author

I already forked and created 2 branch to fix this.

  1. https://github.com/hyades910739/OCTIS/tree/issue89_remove_unnecessary_normalization_in_coherence_metrics
    This one remove the unnecessary normalization code.
  2. https://github.com/hyades910739/OCTIS/tree/issue89_change_embedding_model_loading_logic_in_metrics
    This one create a MixIn to handle embedding model loading.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant