-
Notifications
You must be signed in to change notification settings - Fork 17
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
add more careful handling of modalities #85
add more careful handling of modalities #85
Conversation
this should fix the large part of github.com/machine-intelligence-laboratory/issues/79
Update thetaless_regularizer.py
environment.yml
Outdated
@@ -0,0 +1,12 @@ | |||
name: python_test |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Этот файл нужен? @bt2901
@@ -27,7 +27,7 @@ def artm_dict2df(artm_dict): | |||
|
|||
""" | |||
dictionary_data = artm_dict._master.get_dictionary(artm_dict._name) | |||
dict_pandas = {field: getattr(dictionary_data, field) | |||
dict_pandas = {field: list(getattr(dictionary_data, field)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Да, вот с этим я недавно тоже столкнулся)
self.n_dw_matrix = None | ||
|
||
self.token2id = obtain_token2id(dataset) | ||
self._batches_path = os.path.join(dataset._internals_folder_path, "batches") | ||
|
||
def _initialize_matrices(self, batch_vectorizer, token2id): | ||
self.n_dw_matrix = _batch_vectorizer2sparse_matrix( | ||
batch_vectorizer, token2id, self.modality, self.modalities_to_use | ||
batch_vectorizer, token2id, self.modality, self.modalities_to_use, False |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Не надо ли сделать remove_nans
тоже параметром конструктора Теталесс регуляризатора? Точно надо именно как False его использовать по умолчанию? (Просто тогда, по-моему, как-то не очевидно, зачем вообще было добавлять remove_nans
в функции по работе со sparse матрицами для Теталесс регуляризатора, если сейчас его и задать нельзя при создании регуляризатора, и по умолчанию он и не используется вообще 🙃)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Про захардкоженный False теперь отчасти понятно — это потому, что далее там же вшита логика по обработке NaN-ов.
@@ -147,11 +148,12 @@ def _batch_vectorizer2sparse_matrix(batch_vectorizer, token2id, modality, modali | |||
# this is needed to be in sync with artm dictionary after filtering elements out | |||
# (they need to have the same shape) | |||
ind = sparse_n_dw_matrix.sum(axis=0) | |||
nonzeros = np.ravel(ind > 0) | |||
nonzeros = np.ravel((ind > 0) | (ind != ind)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ох, мне пришлось немного понапрягать котелок, чтоб вспомнить, в чём там была проблема и как её решает проверка ind != ind
😅 Надо бы коммент хотя бы добавить, что это для того, чтоб включить np.nan
значения.
sparse_n_dw_matrix = sparse_n_dw_matrix[:, nonzeros] | ||
|
||
# re-encode values to transform NaNs to explicitly stored zeros | ||
sparse_n_dw_matrix.data = np.nan_to_num(sparse_n_dw_matrix.data) | ||
if remove_nans: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
А зачем вообще может понадобиться сохранять NaN значения?)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Нет, не в принципе, наверно, это бы могло представлять какой-то интерес, но в плане работы Теталесс — ему могут когда-нибудь пригодиться NaN-ы?
) | ||
ind = self.n_dw_matrix.sum(axis=0) | ||
self.modalities_mask = np.ravel((ind == ind)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ааа, вот тут, получается, снова происходит "детект" NaN-ов, даже если они не были убраны в функции-создавальщике sparse матрицы! О-ок. Хм... а не стоит ли тогда просто сделать remove_nans=True парой строчек выше?) Или... хм... Для чего будет нужна modalities_mask
? Как потом используется информация о том, что по таким-то токенам не было NaN-ов? Замена NaN-ов на нули в строчке далее не решает всех проблем?
@@ -336,7 +342,10 @@ def grad(self, pwt, nwt): | |||
tmp = g_dt.T * self.B / (phi_matrix_tr.sum(axis=1) + EPS) | |||
n_tw += (tmp - np.einsum('ij,ji->i', phi_rev_matrix, tmp)) * phi_matrix | |||
|
|||
return self.tau * (n_tw.T - nwt) | |||
result = n_tw.T - nwt | |||
result = (result.T * self.modalities_mask).T |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@bt2901 Скажи плз, что тут происходит) для чего нужна modalities_mask
. Напротив NaN-овских токенов могло получиться что-то ненулевое?
@@ -262,7 +264,7 @@ def calc_A_matrix( | |||
|
|||
|
|||
class ThetalessRegularizer(BaseRegularizer): | |||
def __init__(self, name, tau, modality, dataset: Dataset): | |||
def __init__(self, name, tau, modality, dataset: Dataset, modalities_to_use=None): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Надо будет в докстринг описание добавить)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It could probably be better, but it seems to serve its purpose.
Strong visions: I have strong visions of this place in the empty times... Far below there are wavering pines... I left the rowan elphin woods to fulminate on ancient headlands, dipping slowly into the glasen seas of evening... On the devastated peaks of hills we ease the barrenness into our thin bones like a foot into a tight shoe... The narrative of this place: other than the smashed arris of the ridge there are only sad winds and silences... I lay on the cairn one more rock... I am possessed by Time...
This reverts commit a4407e9.
@@ -44,9 +44,9 @@ Consider using TopicNet if: | |||
</div> | |||
<em> | |||
Example of the two-stage experiment scheme. | |||
At the first stage, regularizer with parameter <img src="https://render.githubusercontent.com/render/math?math=\tau"> taking values in some range <img src="https://render.githubusercontent.com/render/math?math=\{\tau_1, \tau_2, \tau_3\}"> is applied. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@@ -5,7 +5,7 @@ cache: | |||
- $HOME/.ccache | |||
|
|||
python: | |||
- "3.7" | |||
- "3.8" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Какое-то время назад проверял на 3.8 — поэтому пусть будет 3.8 :) Хотя тесты потом всё равно ещё запущу, посмотрю. Ну, а этот Трэвис вообще того, не запускается уже (стал платным что ли).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
К тому же 3.7 уже и не поддерживается вообще (https://www.python.org/downloads/)
@@ -38,6 +38,7 @@ | |||
'numpy', | |||
'pandas', | |||
'plotly', | |||
'protobuf==3.20.3', # BigARTM dependency |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Кривовато, но лучше уж так, чем потом получать ошибки после установки)
name of modality on which the inference should be based. | ||
dataset: Dataset | ||
will be transformed to n_dw_matrix. | ||
modalities_to_use: iterable |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Копипастнул из функции выше 🫣
this should fix the large part of #79