Dataset too large #6

NHendrickson9616 · 2023-06-27T18:10:21Z

I am using the run_mlm.py file but I have my own copy because I changed where the tokenizer is going to since it is a different path from the model which is local.

While intially working with this method, I used the first two lines of my dataset and it was working just fine, but now that I have expanded the input, I am getting this error:

IndexError                                Traceback (most recent call last)
Cell In[58], line 5
      3 scorer = MaskedLM('/data/user/home/nchendri/LongRun/')
      4 text =  dsMap['test']['text']
----> 5 ppl = scorer.get_perplexity(text, batch=32)
      6 print(ppl)
      7 print(list(zip(text, ppl)))

Cell In[57], line 162, in MaskedLM.get_perplexity(self, input_texts, batch)
    159     return _e
    161 if self.max_length is not None:
--> 162     data.append([encode_mask(i) for i in range(min(self.max_length - len(self.sp_token_prefix), len(x)))])
    163 else:
    164     data.append([encode_mask(i) for i in range(len(x))])

Cell In[57], line 162, in <listcomp>(.0)
    159     return _e
    161 if self.max_length is not None:
--> 162     data.append([encode_mask(i) for i in range(min(self.max_length - len(self.sp_token_prefix), len(x)))])
    163 else:
    164     data.append([encode_mask(i) for i in range(len(x))])

Cell In[57], line 157, in MaskedLM.get_perplexity.<locals>.encode_mask(mask_position)
    155 # add the correct token id as the label
    156 label = [PAD_TOKEN_LABEL_ID] * _e['input_ids'].shape[1]
--> 157 label[mask_position + len(self.sp_token_prefix)] = masked_token_id
    158 _e['labels'] = torch.tensor([label], dtype=torch.long)
    159 return _e

IndexError: list assignment index out of range

The text was updated successfully, but these errors were encountered:

asahi417 · 2023-06-27T20:18:58Z

Would it possible to share the sentence that causes the error? I can run it locally to see if I can reproduce the error.

NHendrickson9616 · 2023-06-27T20:20:55Z

It is not a sentence, it is a a list of lines from .txt files, so it is essentially a lot of sentences in a list and I want to evaluate them all

NHendrickson9616 · 2023-06-27T20:21:30Z

And unfortunately I cannot share the dataset anywhere because it has PHI in it

asahi417 · 2023-06-27T20:26:04Z

Ok. Probably, something to do with the sequence length. Which model are you using to compute perplexity? Also, could you tell me the number of the maximum character in a single line in your file?

NHendrickson9616 · 2023-06-27T20:26:51Z

I am fairly sure that it is 512, I think I need help on how to batch my dataset

asahi417 · 2023-06-27T20:27:48Z

If you run it on a single instance, instead of passing a list, you should be able to find the line that causes the error.

text =  dsMap['test']['text']
for t in text:
    ppl = scorer.get_perplexity([t])

NHendrickson9616 · 2023-06-27T20:30:15Z

I am running a version of BERT, but the line that caused issues has only 129 characters in it

asahi417 · 2023-06-27T20:31:24Z

Does the sentence contain only roman characters?

NHendrickson9616 · 2023-06-27T20:33:08Z

It has dashes, commas, a period, and a colon, I don't know if those count, but other than that, yes

NHendrickson9616 · 2023-06-27T20:33:44Z

I have pretrained BERT on this same dataset before with the same tokenizer and model.

asahi417 · 2023-06-27T20:35:29Z

Could you try to compute perplexity on the same files but with roberta-base? It might be BERT specific issue.

NHendrickson9616 · 2023-06-27T20:37:54Z

Same error, same spot

NHendrickson9616 · 2023-06-27T20:38:36Z

But it did take it a little longer to get there although that may be because it was loading a new model in

asahi417 · 2023-06-27T20:39:47Z

So RoBERTa and BERT raises the same error in the same line, correct?

NHendrickson9616 · 2023-06-27T20:40:19Z

Correct.

asahi417 · 2023-06-27T20:46:25Z

If you could mask the alphabet in the sentence with a random character, would it be possible to share here? For instance, if the sentence is @lmppl I have, to say this √√ is &&private 1234---, then you could convert it into @aaaa a aaaa, aa aaa aaaa √√ aa &&aaaaaaa aaaa---. Here, I replaced all the alphabet by a. This way, there's no way I can restore the original sentence, but there's a high chance that you would see same issue there.

asahi417 · 2023-06-27T20:47:05Z

I guess alphabet is not the root cause of the error, and that's why to debug the issue, they are not imporatn.

NHendrickson9616 · 2023-06-27T20:50:17Z

Here:
Aaaaaaaaa: Aaaaaaaaa aaaaaaa aaaa aaaaaaaa - AAA10-AA A19.94, Aaaaaaaaa aaaaa - AAA10-AA A19.10, Aaaaaaa aaaaaaaaaa - AAA10-

NHendrickson9616 · 2023-06-27T20:52:34Z

I can confirm that in my code, running just that sentence, does reproduce the same error, so if that doesn't work for you, it may be that I have accidentally edited the code aside from replacing the transformer.

asahi417 · 2023-06-27T20:55:00Z

It's working fine with the latest lmppl (0.3.1) indeed as below.

In [1] from lmppl import LM
In [2] lm = LM('roberta-base')
In [3] text = "Aaaaaaaaa: Aaaaaaaaa aaaaaaa aaaa aaaaaaaa - AAA10-AA A19.94, Aaaaaaaaa aaaaa - AAA10-AA A19.10, Aaaaaaa aaaaaaaaaa - AAA10-"
In [4] lm.get_perplexity(text)
Out[4]: 115701099.01106478

NHendrickson9616 · 2023-06-27T20:59:53Z

So recopied the run_mlm.py and then ran it as is with roberta-base and here is what I got:

RuntimeError                              Traceback (most recent call last)
Cell In[115], line 5
      3 scorer = MaskedLM('roberta-base')
      4 text = dsMap['test']['text']
----> 5 ppl = scorer.get_perplexity("Aaaaaaaaa:  Aaaaaaaaa aaaaaaa aaaa aaaaaaaa - AAA10-AA A19.94, Aaaaaaaaa aaaaa - AAA10-AA A19.10, Aaaaaaa aaaaaaaaaa - AAA10-")
      6 # for t in text:
      7 #     ppl = scorer.get_perplexity([t])
      8 #     print(t)
      9 #ppl = scorer.get_perplexity(text, batch=32)
     10 print(ppl)

Cell In[113], line 155, in MaskedLM.get_perplexity(self, input_texts, batch)
    153 for s, e in tqdm(batch_id):
    154     _encode = data[s:e]
--> 155     _encode = {k: torch.cat([o[k] for o in _encode], dim=0).to(self.device) for k in _encode[0].keys()}
    156     labels = _encode.pop('labels')
    157     output = self.model(**_encode, return_dict=True)

Cell In[113], line 155, in <dictcomp>(.0)
    153 for s, e in tqdm(batch_id):
    154     _encode = data[s:e]
--> 155     _encode = {k: torch.cat([o[k] for o in _encode], dim=0).to(self.device) for k in _encode[0].keys()}
    156     labels = _encode.pop('labels')
    157     output = self.model(**_encode, return_dict=True)

RuntimeError: Sizes of tensors must match except in dimension 0. Expected size 54 but got size 53 for tensor number 5 in the list.

NHendrickson9616 · 2023-06-27T21:05:22Z

I just ran your exact code and that worked, can I send you my version of the code? (run_mlm.py I mean)

NHendrickson9616 · 2023-06-27T21:17:14Z

Ok so with a little bit of modification, I can run my whole dataset on the code that you sent, now I just need to figure out how to be able to modify that to have a different transformer without breaking it I think.

NHendrickson9616 · 2023-06-27T21:18:19Z

Although I do wonder why you chose to use the version for GPT variants rather than the one for BERT variants in your example

NHendrickson9616 · 2023-06-27T21:19:24Z

If I import and use MaskedLM rather than LM, it breaks, not sure why though.

asahi417 · 2023-06-27T21:46:57Z

Could you try following instead?

scorer.get_perplexity(["Aaaaaaaaa: Aaaaaaaaa aaaaaaa aaaa aaaaaaaa - AAA10-AA A19.94, Aaaaaaaaa aaaaa - AAA10-AA A19.10, Aaaaaaa aaaaaaaaaa - AAA10-"])

asahi417 · 2023-06-27T21:47:44Z

Ah wait, you're right. I should have use MaskedLM, but not LM.

asahi417 · 2023-06-27T21:50:23Z

Yeah, it's working without any issue.

In [1]: from lmppl import MaskedLM

In [2]: text = "Aaaaaaaaa: Aaaaaaaaa aaaaaaa aaaa aaaaaaaa - AAA10-AA A19.94, Aaaaaaaaa aaaaa - AAA10-AA A19.10, Aaaaaaa aaaaaaaaaa - AAA10-"

In [3]: lm = MaskedLM('roberta-base')

In [4]: lm.get_perplexity(text)
100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 1/1 [00:02<00:00,  2.13s/it]
Out[4]: 2.676919185346931

NHendrickson9616 · 2023-06-28T15:47:28Z

Ok, so that simple example works, but then when I tried to loop through it like so:

from lmppl import MaskedLM
lm = MaskedLM('/path/to/local/model/')
text = dsMap['test']['text']
num = 0
sum1 = 0
for t in text:
    sum1 = lm.get_perplexity(t) + sum1
    num = num + 1
print(sum1/num)

it returns another tensor error on the very same text that we were just testing. It gets past the first two loops but then breaks on the third which is what we have been working on:

RuntimeError                              Traceback (most recent call last)
Cell In[14], line 8
      6 sum1 = 0
      7 for t in text:
----> 8     sum1 = lm.get_perplexity(t) + sum1
      9     num = num + 1
     10 print(sum1/num)

File ~/.conda/envs/BioBERTUAB/lib/python3.10/site-packages/lmppl/ppl_mlm.py:154, in MaskedLM.get_perplexity(self, input_texts, batch)
    152 for s, e in tqdm(batch_id):
    153     _encode = data[s:e]
--> 154     _encode = {k: torch.cat([o[k] for o in _encode], dim=0).to(self.device) for k in _encode[0].keys()}
    155     labels = _encode.pop('labels')
    156     output = self.model(**_encode, return_dict=True)

File ~/.conda/envs/BioBERTUAB/lib/python3.10/site-packages/lmppl/ppl_mlm.py:154, in <dictcomp>(.0)
    152 for s, e in tqdm(batch_id):
    153     _encode = data[s:e]
--> 154     _encode = {k: torch.cat([o[k] for o in _encode], dim=0).to(self.device) for k in _encode[0].keys()}
    155     labels = _encode.pop('labels')
    156     output = self.model(**_encode, return_dict=True)

RuntimeError: Sizes of tensors must match except in dimension 0. Expected size 13 but got size 12 for tensor number 1 in the list.

NHendrickson9616 · 2023-06-28T15:49:39Z

It does not return any errors on the same exact code but with LM, but it also returns a perplexity number that is wildly incorrect because it isn't the right type of eval for the model

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Dataset too large #6

Dataset too large #6

NHendrickson9616 commented Jun 27, 2023

asahi417 commented Jun 27, 2023

NHendrickson9616 commented Jun 27, 2023

NHendrickson9616 commented Jun 27, 2023

asahi417 commented Jun 27, 2023

NHendrickson9616 commented Jun 27, 2023

asahi417 commented Jun 27, 2023

NHendrickson9616 commented Jun 27, 2023

asahi417 commented Jun 27, 2023

NHendrickson9616 commented Jun 27, 2023

NHendrickson9616 commented Jun 27, 2023

asahi417 commented Jun 27, 2023

NHendrickson9616 commented Jun 27, 2023

NHendrickson9616 commented Jun 27, 2023

asahi417 commented Jun 27, 2023

NHendrickson9616 commented Jun 27, 2023

asahi417 commented Jun 27, 2023

asahi417 commented Jun 27, 2023

NHendrickson9616 commented Jun 27, 2023

NHendrickson9616 commented Jun 27, 2023

asahi417 commented Jun 27, 2023

NHendrickson9616 commented Jun 27, 2023 •

edited

Loading

NHendrickson9616 commented Jun 27, 2023 •

edited

Loading

NHendrickson9616 commented Jun 27, 2023

NHendrickson9616 commented Jun 27, 2023 •

edited

Loading

NHendrickson9616 commented Jun 27, 2023

asahi417 commented Jun 27, 2023

asahi417 commented Jun 27, 2023

asahi417 commented Jun 27, 2023

NHendrickson9616 commented Jun 28, 2023

NHendrickson9616 commented Jun 28, 2023

Dataset too large #6

Dataset too large #6

Comments

NHendrickson9616 commented Jun 27, 2023

asahi417 commented Jun 27, 2023

NHendrickson9616 commented Jun 27, 2023

NHendrickson9616 commented Jun 27, 2023

asahi417 commented Jun 27, 2023

NHendrickson9616 commented Jun 27, 2023

asahi417 commented Jun 27, 2023

NHendrickson9616 commented Jun 27, 2023

asahi417 commented Jun 27, 2023

NHendrickson9616 commented Jun 27, 2023

NHendrickson9616 commented Jun 27, 2023

asahi417 commented Jun 27, 2023

NHendrickson9616 commented Jun 27, 2023

NHendrickson9616 commented Jun 27, 2023

asahi417 commented Jun 27, 2023

NHendrickson9616 commented Jun 27, 2023

asahi417 commented Jun 27, 2023

asahi417 commented Jun 27, 2023

NHendrickson9616 commented Jun 27, 2023

NHendrickson9616 commented Jun 27, 2023

asahi417 commented Jun 27, 2023

NHendrickson9616 commented Jun 27, 2023 • edited Loading

NHendrickson9616 commented Jun 27, 2023 • edited Loading

NHendrickson9616 commented Jun 27, 2023

NHendrickson9616 commented Jun 27, 2023 • edited Loading

NHendrickson9616 commented Jun 27, 2023

asahi417 commented Jun 27, 2023

asahi417 commented Jun 27, 2023

asahi417 commented Jun 27, 2023

NHendrickson9616 commented Jun 28, 2023

NHendrickson9616 commented Jun 28, 2023

NHendrickson9616 commented Jun 27, 2023 •

edited

Loading

NHendrickson9616 commented Jun 27, 2023 •

edited

Loading

NHendrickson9616 commented Jun 27, 2023 •

edited

Loading