Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Issues in TNTM model debugging #104

Open
williamlhy opened this issue Dec 19, 2024 · 1 comment
Open

Issues in TNTM model debugging #104

williamlhy opened this issue Dec 19, 2024 · 1 comment
Assignees

Comments

@williamlhy
Copy link
Collaborator

When I wanted to use TNTM model, I got the following error.
Code:

from stream_topic.models import TNTM
from stream_topic.utils import TMDataset
dataset = TMDataset()
dataset.fetch_dataset("BBC_News")
dataset.preprocess(model_type="TNTM")
model = TNTM()
model.fit(dataset)

Error:

[/usr/local/lib/python3.10/dist-packages/stream_topic/models/abstract_helper_models/base.py](https://localhost:8080/#) in prepare_embeddings(self, dataset, logger)
    226                 f"--- Creating {self.embedding_model_name} document embeddings ---"
    227             )
--> 228             embeddings = self.encode_documents(
    229                 dataset.texts, encoder_model=self.embedding_model_name, use_average=True
    230             )
AttributeError: 'TNTM' object has no attribute 'encode_documents'

Then I added the SentenceEncodingMixin class to the TNTM model class build and modified some issues in the umap_model build. Then re-run the training code and get the error reported:

2024-12-19 15:48:07.837 | INFO     | stream_topic.models.abstract_helper_models.base:prepare_embeddings:225 - --- Creating /hongyi/stream/sentence-transformers/all-MiniLM-L6-v2 document embeddings ---
100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 2225/2225 [00:54<00:00, 40.89it/s]
2024-12-19 15:49:02.694 | INFO     | stream_topic.models.tntm:_initialize_datamodule:371 - --- Initializing Datamodule for TNTM ---
2024-12-19 15:49:02.964 | INFO     | stream_topic.models.tntm:_prepare_word_embeddings:335 - --- Creating /hongyi/stream/sentence-transformers/paraphrase-MiniLM-L3-v2 word embeddings ---
Batches: 100%
 253/253 [00:01<00:00, 129.29it/s]
/hongyi/STREAM/stream_topic/models/neural_base_models/tntm_base.py:61: UserWarning: To copy construct from a tensor, it is recommended to use sourceTensor.clone().detach() or sourceTensor.clone().detach().requires_grad_(True), rather than torch.tensor(sourceTensor).
  self.word_embeddings_projected = torch.tensor(word_embeddings_projected)
2024-12-19 15:49:38.776 | INFO     | stream_topic.models.tntm:_initialize_trainer:279 - --- Initializing Trainer for TNTM ---
Trainer will use only 1 of 2 GPUs because it is running inside an interactive / notebook environment. You may try to set `Trainer(devices=2)` but please note that multi-GPU inside interactive / notebook environments is considered experimental and unstable. Your mileage may vary.
Trainer already configured with model summary callbacks: [<class 'lightning.pytorch.callbacks.model_summary.ModelSummary'>]. Skipping setting a default `ModelSummary` callback.
GPU available: True (cuda), used: True
TPU available: False, using: 0 TPU cores
HPU available: False, using: 0 HPUs
/hongyi/anaconda3/envs/mystream/lib/python3.10/site-packages/lightning/pytorch/trainer/connectors/logger_connector/logger_connector.py:75: Starting from v1.9.0, `tensorboardX` has been removed as a dependency of the `lightning.pytorch` package, due to potential conflicts with other packages in the ML ecosystem. For this reason, `logger=True` will use `CSVLogger` as the default logger, unless the `tensorboard` or `tensorboardX` packages are found. Please `pip install lightning[extra]` or one of them to enable TensorBoard support by default
2024-12-19 15:49:38.798 | INFO     | stream_topic.models.tntm:fit:489 - --- Training TNTM topic model ---
You are using a CUDA device ('NVIDIA A800 80GB PCIe') that has Tensor Cores. To properly utilize them, you should set `torch.set_float32_matmul_precision('medium' | 'high')` which will trade-off precision for performance. For more details, read https://pytorch.org/docs/stable/generated/torch.set_float32_matmul_precision.html#torch.set_float32_matmul_precision
/hongyi/anaconda3/envs/mystream/lib/python3.10/site-packages/lightning/pytorch/callbacks/model_checkpoint.py:652: Checkpoint directory /hongyi/STREAM/checkpoints exists and is not empty.
LOCAL_RANK: 0 - CUDA_VISIBLE_DEVICES: [0,1]

  | Name                    | Type             | Params | Mode 
---------------------------------------------------------------------
0 | model                   | TNTMBase         | 5.2 M  | train
1 | model.inference_network | InferenceNetwork | 5.2 M  | train
2 | model.mean_bn           | BatchNorm1d      | 10     | train
3 | model.logvar_bn         | BatchNorm1d      | 10     | train
4 | model.beta_batchnorm    | BatchNorm1d      | 16.1 K | train
5 | model.theta_drop        | Dropout          | 0      | train
---------------------------------------------------------------------
5.2 M     Trainable params
8.1 K     Non-trainable params
5.2 M     Total params
20.916    Total estimated model params size (MB)
Sanity Checking DataLoader 0:   0%
 0/2 [00:00<?, ?it/s]
/hongyi/anaconda3/envs/mystream/lib/python3.10/site-packages/lightning/pytorch/trainer/connectors/data_connector.py:424: The 'val_dataloader' does not have many workers which may be a bottleneck. Consider increasing the value of the `num_workers` argument` to `num_workers=255` in the `DataLoader` to improve performance.
2024-12-19 15:49:38.955 | ERROR    | stream_topic.models.tntm:fit:496 - Error in training: Expected all tensors to be on the same device, but found at least two devices, cuda:0 and cpu!
---------------------------------------------------------------------------
RuntimeError                              Traceback (most recent call last)
Cell In[2], line 3
      1 from stream_topic.models import KmeansTM,CEDC, ETM,DCTE,LDA,ProdLDA,NSTM,CTM,CTMNeg,CBC,BERTopicTM,TNTM
      2 model = TNTM(word_embedding_model_name="/hongyi/stream/sentence-transformers/paraphrase-MiniLM-L3-v2",embedding_model_name="/hongyi/stream/sentence-transformers/all-MiniLM-L6-v2")#
----> 3 model.fit(dataset,n_topics=5)#
      5 topics = model.get_topics()
      6 print(topics)

File ~/STREAM/stream_topic/models/tntm.py:493, in TNTM.fit(self, dataset, n_topics, val_size, lr, lr_patience, patience, factor, weight_decay, max_epochs, batch_size, shuffle, random_state, inferece_type, checkpoint_path, monitor, mode, trial, optimize, **kwargs)
    490     self._status = TrainingStatus.RUNNING
    491     # self.model.to("cuda:0")
    492     # print(self.model.device)
--> 493     self.trainer.fit(self.model, self.data_module)
    495 except Exception as e:
    496     logger.error(f"Error in training: {e}")

File ~/anaconda3/envs/mystream/lib/python3.10/site-packages/lightning/pytorch/trainer/trainer.py:543, in Trainer.fit(self, model, train_dataloaders, val_dataloaders, datamodule, ckpt_path)
    541 self.state.status = TrainerStatus.RUNNING
    542 self.training = True
--> 543 call._call_and_handle_interrupt(
    544     self, self._fit_impl, model, train_dataloaders, val_dataloaders, datamodule, ckpt_path
    545 )

File ~/anaconda3/envs/mystream/lib/python3.10/site-packages/lightning/pytorch/trainer/call.py:44, in _call_and_handle_interrupt(trainer, trainer_fn, *args, **kwargs)
     42     if trainer.strategy.launcher is not None:
     43         return trainer.strategy.launcher.launch(trainer_fn, *args, trainer=trainer, **kwargs)
---> 44     return trainer_fn(*args, **kwargs)
     46 except _TunerExitException:
     47     _call_teardown_hook(trainer)

File ~/anaconda3/envs/mystream/lib/python3.10/site-packages/lightning/pytorch/trainer/trainer.py:579, in Trainer._fit_impl(self, model, train_dataloaders, val_dataloaders, datamodule, ckpt_path)
    572 assert self.state.fn is not None
    573 ckpt_path = self._checkpoint_connector._select_ckpt_path(
    574     self.state.fn,
    575     ckpt_path,
    576     model_provided=True,
    577     model_connected=self.lightning_module is not None,
    578 )
--> 579 self._run(model, ckpt_path=ckpt_path)
    581 assert self.state.stopped
    582 self.training = False

File ~/anaconda3/envs/mystream/lib/python3.10/site-packages/lightning/pytorch/trainer/trainer.py:986, in Trainer._run(self, model, ckpt_path)
    981 self._signal_connector.register_signal_handlers()
    983 # ----------------------------
    984 # RUN THE TRAINER
    985 # ----------------------------
--> 986 results = self._run_stage()
    988 # ----------------------------
    989 # POST-Training CLEAN UP
    990 # ----------------------------
    991 log.debug(f"{self.__class__.__name__}: trainer tearing down")

File ~/anaconda3/envs/mystream/lib/python3.10/site-packages/lightning/pytorch/trainer/trainer.py:1028, in Trainer._run_stage(self)
   1026 if self.training:
   1027     with isolate_rng():
-> 1028         self._run_sanity_check()
   1029     with torch.autograd.set_detect_anomaly(self._detect_anomaly):
   1030         self.fit_loop.run()

File ~/anaconda3/envs/mystream/lib/python3.10/site-packages/lightning/pytorch/trainer/trainer.py:1057, in Trainer._run_sanity_check(self)
   1054 call._call_callback_hooks(self, "on_sanity_check_start")
   1056 # run eval step
-> 1057 val_loop.run()
   1059 call._call_callback_hooks(self, "on_sanity_check_end")
   1061 # reset logger connector

File ~/anaconda3/envs/mystream/lib/python3.10/site-packages/lightning/pytorch/loops/utilities.py:182, in _no_grad_context.<locals>._decorator(self, *args, **kwargs)
    180     context_manager = torch.no_grad
    181 with context_manager():
--> 182     return loop_run(self, *args, **kwargs)

File ~/anaconda3/envs/mystream/lib/python3.10/site-packages/lightning/pytorch/loops/evaluation_loop.py:135, in _EvaluationLoop.run(self)
    133     self.batch_progress.is_last_batch = data_fetcher.done
    134     # run step hooks
--> 135     self._evaluation_step(batch, batch_idx, dataloader_idx, dataloader_iter)
    136 except StopIteration:
    137     # this needs to wrap the `*_step` call too (not just `next`) for `dataloader_iter` support
    138     break

File ~/anaconda3/envs/mystream/lib/python3.10/site-packages/lightning/pytorch/loops/evaluation_loop.py:396, in _EvaluationLoop._evaluation_step(self, batch, batch_idx, dataloader_idx, dataloader_iter)
    390 hook_name = "test_step" if trainer.testing else "validation_step"
    391 step_args = (
    392     self._build_step_args_from_hook_kwargs(hook_kwargs, hook_name)
    393     if not using_dataloader_iter
    394     else (dataloader_iter,)
    395 )
--> 396 output = call._call_strategy_hook(trainer, hook_name, *step_args)
    398 self.batch_progress.increment_processed()
    400 if using_dataloader_iter:
    401     # update the hook kwargs now that the step method might have consumed the iterator

File ~/anaconda3/envs/mystream/lib/python3.10/site-packages/lightning/pytorch/trainer/call.py:311, in _call_strategy_hook(trainer, hook_name, *args, **kwargs)
    308     return None
    310 with trainer.profiler.profile(f"[Strategy]{trainer.strategy.__class__.__name__}.{hook_name}"):
--> 311     output = fn(*args, **kwargs)
    313 # restore current_fx when nested context
    314 pl_module._current_fx_name = prev_fx_name

File ~/anaconda3/envs/mystream/lib/python3.10/site-packages/lightning/pytorch/strategies/strategy.py:411, in Strategy.validation_step(self, *args, **kwargs)
    409 if self.model != self.lightning_module:
    410     return self._forward_redirection(self.model, self.lightning_module, "validation_step", *args, **kwargs)
--> 411 return self.lightning_module.validation_step(*args, **kwargs)

File ~/STREAM/stream_topic/models/abstract_helper_models/neural_basemodel.py:46, in NeuralBaseModel.validation_step(self, batch, batch_idx)
     45 def validation_step(self, batch, batch_idx):
---> 46     val_loss = self.model.compute_loss(batch)
     48     self.log(
     49         "val_loss",
     50         val_loss,
   (...)
     54         logger=True,
     55     )
     57     return val_loss

File ~/STREAM/stream_topic/models/neural_base_models/tntm_base.py:215, in TNTMBase.compute_loss(self, x)
    201 """
    202 Computes the loss for the model.
    203 
   (...)
    212     The computed loss.
    213 """
    214 x_bow = x['bow']
--> 215 log_recon, posterior_mean, posterior_logvar = self.forward(x)
    216 loss = self.loss_function(x_bow, log_recon, posterior_mean, posterior_logvar)
    217 return loss

File ~/STREAM/stream_topic/models/neural_base_models/tntm_base.py:143, in TNTMBase.forward(self, x)
    124 """
    125 Forward pass through the network.
    126 
   (...)
    139     The log variance of the variational posterior.
    140 """
    141 theta, posterior_mean, posterior_logvar = self.get_theta(x)
--> 143 log_beta = self.calc_log_beta()
    147 # prodLDA vs LDA
    148 # use numerical trick to compute log(beta @ theta )
    149 log_theta = torch.nn.LogSoftmax(dim=-1)(theta)        #calculate log theta = log_softmax(theta_hat)

File ~/STREAM/stream_topic/models/neural_base_models/tntm_base.py:112, in TNTMBase.calc_log_beta(self)
    109 log_probs = torch.zeros(self.n_topics, self.vocab_size)
    111 for i, dis in enumerate(normal_dis_lis):
--> 112     log_probs[i] = dis.log_prob(self.word_embeddings_projected)
    113 return log_probs

File ~/anaconda3/envs/mystream/lib/python3.10/site-packages/torch/distributions/lowrank_multivariate_normal.py:214, in LowRankMultivariateNormal.log_prob(self, value)
    212 if self._validate_args:
    213     self._validate_sample(value)
--> 214 diff = value - self.loc
    215 M = _batch_lowrank_mahalanobis(
    216     self._unbroadcasted_cov_factor,
    217     self._unbroadcasted_cov_diag,
    218     diff,
    219     self._capacitance_tril,
    220 )
    221 log_det = _batch_lowrank_logdet(
    222     self._unbroadcasted_cov_factor,
    223     self._unbroadcasted_cov_diag,
    224     self._capacitance_tril,
    225 )

RuntimeError: Expected all tensors to be on the same device, but found at least two devices, cuda:0 and cpu!

Finally I tried to move both self.model and its parameters to “cuda:0”, but it still reports the same error.

@williamlhy
Copy link
Collaborator Author

@AnFreTh Could you take a look at this issue?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants