I often use Weights & Biases for training. Others might use something like Tensorboard or mlflow. While these are great for tracking experiments and sweeps or monitoring logs, sometimes you want a simple alert system that'll wake you up when a run starts or fails (but with a little more control so you can quickly get a view on experiment status).
While SLURM allows you to setup email alerts, these end up clogging up my inbox, and I can't log on a more granular level. I've found that implementing a simple Telegram alerts/logger works great for me. I can get a high-level overview of my experiments on my phone when I'm on the go and even setup alerts that'll wake me up (this has helped me when working on my rebuttal or during time crunches, where a failed run can eat up precious time).
So here's the setup.
Setup Telegram
- Install Telegram.
- Follow this guide to setup a new Telegram bot and copy the token somewhere.
- In your SLURM batch script add the line:
Shell
export TELEGRAM_BOT_TOKEN=<your_token_here> - Then, create a new chat and invite the bot to that chat. On phone it might be harder to find this. I go to Telegram on web and copy the chat ID from the url (excluding the hash symbol but including the negative - sign).

- Copy the chat ID and then add this line to your SLURM batch script:
Shell
export TELEGRAM_CHAT_ID=-12345...
Logger Code
We'll implement the logging as a PyTorch Lightning callback. You are free to implement a callback/hook in PyTorch as well.
Simply add it to your list of callbacks:
callbacks: list[pl.Callback] = [
ModelCheckpoint(
monitor="val_mse",
mode="min",
dirpath=checkpoint_dir,
save_top_k=-1,
filename="{epoch}-{val_mse:.4f}",
),
]
# Add Telegram notifications if configured
telegram_bot_token = os.environ.get("TELEGRAM_BOT_TOKEN")
telegram_chat_id = os.environ.get("TELEGRAM_CHAT_ID")
if telegram_bot_token and telegram_chat_id and rank == 0:
telegram_callback = TelegramNotificationCallback(
bot_token=telegram_bot_token,
chat_id=telegram_chat_id,
run_name=run_name
)
callbacks.append(telegram_callback)
print(f"Telegram notifications enabled for run: {run_name}")
...
trainer = pl.Trainer(
...
callbacks=callbacks,
...
)The logger class is not too difficult to understand either. You can sendMessage, editMessageText, and use some basic HTML styling. You could theoretically add buttons to allow for actions from your phone as well, but I haven't provided that here.
The following code updates a message with a progress bar and current model performance (PSNR score for MRI reconstruction. Here's what it looks like while training:

And here is the code for the TelegramNotificationCallback class.
class TelegramNotificationCallback(pl.Callback):
"""Custom callback for Telegram notifications during training."""
def __init__(self, bot_token: str, chat_id: str, run_name: str):
self.bot_token = bot_token
self.chat_id = chat_id
self.run_name = run_name
self.start_time = None
self.last_epoch_update = 0
self.progress_message_id = None
self.start_message_id = None
def _send_telegram_message(self, message: str):
"""Send a message via Telegram bot."""
try:
url = f"https://api.telegram.org/bot{self.bot_token}/sendMessage"
data = {
"chat_id": self.chat_id,
"text": message,
"parse_mode": "HTML"
}
response = requests.post(url, data=data, timeout=10)
response.raise_for_status()
return response.json()["result"]["message_id"]
except Exception as e:
print(f"Failed to send Telegram message: {e}")
return None
def _edit_telegram_message(self, message_id: int, message: str):
"""Edit an existing Telegram message."""
try:
url = f"https://api.telegram.org/bot{self.bot_token}/editMessageText"
data = {
"chat_id": self.chat_id,
"message_id": message_id,
"text": message,
"parse_mode": "HTML"
}
response = requests.post(url, data=data, timeout=10)
response.raise_for_status()
except Exception as e:
print(f"Failed to edit Telegram message: {e}")
def _create_progress_bar(self, current: int, total: int, width: int = 20) -> str:
"""Create a visual progress bar."""
filled = int(width * current / total)
bar = "█" * filled + "░" * (width - filled)
return bar
def _create_progress_message(self, trainer, current_epoch: int) -> str:
"""Create a progress message with visual elements."""
if self.start_time is None:
return ""
max_epochs = trainer.max_epochs or 1
progress = (current_epoch + 1) / max_epochs * 100
elapsed_time = time.time() - self.start_time
# Get current metrics with safe formatting
train_loss = trainer.callback_metrics.get('train_loss', 'N/A')
val_loss = trainer.callback_metrics.get('val_loss', 'N/A')
val_mse = trainer.callback_metrics.get('val_mse', 'N/A')
val_psnr = trainer.callback_metrics.get('val_psnr', 'N/A')
# Helper function to safely format metrics
def format_metric(metric):
if isinstance(metric, torch.Tensor):
return f"{metric.item():.4f}"
elif isinstance(metric, (int, float)):
return f"{metric:.4f}"
else:
return str(metric)
# Create progress bar
progress_bar = self._create_progress_bar(current_epoch + 1, max_epochs)
# Add some animation characters
animation_chars = ["⠋", "⠙", "⠹", "⠸", "⠼", "⠴", "⠦", "⠧", "⠇", "⠏"]
anim_char = animation_chars[int(time.time() * 2) % len(animation_chars)]
message = f"{anim_char} <b>Training Progress</b>\n\n"
message += f"<b>Run:</b> {self.run_name}\n"
message += f"<b>Epoch:</b> {current_epoch + 1}/{max_epochs} ({progress:.1f}%)\n"
message += f"<b>Elapsed:</b> {elapsed_time/3600:.1f}h\n\n"
message += f"<code>{progress_bar}</code>\n\n"
message += f"<b>Train Loss:</b> {format_metric(train_loss)}\n"
message += f"<b>Val Loss:</b> {format_metric(val_loss)}\n"
message += f"<b>Val MSE:</b> {format_metric(val_mse)}\n"
message += f"<b>Val PSNR:</b> {format_metric(val_psnr)}"
return message
def on_fit_start(self, trainer, pl_module):
"""Called when training starts."""
if trainer.is_global_zero: # Only send from rank 0
self.start_time = time.time()
message = f"🚀 <b>Training Started</b>\n\n"
message += f"<b>Run:</b> {self.run_name}\n"
message += f"<b>Model:</b> {pl_module.__class__.__name__}\n"
message += f"<b>Max Epochs:</b> {trainer.max_epochs}\n"
message += f"<b>Devices:</b> {trainer.num_devices}\n"
message += f"<b>Nodes:</b> {trainer.num_nodes}"
self.start_message_id = self._send_telegram_message(message)
# Send initial progress message
progress_message = self._create_progress_message(trainer, 0)
self.progress_message_id = self._send_telegram_message(progress_message)
def on_train_epoch_end(self, trainer, pl_module):
"""Called at the end of each training epoch."""
if trainer.is_global_zero and self.progress_message_id is not None:
current_epoch = trainer.current_epoch
# Update progress message every epoch for smooth animation
progress_message = self._create_progress_message(trainer, current_epoch)
self._edit_telegram_message(self.progress_message_id, progress_message)
# Update last_epoch_update for potential future use
self.last_epoch_update = current_epoch
def on_fit_end(self, trainer, pl_module):
"""Called when training ends."""
if trainer.is_global_zero and self.start_time is not None:
total_time = time.time() - self.start_time
final_val_mse = trainer.callback_metrics.get('val_mse', 'N/A')
final_val_psnr = trainer.callback_metrics.get('val_psnr', 'N/A')
# Helper function to safely format metrics (same as in _create_progress_message)
def format_metric(metric):
if isinstance(metric, torch.Tensor):
return f"{metric.item():.4f}"
elif isinstance(metric, (int, float)):
return f"{metric:.4f}"
else:
return str(metric)
# Update progress message to show completion
if self.progress_message_id is not None:
completion_message = f"✅ <b>Training Completed</b>\n\n"
completion_message += f"<b>Run:</b> {self.run_name}\n"
completion_message += f"<b>Total Time:</b> {total_time/3600:.1f}h\n"
completion_message += f"<b>Final Val MSE:</b> {format_metric(final_val_mse)}\n"
completion_message += f"<b>Final Val PSNR:</b> {format_metric(final_val_psnr)}\n\n"
completion_message += f"<code>{'█' * 20}</code> 100%"
self._edit_telegram_message(self.progress_message_id, completion_message)
# Send final completion message
final_message = f"🎉 <b>Training Successfully Completed!</b>\n\n"
final_message += f"<b>Run:</b> {self.run_name}\n"
final_message += f"<b>Total Time:</b> {total_time/3600:.1f}h\n"
final_message += f"<b>Final Val MSE:</b> {format_metric(final_val_mse)}\n"
final_message += f"<b>Final Val PSNR:</b> {format_metric(final_val_psnr)}"
self._send_telegram_message(final_message)You could probably also modify this code to send you a link to your wandb experiment.