Python Server Health Monitoring: Real-Time Alerts with Telegram Bots
Ensuring the continuous and optimal performance of servers is a fundamental requirement for reliable IT infrastructure. Server health monitoring involves tracking various vital metrics to detect potential issues before they lead to critical failures or performance degradation. While numerous commercial and open-source monitoring systems exist, a lightweight, customizable solution can be rapidly deployed for specific needs using scripting languages like Python and readily available communication platforms such as Telegram.
This article details a practical approach for monitoring core server health indicators—such as CPU usage, memory consumption, and disk space—using Python scripts and dispatching real-time alerts via a Telegram bot. This method offers flexibility, cost-effectiveness for simple setups, and direct control over the monitoring logic and alerting mechanisms.
Essential Concepts in Server Health Monitoring
Server health monitoring focuses on tracking key performance indicators (KPIs) that reflect the operational status and resource utilization of a server. Anomalies in these metrics often indicate underlying problems.
- CPU Usage: Measures the percentage of time the CPU is busy executing processes. High CPU usage can signify heavy load, inefficient applications, or runaway processes, potentially leading to slow response times.
- Memory Usage (RAM): Tracks the amount of physical memory currently in use. Excessive memory consumption or swapping (moving data between RAM and disk) can severely impact performance and stability.
- Disk Space: Monitors the amount of available storage space on disk partitions. Running out of disk space can halt applications, prevent logging, and cause system instability.
- Network Activity: While not the primary focus of this Python script using
psutil, network monitoring tracks data throughput and errors, crucial for services relying on network communication. (More advanced monitoring would involve specific network tools or libraries). - Running Processes: Keeping track of essential system processes or application instances ensures that critical services are operational. (This script focuses on resource metrics but can be extended to check specific processes).
Python’s rich ecosystem, including libraries like psutil, makes it an excellent choice for accessing these system-level metrics in a cross-platform manner. psutil (process and system utilities) provides an interface to retrieve information on processes and system utilization (CPU, memory, disks, network, sensors) in a portable way by implementing many notorious Unix and Windows command-line tools.
Telegram bots offer a convenient and accessible way to receive alerts. The Telegram Bot API allows applications to interact with Telegram users and groups programmatically, sending messages, notifications, and other content. This enables instant notifications directly to a mobile device or desktop via the Telegram application.
Implementing Server Health Monitoring and Telegram Alerts with Python
Setting up this monitoring system involves creating a Telegram bot, writing a Python script to check server metrics and send messages, and scheduling the script to run periodically.
Step 1: Creating a Telegram Bot and Obtaining API Credentials
A Telegram bot is required to send messages to a user or group.
- Find BotFather: Open the Telegram application and search for the user
@BotFather. This is the official bot used to create and manage other bots. - Create a New Bot: Start a chat with
@BotFatherand use the command/newbot. Follow the instructions:- Choose a name for the bot (e.g., “ServerMonitorBot”).
- Choose a unique username ending in “bot” (e.g., “MyServerMonitor_bot”).
- Obtain the API Token: Upon successful creation,
@BotFatherwill provide an HTTP API token. This token is essential for sending messages via the bot. Keep this token secure, as it grants control over the bot. Example token format:123456789:ABCDEFGHIJKLMNOPQRSTUVWXYZABCDEF. - Obtain the Chat ID: The bot needs to know where to send messages. This requires obtaining the chat ID of the conversation with the bot (either a direct message with the bot or a group where the bot has been added and given permission to send messages).
- Start a conversation with your new bot or add it to a group.
- Send a message to the bot (or in the group where the bot is present).
- Access the following URL in a web browser, replacing
<YourBOTToken>with your actual bot token:https://api.telegram.org/bot<YourBOTToken>/getUpdates - This will return a JSON object. Look for the
chatobject within themessageentry. Theidfield within thechatobject is the required chat ID. It will be a large number, possibly negative for group chats. Example:"id": 123456789.
Step 2: Installing Required Python Libraries
The Python script will utilize the psutil library to access server metrics and the requests library to interact with the Telegram Bot API.
Install these libraries using pip:
pip install psutil requestsEnsure pip is installed and updated if necessary.
Step 3: Writing the Python Monitoring Script
The core of the system is a Python script that performs the monitoring checks and sends alerts.
import psutilimport requestsimport timeimport os
# --- Configuration ---# Replace with your actual Telegram bot token and chat IDTELEGRAM_BOT_TOKEN = os.environ.get('TELEGRAM_BOT_TOKEN')TELEGRAM_CHAT_ID = os.environ.get('TELEGRAM_CHAT_ID')
# Define thresholds (percentages)CPU_THRESHOLD = 80RAM_THRESHOLD = 90DISK_THRESHOLD = 90 # Percentage of disk used
# Define time intervals (seconds) to avoid spamming alertsALERT_COOLDOWN_SECONDS = 3600 # 1 hour cooldown per alert type
# Store the last time an alert was sent for each typelast_alert_time = { 'cpu': 0, 'ram': 0, 'disk': 0}
# --- Functions ---
def send_telegram_message(message): """Sends a message to the configured Telegram chat.""" if not TELEGRAM_BOT_TOKEN or not TELEGRAM_CHAT_ID: print("Telegram token or chat ID not configured.") return False
api_url = f"https://api.telegram.org/bot{TELEGRAM_BOT_TOKEN}/sendMessage" try: response = requests.post(api_url, json={'chat_id': TELEGRAM_CHAT_ID, 'text': message}) response.raise_for_status() # Raise an HTTPError for bad responses (4xx or 5xx) print(f"Message sent successfully: {message}") return True except requests.exceptions.RequestException as e: print(f"Error sending message: {e}") return False
def check_threshold_and_alert(alert_type, current_value, threshold, unit="%", disk_partition="/"): """Checks if a value exceeds a threshold and sends an alert if cooldown allows.""" global last_alert_time current_time = time.time()
if current_value > threshold: # Check cooldown if current_time - last_alert_time[alert_type] > ALERT_COOLDOWN_SECONDS: server_hostname = os.uname().nodename # Get hostname for context alert_message = f"🚨 **ALERT: High {alert_type.upper()} Usage** on `{server_hostname}`!" if alert_type == 'disk': alert_message += f" Partition: `{disk_partition}`." alert_message += f"\nCurrent Value: `{current_value:.2f}{unit}` (Threshold: `{threshold}{unit}`)." alert_message += "\nPlease investigate immediately."
if send_telegram_message(alert_message): last_alert_time[alert_type] = current_time # Update cooldown timer return True # Alert sent else: print(f"Threshold for {alert_type} exceeded ({current_value:.2f}{unit}) but still in cooldown.") return False # Threshold exceeded but no alert sent due to cooldown else: print(f"{alert_type.upper()} usage is normal ({current_value:.2f}{unit}).") return False # Threshold not exceeded
def monitor_server(): """Monitors server health metrics and sends alerts if thresholds are exceeded.""" print("Starting server health check...")
# Check CPU Usage # interval=1 takes a 1-second average, instead of returning a potentially misleading instantaneous value cpu_percent = psutil.cpu_percent(interval=1) check_threshold_and_alert('cpu', cpu_percent, CPU_THRESHOLD)
# Check RAM Usage mem_info = psutil.virtual_memory() ram_percent = mem_info.percent check_threshold_and_alert('ram', ram_percent, RAM_THRESHOLD)
# Check Disk Usage (root partition '/') try: disk_info = psutil.disk_usage('/') disk_percent = disk_info.percent check_threshold_and_alert('disk', disk_percent, DISK_THRESHOLD, disk_partition="/") except Exception as e: # Handle cases where '/' might not be a valid partition or other disk errors print(f"Error checking disk usage for '/': {e}") # Potentially send an alert about monitoring failure itself?
# You can add checks for other partitions if needed # Example: psutil.disk_usage('/var').percent
print("Server health check finished.")
# --- Main execution ---if __name__ == "__main__": # It's safer to load credentials from environment variables # export TELEGRAM_BOT_TOKEN='your_token_here' # export TELEGRAM_CHAT_ID='your_chat_id_here' if not TELEGRAM_BOT_TOKEN or not TELEGRAM_CHAT_ID: print("FATAL: TELEGRAM_BOT_TOKEN and TELEGRAM_CHAT_ID environment variables must be set.") else: monitor_server()Code Explanation:
- Configuration: Defines the Telegram bot token and chat ID, ideally loaded from environment variables for security. Sets the thresholds for CPU, RAM, and Disk usage as percentages. Includes a
ALERT_COOLDOWN_SECONDSto prevent sending too many alerts for a persistent issue within a short period. send_telegram_message(message): A helper function that takes a stringmessage, constructs the Telegram API URL, and sends a POST request using therequestslibrary. Includes basic error handling.check_threshold_and_alert(...): This function checks if thecurrent_valueexceeds the definedthreshold. If it does, it checks thelast_alert_timefor that specificalert_type. If the cooldown period has passed, it formats an alert message including the server’s hostname (usingos.uname().nodename) and sends it viasend_telegram_message, then updates thelast_alert_time.monitor_server(): The main monitoring logic. It usespsutil.cpu_percent(),psutil.virtual_memory().percent, andpsutil.disk_usage('/').percentto get the current usage percentages. It then callscheck_threshold_and_alertfor each metric to determine if an alert is necessary.- Main Execution Block (
if __name__ == "__main__":): Ensures themonitor_server()function is called only when the script is executed directly. It includes a check to ensure environment variables are set.
Security Note: Storing sensitive information like API tokens directly in the script is not recommended, especially if the script might be shared or version-controlled. Using environment variables (as shown) or a configuration file with restricted permissions is a better practice.
Step 4: Scheduling the Script
For continuous monitoring, the Python script needs to be executed at regular intervals. On Linux systems, cron is a standard utility for scheduling tasks.
- Open crontab: Open the crontab editor for the current user:
Terminal window crontab -e - Add a cron job: Add a line to the crontab file to run the script periodically. For example, to run the script every 5 minutes:
*/5 * * * * /usr/bin/env python3 /path/to/your/script.py >> /var/log/server_monitor.log 2>&1
Explanation of the cron entry:
*/5 * * * *: This specifies the schedule: every 5 minutes (* on the day of the month, month, day of the week)./usr/bin/env python3: Ensures the script is executed using thepython3interpreter, relying on the system’s environment to find the correct Python executable./path/to/your/script.py: The absolute path to the Python script file.>> /var/log/server_monitor.log 2>&1: Redirects standard output and standard error to a log file. This is helpful for debugging. Ensure the log file path is writable by the user running the cron job.
Remember to set the necessary environment variables (TELEGRAM_BOT_TOKEN, TELEGRAM_CHAT_ID) for the user under which the cron job runs. This can be done in the user’s .bashrc or .profile file, or by adding them directly in the crontab entry itself (though less clean), e.g., TELEGRAM_BOT_TOKEN='...' TELEGRAM_CHAT_ID='...' /usr/bin/env python3 ....
On systems using systemd, systemd timers offer a more modern and robust alternative to cron.
Real-World Applications and Considerations
This Python-based monitoring system provides a flexible foundation applicable in various scenarios:
- Small-Scale Deployments: For monitoring a few critical servers in a small business or personal project where setting up a full-fledged monitoring suite is overkill.
- Specific Resource Checks: Monitoring only the most crucial metrics for a particular application (e.g., ensuring sufficient disk space for a database log partition).
- Custom Alerts: Tailoring alert messages with specific details or triggering different actions based on the type and severity of the issue.
- Ephemeral Environments: Quickly deploying basic monitoring in cloud instances or containers that might not persist long-term monitoring agents.
Example: Monitoring a simple web server. A script like the one above can be scheduled to run every 5 minutes. If a sudden traffic surge or misconfiguration causes high CPU usage (>80%), the script detects this on its next run. Since the CPU threshold is exceeded and the cooldown has passed, an alert message like ”🚨 ALERT: High CPU Usage on webserver-01! Current Value: 85.50% (Threshold: 80%). Please investigate immediately.” is sent instantly to the Telegram recipient, prompting immediate action to prevent downtime or performance issues. Similarly, if a log file fills up a disk partition, the disk usage check triggers an alert.
Limitations: While powerful for targeted checks, this simple approach has limitations compared to dedicated monitoring platforms:
- No Historical Data/Graphing: It primarily provides point-in-time checks and alerts, lacking capabilities for collecting, storing, and visualizing historical performance data for trend analysis.
- Lack of Centralization: Managing scripts on many servers becomes complex. Centralized solutions allow monitoring from a single dashboard.
- Agent Management: Scripts need to be deployed and updated on each server.
- Advanced Monitoring: Does not inherently support complex checks like application-specific metrics, log analysis, dependency mapping, or sophisticated anomaly detection.
- Alert Routing/Escalation: Lacks built-in features for routing alerts to different teams or escalating issues if unaddressed.
Despite these limitations, the Python + Telegram approach offers a quick, understandable, and effective way to implement custom server health checks and ensure timely notifications for basic resource issues.
Key Takeaways
- Server health monitoring is critical for detecting issues before they cause significant impact.
- Python, with libraries like
psutil, provides a flexible way to access server resource metrics programmatically. - Telegram bots offer a free and convenient platform for receiving real-time server alerts.
- The implementation involves creating a Telegram bot, writing a Python script using
psutilandrequests, and scheduling the script (e.g., withcron). - Thresholds for CPU, RAM, and Disk usage must be configured to trigger alerts based on acceptable limits.
- Implementing a cooldown period is essential to prevent alert spamming for persistent issues.
- Securing the Telegram bot token and chat ID, ideally via environment variables, is important.
- This method is suitable for simple, custom monitoring tasks on a small scale but lacks features of full monitoring systems like historical data analysis or centralized management.