1815 words
9 minutes
Using Python to Analyze Git Commit History and Contributor Stats

Analyzing Git Commit History and Contributor Statistics with Python#

Analyzing version control data provides valuable insights into project development, team dynamics, and individual contributions. Git, as the dominant version control system, stores a rich history of changes, authorship, and development timelines in its commit history. This data, when programmatically accessed and analyzed, can reveal patterns of activity, identify key contributors, track progress, and highlight areas of the codebase receiving the most attention.

Leveraging Python’s data processing capabilities and readily available libraries offers a powerful and flexible approach to extracting and analyzing Git repository data. This method allows for custom metrics, automated reporting, and integration with other data analysis and visualization tools.

Key terms essential to understanding Git commit history analysis include:

  • Commit: A snapshot of the repository’s state at a specific point in time, along with metadata like author, committer, date, and a message describing the changes.
  • Author: The person who originally wrote the code changes included in a commit.
  • Committer: The person who applied the commit to the repository (may be different from the author, e.g., when applying a patch).
  • Repository: The complete history of commits for a project, stored in a .git directory.
  • SHA (Secure Hash Algorithm 1): A unique identifier assigned to each commit, based on the commit’s content and metadata.

The Value of Analyzing Git History with Python#

Programmatic analysis of Git history offers several advantages over manual inspection or basic git log commands:

  • Scalability: Process large repositories with extensive histories efficiently.
  • Customization: Define and calculate specific metrics tailored to project needs.
  • Automation: Automate reporting and analysis pipelines for continuous monitoring.
  • Integration: Combine Git data with issue trackers, CI/CD systems, or other project management tools.
  • Depth of Analysis: Perform complex calculations, trend analysis, and data visualization that are difficult with command-line tools alone.

Analyzing commit history with Python can uncover insights into:

  • Project activity levels over time.
  • Distribution of contributions among team members.
  • Impact of contributions (e.g., lines of code added/deleted).
  • Bus factor (dependency on a few key contributors).
  • Frequency of certain types of changes (features, fixes, refactors) based on commit messages.
  • Identification of frequently modified or stable parts of the codebase.

Essential Concepts for Git History Analysis#

Effective analysis requires understanding the information contained within each commit object and what metrics can be derived:

  • Commit Metadata:
    • hexsha: The unique SHA-1 hash of the commit.
    • author: The git.Actor object representing the author (name and email).
    • committer: The git.Actor object representing the committer.
    • authored_date: Timestamp when the commit was authored.
    • committed_date: Timestamp when the commit was committed.
    • message: The commit message string.
    • parents: A list of parent commits (typically one, more for merges).
  • Commit Stats: Information about the changes introduced by the commit, typically including:
    • total: A dictionary containing insertions, deletions, and lines (total changes).
    • files: A dictionary mapping file paths to their individual insertions, deletions, and lines stats.

Common metrics derived from this data include:

  • Commit Count: Total number of commits, or commits per author, per time period.
  • Lines Changed: Total lines added or deleted per author, per commit, or per file.
  • Commit Frequency: How often commits occur (daily, weekly, monthly) to identify development pace and trends.
  • Author Activity: Ranking contributors by commit count or lines changed.
  • File Activity: Identifying files or directories with the most frequent changes.
  • Commit Time Analysis: Examining the time of day or day of the week when commits are made.

Tools and Libraries for Python Git Analysis#

The primary library for interacting with Git repositories in Python is GitPython. It provides an object-oriented interface to the Git repository, allowing programmatic access to its objects like repositories, commits, trees, blobs, and more.

Installing GitPython#

GitPython can be installed using pip:

Terminal window
pip install GitPython

Basic GitPython Usage#

Interacting with a repository involves initializing the git.Repo object:

import git
import os
# Path to the Git repository
repo_path = '/path/to/your/repository'
# Ensure the path is a valid Git repository
if git.is_git_repository(repo_path) and os.path.exists(os.path.join(repo_path, '.git')):
repo = git.Repo(repo_path)
print(f"Successfully opened repository: {repo_path}")
else:
print(f"Error: {repo_path} is not a valid Git repository.")
repo = None

Iterating through commits is a fundamental operation:

if repo:
# Iterate through all commits in the default branch (usually 'HEAD')
# You can specify a branch or commit hash to start from
print("\nRecent commits:")
for commit in repo.iter_commits('HEAD', max_count=10):
print(f"SHA: {commit.hexsha[:7]}")
print(f"Author: {commit.author.name}")
print(f"Date: {commit.authored_date}")
print(f"Message: {commit.summary}")
print("-" * 20)

Accessing commit statistics (lines added/deleted):

if repo:
print("\nStats for a recent commit:")
try:
latest_commit = repo.head.commit
print(f"Commit: {latest_commit.hexsha[:7]}")
stats = latest_commit.stats.total
print(f"Lines Added: {stats['insertions']}")
print(f"Lines Deleted: {stats['deletions']}")
print(f"Total Lines Changed: {stats['lines']}")
# You can also access stats['files'] for per-file changes
except Exception as e:
print(f"Could not get stats for latest commit: {e}")

For data processing and aggregation, the pandas library is invaluable. Matplotlib and Seaborn are standard choices for visualization.

Terminal window
pip install pandas matplotlib seaborn

Step-by-Step: Analyzing Commit Data for Contributor Stats#

This section outlines a process for extracting and analyzing commit data to generate contributor statistics.

Step 1: Set Up Environment and Import Libraries#

Ensure Python is installed and the necessary libraries (git, pandas, datetime) are available.

import git
import pandas as pd
from datetime import datetime
import os

Step 2: Specify Repository Path#

Define the local path to the Git repository that will be analyzed.

repo_path = '/path/to/your/repository' # Replace with the actual path

Step 3: Load the Repository#

Initialize the git.Repo object. Include error handling for invalid paths.

try:
if git.is_git_repository(repo_path) and os.path.exists(os.path.join(repo_path, '.git')):
repo = git.Repo(repo_path)
print(f"Loading repository from {repo_path}")
else:
print(f"Error: {repo_path} is not a valid Git repository or does not exist.")
repo = None
except git.exc.InvalidGitRepositoryError:
print(f"Error: {repo_path} is not a valid Git repository.")
repo = None
except Exception as e:
print(f"An unexpected error occurred: {e}")
repo = None

Step 4: Iterate Through Commits and Extract Data#

Loop through the commits and collect relevant information. Storing this in a list of dictionaries is a flexible approach before converting to a pandas DataFrame.

commit_list = []
if repo:
print("Extracting commit data...")
try:
# Iterate through all commits in the history
# Consider adding filtering arguments like max_count, since, until if history is very large
for commit in repo.iter_commits('HEAD'):
try:
# Attempt to get stats; this can sometimes fail for merge commits or empty commits
stats = commit.stats.total
insertions = stats.get('insertions', 0)
deletions = stats.get('deletions', 0)
total_lines = stats.get('lines', 0) # lines is insertions + deletions
except Exception:
# Handle cases where stats might not be available or raise errors
insertions = 0
deletions = 0
total_lines = 0
commit_data = {
'hexsha': commit.hexsha,
'author_name': commit.author.name,
'author_email': commit.author.email,
'committed_date': datetime.fromtimestamp(commit.committed_date),
'message': commit.message.strip(),
'insertions': insertions,
'deletions': deletions,
'total_lines_changed': total_lines
}
commit_list.append(commit_data)
print(f"Extracted {len(commit_list)} commits.")
except git.exc.GitCommandError as e:
print(f"Git command failed during iteration: {e}")
except Exception as e:
print(f"An error occurred during commit extraction: {e}")

Step 5: Create a Pandas DataFrame#

Convert the collected data into a pandas DataFrame for easier manipulation and analysis.

if commit_list:
df = pd.DataFrame(commit_list)
# Ensure committed_date is in datetime format for time-based analysis
df['committed_date'] = pd.to_datetime(df['committed_date'])
print("\nDataFrame created:")
print(df.head())
print(f"\nDataFrame shape: {df.shape}")
else:
print("\nNo commit data to process.")
df = pd.DataFrame() # Create an empty DataFrame if no data

Step 6: Analyze and Aggregate Data#

Use pandas capabilities to calculate contributor statistics.

Example: Commits per Author#

if not df.empty:
commits_per_author = df['author_name'].value_counts().reset_index()
commits_per_author.columns = ['Author', 'Total Commits']
print("\nTotal Commits per Author:")
print(commits_per_author.head(10)) # Display top 10

Example: Lines Changed per Author#

if not df.empty:
lines_per_author = df.groupby('author_name')[['insertions', 'deletions', 'total_lines_changed']].sum().reset_index()
lines_per_author.columns = ['Author', 'Total Insertions', 'Total Deletions', 'Total Lines Changed']
# Calculate total lines changed as a sum of insertions and deletions for clarity
# The 'total_lines_changed' column from stats usually IS insertions + deletions, but recalculating ensures consistency.
lines_per_author['Calculated Total Lines Changed'] = lines_per_author['Total Insertions'] + lines_per_author['Total Deletions']
print("\nLines Changed per Author:")
# Sort by calculated total lines changed or total commits for ranking
print(lines_per_author.sort_values(by='Calculated Total Lines Changed', ascending=False).head(10)) # Display top 10

Example: Commit Frequency Over Time#

if not df.empty:
# Group by date (floor to day) and count commits
commits_over_time = df.set_index('committed_date').resample('D').size().reset_index(name='Commit Count')
print("\nCommit Frequency (Daily):")
print(commits_over_time.head())
# Group by week
commits_over_week = df.set_index('committed_date').resample('W').size().reset_index(name='Commit Count')
print("\nCommit Frequency (Weekly):")
print(commits_over_week.head())

Using matplotlib or seaborn to visualize the aggregated data makes insights more accessible.

import matplotlib.pyplot as plt
import seaborn as sns
# Configure matplotlib for better display
plt.style.use('seaborn-v0_8-whitegrid')
sns.set_theme()
if not df.empty:
# Visualization 1: Top N contributors by commit count
top_n = 10
if len(commits_per_author) > 0:
plt.figure(figsize=(12, 6))
sns.barplot(x='Total Commits', y='Author', data=commits_per_author.nlargest(top_n, 'Total Commits'))
plt.title(f'Top {top_n} Contributors by Commit Count')
plt.xlabel('Number of Commits')
plt.ylabel('Author')
plt.tight_layout()
plt.show()
else:
print("\nNot enough data to plot Commits per Author.")
# Visualization 2: Commit frequency over time (Weekly)
if len(commits_over_week) > 1: # Need at least two points for a meaningful line plot
plt.figure(figsize=(15, 6))
sns.lineplot(x='committed_date', y='Commit Count', data=commits_over_week)
plt.title('Weekly Commit Frequency')
plt.xlabel('Week')
plt.ylabel('Number of Commits')
plt.xticks(rotation=45)
plt.tight_layout()
plt.show()
else:
print("\nNot enough data to plot Weekly Commit Frequency.")
# Visualization 3: Top N contributors by lines changed
if len(lines_per_author) > 0:
plt.figure(figsize=(12, 6))
sns.barplot(x='Calculated Total Lines Changed', y='Author', data=lines_per_author.nlargest(top_n, 'Calculated Total Lines Changed'))
plt.title(f'Top {top_n} Contributors by Lines Changed (Insertions + Deletions)')
plt.xlabel('Total Lines Changed')
plt.ylabel('Author')
plt.tight_layout()
plt.show()
else:
print("\nNot enough data to plot Lines Changed per Author.")
else:
print("\nNo data available for visualization.")

Concrete Example: Analyzing a Hypothetical Project#

Consider a small open-source library’s Git repository located at /home/user/my-python-library. Applying the steps above to this repository could yield the following hypothetical results:

Total Commits per Author (Example Output)

AuthorTotal Commits
Alice Developer150
Bob Coder110
Charlie Contributor45
David Newcomer12
Eve Fixer5

Interpretation: This table quickly identifies the most active contributors by commit volume. Alice and Bob appear to be the core maintainers based on this metric.

Lines Changed per Author (Example Output - Top 5 by Total Lines)

AuthorTotal InsertionsTotal DeletionsTotal Lines ChangedCalculated Total Lines Changed
Bob Coder1500070002200022000
Alice Developer1200090002100021000
Charlie Contributor3000100040004000
David Newcomer80050850850
Eve Fixer100300400400

Interpretation: While Alice has slightly more commits, Bob has contributed a higher volume of code changes (insertions + deletions). This shows the value of looking beyond just commit counts.

Weekly Commit Frequency (Example Snippet)

committed_dateCommit Count
2023-01-013
2023-01-088
2023-01-1515
2023-01-2212
2023-01-295

Interpretation: A visualization of this data would show periods of high and low activity, potentially corresponding to feature sprints, release cycles, or holiday periods. Spikes might indicate intense development phases, while dips could signal periods of low activity or project stagnation.

Further analysis could involve filtering commits by date range, analyzing commit message content using text analysis techniques, or identifying contributions to specific directories or files.

Key Takeaways and Actionable Insights#

Analyzing Git commit history with Python provides powerful capabilities for understanding software development projects.

  • Identify Core Contributors: Easily rank authors by commit count or lines changed to recognize key players.
  • Monitor Project Health: Track commit frequency over time to gauge development pace and identify potential bottlenecks or periods of inactivity.
  • Assess Contribution Impact: Analyze lines added/deleted to understand the scope and nature of contributions, moving beyond simple commit counts.
  • Inform Resource Allocation: Data on who is working on what parts of the codebase (by analyzing commit file stats) can inform team structure and task assignment.
  • Recognize and Reward: Use quantitative data to acknowledge significant contributions from team members or community contributors.
  • Refine Processes: Analyze commit patterns (e.g., size of commits, frequency of merge commits) to identify areas for process improvement.
  • Onboard New Members: Analyze historical data to understand project evolution and highlight active areas for new team members.

By automating the extraction and analysis of Git data using Python and libraries like GitPython and pandas, organizations can gain objective, data-driven insights into their development processes and team performance. This enables more informed decision-making and a better understanding of the project’s historical trajectory.

Using Python to Analyze Git Commit History and Contributor Stats
https://dev-resources.site/posts/using-python-to-analyze-git-commit-history-and-contributor-stats/
Author
Dev-Resources
Published at
2025-06-29
License
CC BY-NC-SA 4.0