1815 words

9 minutes

Using Python to Analyze Git Commit History and Contributor Stats

2025-06-29

Project

Python

/

Git

/

Data Analysis

/

Automation

/

Dev Tools

Analyzing Git Commit History and Contributor Statistics with Python#

Analyzing version control data provides valuable insights into project development, team dynamics, and individual contributions. Git, as the dominant version control system, stores a rich history of changes, authorship, and development timelines in its commit history. This data, when programmatically accessed and analyzed, can reveal patterns of activity, identify key contributors, track progress, and highlight areas of the codebase receiving the most attention.

Leveraging Python’s data processing capabilities and readily available libraries offers a powerful and flexible approach to extracting and analyzing Git repository data. This method allows for custom metrics, automated reporting, and integration with other data analysis and visualization tools.

Key terms essential to understanding Git commit history analysis include:

Commit: A snapshot of the repository’s state at a specific point in time, along with metadata like author, committer, date, and a message describing the changes.
Author: The person who originally wrote the code changes included in a commit.
Committer: The person who applied the commit to the repository (may be different from the author, e.g., when applying a patch).
Repository: The complete history of commits for a project, stored in a .git directory.
SHA (Secure Hash Algorithm 1): A unique identifier assigned to each commit, based on the commit’s content and metadata.

The Value of Analyzing Git History with Python#

Programmatic analysis of Git history offers several advantages over manual inspection or basic git log commands:

Scalability: Process large repositories with extensive histories efficiently.
Customization: Define and calculate specific metrics tailored to project needs.
Automation: Automate reporting and analysis pipelines for continuous monitoring.
Integration: Combine Git data with issue trackers, CI/CD systems, or other project management tools.
Depth of Analysis: Perform complex calculations, trend analysis, and data visualization that are difficult with command-line tools alone.

Analyzing commit history with Python can uncover insights into:

Project activity levels over time.
Distribution of contributions among team members.
Impact of contributions (e.g., lines of code added/deleted).
Bus factor (dependency on a few key contributors).
Frequency of certain types of changes (features, fixes, refactors) based on commit messages.
Identification of frequently modified or stable parts of the codebase.

Essential Concepts for Git History Analysis#

Effective analysis requires understanding the information contained within each commit object and what metrics can be derived:

Commit Metadata:
- hexsha: The unique SHA-1 hash of the commit.
- author: The git.Actor object representing the author (name and email).
- committer: The git.Actor object representing the committer.
- authored_date: Timestamp when the commit was authored.
- committed_date: Timestamp when the commit was committed.
- message: The commit message string.
- parents: A list of parent commits (typically one, more for merges).
Commit Stats: Information about the changes introduced by the commit, typically including:
- total: A dictionary containing insertions, deletions, and lines (total changes).
- files: A dictionary mapping file paths to their individual insertions, deletions, and lines stats.

Common metrics derived from this data include:

Commit Count: Total number of commits, or commits per author, per time period.
Lines Changed: Total lines added or deleted per author, per commit, or per file.
Commit Frequency: How often commits occur (daily, weekly, monthly) to identify development pace and trends.
Author Activity: Ranking contributors by commit count or lines changed.
File Activity: Identifying files or directories with the most frequent changes.
Commit Time Analysis: Examining the time of day or day of the week when commits are made.

Tools and Libraries for Python Git Analysis#

The primary library for interacting with Git repositories in Python is GitPython. It provides an object-oriented interface to the Git repository, allowing programmatic access to its objects like repositories, commits, trees, blobs, and more.

Installing GitPython#

GitPython can be installed using pip:

1
pip install GitPython

Basic GitPython Usage#

Interacting with a repository involves initializing the git.Repo object:

1
import git
2
import os
3

4
# Path to the Git repository
5
repo_path = '/path/to/your/repository'
6

7
# Ensure the path is a valid Git repository
8
if git.is_git_repository(repo_path) and os.path.exists(os.path.join(repo_path, '.git')):
9
    repo = git.Repo(repo_path)
10
    print(f"Successfully opened repository: {repo_path}")
11
else:
12
    print(f"Error: {repo_path} is not a valid Git repository.")
13
    repo = None

Iterating through commits is a fundamental operation:

1
if repo:
2
    # Iterate through all commits in the default branch (usually 'HEAD')
3
    # You can specify a branch or commit hash to start from
4
    print("\nRecent commits:")
5
    for commit in repo.iter_commits('HEAD', max_count=10):
6
        print(f"SHA: {commit.hexsha[:7]}")
7
        print(f"Author: {commit.author.name}")
8
        print(f"Date: {commit.authored_date}")
9
        print(f"Message: {commit.summary}")
10
        print("-" * 20)

Accessing commit statistics (lines added/deleted):

1
if repo:
2
    print("\nStats for a recent commit:")
3
    try:
4
        latest_commit = repo.head.commit
5
        print(f"Commit: {latest_commit.hexsha[:7]}")
6
        stats = latest_commit.stats.total
7
        print(f"Lines Added: {stats['insertions']}")
8
        print(f"Lines Deleted: {stats['deletions']}")
9
        print(f"Total Lines Changed: {stats['lines']}")
10
        # You can also access stats['files'] for per-file changes
11
    except Exception as e:
12
        print(f"Could not get stats for latest commit: {e}")

For data processing and aggregation, the pandas library is invaluable. Matplotlib and Seaborn are standard choices for visualization.

1
pip install pandas matplotlib seaborn

Step-by-Step: Analyzing Commit Data for Contributor Stats#

This section outlines a process for extracting and analyzing commit data to generate contributor statistics.

Step 1: Set Up Environment and Import Libraries#

Ensure Python is installed and the necessary libraries (git, pandas, datetime) are available.

1
import git
2
import pandas as pd
3
from datetime import datetime
4
import os

Step 2: Specify Repository Path#

Define the local path to the Git repository that will be analyzed.

1
repo_path = '/path/to/your/repository' # Replace with the actual path

Step 3: Load the Repository#

Initialize the git.Repo object. Include error handling for invalid paths.

1
try:
2
    if git.is_git_repository(repo_path) and os.path.exists(os.path.join(repo_path, '.git')):
3
        repo = git.Repo(repo_path)
4
        print(f"Loading repository from {repo_path}")
5
    else:
6
        print(f"Error: {repo_path} is not a valid Git repository or does not exist.")
7
        repo = None
8
except git.exc.InvalidGitRepositoryError:
9
    print(f"Error: {repo_path} is not a valid Git repository.")
10
    repo = None
11
except Exception as e:
12
    print(f"An unexpected error occurred: {e}")
13
    repo = None

Step 4: Iterate Through Commits and Extract Data#

Loop through the commits and collect relevant information. Storing this in a list of dictionaries is a flexible approach before converting to a pandas DataFrame.

1
commit_list = []
2

3
if repo:
4
    print("Extracting commit data...")
5
    try:
6
        # Iterate through all commits in the history
7
        # Consider adding filtering arguments like max_count, since, until if history is very large
8
        for commit in repo.iter_commits('HEAD'):
9
            try:
10
                # Attempt to get stats; this can sometimes fail for merge commits or empty commits
11
                stats = commit.stats.total
12
                insertions = stats.get('insertions', 0)
13
                deletions = stats.get('deletions', 0)
14
                total_lines = stats.get('lines', 0) # lines is insertions + deletions
15
            except Exception:
16
                # Handle cases where stats might not be available or raise errors
17
                insertions = 0
18
                deletions = 0
19
                total_lines = 0
20

21
            commit_data = {
22
                'hexsha': commit.hexsha,
23
                'author_name': commit.author.name,
24
                'author_email': commit.author.email,
25
                'committed_date': datetime.fromtimestamp(commit.committed_date),
26
                'message': commit.message.strip(),
27
                'insertions': insertions,
28
                'deletions': deletions,
29
                'total_lines_changed': total_lines
30
            }
31
            commit_list.append(commit_data)
32

33
        print(f"Extracted {len(commit_list)} commits.")
34

35
    except git.exc.GitCommandError as e:
36
        print(f"Git command failed during iteration: {e}")
37
    except Exception as e:
38
        print(f"An error occurred during commit extraction: {e}")

Step 5: Create a Pandas DataFrame#

Convert the collected data into a pandas DataFrame for easier manipulation and analysis.

1
if commit_list:
2
    df = pd.DataFrame(commit_list)
3

4
    # Ensure committed_date is in datetime format for time-based analysis
5
    df['committed_date'] = pd.to_datetime(df['committed_date'])
6

7
    print("\nDataFrame created:")
8
    print(df.head())
9
    print(f"\nDataFrame shape: {df.shape}")
10
else:
11
    print("\nNo commit data to process.")
12
    df = pd.DataFrame() # Create an empty DataFrame if no data

Step 6: Analyze and Aggregate Data#

Use pandas capabilities to calculate contributor statistics.

Example: Commits per Author#

1
if not df.empty:
2
    commits_per_author = df['author_name'].value_counts().reset_index()
3
    commits_per_author.columns = ['Author', 'Total Commits']
4
    print("\nTotal Commits per Author:")
5
    print(commits_per_author.head(10)) # Display top 10

Example: Lines Changed per Author#

1
if not df.empty:
2
    lines_per_author = df.groupby('author_name')[['insertions', 'deletions', 'total_lines_changed']].sum().reset_index()
3
    lines_per_author.columns = ['Author', 'Total Insertions', 'Total Deletions', 'Total Lines Changed']
4
    # Calculate total lines changed as a sum of insertions and deletions for clarity
5
    # The 'total_lines_changed' column from stats usually IS insertions + deletions, but recalculating ensures consistency.
6
    lines_per_author['Calculated Total Lines Changed'] = lines_per_author['Total Insertions'] + lines_per_author['Total Deletions']
7

8
    print("\nLines Changed per Author:")
9
    # Sort by calculated total lines changed or total commits for ranking
10
    print(lines_per_author.sort_values(by='Calculated Total Lines Changed', ascending=False).head(10)) # Display top 10

Example: Commit Frequency Over Time#

1
if not df.empty:
2
    # Group by date (floor to day) and count commits
3
    commits_over_time = df.set_index('committed_date').resample('D').size().reset_index(name='Commit Count')
4
    print("\nCommit Frequency (Daily):")
5
    print(commits_over_time.head())
6

7
    # Group by week
8
    commits_over_week = df.set_index('committed_date').resample('W').size().reset_index(name='Commit Count')
9
    print("\nCommit Frequency (Weekly):")
10
    print(commits_over_week.head())

Step 7: Visualize Data (Optional but Recommended)#

Using matplotlib or seaborn to visualize the aggregated data makes insights more accessible.

1
import matplotlib.pyplot as plt
2
import seaborn as sns
3

4
# Configure matplotlib for better display
5
plt.style.use('seaborn-v0_8-whitegrid')
6
sns.set_theme()
7

8
if not df.empty:
9
    # Visualization 1: Top N contributors by commit count
10
    top_n = 10
11
    if len(commits_per_author) > 0:
12
        plt.figure(figsize=(12, 6))
13
        sns.barplot(x='Total Commits', y='Author', data=commits_per_author.nlargest(top_n, 'Total Commits'))
14
        plt.title(f'Top {top_n} Contributors by Commit Count')
15
        plt.xlabel('Number of Commits')
16
        plt.ylabel('Author')
17
        plt.tight_layout()
18
        plt.show()
19
    else:
20
         print("\nNot enough data to plot Commits per Author.")
21

22

23
    # Visualization 2: Commit frequency over time (Weekly)
24
    if len(commits_over_week) > 1: # Need at least two points for a meaningful line plot
25
        plt.figure(figsize=(15, 6))
26
        sns.lineplot(x='committed_date', y='Commit Count', data=commits_over_week)
27
        plt.title('Weekly Commit Frequency')
28
        plt.xlabel('Week')
29
        plt.ylabel('Number of Commits')
30
        plt.xticks(rotation=45)
31
        plt.tight_layout()
32
        plt.show()
33
    else:
34
        print("\nNot enough data to plot Weekly Commit Frequency.")
35

36
    # Visualization 3: Top N contributors by lines changed
37
    if len(lines_per_author) > 0:
38
        plt.figure(figsize=(12, 6))
39
        sns.barplot(x='Calculated Total Lines Changed', y='Author', data=lines_per_author.nlargest(top_n, 'Calculated Total Lines Changed'))
40
        plt.title(f'Top {top_n} Contributors by Lines Changed (Insertions + Deletions)')
41
        plt.xlabel('Total Lines Changed')
42
        plt.ylabel('Author')
43
        plt.tight_layout()
44
        plt.show()
45
    else:
46
        print("\nNot enough data to plot Lines Changed per Author.")
47

48
else:
49
    print("\nNo data available for visualization.")

Concrete Example: Analyzing a Hypothetical Project#

Consider a small open-source library’s Git repository located at /home/user/my-python-library. Applying the steps above to this repository could yield the following hypothetical results:

Total Commits per Author (Example Output)

Author	Total Commits
Alice Developer	150
Bob Coder	110
Charlie Contributor	45
David Newcomer	12
Eve Fixer	5

Interpretation: This table quickly identifies the most active contributors by commit volume. Alice and Bob appear to be the core maintainers based on this metric.

Lines Changed per Author (Example Output - Top 5 by Total Lines)

Author	Total Insertions	Total Deletions	Total Lines Changed	Calculated Total Lines Changed
Bob Coder	15000	7000	22000	22000
Alice Developer	12000	9000	21000	21000
Charlie Contributor	3000	1000	4000	4000
David Newcomer	800	50	850	850
Eve Fixer	100	300	400	400

Interpretation: While Alice has slightly more commits, Bob has contributed a higher volume of code changes (insertions + deletions). This shows the value of looking beyond just commit counts.

Weekly Commit Frequency (Example Snippet)

committed_date	Commit Count
2023-01-01	3
2023-01-08	8
2023-01-15	15
2023-01-22	12
2023-01-29	5
…	…

Interpretation: A visualization of this data would show periods of high and low activity, potentially corresponding to feature sprints, release cycles, or holiday periods. Spikes might indicate intense development phases, while dips could signal periods of low activity or project stagnation.

Further analysis could involve filtering commits by date range, analyzing commit message content using text analysis techniques, or identifying contributions to specific directories or files.

Key Takeaways and Actionable Insights#

Analyzing Git commit history with Python provides powerful capabilities for understanding software development projects.

Identify Core Contributors: Easily rank authors by commit count or lines changed to recognize key players.
Monitor Project Health: Track commit frequency over time to gauge development pace and identify potential bottlenecks or periods of inactivity.
Assess Contribution Impact: Analyze lines added/deleted to understand the scope and nature of contributions, moving beyond simple commit counts.
Inform Resource Allocation: Data on who is working on what parts of the codebase (by analyzing commit file stats) can inform team structure and task assignment.
Recognize and Reward: Use quantitative data to acknowledge significant contributions from team members or community contributors.
Refine Processes: Analyze commit patterns (e.g., size of commits, frequency of merge commits) to identify areas for process improvement.
Onboard New Members: Analyze historical data to understand project evolution and highlight active areas for new team members.

By automating the extraction and analysis of Git data using Python and libraries like GitPython and pandas, organizations can gain objective, data-driven insights into their development processes and team performance. This enables more informed decision-making and a better understanding of the project’s historical trajectory.