Skip to content

PyVCS : A version control system in Python

"The goal is to peek underneath the abstraction and reimplement the abstraction from scratch, to gain deep understanding of what's being abstracted."

Introduction

Version control system is a tool for managing and tracking changes in software projects. Git is a version control system (as you may know), is what developers use maintain code and collaborate. In this doc, we'll explore PyVCS, a simple version control system implemented in Python that mimics some of Git's core functionality.

How Git Works: A Brief Overview

Before diving into PyVCS, here is a quick rundown of how Git works (limited to what is relevant to this project):

  1. Git uses a content-addressable filesystem to store snapshots of your project.
  2. It maintains a series of commits, each representing a point in the project's history.
  3. Git uses objects (blobs, trees, commits) to store content and metadata.
  4. References (refs) are used to point to specific commits (e.g., branches, tags).
  5. The working directory, staging area, and repository form the three main states of Git.

What PyVCS Implements

We will implement the following Git commands:

  • init - initializes a new repository
  • add - adds file to staging area
  • commit - commits changes
  • status - track the changes
  • log - view the commit history
  • diff - compare the current version of the file to that of the last committed version

We will dive deep in how each command works in PyVCS, compared to Git and what goes on under-the-hood.

Implementing Commands

1. pyvcs init ( v/s Git )

When the init command is executed, PyVCS creates the following directory structure

.pyvcs/
├── objects/
└── refs/
  • The .pyvcs directory is the root of the version control system.
  • The objects directory will store all content (files and commits) as hash-addressed objects.
  • The refs directory will store references, particularly the HEAD reference.
class VersionControl:
    def __init__(self, root_dir):
        self.root_dir = os.path.abspath(root_dir)
        self.pyvcs_dir = os.path.join(self.root_dir, '.pyvcs')
        self.objects_dir = os.path.join(self.pyvcs_dir, 'objects')
        self.refs_dir = os.path.join(self.pyvcs_dir, 'refs')

Create the necessary directory structure if it doesn't exist, and return a success message or an "already initialized" message.

def init(cls, root_dir, directory_name=None):
    root_dir = cls._create_repo_directory(root_dir, directory_name)
    vc = cls(root_dir)
    if not os.path.exists(vc.pyvcs_dir):
        cls._create_pyvcs_structure(vc.pyvcs_dir, vc.objects_dir, vc.refs_dir)
        return True, f"Initialized empty PyVCS repository in {vc.pyvcs_dir}"
    return False, "PyVCS directory already initialized."

Create a new repository directory if a directory name is provided.

def _create_repo_directory(root_dir, directory_name):
    if directory_name:
        new_dir = os.path.join(root_dir, directory_name)
        if not os.path.exists(new_dir):
            os.makedirs(new_dir)
        root_dir = new_dir
    return root_dir
Create the PyVCS directory structure.
def _create_pyvcs_structure(pyvcs_dir, objects_dir, refs_dir):
    os.makedirs(pyvcs_dir, exist_ok=True)
    os.makedirs(objects_dir, exist_ok=True)
    os.makedirs(refs_dir, exist_ok=True)

  • It doesn't create any initial commit or branch.

2. pyvcs add ( v/s Git )

  1. The file content is read.
  2. A SHA-256 hash of the content is calculated.
  3. The content is stored in the objects directory with the hash as the filename.
  4. The file path and hash are added to an index file (.pyvcs/index).

def add(self, file_path):
    content = read_file(file_path)
    hash_value = calculate_hash(content)
    object_path = os.path.join(self.objects_dir, hash_value)
    write_file(object_path, content)
    self._update_index(file_path, hash_value)
    return hash_value
Update the index file with the new file path and hash value.
def _update_index(self, file_path, hash_value):
    index_path = os.path.join(self.pyvcs_dir, 'index')
    if os.path.exists(index_path):
        with open(index_path, 'r') as f:
            index = json.load(f)
    else:
        index = {}
    index[os.path.relpath(file_path, self.root_dir)] = hash_value
    with open(index_path, 'w') as f:
        json.dump(index, f)

  • PyVCS uses a simple JSON file as its index, storing file paths and their corresponding hashes.
  • It doesn't create tree objects or handle directories specially.

3. pyvcs commit ( v/s Git )

  1. It reads the current HEAD commit (if it exists).
  2. It creates a commit object containing:

    • Commit message
    • Timestamp
    • Parent commit hash
    • A dictionary of file paths and their corresponding hashes
  3. It calculates a SHA-256 hash of the commit object.

  4. It stores the commit object in the objects directory.
  5. It updates the HEAD reference to point to the new commit.
    def commit(self, message):
        parent_hash = self._get_head_commit()
        commit_obj = self._create_commit_object(message, parent_hash)
        commit_hash = self._write_commit_object(commit_obj)
        self._update_head(commit_hash)
        return commit_hash
    
    Read and return the current HEAD commit hash.
    def _get_head_commit(self):
        head_path = os.path.join(self.refs_dir, 'HEAD')
        if os.path.exists(head_path):
            return read_file(head_path).decode().strip()
        return None
    
    Create a commit object with the given message, current timestamp, staged files, and parent commit hash.
    def _create_commit_object(self, message, parent_hash):
        commit_obj = {
            'message': message,
            'timestamp': datetime.now().isoformat(),
            'files': self._get_staged_files(),
            'parent': parent_hash,
        }
        return commit_obj
    
    Read and return the staged files from the index file.
    def _get_staged_files(self):
        index_path = os.path.join(self.pyvcs_dir, 'index')
        with open(index_path, 'r') as f:
            return json.load(f)
    
    Write the commit object to the object directory and return its hash
    def _write_commit_object(self, commit_obj):
        commit_content = json.dumps(commit_obj).encode()
        commit_hash = calculate_hash(commit_content)
        commit_path = os.path.join(self.objects_dir, commit_hash)
        write_file(commit_path, commit_content)
        return commit_hash
    
    Update the HEAD reference to point to the new commit
    def _update_head(self, commit_hash):
        head_path = os.path.join(self.refs_dir, 'HEAD')
        write_file(head_path, commit_hash.encode())
    
  • PyVCS stores commit objects as JSON files.
  • It doesn't create separate tree objects for directories.
  • The commit hash is based on the entire commit object, including the file list.

4. pyvcs status ( v/s Git )

  1. Reads the index file to get staged files.
  2. Scans the working directory for all files.
  3. Compares working directory files with the index to identify:

    • Untracked files (in working directory but not in index)
    • Modified files (content hash different from index)
  4. Returns lists of staged and changed files.

def status(self):
    staged_files = self._get_staged_files()
    changed_files = self._get_changed_files(staged_files)
    return staged_files, changed_files

Identify files that have been changed or are untracked.

def _get_changed_files(self, staged_files):
    changed_files = {}
    for file_path in list_files(self.root_dir):
        rel_path = os.path.relpath(file_path, self.root_dir)
        if rel_path.startswith('.pyvcs'):
            continue
        if rel_path not in staged_files:
            changed_files[rel_path] = 'Untracked'
        else:
            content = read_file(file_path)
            hash_value = calculate_hash(content)
            if hash_value != staged_files[rel_path]:
                changed_files[rel_path] = 'Modified'
    return changed_files

  • It doesn't track renames or copies.
  • It doesn't handle submodules or complex ignore rules.

5. pyvcs log ( v/s Git )

  1. Starts from the HEAD commit.
  2. For each commit:

    • Reads the commit object from the objects directory.
    • Yields the commit hash and object.
    • Moves to the parent commit.
  3. Stops when there's no parent (root commit).

def log(self):
    commit_hash = self._get_head_commit()
    while commit_hash:
        commit_obj = self._read_commit_object(commit_hash)
        yield commit_hash, commit_obj
        commit_hash = commit_obj.get('parent')
Read and return the commit object from the object directory
def _read_commit_object(self, commit_hash):
    commit_path = os.path.join(self.objects_dir, commit_hash)
    commit_content = read_file(commit_path)
    return json.loads(commit_content.decode())

  • It's a simple linear traversal of commits.
  • It doesn't handle branches or merges.

6. pyvcs diff ( v/s Git )

  1. Retrieves the file content from the last commit.
  2. Reads the current file content.
  3. Uses Python's difflib to generate a unified diff.

def diff(self, file_path):
    last_commit_content = self._get_last_commit_file_content(file_path)
    if last_commit_content is None:
        return f"No previous commit for file {file_path}"
    current_content = read_file(file_path).decode().splitlines(keepends=True)
    last_commit_content = last_commit_content.decode().splitlines(keepends=True)
    diff_result = difflib.unified_diff(
        last_commit_content,
        current_content,
        fromfile=f"a/{file_path}",
        tofile=f"b/{file_path}",
        lineterm=''
    )
    return '\n'.join(diff_result)
Retrieve the content of file from last commit
def _get_last_commit_file_content(self, file_path):
    commit_hash = self._get_head_commit()
    if not commit_hash:
        return None
    commit_obj = self._read_commit_object(commit_hash)
    staged_files = commit_obj.get('files', {})
    rel_path = os.path.relpath(file_path, self.root_dir)
    if rel_path in staged_files:
        file_hash = staged_files[rel_path]
        object_path = os.path.join(self.objects_dir, file_hash)
        return read_file(object_path)
    return None

  • It only shows differences between the working directory and the last commit.
  • It doesn't handle staged changes separately.

How does Git implement the commands?

Obviously, Git's implementation is far more sophisticated and optimized. Git's design choices ( object model, efficient indexing, and graph-based history representation ) allow it to handle large-scale projects with complex histories

I have tried to compile a list of key differences between PyVCS and Git. There can always be improvements to be made in this project based on these differences.

Feel free to click the links to go back and forth to understand more.

1. git init( v/s PyVCS )

Git's init command creates a more complex structure

.git/
├── objects/
├── refs/
│   ├── heads/
│   └── tags/
├── HEAD
├── config
└── description

Functions performed:

  1. Initialize an empty HEAD file pointing to refs/heads/master.
  2. Create a config file with repository settings.
  • It doesn't create an initial commit, but it sets up the structure for branches and tags.

2. git add ( v/s PyVCS )

Functions performed:

  1. Calculate a SHA-1 hash of the file content.
  2. Create a blob object in the objects database.
  3. Update the index (staging area) with the file information.
  • Git uses a binary file format for its index, which is more efficient.
  • It creates tree objects to represent directory structures.
  • It can handle partial file adds (hunks) using the index.

3. git commit ( v/s PyVCS )

Functions performed:

  1. Create a tree object representing the current state of the index.
  2. Create a commit object containing:

    • Tree hash
    • Parent commit hash(es)
    • Author and committer information
    • Timestamp
    • Commit message
  3. Calculate a SHA-1 hash of the commit object.

  4. Update the current branch reference to point to the new commit.
  • Git uses a custom binary format for objects, which is more space-efficient.
  • It separates content (blobs), structure (trees), and metadata (commits).
  • The commit hash is based on all this information, making it tamper-evident.

4. git status ( v/s PyVCS )

Functions performed:

  1. Compares the index with the HEAD commit to show staged changes.
  2. Compares the working directory with the index to show unstaged changes.
  • Git uses optimized data structures and algorithms to make this operation fast, even for large repositories.
  • It identifies untracked files.
  • It handles renames, copies, and submodules.
  • It respects .gitignore rules.

5. git log ( v/s PyVCS )

  1. Traverse complex commit graphs with multiple branches and merges.
  2. Filter and format the output in numerous ways.
  3. Show the evolution of specific files or directories.
  • Git uses efficient graph traversal algorithms.
  • It can handle very large histories quickly due to its object model and indexing.

6. git diff ( v/s PyVCS )

  1. Show differences between any two commits, branches, or trees.
  2. Show staged changes (diff --cached).
  • Git uses optimized diff algorithms that can handle large files and repositories.
  • It can show renames and copies as such, rather than as full file additions and deletions.
  • It uses a more efficient diff algorithm.