PyVCS : A version control system in Python
"The goal is to peek underneath the abstraction and reimplement the abstraction from scratch, to gain deep understanding of what's being abstracted."
Introduction
Version control system is a tool for managing and tracking changes in software projects. Git is a version control system (as you may know), is what developers use maintain code and collaborate. In this doc, we'll explore PyVCS, a simple version control system implemented in Python that mimics some of Git's core functionality.
How Git Works: A Brief Overview
Before diving into PyVCS, here is a quick rundown of how Git works (limited to what is relevant to this project):
- Git uses a content-addressable filesystem to store snapshots of your project.
- It maintains a series of commits, each representing a point in the project's history.
- Git uses objects (blobs, trees, commits) to store content and metadata.
- References (refs) are used to point to specific commits (e.g., branches, tags).
- The working directory, staging area, and repository form the three main states of Git.
What PyVCS Implements
We will implement the following Git commands:
init
- initializes a new repositoryadd
- adds file to staging areacommit
- commits changesstatus
- track the changeslog
- view the commit historydiff
- compare the current version of the file to that of the last committed version
We will dive deep in how each command works in PyVCS, compared to Git and what goes on under-the-hood.
Implementing Commands
1. pyvcs init
( v/s Git )
When the init command is executed, PyVCS creates the following directory structure
- The .pyvcs directory is the root of the version control system.
- The objects directory will store all content (files and commits) as hash-addressed objects.
- The refs directory will store references, particularly the HEAD reference.
class VersionControl:
def __init__(self, root_dir):
self.root_dir = os.path.abspath(root_dir)
self.pyvcs_dir = os.path.join(self.root_dir, '.pyvcs')
self.objects_dir = os.path.join(self.pyvcs_dir, 'objects')
self.refs_dir = os.path.join(self.pyvcs_dir, 'refs')
Create the necessary directory structure if it doesn't exist, and return a success message or an "already initialized" message.
def init(cls, root_dir, directory_name=None):
root_dir = cls._create_repo_directory(root_dir, directory_name)
vc = cls(root_dir)
if not os.path.exists(vc.pyvcs_dir):
cls._create_pyvcs_structure(vc.pyvcs_dir, vc.objects_dir, vc.refs_dir)
return True, f"Initialized empty PyVCS repository in {vc.pyvcs_dir}"
return False, "PyVCS directory already initialized."
Create a new repository directory if a directory name is provided.
def _create_repo_directory(root_dir, directory_name):
if directory_name:
new_dir = os.path.join(root_dir, directory_name)
if not os.path.exists(new_dir):
os.makedirs(new_dir)
root_dir = new_dir
return root_dir
def _create_pyvcs_structure(pyvcs_dir, objects_dir, refs_dir):
os.makedirs(pyvcs_dir, exist_ok=True)
os.makedirs(objects_dir, exist_ok=True)
os.makedirs(refs_dir, exist_ok=True)
- It doesn't create any initial commit or branch.
2. pyvcs add
( v/s Git )
- The file content is read.
- A SHA-256 hash of the content is calculated.
- The content is stored in the objects directory with the hash as the filename.
- The file path and hash are added to an index file (.pyvcs/index).
def add(self, file_path):
content = read_file(file_path)
hash_value = calculate_hash(content)
object_path = os.path.join(self.objects_dir, hash_value)
write_file(object_path, content)
self._update_index(file_path, hash_value)
return hash_value
def _update_index(self, file_path, hash_value):
index_path = os.path.join(self.pyvcs_dir, 'index')
if os.path.exists(index_path):
with open(index_path, 'r') as f:
index = json.load(f)
else:
index = {}
index[os.path.relpath(file_path, self.root_dir)] = hash_value
with open(index_path, 'w') as f:
json.dump(index, f)
- PyVCS uses a simple JSON file as its index, storing file paths and their corresponding hashes.
- It doesn't create tree objects or handle directories specially.
3. pyvcs commit
( v/s Git )
- It reads the current HEAD commit (if it exists).
-
It creates a commit object containing:
- Commit message
- Timestamp
- Parent commit hash
- A dictionary of file paths and their corresponding hashes
-
It calculates a SHA-256 hash of the commit object.
- It stores the commit object in the objects directory.
- It updates the HEAD reference to point to the new commit.
Read and return the current HEAD commit hash.
def commit(self, message): parent_hash = self._get_head_commit() commit_obj = self._create_commit_object(message, parent_hash) commit_hash = self._write_commit_object(commit_obj) self._update_head(commit_hash) return commit_hash
Create a commit object with the given message, current timestamp, staged files, and parent commit hash.def _get_head_commit(self): head_path = os.path.join(self.refs_dir, 'HEAD') if os.path.exists(head_path): return read_file(head_path).decode().strip() return None
Read and return the staged files from the index file.def _create_commit_object(self, message, parent_hash): commit_obj = { 'message': message, 'timestamp': datetime.now().isoformat(), 'files': self._get_staged_files(), 'parent': parent_hash, } return commit_obj
Write the commit object to the object directory and return its hash Update the HEAD reference to point to the new commitdef _get_staged_files(self): index_path = os.path.join(self.pyvcs_dir, 'index') with open(index_path, 'r') as f: return json.load(f)
- PyVCS stores commit objects as JSON files.
- It doesn't create separate tree objects for directories.
- The commit hash is based on the entire commit object, including the file list.
4. pyvcs status
( v/s Git )
- Reads the index file to get staged files.
- Scans the working directory for all files.
-
Compares working directory files with the index to identify:
- Untracked files (in working directory but not in index)
- Modified files (content hash different from index)
-
Returns lists of staged and changed files.
def status(self):
staged_files = self._get_staged_files()
changed_files = self._get_changed_files(staged_files)
return staged_files, changed_files
Identify files that have been changed or are untracked.
def _get_changed_files(self, staged_files):
changed_files = {}
for file_path in list_files(self.root_dir):
rel_path = os.path.relpath(file_path, self.root_dir)
if rel_path.startswith('.pyvcs'):
continue
if rel_path not in staged_files:
changed_files[rel_path] = 'Untracked'
else:
content = read_file(file_path)
hash_value = calculate_hash(content)
if hash_value != staged_files[rel_path]:
changed_files[rel_path] = 'Modified'
return changed_files
- It doesn't track renames or copies.
- It doesn't handle submodules or complex ignore rules.
5. pyvcs log
( v/s Git )
- Starts from the HEAD commit.
-
For each commit:
- Reads the commit object from the objects directory.
- Yields the commit hash and object.
- Moves to the parent commit.
-
Stops when there's no parent (root commit).
def log(self):
commit_hash = self._get_head_commit()
while commit_hash:
commit_obj = self._read_commit_object(commit_hash)
yield commit_hash, commit_obj
commit_hash = commit_obj.get('parent')
def _read_commit_object(self, commit_hash):
commit_path = os.path.join(self.objects_dir, commit_hash)
commit_content = read_file(commit_path)
return json.loads(commit_content.decode())
- It's a simple linear traversal of commits.
- It doesn't handle branches or merges.
6. pyvcs diff
( v/s Git )
- Retrieves the file content from the last commit.
- Reads the current file content.
- Uses Python's difflib to generate a unified diff.
def diff(self, file_path):
last_commit_content = self._get_last_commit_file_content(file_path)
if last_commit_content is None:
return f"No previous commit for file {file_path}"
current_content = read_file(file_path).decode().splitlines(keepends=True)
last_commit_content = last_commit_content.decode().splitlines(keepends=True)
diff_result = difflib.unified_diff(
last_commit_content,
current_content,
fromfile=f"a/{file_path}",
tofile=f"b/{file_path}",
lineterm=''
)
return '\n'.join(diff_result)
def _get_last_commit_file_content(self, file_path):
commit_hash = self._get_head_commit()
if not commit_hash:
return None
commit_obj = self._read_commit_object(commit_hash)
staged_files = commit_obj.get('files', {})
rel_path = os.path.relpath(file_path, self.root_dir)
if rel_path in staged_files:
file_hash = staged_files[rel_path]
object_path = os.path.join(self.objects_dir, file_hash)
return read_file(object_path)
return None
- It only shows differences between the working directory and the last commit.
- It doesn't handle staged changes separately.
How does Git implement the commands?
Obviously, Git's implementation is far more sophisticated and optimized. Git's design choices ( object model, efficient indexing, and graph-based history representation ) allow it to handle large-scale projects with complex histories
I have tried to compile a list of key differences between PyVCS and Git. There can always be improvements to be made in this project based on these differences.
Feel free to click the links to go back and forth to understand more.
1. git init
( v/s PyVCS )
Git's init command creates a more complex structure
Functions performed:
- Initialize an empty HEAD file pointing to refs/heads/master.
- Create a config file with repository settings.
- It doesn't create an initial commit, but it sets up the structure for branches and tags.
2. git add
( v/s PyVCS )
Functions performed:
- Calculate a SHA-1 hash of the file content.
- Create a blob object in the objects database.
- Update the index (staging area) with the file information.
- Git uses a binary file format for its index, which is more efficient.
- It creates tree objects to represent directory structures.
- It can handle partial file adds (hunks) using the index.
3. git commit
( v/s PyVCS )
Functions performed:
- Create a tree object representing the current state of the index.
-
Create a commit object containing:
- Tree hash
- Parent commit hash(es)
- Author and committer information
- Timestamp
- Commit message
-
Calculate a SHA-1 hash of the commit object.
- Update the current branch reference to point to the new commit.
- Git uses a custom binary format for objects, which is more space-efficient.
- It separates content (blobs), structure (trees), and metadata (commits).
- The commit hash is based on all this information, making it tamper-evident.
4. git status
( v/s PyVCS )
Functions performed:
- Compares the index with the HEAD commit to show staged changes.
- Compares the working directory with the index to show unstaged changes.
- Git uses optimized data structures and algorithms to make this operation fast, even for large repositories.
- It identifies untracked files.
- It handles renames, copies, and submodules.
- It respects .gitignore rules.
5. git log
( v/s PyVCS )
- Traverse complex commit graphs with multiple branches and merges.
- Filter and format the output in numerous ways.
- Show the evolution of specific files or directories.
- Git uses efficient graph traversal algorithms.
- It can handle very large histories quickly due to its object model and indexing.
6. git diff
( v/s PyVCS )
- Show differences between any two commits, branches, or trees.
- Show staged changes (diff --cached).
- Git uses optimized diff algorithms that can handle large files and repositories.
- It can show renames and copies as such, rather than as full file additions and deletions.
- It uses a more efficient diff algorithm.