Git For the Modern Data Scientist: 9 Git Concepts You Can’t Ignore | by Bex T.

[ad_1]

Explained with striking visuals

Introduction

Most data scientists feel like a fish out of water when it comes to Git. There are software engineers who talk about nothing but Git-things, and there are data scientists who say “Huh?” (I wish I could add a sound to this) every time.

That stops today! Since Git is an essential tool for collaboration, I will break down nine of the most critical Git concepts that data scientists must know like the back of their hand.

I can promise that you won’t be nodding your head in fake understanding the next time someone talks about Git or version control.

Let’s get started!

For the 1000th time…

You may have heard it a few hundred times already, but I will err on the side of caution and say it for the few hundred and first time:

Git is one of the most critical tools in developing ML and AI systems.

If your idea of a machine learning or data science project involves models cooked up in notebooks with creatively named files such as “notebook1”, “notebook2”, “notebook_final”, and “notebook_final_final”, then don’t bother with Git.

However, if you aim to deploy models that others can use without migraines, Git is a relatively small price to pay.

Git allows you to keep track of changes to your code and data, collaborate with others, and maintain a history of your project. With Git, you can easily revert to a previous version of your work, compare different versions, and merge changes made by multiple contributors.

Moreover, Git easily integrates with other popular MLOps tools like DVC for data version control, making it an essential tool for data scientists.

0. Repository

Basically, a repository is this:

It is a folder on your machine. It can have no files, three files, or a hundred. The only thing needed to convert that folder into a Git repository is to call git init inside it.

A machine learning repository usually has folders to store data, models, and code for loading, cleaning, and transforming data, as well as selecting, training, and saving models for deployment.

There will be other miscellaneous files, such as the .git folder for Git internals and metadata files.

All of these make up a single repository, and Git is usually enough to track them (except for data and models. For that, see this article afterward).

1. Tracked, untracked

When you initialize Git inside a directory, by default, any existing or new files/directories you create will be untracked by Git.

Image by me. Showcase of the `git status` command

This means that any future changes you make to them will also be untracked. Therefore, you need to put those files under Git supervision by running git add path/to/file.py.

Image by me. Tracking files and directories in Git.

After calling git add on files, they will be under Git-watch.

If you wish to add all files in the repository (although this is highly unlikely), you can call git add ..

There are also cases where you never want files to be tracked by Git. This is when you create a .gitignore file.

As the name suggests, files added to .gitignore won’t be tracked or indexed by Git for as long as they are there. Typical items you should add to .gitignore for data projects are large data files like CSVs, parquets, images, videos, or audio. Git has historically been terrible at handling those.

It handles the rest like a champ.

P.S. You can create a .gitignore file in the terminal with touch .gitignore and add files/folders to it with echo "filename" >> .gitignore on new lines.

2. Commit

A Git commit is a precious thing. The entire idea of version control is based on it.

When you call git commit inside a Git repository, you take a snapshot of every Git-tracked file for that specific point in time. Think of it like a time capsule with contents (versions) of your project from different periods.

All the commits you make will form your Git history or Git tree, as shown below.

A good Git tree organizes the linear progression of your repository. By breaking down your code changes into discrete, well-defined commits, you can map out the progress of your repository almost like a book.

Then, you can browse through the pages of this Git book through commits.

Just like a writer puts a lot of effort into writing each page of their book, you should treat your commits with care.

You shouldn’t be making commits for the sake of committing. Consider them as little pieces of history, and know that future versions of yourself and other developers should look at them with delight, rather than disgust.

Traditional advice: A good commit has an informative message describing the changes made.

Some common scenarios to commit in a typical machine learning project:

Implementing a new feature: writing code that adds a new functionality like a new function, class, class method, training a new model, new data cleaning operation, etc.
Fixing a bug: documenting bug fixes to existing functions, methods, and classes
Improving performance: writing code that enhances an existing feature like optimizing blocks of code
Updating docs and dependencies
Machine learning experiments: in a project, you will run dozens of experiments to choose and tune the best model. Each model run should be tracked as a commit.

3. Staging area

By talking about commits, we have got ahead of ourselves. Before closing the cap of the commit capsule, you have to make sure the contents within are right.

This involves telling Git exactly which changes from which files you want to commit. Sometimes, new changes might come from several files and you may only want to commit some of them and leave the rest for future commits.

This is where we lift the curtains and reveal the staging area (pun intended):

Image by me. The staging area is changed after the changes in train.py are added.

The idea is that you must have some way of double-checking, editing, or undoing the changes you want to add to your Git history before you press that commit button.

Adding the new changes to the staging area (or Git index as some kids say it) allows you to do that. The area holds the changes you want to include in the next commit.

Let’s say you changed both clean.py and train.py. If you add the changes in train.py with git add train.py to the staging area, the next commit will only include that change.

The modified clean.py will stay as is (uncommitted).

Image by me. The image above reshown for clarity.

So, here is an easy workflow for you:

Track new files with Git (only done once)
Add changes in tracked files to the staging area with git add changed_file.extension
Commit the changes in the staging area to history with git commit -m "Commit message".

4. Hashes and tags

Apart from messages, all Git commits have hashes so you can point to them more easily.

Image by me. Three sample commits with 7-character hashes.

A hash is a string with 40 hexadecimal characters that give each commit unique IDs, like 1a3b5c7d9e2f4g6h8i0j1k2l3m4n5o6p7q8r9s0t.

They make switching between commits (different versions of your code base) much easier with git checkout HASH. You don’t have to write the full hash when switching. Only the first few characters of the hash that make it unique are enough.

You can list all the commits you’ve made with their hashes using git log (this shows the author and message of the commit).

To list only the hash and the message without cluttering up your screen, you can use git log --oneline.

Image by me. The command to list your Git log line-by-line.

If hashes intimidate you, there are also Git tags. A Git tag is a friendly nickname you can give to some important commits (or any) to remember and refer to them even more easily.

Image by me. Four commits with two of them tagged.

You can use the command “git tag” to assign tags to specific commits that are important, such as those containing a crucial feature or a significant code base release (e.g., v1.0.0). Additionally, you can tag a commit that represents your best model, such as “random_forest_best”.

Think of tags as little human-readable milestones that stand out among all the commit hashes.

To clarify, the command git tag ‘tag_name’ will only add a tag to the last commit. If you want to add a tag to a specific commit, you need to specify the commit hash at the end of the command, after the tag name.

5. Branch

After commits, branches are the bread and butter of Git. 99% of the time, you will be working inside a Git branch.

By default, the branch you are on when you initialize Git inside a folder will be named either main or master.

You can think of other branches as alternate realities of your code base.

By creating a Git branch, you can test and experiment with new features, ideas, and fixes without fearing you will mess up your code base.

For example, you can test a new algorithm for a classification task in a new branch without disrupting the main code base:

Image by me. Creating the new SGD branch.

Git branches are very cheap. When you call git branch new_branch_name, Git creates a pseudo-copy of the master branch without duplicating any of the files.

After creating a new branch and experimenting with your fresh ideas, you have the option to delete the branch if the results do not seem promising. On the other hand, if you are content with the changes made in the new branch, you can merge it with the master branch.

6. HEAD

A Git repository can have several branches and hundreds of commits. So you might raise the excellent question “How does Git know which branch or commit you are at?”.

Git uses a special pointer called HEAD and that is the answer.

Basically, the HEAD is you. Wherever you are, HEAD follows you in Git. 99% of the time, HEAD will be pointing to the latest commit in the current branch.

If you make a new commit, HEAD will move on to that. If you switch to a new or an old branch, HEAD will switch to the latest commit in that branch.

One use-case for HEAD is when comparing changes in different commits to each other. For example, calling git diff HEAD~1 will compare the latest commit to the commit immediately before it.

This also means that HEAD~n syntax in Git refers to the nth commit before wherever the HEAD is.

You may also go into the dreaded detached HEAD state. This doesn’t mean Git has lost track of you and doesn’t know where to point.

A detached head state occurs when you use the command git checkout HASH to check out a specific commit, instead of using git checkout branch_name. This forces the HEAD to no longer point to the tip of a branch, but rather to a specific commit somewhere in the middle of the commit history.

Any changes or commits you make in the detached HEAD state will be isolated or orphaned and won’t be part of your Git history. The reason is that HEAD is, well, the head of branches. It strongly fancies attaching itself to branch tips or heads, not its stomach or legs.

So, if you want to make changes in a detached HEAD state, you should call git switch -c new_branch to create a new branch at the current commit. This gets you out of the state and moves the HEAD.

Getting the hang of the HEAD will go a long way in helping you navigate any tangled Git tree.

7. Merge

So, what happens after you create a new branch?

Do you discard it if your experiment doesn’t pan out with git branch -d branch_name? Or do you perform a fabled Git merge?

Basically, a Git merge is a fancy party where two or even more branches come together to create a single thicker branch.

When you merge branches, Git takes the code from each branch and combines them into a single cohesive code base.

If there are overlapping changes in the branches, i.e. both branches have changed lines 5–10 in train.py, Git raises a merge conflict.

A merge conflict is as nasty as it sounds. To resolve the conflict, you have to decide which branch’s changes you want to keep.

Solving merge conflicts without swearing and boiling from the ears is a rare skill developed over time. So, I won’t talk much about them and will refer you to this excellent article from Atlassian.

8. Stash

I tend to screw up a lot when coding. An idea strikes me; I try it out only to realize that it is rubbish.

In the beginning, I would foolishly erase the mess into oblivion but later regret it. Even though the idea was rubbish, it doesn’t mean I couldn’t use certain code blocks in the future.

Then, I discovered Git stashes and they quickly became one of my favorite Git features.

When you call git stash, Git automatically stashes or hides both staged and unstaged changes in the working directory. The files revert back to a state where they just came out of a commit.

After you stash your changes, you can continue your work as usual. When you want to retrieve them again (anywhere), you can use the git stash apply or git stash pop command. These commands will restore the changes that were previously saved in the stash to the working directory.

Note that git stash command only saves changes made to tracked files and not untracked files. In order to stash both tracked and untracked files, you need to use the -u flag with the git stash command. Ignored files will not be included in the stash.

9. GitHub

So, we come to the age-old question — what is the difference between Git and GitHub?

This is like asking the difference between a burger and a cheeseburger.

Git is a version control system that tracks repositories. On the other hand, GitHub is a web-based platform used to store Git-controlled repositories online.

Git really shines when its repositories are made online and hence, open for collaboration. If a repository is only on your local machine, people can’t work on it with you.

So, think of GitHub as a remote mirror of your local repo that people can clone, fork, and suggest pull requests.

And if these terms sound alien to you, stick around for my next article where I explain N (I don’t know how many right now) GitHub concepts that will clear the confusion right away.

[ad_2]
Source link

Git For the Modern Data Scientist: 9 Git Concepts You Can’t Ignore | by Bex T. | May, 2023