Manish Barnwal

...just another human

git and github for data scientists

It has been close to a year since I shifted to a start-up which incidentally got acquired after a month of my joining. Before this I used to work at WalmartLabs where we always wanted to use a version control system like git but it never took off properly. Now that I am working in this start-up I got to know that just taking a course on git/github doesn't make you a master of this topic. And this is true for anything you always wanted to learn. We tend to spend too much time on tutorials and hesitate to take the next step of actually applying it. I recently read this excellent article along the same lines - Tutorial purgatory. I used to think that I understand git/github but recently I have come to know about so many features that I now have accepted this - You know nothing, Manish Barnwal!

In today's post I will not talk about theory in details. This post is more of a practical introduction to git/github. If you have never heard of git, you should probably go somewhere else to learn about it. Let us get started.

What is git?

Git is a version control system that allows you to capture snapshot of the progress of your code while you are developing it. There will be many versions of the code you will develop after which you have your final code written. Git allows you to capture and commit these versions so that if later the code breaks you can come back to an earlier working version and start developing from where you left.

Whit is github?

Github is the server where the various versions of the code you have written gets stored. I understand it as a web-version of git. Github allows other collaborators in the project to easily get the latest version of the code. If for some reason, your laptop crashes, all your git files will be lost. Github continuously takes a back-up of the versioned files so anyone can easily clone the code repository to their local (laptop).

Cloning a repository

Repository is a fancy name for the code-files and directories in a project. Cloning a repository means getting the copy of the repository into your local directory. This creates a ./git folder into your local that maintains all the commit logs, version-changes, directories, branches and other information. You can see the tree structure of the ./git folder by typing tree .git/. If you are on Mac, you might have to install tree by typing brew install tree

When you clone a repository, you not only gets the code-files but also all the versions, all the history, all the commit messages that have been generated in the project ever since it started. So cloning is not just copying the code-files to your local. You can clone a repository by typing git clone <git-repository-url>.

Stages in code development using git

Let us say you made changes to a code file named - my_code.py. Once you are satisfied with your changes, you would want to commit this change and push it to github so that the web-version also gets updated.

After you have made your code changes, you should always check the status of your local repo by typing git status. Once you have made changes to an existing file, you need to send it to a staging area by typing git add my_code.py. To stage a file is to simply prepare it finely before commit. git add tells git to compress the file, create a hash of the file and store it as an object in the git tree.

Once you have your file in staging area, you would like it to commit it with a message about the changes you did in your file - something like this: git commit -m 'fixed the date-time issue'. And then you push these changes by typing git push

If everything ran without giving an error message, you should have the changes reflected on Github. Let us dig in a little more into push and there is something along the same lines pull.

Push and Pull

Push and pull are actions you perform on the remote. But what is remote? git allows you to manage code versions locally, but you need a way to pass on this changes to the outside world. Remote allows you to do so. Remote is a common repository that allows team members to exchange their codes changes with others. Generally, remote is stored in code-hosting services like Github or internal server.

From this excellent stackoverflow answer.

As you probably know, git is a distributed version control system. Most operations are done locally. To communicate with the outside world, git uses what are called remotes. These are repositories other than the one on your local disk which you can push your changes into (so that other people can see them) or pull from (so that you can get others changes). The command git remote add origin git@github.com:peter/first_app.gitcreates a new remote called origin located at git@github.com:peter/first_app.git. Once you do this, in your push commands, you can push to origin instead of typing out the whole URL.



Push allows you to add your local changes to the remote by typing git push <name of remote> <name of branch>. Pull allows you to get the latest changes from the remotte to your local by typing git pull <name of remote> <name of branch>. This gets translated to git pull origin master. `

origin is the default name of the remote. And since master is the default branch, you can see how the command devolves into the simple name we find everywhere: git pull origin master.

Let us now talk about branches in git.

Creating a branch in git

A branch in git allows you to work independently of the main project. If there is a feature you want to add in to the main project, you go about creating a separate branch and you work on it. Creating a branch makes a separate copy of the master branch in your local and now you can add your features, test new changes without affecting the master branch.

Let us understand how to create a branch in git.

To give an example, recently I had to work on checking the quality of data that we use for our recommendation project. The task was to assess the quality of the input data, create a few quality metrics and if there is any problem with the data, the task should fail and generate a reminder about the same.

Assessing the quality of the data is an independent task and so I created a new branch called data_validation. To create a branch in git type,

git branch <branch-name>
#  So in my case it was `git branch data_validation`.

Once a branch is created, you can have a look at all the branches in your local by typing git branch

There will be a asterisk mark next to the branch on which you are currently on. If you want to change the branch, you can do so by typing git checkout <branch-name>. The default branch is master.

Once your branch is created, you would add your code to create the feature you wanted to create. This branch ensures that you don't break anything in the main code-base. Once you have tested the code in your local and you're confident that it is working fine, you would want to get it reviewed from others in your team. How do you get it reviewed from others? Comes into picture PR. But before creating a PR, you have to push your branch in local to remote. Let us see how?

Pushing new local branch to remote Git

You have already made changes in the code - you have added your new code for the new feature. You can have a look at what all files you have modified by typing git status. You can add the code files to the staging area by typing git add <file-name>. You can add a short message to describe the changes you have made by typing, git commit 'your message'.

You can now push your local branch to remote git by typing, git push -u origin <branch-name>. You can now see this branch in Github. This branch is now accessible to everyone in your team. Now, you want people in your team to review your code. We do so by creating a PR.

What is a PR?

A PR (short for Pull Request) is a way to get your work checked by others in your team so that they can give you a feedback if there's any changes to be made to the code you have written in the new branch created. There could be multiple to-and-fro reviews and changes before all the reviewers are satisfied with the code in the branch. If all the reviewers are satisfied with the code changes requested for, they will approve your PR. But how to generate a PR?

Create a pull request

Go to Github and select the branch that have your commits. To the right of the Branch menu, click New pull request. Type a title and description for your pull request. You also have option to choose who you want your code reviewers to be. Click Create pull request and your PR is created.

The reviewers will make suggestions for changes to your code. And this may take to-and-fro dialogues between you and your reviewers. Once, your changes are approved, you can merge the PR to the main master repository.

PR is approved. What next?

  1. Merge pull-request from Github

Once your PR is approved from all the reviewers, you can merge it with the master by going to that particular branch in Github and clicking on Merge pull request option. If there are no conflicts, your branch will successfully get merged with the master - meaning whatever code changes you had done in your new created branch is now part of the master.

2. git checkout master from git

Now, go to git and do git checkout master to move to the master branch

3. git pull

Do a git pull to get the latest changes from the master branch on Github to your local git.

git tag

If you go to a repo on github.com you will see a tab called releases. You will see various release-numbers there. This is done so that you have versions that one can revert back to if a release fails.

git tag from command line also gives you the list of all releases for the repository.

Normally a release looks something like this 2.11.1. So there are three set of numbers to be looked at - each separated by a period - major, minor, bug-fixes. If you want to give a tag number to your release, look for the most recent tag number and update it according to the framework below.

How to update a tag (release number) ?

Given a version number MAJOR.MINOR.PATCH, increment the:

  • MAJOR version when you make incompatible API changes,

  • MINOR version when you add functionality in a backwards-compatible manner, and

  • PATCH version when you make backwards-compatible bug fixes.

So how do you add a tag?

Depending on whether you have made a major, minor or patch level changes, you can add a tag by typing

git tag <tag-number>
#  For eg. `git tag rel/2.11.0`

You can additionally add a message to the tag something like this:

git tag rel/2.11.0 -m 'Code changes to include new features'
#  This message will appear on the Github corresponding to this tag.

To push this tag to remote repo, git push origin <tag-no.> # If you want to push multiple tags - `git push origin --tags` will work.

I will keep visiting this post to add more of my notes and learnings on git/github. Thanks for reading. Keep learning!

Advertiser Disclosure: This post contains affiliate links, which means I receive a commission if you make a purchase using this link. Your purchase helps support my work.

Comments