Manish Barnwal

...just another human

git and github for data scientists

It has been close to a year since I shifted to a start-up which incidentally got acquired after a month of my joining. Before this I used to work at WalmartLabs where we always wanted to use a version control system like git but it never took off properly. Now that I am working in this start-up I got to know that just taking a course on git/github doesn't make you a master of this topic. And this is true for anything you always wanted to learn. We tend to spend too much time on tutorials and hesitate to take the next step of actually applying it. I recently read this excellent article along the same lines - Tutorial purgatory. I used to think that I understand git/github but recently I have come to know about so many features that I now have accepted this - You know nothing, Manish Barnwal!

In today's post I will not talk about theory in details. This post is more of a practical introduction to git/github. If you have never heard of git you should probably go somewhere else to learn about it. Let us get started.

What is git?

Git is a version control system that allows you to capture snapshot of the progress of your code while you are developing it. There will be many versions of the code you will develop after which you have your final code written. Git allows you to capture and commit these versions so that if later the code breaks you can come back to an earlier working version and start developing from where you left.

Whit is github?

Github is the server where the various versions of the code you have written gets stored. I understand it as a web-version of git.

Stages in code development using git

Let us say you made changes to a code file named - my_code.py. Once you are satisfied with your changes, you would want to commit this change and push it to github so that the web-version also gets updated.

After you have made your code changes, you should always check the status of your local repo by typing git status. Once you have made changes to an existing file, you need to send it to a staging area by typing git add my_code.py. To stage a file is to simply prepare it finely before commit.

Once you have your file in staging area, you would like it to commit it with a message about the changes you did in your file - something like this: git commit -m 'fixed the date-time issue'. And then you push these changes by typing git push

If everything ran without giving an error message, you should have the changes reflected on Github.

Creating a branch in git

A branch in git allows you to work independently of the main project. If there is a feature you want to add in to the main project, you go about creating a separate branch and you work on it. Once this feature is completed, you get it reviewed from others by raising a PR (pull request).

Creating a branch makes a separate copy of the master branch in your local and now you can add your features, test new changes without affecting the master branch.

What is a PR?

A PR is a way to get your work checked by others in your team so that they can give you a feedback if there's any changes to be made to the code you have written in the new branch created. There could be multiple to-and-fro reviews and changes before all the reviewers are satisfied with the code in the branch. If all the reviewers are satisfied with the code changes requested for, they will approve your PR.

Once the PR is approved, you can merge it with the master and the code you had written in the new branch becomes part of the master now.

To give an example, recently I had to work on checking the quality of data that we use for our recommendation project. The task was to assess the quality of the input data, create a few quality metrics and if there is any problem with the data, the task should fail.

Assessing the quality of the data is an independent task and so I created a new branch called data_validation. To create a branch in git type git branch <branch-name> So in my case it was git branch data_validation.

Once a branch is created, you can have a look at all the branches in your local by typing git branch

There will be a asterisk mark next to the branch on which you are currently on. If you want to change the branch, you can do so by typing git checkout <branch-name>. The default branch is master.

PR is approved. What next?

  1. Merge pull-request from Github

Once your PR is approved from all the reviewers, you can merge it with the master by going to that particular branch in Github and clicking on Merge pull request option. If there are no conflicts, your branch will successfully get merged with the master - meaning whatever code changes you had done in your new created branch is now part of the master.

2. git checkout master from git

Now, go to git and do git checkout master to move to the master branch

3. git pull

Do a git pull to get the latest changes from the master branch on Github to your local git.

git tag

If you go to a repo on github.com you will see a tab called releases. You will see various release-numbers there. This is done so that you have versions that one can revert back to if a release fails.

git tag from command line also gives you the list of all releases for the repository.

Normally a release looks something like this 2.11.1. So there are three set of numbers to be looked at - each separated by a period - major, minor, bug-fixes. If you want to give a tag number to your release, look for the most recent tag number and update it according to the framework below.

How to update a tag (release number) ?

Given a version number MAJOR.MINOR.PATCH, increment the:

  • MAJOR version when you make incompatible API changes,

  • MINOR version when you add functionality in a backwards-compatible manner, and

  • PATCH version when you make backwards-compatible bug fixes.

So how do you add a tag?

Depending on whether you have made a major, minor or patch level changes, you can add a tag by typing git tag <tag-number>. For eg. git tag rel/2.11.0

To push this tag to remote repo, you have to type git push origin <tag-no.>. If you want to push multiple tags - git push origin --tags will work.

I will keep visiting this post to add more of my notes and learnings on git/github. Thanks for reading. Keep learning!

Advertiser Disclosure: This post contains affiliate links, which means I receive a commission if you make a purchase using this link. Your purchase helps support my work.

Comments