Manish Barnwal

Brief introduction to SparkUI

2020-08-31T08:00:00-03:00

Recently, I've been working a lot with PySpark in AWS EMR. I have a huge data dump (~300 million users) that I needed to process and transform it into the right format for further processing. The data is in a S3 bucket in AWS and I use AWS EMR to launch a cluster of r5.2xlarge instances to run the spark application. We use a Hadoop YARN resource manager.

Once you launch a spark application, you get access to a corresponding Spark UI interface that gives you an option to look at the overall status of the jobs, tasks, executors and other details. In today's post, I will be talking about what to look for in the Spark UI when you launch a spark application. I've been using SparkUI a lot over the last few months and thought writing a post on this would be useful for others as well. Let us begin.

Spark overview

Before we proceed with the introduction to SparkUI, a quick refresher on Spark vocabulary will be useful. There are three main components of spark:

Driver program (sparkContext)
1. this is the heart of spark in cluster - keeps live status of the application
2. if the driver process dies due to some reason say OutOfMemory error, your spark application will be killed as well
3. contains the code that needs to be run
4. it there is only machine/process you could monitor, it has to be the driver process
Cluster manager
1. acts as the bridge between the driver program and the worker nodes
2. also allocates resources to different spark applications
Worker instances
1. the instances/machines that form the cluster
2. they consists of executors and these executors are the actual horses that do the actual code execution

Let us talk a little about the cluster managers. As I've just mentioned above, a cluster manager is the bridge between the sparkContext and the worker nodes and allocates resources to different spark applications. There are four most commonly available cluster managers:

In-built spark Standalone cluster manager
1. Good for starting but eventually you will have to graduate to other resource managers - Hadoop YARN or Apache Mesos
2. Not used by a lot of people because there are better alternatives available
Hadoop YARN (Yet Another Resource Negotiator)
1. Hadoop YARN is just a framework for job scheduling and resource management
2. Spark has nothing to do with Hadoop in Hadoop YARN
Apache Mesos
1. Founded by the founders of Spark itself
2. Apache Mesos is considered very heavy-weight and should be used only if you already have Mesos configured in your large scale production system
Kubernetes cluster manager
1. This is open-source and is used a lot by people

SparkUI under the hood

When a spark application is launched within a sparkContext, there are few processes that are run in background that are used by the sparkUI. Some of the processes are:

LiveListenerBus: processes events as the application is live and kicking
JobProgressListener:
1. responsible for collecting all data that sparkUI uses to show statistics
2. the information that we see under Jobs tab in SparkUI is because of this listener bus
Creates optional sparkUI if spark.ui.enabled is set True
EnvironmentListener, StorageStatusListener, RDDOperationsGraphListener
1. Likewise each of the above buses are responsible for other tabs in the SparkUI
  
  I won't go deep into each of these listeners (because I do not truly understand their internals) but if you are interested you should read on them. It is because of these SparkListeners that we are able to see the information on the jobs, executors, tasks in the Spark UI. So in essence, under the hood, Spark UI is a bunch of SparkListeners running and collecting all information for the jobs/executors.

SparkContext and spark applications

A sparkContext can have multiple Spark applications running and each application will have its own corresponding Spark UI. Spark UI for an application can be accessed in port 4040 and successive applications will be given the next port numbers - 4041, 4042 and likewise. If you are running Spark in local, you can access the Spark UI at http://localhost:4040. You can change the port number using spark.ui.port in conf/spark-env.sh. To persist the Spark UI so that you can take a look at the Spark UI after the application is run, you can do so by accessing the spark history server.

To enable persisting of Spark UI, the property is spark.eventLog.enabled which defaults to True. To enable the Spark UI, the property is spark.ui.enabled which defaults to True.

Few examples of Spark applications:

Spark application responsible for loading data from S3, process and transforming it and then saving it back to S3 for next job to consume it
Running two jupyter notebooks in your local Spark are examples of two running applications. Spark UI of each of the applications will be available at http://localhost:4040 and http://localhost:4041

Each of the applications in sparkContext are run independent of each other. It is the responsibility of the cluster manager to allocate executors for each of these applications. A worker node has many executors and and the same worker node can be assigned to multiple applications but each executor will allocated to only one application. The cluster manager assures that a single executor is not allocated to more than one application.

Now before we start with the Spark UI, let us cover a quick refresher of the Spark vocabulary that will help use better understand what to look for in Spark UI

Spark vocabulary

Job
1. Whenever an action is called a job is triggered
2. Examples of action in Spark are collect, aggregation functions like count, sum, or writing data to s3
3. A job is broken into stages, the number of which depends on how many shuffle operations are there in the job
Stages
1. a job is broken into DAG (directed acyclic graph) of stages
2. a stage is a collection of similar tasks that can be performed together
3. stages can either depend on each other or be run independently. Stages that are not interdependent on each other are run in parallel
4. the driver keeps track of how the stages should interact with each other to finally get the job done
Task
1. A task is a unit of computation that is applied to a block of data (partition)
2. So, it is a combination of the operation and the block of data on which the operation will be applied
3. It is the only abstraction of job that interacts with the hardware - task is executed in an executor in a worker node
Executors
1. the processes in worker nodes that do the actual execution of the code
2. each executor is assigned a task and keeps data in memory or disk
3. think of the executors as the horses on the ground that do the actual code execution
Worker instances
the machines that form the cluster

Navigating the Spark UI tabs

Now that we understand the basic vocabulary of Spark, we will go through each of the tabs in Spark UI and what to look for in them. The following are the six tabs available:

Jobs
1. This will contain all the jobs you run in your application - it also tells you how many stages and tasks were there in total in the job
Stages
1. This shows all the information around stages - both complete and running.
2. Tells you how long each stage it taking and into how many tasks a stage is broken into
Storage
1. This will contain information about the data that you may have persisted/cached. If there is not data cached, then this will be empty.
Environment
1. Contains information about the spark configurations and other and some java configurations
Executors
1. This is where you will see a detailed information on how your executors are doing with respect to task execution
2. It also gives a summary of the time taken by executors to complete the tasks
SQL
1. Here you will see the DAG of the query that you have executed (either via Dataframes API or SQL API) in the job.
  
  Okay, I understand what Spark UI is but what do I do with it? Understanding Spark UI is one of the first steps toward spark optimisations of your spark job. The UI tells you if there is a particular stage which is taking a long time to run and whether you can do something about it. Having a good understanding of Spark UI is of profound importance if you want to monitor/debug your Spark jobs.
  
  And with that I will end this post. Play around with the Spark UI, keep reading and this will become one of your favourite things about running spark jobs. I certainly keep an eye out for anything abnormal happening in the Spark UI when running spark jobs.

Why you should use logging instead of print statements?

2020-05-26T08:00:00-03:00

It is a common but wrong practice to add print statements inside your code to convey message about the code/function on the standard output - at least that is what I used to do till last year. Last year, I got to know about logging and got to understand its benefits when we build platforms or work on larger projects. And in this post, we will discuss about logging. Let us begin with understanding a little about logging.

What is logging?

Logging as the name suggests is a way to display useful messages to the user of your code about things that would add more context to the user in understanding what is happening in the code. An example of a useful logging could be this:

Imagine a scenario where you have code for model training in your project. A good logging practice is just before your model training you add a log message in your code that says - 'Beginning model training; Shape of input data: 10000, 50'. This extra logging code is really useful for the user who is running the project. And likewise once the model training is completed, another log message like 'Finished model training successfully; Time taken: 5 minutes' is super helpful.

Alright, we understand that logging is useful. The next question is how do we add these logging? The easiest and wrong way of doing this is adding a print() statement with the message you want to display. We won't be doing it via print(), we will use something called in-built 'logging' module in Python. There are packages for other languages as well and the basic premise of how they work is very similar.

Let us first try to understand the different levels of logging and then we will continue to understand why logging should be preferred over print statements.

Logging in Python

We will use an inbuilt Python module called logging to do our logging in Python. Since logging module is built in, we do not need to install anything else to get started with logging in Python. Before we proceed with writing log messages using logging, let us try to understand a basic thing about logging - the different log levels.

Different levels in logging

When you write log messages there are different use cases and based on this we classify our log messages into various levels. The most common form of log messages that I've come across is the one where we just want to add an informational message that a process has started or a process has run successfully. This form of log messages are informational and hence come under INFO log level. Likewise there are other log levels - debug, info, warning, error, critical. Depending on what type of message and severity you want to log, you can use these log levels while logging the message.

The logging level is setup depending on the severity of the message you want to display. There is an inherent order of the severity in the logging level. The default logging level is WARNING. Below are the logging levels in increasing order of severity and a rough template of when to use each level.

DEBUG: Set this as your logging level if you want to debug your program. All the levels above this are activated if you set DEBUG as your logging level i.e. messages in INFO, WARNING, ERROR, CRITICAL all will be logged out
INFO: Set this as your logging level when you want to display normal informational logs about the completion of some tasks/processes in your program.
WARNING: Set this as your logging level if you are not interested in DEBUG, INFO logs and is only concerned about WARNING, ERROR, CRITICAL logs. Messages under WARNING logs just say that there are some warnings that will not stop your code from execution but in future may lead to errors so make sure to check these warnings and understand what they are saying
ERROR: Set this as your log level if you are only interested in messages that raise ERROR, exceptions, and are CRITICAL.
CRITICAL: Messages under this level are critical and you want to put in messages here that are invoked when there is something critical with the program and the program won't process further.

Where to log?

Now that we have an understanding of different log levels and their potential use-case, we will now look at where can we put this log messages. You can direct the log messages to:

standard output i.e. the messages will be displayed on the screen of the user
text files so that they can be used later for RCA (Root Cause Analysis) I've mostly logged messages to screen but as the project becomes gigantic we should direct the logs to some file system to be used later for analysis.

How to set up logging in Python?

To illustrate this, I will use the below code snippet

import logging

# logging level set to INFO
logging.basicConfig(format='%(message)s',
                    level=logging.INFO)

LOG = logging.getLogger(__name__)

Note that the logger is assigned to LOG so that we can use this to log various log level messages. You can setup a basing config for your logging messages that takes inputs like:

format: the format in which you want your log messages; you can add date as well here

level: the logging level you want to set; if set to INFO, DEBUG and WARNING messages won't be logged even if you have added these log messages in your program

def run(self):
    with self.input().open('r') as infile:
        company_stats_embeddings = pd.read_csv(infile)
        LOG.info('--- Successfully loaded merged company stats and embeddings data ---')
        LOG.info(f'--- No. of companies in data: {company_stats_embeddings.shape[0]} ---')

    df_normalized = self.scale_df(company_stats_embeddings)

    LOG.info('--- Starting k-means clustering ---')
    kmeans_model = KMeansClustering(n_clusters=self.n_clusters)
    kmeans_model.fit(df_normalized)
    kmeans_cluster = kmeans_model.predict(df_normalized)

    company_cluster = company_stats_embeddings.copy()
    company_cluster['cluster'] = kmeans_cluster

    LOG.info('--- Distribution of clusters ---')
    LOG.info(f'--- \n{company_cluster.cluster.value_counts()} \n ---')

    LOG.info('--- Dumping generated clusters ---')
    with self.output().open('w') as outfile:
        company_cluster.to_csv(outfile, index=False)

As you can see, I've added sufficient log messages so that it is easier for the user to understand what is happening in the code while running it.

def scrape_data(companies_list, type_data='stand_alone'):
    final_basic_stats_list = []
    for company in companies_list:
        if type_data == 'stand_alone':
            url = f'https://www.screener.in/company/{company}'
        elif type_data == 'consolidated':
            url = f'https://www.screener.in/company/{company}/{type_data}'
        try:

            response = rq.get(url)
            soup = bs(response.text, "html.parser")  # parse the html page
            basic_features_soup = soup.find_all(class_='row-full-width')
            basic_features_list = basic_features_soup[0].find_all(class_='four columns')
            basic_stats = [f.get_text() for f in basic_features_list]

            basic_stats = [f.lower().strip().replace('\n', '').replace('  ', '').replace(' ', '_') for f in basic_stats]
            company_stats_dict = dict()
            company_stats_dict['symbol'] = company
            for f in basic_stats:
                s = f.split(":")
                if len(s) == 2:
                    company_stats_dict[s[0]] = s[1]
            final_basic_stats_list.append(list(company_stats_dict.values()))
        except IndexError:
            LOG.exception(f'--- Error in scraping {company} company data. Continue to scrape. ---')
            pass

    company_stats_df = pd.DataFrame(final_basic_stats_list,
                                    columns=company_stats_dict.keys())
    change_col_names = {'stock_p/e': 'stock_pe',
                        'sales_growth_(3yrs)': 'sales_growth_3yrs'
                        }
    company_stats_df.rename(change_col_names, axis=1, inplace=True)

    return company_stats_df

In the above snippet, I've added an exception log when the code raises an IndexError and I've also added the company because of which the exception was raised.

I hope this helps in understanding how to go about logging in Python. I've just scratched the surface of the topic but this should get you comfortably started with using logging in Python. If you want to learn more about logging in Python, the official Python resource is a great place to check out. Before I wrap up this post, I will end this post with a final reminder of why logging should be preferred over print statement.

Why to use logging over print statement?

Logging has different levels of severity that allows you to display log messages according to the level you want. A print statement does not give you that flexibility
Logging allows you to direct the log messages to separate files that can then be used for post analysis while the same is not easily available with print statement
You can set different log levels at individual code file level as well - some files may have INFO level while some may have DEBUG level

And that is it for now on this post. I will keep visiting this page for more additions as and when I learn more about logging.

Did you find the article useful? If you did, share your thoughts in the comments. Share this post with people who you think would enjoy reading this. Let's talk more of data science.

Personal finance 101

2020-04-07T08:00:00-03:00

I am not going to talk about the power of compounding, why money is important, how can you make your money work for you, or any similar sounding cliched titles. If you are reading this blog, I assume you already understand that money is important and you would love to have more of it.

Disclaimer: I am not an expert in finance and all the viewpoints expressed here are to be used at your own discretion. But I read a lot on money and investments and adapt these learnings to my investments. Oh, if it helps, I have cleared the FRM (Financial Risk Manager) level 1 exam. This post will be about the meager basics one should have sorted with their money. Let us get started.

1. Get yourself insured

Medical insurance

First steps, first. Start with protecting your dependents. Insurance is something I recommend everyone to get it done as soon as possible. Mostly your employer provides medical insurance but if your employer doesn't, you should get it on your own. Make sure to get medical insurance for yourself and your dependents - mom, dad, wife, children - all your dependents. It is recommended to not have medical insurance tied just to your employer. This becomes all the more important during these times of a pandemic. Given the current pandemic situation, many are losing their job and if you lose your job, you lose the medical insurance and that is a bad situation to be in.

Term insurance

Moving on, you should then go about getting term insurance. What is term insurance, you may ask? Term insurance is an insurance that gives your dependents a certain assured sum if you (the insurer) dies. If you would like to know more about Term insurance, you should google and read more about it. Go for plain vanilla term insurance. While buying the term insurance, keep in mind to not buy a mixed product insurance - one that gives you returns as well on top of the insurance. ULIP (unit linked insurance plan) is an example of a mixed insurance product. ULIPS are notoriously famous and sold by bankers and relationship managers just for the sake of the high commissions they generate by selling it to you. Run away from the person who tries to sell you an ULIP and never talk to him again.

You should get just the plain, basic term insurance. We will worry about our returns in the next few paragraphs. One of the rules of investment that has worked well for me so far - do not mix investment products. If you are buying insurance, it should just provide you insurance. Do not buy ULIPs and other similar products that claim to also give you returns on top of the insurance.

2. Build an emergency fund

As the name suggests, this fund is to be accumulated to meet uncertainties/emergencies in your life. You decide what should be the amount you would be comfortable with. You can create this fund by putting a certain portion of your salary every month to this fund.

What would an emergency look like?

If tomorrow, you lose your job, how many months would you survive? Or may be you decide you wanted a few months off to figure out things or you want to travel for the next few months. Will you be able to afford this? Nothing holds more importance in life than security and freedom. And this fund is just for that.

How do you estimate how much money you need in this fund?

How many months do you want to survive without a job? Multiply this number by your monthly expense. Start with estimating your monthly expense - your rent, food bills, cook's salary, internet bill, entertainment expenses - every expense (no matter how small) that you spend on a typical month. Let us say that comes out to be 50,000 INR. Now, multiply this by the number of months you would want to it to last (say 6 months), and you get your fuck-you money fund i.e. 50,000 * 6 = 300,000

How many months should your emergency fund last?

Again, this number of months is a number you should be comfortable with. I suggest a minimum of at least six months. You can increase this to maybe ten as well. I have my emergency fund that would last for six months - I plan to increase this to ten. Again, financial freedom and peace of mind is of utmost importance when planning investments.

Where should I keep this money?

I keep it in a mix of cash (in your savings-bank account), fixed-deposit, and liquid funds. If you don't know what liquid funds are, google it. It is important to have this cash as liquid as possible that's why I insist on keeping a portion of this fund in your savings-bank account which would allow you to withdraw it whenever you need. Do not put this money into equity. We will come to equities in some time. Emergency funds will help you enjoy the little things in life, bring sanity and happiness without worrying about the next recession or layoffs.

3. Save taxes using investments under 80-C

Money saved is money earned. Do not miss out the opportunity to save taxes. Under 80-C, your income is exempted from taxes up to a limit of 150,000. Do not let this opportunity slip.

Where should you invest this?

I recommend a mix of PPF and ELSS. Or if you want to really keep it simple and do not want to take the extra risk of ELSS, just dump it in your PPF account. Again, if you do not know what ELSS mean, google it. If you still do not understand it, ping me. I will direct you to right links.

4. Start investing

Now that we have ourselves covered - medical insurance, term insurance, tax benefits all done, we will now focus on making our monies work for us. Let us begin our investment journey. Alternate bank account

Start with creating an alternate bank account

I call this bank account investment account. Now, this account is not your salary account. This account is meant to be used just for investments, emergency funds and for parking additional funds. You are never going to use this account for your personal expenditures.

How much to invest?

You should have a rough estimate of your monthly expenditure. Once your salary gets credited to your salary account, keep the amount necessary for your monthly expenditures and transfer the remaining amount to your newly created alternate bank account. Say, you can invest 50k INR per month. Now, that you have an alternate bank account, I will answer what to do next.

Start your SIP in mutual funds

If you do not understand what SIPs are, google it. Let me try to explain in a few sentences. SIP is a systematic investment plan that allows you to invest some amount at a regular time interval in an investment. SIPs are mostly talked in conjunction to mutual funds. If someone says he is doing an SIP of 10k per month, it just means that he is investing 10k per month in a a fund irrespective of market going up or down. You set up an instruction that every month (say on 5th), I want to invest 10k INR in a fund.

I will continue to add more to this. But for now, this should suffice to start your investment journey. Let me know if you have questions and I will address them in later posts. Till then, keep learning, keep writing.

Cluster NSE top 500 companies

2019-10-28T08:00:00-02:00

Idea

Each company can be represented by few metrics that would define the health of the company. Metrics like eps, revenue, market-cap, last 4 quarter earnings, RoE, PE ratio, etc. There could be other metrics or features as well. If we represent each company with these metrics and apply clustering on this data, would the clusters generated be meaningful?

I had this question one night when I was sleepless and I jumped out of my bed and inked down the basic idea and approach.

We know we have companies like Bajaj Finance that has given excellent returns in the past and is continuing to do so. There would be other companies like this. Would it be possible that companies like Bajaj Finance will be clubbed into one cluster.

Objective

Find the characteristics of each cluster by looking at few known companies mapped in each cluster and try to see which not-so-famous companies can be found in the same cluster.

Approach

Which companies to select in data?

We start with top 500 NSE companies. Get list of top 500 companies from NSE website

Where to extract the data from?

screener.in. Use requests and beautifulsoup library

What features to use for clustering?

Start with basic set of features that are easier to get like sector, market-cap, EPS, ROE, PE ratio. I will build on this later. Good now is better than perfect tomorrow.

Other features

  - debt/equity ratio

  - last 4 quarter earnings

  - Price info: mean price, std, 25 percentile price, median  price, 75 percentile price, 52 week high, 52 week low

  - Volume traded

  - Some way to get user sentiments

Start modeling

Code implementation

The complete implementation and code can be found in this github repository.

This is a first version of the project. I plan to come back to this at a later time and add more features to represent a company.

Amazon's item-item Collaborative filtering recommendation algorithm [paper summary]

2019-07-16T08:00:00-03:00

This paper published in 2003 by Amazon explains how their basic recommendation algorithm works and is one of the papers that anyone interested in recommendation systems should give a read. I have been working in RS (recommender systems) domain at my present company for the last two years. Since the paper is dated almost fifteen years back, the algorithm may sound basic to you but it still works today.

The paper starts with talking about the different class of techniques we have for RS and why they do not work for Amazon and then it finally talks about their item-to-item collaborative filtering algorithm.

Some of the pain points of large e-commerce companies which stops them to use traditional algorithms are:

Amazon has tens of millions of customers and millions of items. Traditional RS mostly are good for toy-datasets but they fail as soon as your data becomes huge and scalability issues come into the picture.
A lot of their users are first-time users who have not made any transactions yet. Cold-start problems are not solved using conventional RS techniques.

Below are the conventional RS techniques and the problems associated with them.

1. Content-based (or search-based)

As the name suggests, it takes into considerations the content or the features as input to build the RS. This system relies just on the content/feature of the item that the user has interacted with. It is based on keywords and so do not generate interesting/new recommendations.

The results from content-based are almost always obvious and there is no sense of novelty or serendipity. The recommendations from this system are narrow in scope and limited to what the user has viewed. For eg., if a user has watched the Godfather movie, chances are the user will mostly be recommended movies from the same genre i.e. drama. RS are supposed to give you suggestions that are not obvious and the content-based RS fails drastically on this front. Content-based RS produce good results when the number of users and items are relatively smaller which is clearly not the case with Amazon.

2. Cluster-based

In this RS, users are clustered into some number of segments based on some features. Each of these segments now represents a class of customers who have similar tastes within the same class. When a new user comes, she is mapped to the closest segment based on the affinity (distance-based or some other similarity metric). This user is then recommended items that the users in the segment have interacted with.

This technique is scalable since the cluster generation will be done offline but the recommendation generated from this is of poor quality.

Will a cluster of some number of segments (say 5) be enough to represent the behavior and tastes of all your customers? One can say that making the clusters more granular (increasing the number of clusters) will help in forming more specific clusters, but this will lead to increase in time taken to map the user to one of the segments (now that we have a huge number of segments)

How do you find the optimal number of segments? Finding the optimal number of segments and be assured that the segments generated are the best ones is one of the major problems of this method.

3. Collaborative filtering

This technique is computationally expensive. For M users and N items, its time complexity is O(M*N) in the worst case but since the data is mostly sparse i.e. a user only buys a handful of items out of the whole lot, the complexity is approximately O(M + N). Today there are methods of dimensionality reduction that converts the hugely sparse interaction matrix to a dense one using techniques like matrix factorization.

Since the paper was written in 2003, a lot of these techniques were not that developed during that time. Research in RS peaked in 2006 after Netflix announced prize money of 1 million dollars for their competition. Cold start problem is obviously there with collaborative filtering. If a new user comes, collaborative filtering won’t be able to recommend an item for her.

4. Item-to-item collaborative filtering

Rather than relying on finding similar customers, item-to-item CF matches each of the user’s purchased items to similar items and then combine those similar items into a recommendation list. To determine which items are similar to an item (say seed-item), the method looks at items that were purchased together with seed-item and find the similarity between them.

Generating item-to-item similarity table

Pseudo-code

for each item in product catalog (I1)
    for each customer who bought I1 (C1)
        for each item bought by customer (I2)
            similarity(I1, I2)

There are two issues with the above calculation of item-to-item similarity table.

Offline computation: First, the time complexity of this table is too huge. In the worst case, it is O(N^²M) but since most customers interact only with a fraction of items, its time complexity will ideally be O(N*M). And that is why this similarity table is to be computed offline. For larger datasets, scalable RS must perform expensive computations offline.
Calculating similarity between items: Second, how do we go about finding the similarity between the items? The simplest approach is to calculate the cosine similarity between each item pair’s item vectors. Each item vector is M dimensional which represents the customers who bought that item.

Generating recommendations online

Once we have the similarity table, we can go about generating recommendations for a user’s purchase by finding the similar items corresponding to each item in the user’s purchase and generate the final recommendation list based on the choosing the most popular similar item or correlation between them. This process is not expensive at all and is done online (as it has to just look into the similarity table that was computed offline).

Conclusion

The most important point to keep in mind is that the most expensive task of calculating the item-item similarity table is done offline and the task to generate recommendations for a user’s purchased items is done real-time as it only depends on the items purchased by the user and not at all on the number of items in the product catalog of Amazon.

This reduces the time significantly and thus makes the item-to-item collaborative filtering approach scalable. And because the algorithm recommends highly correlated similar items, recommendation quality is excellent.

Types of data in recommender systems

2018-09-27T08:00:00-03:00

There are two ways in which we can collect data for building recommender systems — explicit and implicit. In this post, we will talk about both types of data, their characteristics and the challenges with them.

Explicit feedback datasets

The dictionary meaning of explicit is to state clearly and in detail. Explicit feedback data as the name suggests is an exact number given by a user to a product. Some of the examples of explicit feedback are ratings of movies by users on Netflix, ratings of products by users on Amazon. Explicit feedback takes into consideration the input from the user about how they liked or disliked a product. Explicit feedback data are quantifiable.

But there are a few problems with explicit data. We will talk about them and then move to discuss the more abundant and easily available implicit feedback data.

Issues with explicit feedback data

When was the last time you rated a movie on Netflix? People normally rate a movie or an item on extreme feelings — either they really like the product or when they just hated it. The latter being more prominent. So, chances are your dataset will be largely filled with a lot of positive ratings but very less negative ratings.

Explicit feedback is hard to collect as they require additional input from the users. They need a separate system to capture this type of data. Then you’ve to decide whether you should go with ratings or like/dislike option to collect the feedback. Each having their merits/demerits.

Explicit feedback doesn’t take into consideration the context of when you were watching the movie. Let us understand with an example. You watched a documentary and you really liked it and you rated it well. Now, this doesn’t mean you would like to see a lot many documentaries but with explicit ratings, this becomes difficult to take into consideration. I like to binge watch The Office tv series while having dinner and I would give it a high rating 4.5/5 but that doesn’t mean that I would watch it at any time of the day.

There is another problem with explicit ratings. People rate movies/items differently. Some are lenient with their ratings while others are particular about what ratings they give. You need to take care of bias in ratings from users as well.

For me a good movie will be rated 3/5 but may be, for you, a good movie is rated 4/5 so clearly our ways of rating a movie is different and this bias needs to be taken care of.

We now understand explicit feedback and some of the issues with it. There is another type of feedback data in recommendation systems — implicit feedback. Let us talk about them.

Implicit feedback datasets

The dictionary meaning of implicit is suggested though not stated clearly. And that’s exactly what implicit feedback data represents. Implicit feedback doesn’t directly reflect the interest of the user but it acts as a proxy for a user’s interest.

Examples of implicit feedback datasets include browsing history, clicks on links, count of the number of times a song is played, the percentage of a webpage you have scrolled — 25%, 50% or 75 % or even the movement of your mouse.

If you just browsed an item that doesn’t necessarily mean that you liked that item but if you have browsed this item multiple times that gives us some confidence that you may be interested in that item. Implicit feedback collects information about the user’s actions.

Implicit feedback data are found in abundance and are easy to collect. They don’t need any extra input from the user. Once you have the permission of the user to collect their data, your dependence on the user is reduced.

Let us talk about some unique characteristics of implicit feedback datasets:

No negative preference measured directly

Unlike explicit feedback where a user gives a poor rating for an item, he/she doesn’t like, we do not have a direct way to measure the interest of a user. Repeated action in favor of the item — for eg. listening to Coldplay’s Fix You gives us confidence that the user likes this song and we could recommend a similar song to the user. However, an absence of listening count for a song doesn’t mean that the user does not like the item — may be the user is not even aware of the existence of the song. So, there is no way to measure negative preference directly.

2. The numerical value of implicit feedback denotes confidence that the user likes the item

For eg., if I’ve listened to Coldplay more number of times than that to Pink Floyd on Youtube, the system would infer with higher confidence about my likeability for Coldplay.

3. A lot of noisy data to take care of

You need to do a lot of filtering before you actually can get worthwhile data to be modeled upon. Just because I bought an item doesn’t mean I liked the item — may be I bought it for a friend or maybe I didn’t like the item at all after purchasing the item. To handle such issues, we can calculate confidence associated with the preference of the users for items. Read this excellent paper to get an idea of how to incorporate confidence in the feedback data — Collaborative Filtering For Implicit Feedback Datasets.

4. Missing values

Explicit feedback datasets are difficult to capture and hence a lot many values are missing and we go about modeling with whatever remaining ratings we have ignoring the missing values. While in the implicit feedback, we assign the missing values as 0 indicating no action from the user — no purchase or not listened to the song.

5. Evaluation of implicit feedback models require appropriate measures

With explicit feedback — eg. ratings of movies in Netflix data, one can use RMSE (Root mean squared error) as a metric to see how far the predicted ratings are from the observed ratings on a test dataset. With implicit feedback, such a metric is not available. We work with other metrics like decision-support based metrics — precision, recall or rank-based metrics like MRR (mean reciprocal rank), NDCG (normalized discounted cumulative gain), precision@k.

I know it may sound counter-intuitive when I say this — implicit feedback datasets are better than explicit feedback in general at getting improved recommendations. I don’t have a case study to prove this but I have listened to many talks of Xavier Amatrian, a great name in recommendation systems domain and I’ve heard him iterate this point many times.

So, there we have the types of data we generally deal with while building recommendation systems. Let me know your thoughts!

Handling errors with try-catch in Python

2018-06-18T08:00:00-03:00

In the previous post I discuss about how to convert a string to date format in Python. I was working on similar idea today. I had a column of object type which was string of dates. The column name is 'signed_up_at' and I wanted to convert it to date format and then calculate the number of days since the user has signed up on our site.

When I tried to apply this function to create a new column to convert an 'object' type column to date column.

df['date_signed_up'] = df.signed_up_at.apply(lambda x: x.split('T'))

But I got the below error: AttributeError: 'float' object has no attribute 'split'

Which implied that there is a value in 'signed_up_at' column which is float and 'split' doesn't work on float object.

I don't know where this float value is in the column but I can't also apply this 'split' unless I bypass this error.

Comes into picture - try-catch.

Try lets you try a piece of code and when encountered with an error, you can specify what to do with it. Instead of halting completely when encountered with an error, try-except lets you bypass that error and continue the code execution.

Try will try to run a block of code and if it succeeds it won't go to except at all. However, if there is an error in try block, the execution flow goes to except and does what is written in except.

We will do something similar with our error as well.

Working with dates in Python

2018-06-18T08:00:00-03:00

I cringe every time I see a date type column in the data. And you may ask why so? Date columns need some methods applied to them

The reason is I don't normally see date columns in the data I work with so I don't remember the functions and methods that work on dates to get meaningful columns out of the date column in the data.

from datetime import datetime

To convert a string to date:

`datetime.strptime('string', 'format_of_the_string')``

d1 = datetime.strptime('2017-02-02', '%Y-%m-%d')

`d2 = datetime.strptime('2018-06-17', '%Y-%m-%d')``

Find the number of days between two dates `delta = d2 - d1 delta.days ``

Difference between today and some date today = datetime.today() delta = today - d2 delta.days

git and github for data scientists

2018-04-18T08:00:00-03:00

It has been close to a year since I shifted to a start-up which incidentally got acquired after a month of my joining. Before this I used to work at WalmartLabs where we always wanted to use a version control system like git but it never took off properly. Now that I am working in this start-up I got to know that just taking a course on git/github doesn't make you a master of this topic. And this is true for anything you always wanted to learn. We tend to spend too much time on tutorials and hesitate to take the next step of actually applying it. I recently read this excellent article along the same lines - Tutorial purgatory. I used to think that I understand git/github but recently I have come to know about so many features that I now have accepted this - You know nothing, Manish Barnwal!

In today's post I will not talk about theory in details. This post is more of a practical introduction to git/github. If you have never heard of git, you should probably go somewhere else to learn about it. Let us get started.

What is git?

Git is a version control system that allows you to capture snapshot of the progress of your code while you are developing it. There will be many versions of the code you will develop after which you have your final code written. Git allows you to capture and commit these versions so that if later the code breaks you can come back to an earlier working version and start developing from where you left.

Whit is github?

Github is the server where the various versions of the code you have written gets stored. I understand it as a web-version of git. Github allows other collaborators in the project to easily get the latest version of the code. If for some reason, your laptop crashes, all your git files will be lost. Github continuously takes a back-up of the versioned files so anyone can easily clone the code repository to their local (laptop).

Cloning a repository

Repository is a fancy name for the code-files and directories in a project. Cloning a repository means getting the copy of the repository into your local directory. This creates a ./git folder into your local that maintains all the commit logs, version-changes, directories, branches and other information. You can see the tree structure of the ./git folder by typing tree .git/. If you are on Mac, you might have to install tree by typing brew install tree

When you clone a repository, you not only gets the code-files but also all the versions, all the history, all the commit messages that have been generated in the project ever since it started. So cloning is not just copying the code-files to your local. You can clone a repository by typing git clone <git-repository-url>.

Stages in code development using git

Let us say you made changes to a code file named - my_code.py. Once you are satisfied with your changes, you would want to commit this change and push it to github so that the web-version also gets updated.

After you have made your code changes, you should always check the status of your local repo by typing git status. Once you have made changes to an existing file, you need to send it to a staging area by typing git add my_code.py. To stage a file is to simply prepare it finely before commit. git add tells git to compress the file, create a hash of the file and store it as an object in the git tree.

Once you have your file in staging area, you would like it to commit it with a message about the changes you did in your file - something like this: git commit -m 'fixed the date-time issue'. And then you push these changes by typing git push

If everything ran without giving an error message, you should have the changes reflected on Github. Let us dig in a little more into push and there is something along the same lines pull.

Push and Pull

Push and pull are actions you perform on the remote. But what is remote? git allows you to manage code versions locally, but you need a way to pass on this changes to the outside world. Remote allows you to do so. Remote is a common repository that allows team members to exchange their codes changes with others. Generally, remote is stored in code-hosting services like Github or internal server.

From this excellent stackoverflow answer.

As you probably know, git is a distributed version control system. Most operations are done locally. To communicate with the outside world, git uses what are called remotes. These are repositories other than the one on your local disk which you can push your changes into (so that other people can see them) or pull from (so that you can get others changes). The command git remote add origin git@github.com:peter/first_app.gitcreates a new remote called origin located at git@github.com:peter/first_app.git. Once you do this, in your push commands, you can push to origin instead of typing out the whole URL.

Push allows you to add your local changes to the remote by typing git push <name of remote> <name of branch>. Pull allows you to get the latest changes from the remotte to your local by typing git pull <name of remote> <name of branch>. This gets translated to git pull origin master. `

origin is the default name of the remote. And since master is the default branch, you can see how the command devolves into the simple name we find everywhere: git pull origin master.

Let us now talk about branches in git.

Creating a branch in git

A branch in git allows you to work independently of the main project. If there is a feature you want to add in to the main project, you go about creating a separate branch and you work on it. Creating a branch makes a separate copy of the master branch in your local and now you can add your features, test new changes without affecting the master branch.

Let us understand how to create a branch in git.

To give an example, recently I had to work on checking the quality of data that we use for our recommendation project. The task was to assess the quality of the input data, create a few quality metrics and if there is any problem with the data, the task should fail and generate a reminder about the same.

Assessing the quality of the data is an independent task and so I created a new branch called data_validation. To create a branch in git type,

git branch <branch-name>
#  So in my case it was `git branch data_validation`.

Once a branch is created, you can have a look at all the branches in your local by typing git branch

There will be a asterisk mark next to the branch on which you are currently on. If you want to change the branch, you can do so by typing git checkout <branch-name>. The default branch is master.

Once your branch is created, you would add your code to create the feature you wanted to create. This branch ensures that you don't break anything in the main code-base. Once you have tested the code in your local and you're confident that it is working fine, you would want to get it reviewed from others in your team. How do you get it reviewed from others? Comes into picture PR. But before creating a PR, you have to push your branch in local to remote. Let us see how?

Pushing new local branch to remote Git

You have already made changes in the code - you have added your new code for the new feature. You can have a look at what all files you have modified by typing git status. You can add the code files to the staging area by typing git add <file-name>. You can add a short message to describe the changes you have made by typing, git commit 'your message'.

You can now push your local branch to remote git by typing, git push -u origin <branch-name>. You can now see this branch in Github. This branch is now accessible to everyone in your team. Now, you want people in your team to review your code. We do so by creating a PR.

What is a PR?

A PR (short for Pull Request) is a way to get your work checked by others in your team so that they can give you a feedback if there's any changes to be made to the code you have written in the new branch created. There could be multiple to-and-fro reviews and changes before all the reviewers are satisfied with the code in the branch. If all the reviewers are satisfied with the code changes requested for, they will approve your PR. But how to generate a PR?

Create a pull request

Go to Github and select the branch that have your commits. To the right of the Branch menu, click New pull request. Type a title and description for your pull request. You also have option to choose who you want your code reviewers to be. Click Create pull request and your PR is created.

The reviewers will make suggestions for changes to your code. And this may take to-and-fro dialogues between you and your reviewers. Once, your changes are approved, you can merge the PR to the main master repository.

PR is approved. What next?

Merge pull-request from Github

Once your PR is approved from all the reviewers, you can merge it with the master by going to that particular branch in Github and clicking on Merge pull request option. If there are no conflicts, your branch will successfully get merged with the master - meaning whatever code changes you had done in your new created branch is now part of the master.

2. git checkout master from git

Now, go to git and do git checkout master to move to the master branch

3. git pull

Do a git pull to get the latest changes from the master branch on Github to your local git.

git tag

If you go to a repo on github.com you will see a tab called releases. You will see various release-numbers there. This is done so that you have versions that one can revert back to if a release fails.

git tag from command line also gives you the list of all releases for the repository.

Normally a release looks something like this 2.11.1. So there are three set of numbers to be looked at - each separated by a period - major, minor, bug-fixes. If you want to give a tag number to your release, look for the most recent tag number and update it according to the framework below.

How to update a tag (release number) ?

Given a version number MAJOR.MINOR.PATCH, increment the:

MAJOR version when you make incompatible API changes,
MINOR version when you add functionality in a backwards-compatible manner, and
PATCH version when you make backwards-compatible bug fixes.

So how do you add a tag?

Depending on whether you have made a major, minor or patch level changes, you can add a tag by typing

git tag <tag-number>
#  For eg. `git tag rel/2.11.0`

You can additionally add a message to the tag something like this:

git tag rel/2.11.0 -m 'Code changes to include new features'
#  This message will appear on the Github corresponding to this tag.

To push this tag to remote repo, git push origin <tag-no.> # If you want to push multiple tags - `git push origin --tags` will work.

I will keep visiting this post to add more of my notes and learnings on git/github. Thanks for reading. Keep learning!

Creating a virtual environment in Python

2018-01-23T08:00:00-02:00

I was trying to get a virtual environment set up on Python 3 using mkvirtualenv but somehow the virtual environment was getting created on Python 2.7 (my system python).

If you already know about virtual environments and why they are useful, you may skip the next two paragraphs. I came to know about virtual environments only recently. Virtual environment helps you create an isolated space wherein whatever packages you install won't have an impact outside the environment. To give you a better understanding I will tell you why I use virtual environment.

I have two virtual environments in my computer - one for office work (say office_space) and the other for my personal learnings (personal_space). Now, office_space has all the packages that I need to get my daily office coding done. This has packages particular to the recommendation engines that we have built. Whereas the personal_space environment has just normal packages - numpy, pandas, luigi and others.

Virtual environment makes it easier to work on more than one project at a time without introducing conflicts in their dependencies. It helps to manage the package versions without much hassle. In a realistic scenario, many a times different projects may require different versions of the same package and a change in the version of a package may give erratic results. I have learnt the hard way to always have your package version pinned. Always, make sure the version of the package you are using in the production code is the right one and pin that version. Virtual environments help you to easily meet and manage different package versions for different projects.

Python 2 or 3 or may be both

With Python making a change to Python 3, we always have this question - which version should I use. I started learning Python recently and one of my seniors recommended learning Python 3. You decide what you want to learn. My recommendation is Python 3. Having said that, you can easily have both the versions in your system. Comes to the rescue pyenv. Pyenv is a python version management tool. It lets you to switch from one version to other with just a single command.

If you are on mac, consider installing with Homebrew package manager by typing:

brew update
brew install pyenv

You can read more on how to install, use and other details here.

Create a virtual env on python3

You need to install virtualenvwrapper package using:

pip install virtualenvwrapper

virtualenvwrapper helps creating and deleting virtual environments and I have already explained why we need virtual environments.

Once you have it installed, you can easily create an environment say ml_learn_env by typing

mkvirtualenv ml_learn

This will activate your virtual environment. You can now install packages in this environment that you need. Any package that you install in this env won't have any impact on anything outside this space.

Many times, you may have to pass the path of the python version along with mkvirtualenv to ensure the virtual env is created over that python particular version.

If I do pyenv versions on my terminal, this is what I get.

system
* 3.4.3 (set by /Users/manishb-imac/.python-version)
3.5.2

There are three versions of Python on my machine: system - the default Python provided by Mac which is Python 2.7. And then I have Python 3.4.3 (marked by asterisk meaning the current active version) and Python 3.5.2.

If I want to make a virtual env corresponding to a particular python version, say 3.4.3 I can pass the path explicitly like

mkvirtualenv -p /Users/Manish/.pyenv/versions/3.4.3/bin/python ml_learn_env

This will definitely have 3.4.3 as the python version on your virtual env - ml_learn_env.

Deactivate and delete virtual environment

If you want to delete a virtual env. (say tmp_venv) you need to first get out of the virtual env. by deactivating it using: deactivate tmp_venv. This will get you out of the virtual env. and now if you want to delete it, type rmvirtualenv tmp_venv.

Create a Jupyter kernel over virtual env.

Once your virtual environment is created, we want to install packages specific to the project you created the virtual environment for. When writing the code on Jupyter notebook for this project we want to use these packages.

To do so we need to install a package called ipykernel that allows us to use these packages in our jupyter notebook. We then create a kernel over this virtual environment by typing the below code.

pip install ipykernel
python -m ipykernel install --user --name ml_learn_env --display-name 'ml_learn_kernel'

You should now able to switch to ml_learn_kernel by clicking on change kernel under the Kernel tab of Jupyter notebook.

This post is not complete yet. I will keep updating this as and when I learn more on virtual environments.

Eight lessons learnt from the process of learning the guitar

2017-11-28T08:00:00-02:00

It has been close to three years since the day I picked up my brother's guitar and asked him to show me how this works. And man! What a journey it has been since then. I have learned so much from the guitar apart from learning to play this beautiful instrument.

I learn from everyone and anyone around me. You spend some time with me and I will find something to learn from you. You won't know this but I would have learned something beautiful from you. Everyone has something to offer. It is up to us to grasp it. It should not come as a surprise that I strongly believe in this adage.

You are the average of five people you spend your most time with.

I have spent approximately thousand days with the guitar residing by my bed. Obviously, I can play this instrument but what other lessons has this guitar taught me. I will digress a little to admire my guitar a little. Two of my guitar teachers have complimented my guitar something like this - your guitar has a John Mayer feeling. I am so proud of my guitar. It is beautiful, melodious, and elegant.

Without further ado, below are the lessons the guitar has taught me:

1. Start small

If you pick a guitar the first time, you will realise you can't even hold it properly, you can't press the strings, every note you press sounds pathetic. You want to play your favourite song to it but you can't. You just can't. Your dream of wading your fingers on the guitar gets shattered. You get depressed. You decide maybe this is not for me. You decide to let it go. And here is when you need to realise that Rome was not built in a day. It takes time to learn new stuff. Instead of focusing on the desired output, just start and start small.

The initial few months of learning guitar are just the boring finger exercises so that your fingers get accustomed and get the strength to press the string. Guitar has taught me to start small and not to give up thinking about the bigger picture. Just focus on the smallest portion of the humongous task and get it done. Pick up the next smallest task and get it done. With time, you will get closer to your end goal.

2. People will say things when you start something new

Obviously, you don't sound even close to music the first few months of picking the guitar. Well, it may take more than a year as well. People around you will get irritated with the cranky sound from your attempts to play the guitar. And I don't blame them. It is actually not that good a sound for the ears. But you have to fight them, ignore them, find a time when they are not around and continue your journey of learning. Don't let their voices subdue your will to learn. Keep practicing those finger exercises.

3. It takes time

Every time I talked to my tutor about not able to play a thing on guitar - say not able to play the painful barre chords or not able to perfect the timing of a note. He had just one answer to all of these queries - it takes time. And that's it. After few months, I realised he was right. I stopped complaining to him. I told myself whenever I got stuck at something - it takes time.

A lot many people give up on guitar within first few months of learning because they don't see any progress in their learning. My friend, Issac calls them 6-month guitarist. They give up and stop learning. Don't be like them. Understand that it takes time to achieve bigger goals. Learning guitar isn't a sprint; it is a never-ending marathon. You keep on learning. I kept trying without complaints.

With much practice, I realised I was improving and that was like a pat on the back. This small progress motivated me to continue the journey of learning. If there is something you suck at, just keep trying. All it takes is a little time.

4. Track your short-term goals to achieve the end goal

I used to see how many times I am able to change from one chord (say C) to another chord (say G). This is called as one-minute changes. I kept a track of this number. I remember the first few months this number would be somewhere around twenty. Twenty changes from C to G in a minute. And I used to write it down in a notebook. The next day the same one-minute changes and I used to keep track if the number increased.

Obviously, it didn't change much in a month or even two months but after six months I was doing 45 changes. That's an improvement of 80% over a few months. That is not bad! Guitar has taught me to track the progress. It does not have to be the one I explained. See, what works for you and your learning journey.

5. Set aside a fixed time for achieving a new goal

It helps to set aside a fixed time to learn the skill you want to acquire. And this is absolutely important in the starting few days of your learning journey. I used to wake up at 6.30 a.m. every day during the starting few months. I didn't have a guitar teacher in the beginning and discipline was of utmost importance. Choose a time that fits best into your routine. This time shouldn't be given to any other activity. This is the time you have to spend on your goal.

Morning worked best for me. See what works best for you. You need to stick to this time and make the best use of it. I have applied this technique to learn other new stuff as well and this has worked for me.

6. The best time to plant a tree was 20 years ago. The next best time is now

A month since I started learning guitar, I thought I had so much time in college. I should have picked guitar during that time. I would have made so much progress from then until today. And it is true that most of your college days are mostly free. But then a thought from Quora came to the rescue of the disheartened heart and eased me - The best time to plant a tree was 20 years ago. The next best time is now. This motivated me not to think of the lost time. Now if I want to learn a thing, I will start it today. I will start it now. Most of the times you have to just start without letting your mind convince you otherwise. Start it today. Start now.

7. Practise what you are not good at

Practice is extremely personal. Assess your weakness and practice to master it. I realised late into the journey of learning guitar that I am not making much progress. I felt I am stuck at a level from where I am not making progress. I am not levelling up. I talked to my guitar tutor about this. And he explained to me that it is important to practice what you are not good at. I was not good at playing barre chords and I never practiced it. I don't know why. Maybe, because it would have required some extra effort or whatever stupid reasons I had. But then, I made up my mind to practice skills I am not good at. You improve only when you know your weaknesses. Find out your gaps and start building that bridge - one day at a time.

8. If you want to learn a skill, surrounding yourself with people better than you helps

If you talk to an expert for an hour that is equivalent to reading hundred books. I don't know if this is entirely true but there is some truth in this. Surround yourself with people who are better than you. I am lucky to have a lot many around me who are way too better than me at playing guitar. And this has helped me enormously. First, it motivates you to keep learning to reach the other person's level. Second, you realise, obviously, you are not the best and this pushes you to work more on it.

Thank you for reading!

An easy starter to music theory

2017-11-24T00:00:00-02:00

I started playing guitar almost three years back and at that time I had wished it would have been better if I had picked this lovely instrument during my college days. Well, we always repent things we don't do. The guitar was one of such things. But then I read this quote in one of the answers on Quora.

The best time to plant a tree was twenty years back and the second best time is NOW.

I strongly believe in the power of NOW. So without thinking much, I started learning guitar in the month of June 2014.

While I have been playing guitar for almost three years, I have also enjoyed learning about the music theory alongside. So I thought it will be a worthy attempt at explaining the music theory I understand. I may not have the answer to all of the questions but I can cover the basics.

In a series of posts, we will talk about:

the basic theory of scales
how chords are derived from scales
what it means when we say that the song is in the key of C
and many more related questions

We will become a pro. We will talk music. Let us begin.

There are seven pure notes in music. They are Sa, Re, Ga, Ma, Pa, Dha, Ni, Sa (same as the first note but higher pitch) in Indian Classical music. The same holds true for Western music as well. Here they are known as Do, Re, Mi, Fa, So, La, Ti, Do. Notes are the key ingredients of music. Notes are what alphabets are in the English language. They are the building blocks.

In this post, I will talk about Western music and to make the concepts relatable I will use the guitar to explain the theory. There are seven pure notes also called as normal notes. They are C, D, E, F, G, A, B, C. Then there are five accidental notes. They are named accidental because they have been derived from the pure notes. The accidental notes are denoted by # (read as sharp) or superscript b (read as flat). For instance, C# is pronounced as C-sharp or D-flat. We will cover this in next few paragraphs.

Not all of the pure notes have an associated accidental note. Only five of the seven normal notes have an associated accidental note. Notes B and E have no accidental notes i.e B and E have no sharps.

So the accidental notes are:

C#, D#, F#, G#, A# or they can also be called as D-flat, E-flat, G-flat, A-flat, and B-flat respectively. So all the notes - C, D, E, F, G, and A have sharps except for B and E notes.

B and E has no sharps.

So in total, we have twelve notes in music.

You would agree with me when I say music is all about ears. But there needs to be some form of measurement that can tell us this note is this far from this note. To elucidate, how do we measure the distance between two cities say A and B? We use Kilometers. Similarly, we have semi-tone and tone to measure the distance between two notes.

So what is a tone? A tone is one full-step from one note to another. As the name suggests, a semi-tone is half of a tone i.e. a semi-tone is half-step from one note to the other. C to D is a full-step and is a tone whereas C to C# is a half-step and is a semi-tone. B to C is also a half-step (remember B has no sharp) so it is a semi-tone.

So far, we have covered notes, tone, and semi-tones.

Let us talk about scales. Have you heard of them? A lot of my friends shut down their brain the moment they hear about this word - scales. I try explaining to them that it is not as complicated as people assume. It is simple to understand. Once you have grasped the concept, you need to keep using it in your vocabulary so that you don't forget it. I do the same.

I love explaining to people what scales are or any music theory, hence this post. Another reason for writing this post is I want to use this as a reference for whatever music theory I have understood so far.

Scales - the basic building blocks of chords

Scale is a combination of notes arranged in patterns that provides a template to compose or improvise depending on mood or feel. The notes are arranged in a particular pattern - not any random pattern. We will talk about what these patterns are in next few paragraphs.

So, why are scales important?

Scales are important because chords are derived from scales. Scales are building blocks of chords. I will explain how to get chords from a scale in a later post. Coming back to scales, they help you get familiarity with the different notes and help you understand how each note sounds. It is important to understand how each note of guitar sounds. Well, you don't start with hearing each note and remembering how each note sounds. But this should be your long-term goal. You can't play any combination of notes - they won't sound nice to the ears. Scales are a group of notes which when played together sound like music.

Ever seen someone listening to a tune and instantly playing that tune on the guitar? How do they achieve this skill? They have practiced the scales for quite some time. They have spent time transcribing notes on their instrument. Transcribing is the process of finding the note on the guitar that you have in your mind. Well, this takes time and comes with practice. Scales help you achieve this with ease.

What are different types of scales?

There are broadly two types of scales - major and minor. Well, there are others as well, like pentatonic scales. For starters, we will just focus on major and minor scales.

Remember that there is a pattern in the combination of notes for a scale. Both major and minor scales have a certain pattern in which the notes are arranged and it is this pattern that differentiates their sound and feel. Let us have a look at the major scale.

Major scale - R T T S T T T S

There you have the major scale. All you need is now to understand what these alphabets mean.

R stands for root
T for tone, and
S for semi-tone

What does a root mean? Root is the main note of the scale. The root and the pattern decide what other notes will be there in the scale. Let us take an example of the simplest scale - C-major scale. So the root here is C note. C-major scale starts with a C, has intermediate notes according to the pattern and then ends with C.

Try writing the notes for some other scale, say D-major scale. And check if you have the correct answer.

Moving on, next is minor scale.

Minor scale - R T S T T S T T

Let us take an example of A-minor scale to see what are the notes of this scale. Below image shows the notes of both C-major scale and A-minor scale.

The moods of major and minor scale

Major and minor scales are often described in terms of the contrasting feel or mood they set. Major scale in general sets a happy and positive mood. Play a C-major scale and you will realise the mood is happy. The minor scale gives a feeling of sadness and melancholy. Play A-minor scale to see if it sets a little sad mood.

We know there are 12 notes in music (seven pure notes and five accidental). Each of these twelve notes has a corresponding major and a minor scale. So we have 12 major scales and 12 minor scales. Get familiar with scales. This is how my notes on scales look like.

Try getting the notes of few of the major and minor scales, say G-major scale, D-minor scale or any scale you want. Do the same and I promise this will help you understand scales much better.

Now that we know what scales are, we will talk about chords in the next post. And in a few later posts, I will write more on music explaining the basic concepts.

Share this post with people you think are interested in music. Comment if you would like to discuss more on this or have any questions.

Common docker commands

2017-11-23T08:00:00-02:00

I recently got to know about dockers. And I love it. For those who don't know what dockers are. Here it is. Dockers help in software development in isolated frameworks.

Say, you are building an application named epsilon-X. epsilon-X relies on packages like numpy, scipy, pandas, other services, and software. Either you install each of these packages on your machine or you create an environment that has all these required packages along with the suitable versions. Docker helps you create that environment so that you don't face the issues like it works in dev environment but fails in the production environment.

Docker ensures that you don't have to worry about the versions of the packages needed for software development. Many times we face issues like it works on my machine, not sure why it is not working on your machine. And reasons for that could be the difference in versions of packages in the machines. Docker helps you abate such issues.

Docker automates the repetitive tasks of setting up and configuring development environments so that developers can focus on their core task of building software.

So basically docker is like an isolated machine (read environment) that has all the required packages and services for your software to run.

Some docker associated terms

Docker file

Docker file is a few lines of code that keeps information about packages and softwares to install. This docker file creates the docker image which sits on top of the virtual machine. There are many containers that run in isolation. Think of each container as a bundle of the packages and softwares required to run the application you want to run.

Containers

Containers are like a group of packages bundled together. Each container has its own space and does not delve into the outside of its space. A docker can have many containers- each container having a different bundle of packages.

I have already talked about the use cases of docker but if you still want to know more about docker, there are a plethora of resources available online. Let us now talk about some of the important docker commands.

Common docker commands

docker-compose up

Once you have the docker-file base, you can build the docker and up the containers by typing

docker-compose up -d --build
# Make sure you run the above command from the folder that has the Docker file

If you just want to up the containers(once you have your dockers built),
docker-compose up
docker-compose up -d  # ups the containers in background by passing -d flag

docker images

Lists all the docker images running; it also lists all the containers running.

docker images

docker ps Lists all the containers running for the docker image.

docker ps

docker-compose down

This will shut down the containers

docker-compose down

docker restart

This will restart the containers of the docker image

docker restart

docker logs

If you want to get the logs of processes running on the worker or master, you can do so by using

docker logs -f docker_worker_1
docker logs docker_master
# This will only give you the tail of the logs.

If you want the complete logs, you can do so by passing additional flag. Something like:

docker logs -f docker_workder_1 --since 60m
# gives you the logs since 60 minutes

docker exec If you want to talk to of one of the machines, say you want to go to the bash of docker_master you do so by typing

docker exec -it docker_master bash
# This will take you to the bash of the master

I recently wanted to know what all python packages were installed in my docker. I wanted to know the python packages installed in docker_worker_1 container. You can do so by typing

docker exec docker_worker_1 pip list

docker cp

If you want to copy files from docker machines say docker_worker1 to the host, you can do so using docker cp, something like this:

docker cp docker_worker_1:<file-path of source>  <folder-path of destination>

example: docker cp docker_worker_1:/tmp/some-folder/some-file.txt .

Obviously, the above list is not exhaustive. There are many but these are a few that I have used most so far. Let me know in comments what other commands you use.

Yaadon ki chuskiya

2017-07-12T00:00:00-03:00

Recently, my friend, Tushar had come to Bangalore to visit me and Rahul before he left for the US for his Masters. He was here for a weekend. We intended to visit places around Bangalore. And we ended up visiting Lal Bagh. I know you are judging me right now. Who goes to Lal Bagh with friends! I had the same thought until I visited this place. Lal Bagh is beautiful and serene. This place is so under-rated. Go there in morning and you will love the place.

Without digressing further, let us talk about how we managed to write a song, play guitar to it, record it on phone and upload it on Youtube. On a Saturday evening, four of us were getting bored. We had just come enjoying our tea from the tea-stall outside my apartment. Tea is important here. See, we were already bored. We didn't have any plans for the night as well. All of us were seated in the room when we thought of writing a song. Ahh, it may sound a little weird but I have done this before. I like writing lyrics of a song at random times.

So, we first decided what should be the theme of the song. Without much discussion, we all agreed on friendship. Since Tushar was leaving for the US, all our thoughts started moving towards that direction. How would he feel after a year when he is seated in a room far from his country. Rahul was the one mostly coming up with the important lines. Dipu and I were mostly into the guitar deciding the music and the lead. Tushar was mostly confused amongst all this. Fidgeting with his mobile. He ended up playing the egg-shaker.

The song is in key of C but the song starts with G chord which gives a different sound to the song. You will see most of the songs in Bollywood, that are in C scale, starts with C-major chord or its relative minor - A-minor. And this sounds repetitive. So I thought of starting with G-major.

This is what we came up with in four hours. Here is the recording of the song - Yaadon Ki Chuskiya.

Verse 1

Chuskiya, chai ki chuskiya...
Chutkiya, yaaron ki chutkiya...
Khushiya, chhoti chhoti khushiya
Humm once...

Chorus 1

Yaad aati hai...bas yaad aati hai...
Yaad aati hai...bas yaad aati hai...

Humming

Hmmm...la la la la la...
Hmmm...la la la...

Verse 2

Khoya...kyu hoon main khoya
Roya ...kyu hoon main roya
Akela...kyu main akela
Humm once...

Chorus 2

Savera... hai andhera
Pura... phir q adhura

Savera... hai andhera
Pura... phir kyu adhura

Humming

Hmmm...la la la la la...
Hmmm...la la la...

Lead

Play the lead

Chorus 1

Yaad...aati hai...bas yaad aati hai...
Yaad...aati hai...bas yaad aati hai...

Humming

Hmmm...la la la...
Hmmm...la la la...

Cheers,

Manish

Top lessons I learned from 3 years at Walmart

2017-06-05T00:00:00-03:00

Having graduated from college, I joined WalmartLabs as Statistical analyst in the year 2014. We were a batch of four freshers, all from different colleges — one from IISc. Bangalore, two from ISI Kolkata, and I from IIT Kharagpur. We were the first batch of freshers to join the data science team. There were three seniors to guide us through the on-boarding formalities and getting to know the Walmart systems.

I clearly remember the first few days of the office. I used to be super excited everyday. I used to feel like there is so much to learn. There was a spark of learning that kept me excited. I worked at WalmartLabs for close to three years. People say, the first company is always special. And I won’t deny this. I have enjoyed my days at WalmartLabs. The stay here was packed with learnings, great friends, and wonderful experiences.

Today as I reckon how the last three years has passed, I think to myself what I could have done better? What were a few things that I took way too seriously. What could I have done better? What were my learnings?

Below are my learnings from the 1000 days I have spent working in a corporate job.

Don’t hesitate to ask questions.

When I had joined as a fresher, there were so many terms I had never heard of. WFH, WebEx, EDLP, EDLC were alien to me. Technology wise, I had never worked on R. But that never discouraged me asking questions whenever I got stuck. And the earlier you clarify these questions the better for you. You don’t want to ask the fresher-tagged questions after one year of your employment. Moreover, asking questions clarify concepts. So, ask as many questions as possible.

New phase. Write a new story.

This is a new phase of life after college. We all are excited about it. This is a chance for you get a better version of yourself. You are surrounded by new people. You have a fresh page to start with. You can write a new story or continue with the story you have been living in your college. If there is something you want to change in your life. This is the time. When everyone around you is new. You don’t have your old people to drag you down. It’s a great opportunity to overcome any challenges or weaknesses from your past.

Your 9-to-5 job is not enough for your learnings.

Do not be under the impression that your project will teach you all. There will always be gaps in your learnings. Bridge these gaps with your personal projects. I participate in hacks, read books, write codes for personal projects, and then blog about them. This keeps me motivated. Choose what you truly want to learn. This shall excite you and keep you on top of the learning curve.

First impressions are exaggerated.

First impressions are exaggerated — be it a good one or a bad one. Let time pass by and then decide. The bad thing about first impressions is that people forget to update it.

Don’t get busy in 9 to 5 cycle.

It is very easy to get comfortable in the 9–5 daily cycle and the weekly Monday-to-Friday cycle. Time flies week by week and you won’t even realize that you completed a year of your job. Learn new skills — be it boxing, guitar or anything you always wanted to learn. Weekends should not be looked upon as your lazy days.

Save money. Don’t waste all your money on beers, drinks, and getting yourself pampered.

After a year of job, I realized I had not saved enough. Saving is a practice. Inculcate it. It is up to you how much you want to save. It could be as simple as opening a recurrent deposit account.

Your manager should be aware of all your work.

Make sure you and your manager frequently connect to discuss the work you are doing. Always ask for feedback. Any improvement in any area that is required of you. If your manager doesn’t have answer to your questions, ask a senior you trust and respect.

You are the average of 5 people you spend your most time with.

If there is one mantra in life I follow whole-heartedly, it is this one. You are the average of five people you spend your most time with. So choose your group wisely.

Be the go to person.

No matter how many blogs or books you read, if you don’t solve real problems you won’t learn. Try to solve as many problems as possible. Become an expert of your skill. You should be seen as the go-to-person for this topic by people in your office. This will expose you to solve new problems and enhance your learnings.

Don’t get excited by hike. Let your work excite you.

This is easier said than done. I will be lying if I say that I abide by this. I try to find peace and excitement in the work I am doing. Always ask the question — Why am I doing this? How will this help in the bigger picture? Try to relate yourself to the learnings from the work you are doing. If you are interested in your work, hike will come automatically as a byproduct. Never work just for hikes.

Give chance to your juniors. Always.

Once you are a senior, you have a different responsibility of nurturing your juniors. Pass on your learnings to them. Don’t look upon your juniors as the resource to do your shittiest work. Once they become senior, they will realize this and hate you for this. When presented with an opportunity, give chance to your juniors to present the solution.

Find a mentor. The right one.

A mentor is one who knows your problems, has faced similar problems in the past, and has answer to your problems. There will be times when nothing will work in your favor. You want to come out of the muddle. But you don’t know how. Your mentor will be your hero. I was fortunate to have some of my seniors as best mentors — Jeeban, Pralabh, and Issac. You find yours. The right ones. Not the ones who become your mentor just for the sake of company’s policy.

This is obviously not an exhaustive list but for now this is what I have. I will keep visiting this page to add more of my learnings. Please share your thoughts in comments. Share it with people who you think would enjoy reading this.

The essence of music in my life

2017-05-30T08:00:00-03:00

Guitar is cool. I reckon from my childhood when seeing someone play guitar in movies inspired a sense of great appreciation for the character playing it. Remember Hrithik from the movie Kaho na pyaar hai. I always wondered how does one play this instrument.

I loved listening to songs. Sonu Nigam, KK, Kunal Ganjawala, and Neeraj Sridhar (Wo chali wo chali, remember this song?) were a few singers who kept me hooked to my CD player. I used to even remember the tough lyrics of a few Punjabi songs - Kunal Ganjawala's Channa ve ghar aa jaa ve. And then, the next day sing the song with classmates during the recess.

My mother had this talent of identifying a singer by just listening to the singer's voice. And at that point of time, this was a huge talent for me. I was in awe of this talent of my mother's. I would play a song in the TV room and run to mother in the kitchen and ask her who was the singer. I don't remember even a single time when she flunked my tests.

So, today when I look back I realize music was something I really enjoyed. Sadly, I was not aware of this then. The idea of music was alien to me. I had never looked at it the way I look at it today.

The closest I could get to music was singing while taking the shower. And that too when father was not around. Singing in front of elders was considered rude. I don't know the reason. Even today, I don't sing in front of my father. Mum is cool. I used to listen and enjoy songs a lot. Many a times while solving the Mathematics problems on simultaneous equations or finding the area of hexagon. Music soothed me. Solving Mathematical problems while listening to music was heaven for the child in me. The best time of the day.

Days passed and the adolescent me got involved in other important responsibilities like attending classes, solving Irodov, DPPs (Daily Practice Problems), preparing for JEE. I was not at home now. I had shifted to Kota far far away from home. Now I was a fan of Atif songs. Dooorie, Pehli nazar mein were few of the songs to whose tune I used to solve DPPs - finding the maximum range of a projectile or identifying the equation of a plane satisfying a few constraints, or finding the entropy of a system.

Life used to start with breakfast at TTS (Tina Tea Stall) in Vigyan Nagar, attend the classes at Gaurav Tower, have a masala pattiz (a type of quick snacks) while returning from the class and end with solving the DPPs for that day. This was the routine. Kota has been the learning years of my life. I have learnt so much about life, responsibilities, and people from my stay at Kota. Life had only one aim - Crack the IIT JEE exam.

In the year of 2009 I got admitted to IIT Kharagpur, Maths and Computing. Mathematics was always with me. Place changed, people changed. The world here was different. People were different. The first year passed in getting acquainted to the place. We were masters of our world. Proxy, bhaat (talk to your friends on the least important topics for hours), mid-sem exams, and most importantly peace had become part of my life.

In the second year, there was an audition happening for the music club in the hostel. I was naive to think that I sing pretty decent. So, I too went for audition. I failed miserably. There were candidates far better than me. I came back from the audition, talked to my friends, headed to the canteen for tea and the audition chapter was closed.

It was in the winter of year 2013 when my brother, Mukesh had a winter break and he had asked me to recommend what he could do for a month? Without much thought, I suggested him to get a guitar. And he did. My parents tried to intervene saying his studies would get affected. Parents, I tell you, they care a lot about your future. I convinced my mother to get him the guitar or I will buy him one from my stipend. And they agreed.

Mukesh tried his hands on learning guitar for a month. Joined a tutor. But he didn't enjoy it much. Guitar is not everyone's cup of tea. He was more into photography. So later, I got him a DSLR. He has got his repository of clicked photos on Flickr. You can check out his photos at Mukesh Barnwal Flickr.

My other brother, Mitesh had a knack for music. He used to hum songs a lot. The best one I reckon him singing is the song - Kaho na pyaar hai. Now, the guitar belonged to Mitesh. He took it to his college in Jalpaiguri. He joined a teacher there. And slowly he picked up the tabs, the chords. He even auditioned for the music club in his college and got selected.

Within no time he started singing song while playing guitar. I still remember how amazed I was when I had first listened to his recorded audio on Whatsapp. I was proud of my brother. I was elated that he had learnt something that a lot of people would love to but failed to invest the time it needs to learn the guitar. You can listen to some of his songs at Mitesh Barnwal Youtube.

In June 2014, after completing my 20 days trip in Uttarakhand and Chandigarh, I was at home for another 20 days waiting to join Walmart Labs, my first job. I had no idea what to do for these 20 days.

It was one fine morning, Mitesh was practicing on his guitar that I sat next to him and saw him playing the guitar. I asked him, if I could try my hands on the instrument. He agreed. I asked him to teach me something. He smiled. Nevertheless, he explained me a few things- what the strings are named, what each part of the guitar is called and a few other details.

I further enquired about the effective way to learn the guitar. He had a few downloaded tutorials on learning to play guitar. During those few days, I did a few of those tutorials. My guitar sounded pathetic. The sound was not something one would call music. My brother explained, it takes time before you create something soothing. I used to listen to whatever he would say. I have a huge respect for a person who has learnt something from scratch. It shows that the person really enjoys it and has taken the pain to go through the learning curve.

I continued practicing the guitar for those 20 days. Sometimes my mother would enter the room and would ask with a smirk on her face - You too want to learn the guitar, son? Ohh then we would have two guitars!. My mother hates anything that takes up space in the room. She always had a dislike for guitar as well, because of the space issue. Not so much though, once she got impressed by my brother's performance on guitar.

Post those 20 days, I landed in Bangalore to join the corporate world. Here too, I had a friend who used to play the guitar. I started using his guitar. After a month, I got the first salary credited in my account. The next day me and my friend, Dipu headed to Furtados, a music shop in Koramangala to get the guitar. I got an Epiphone brand guitar.

I started taking classes at justinguitar.com. This site is the best online resource available for a beginner. After a few months, I realized I needed a tutor for better learning curve. And it has been since then that I have been playing guitar.

The one thing I used to crave for after work in office was to sit down and practice the tabs and chords on the guitar. I have been playing this instrument for the last 2.5 years now.

I am in love with guitar. When I am happy I like to play guitar, when I am tensed I like to play the scales, when I am stressed I like to play something rock. If not never, it is very rare when I don't feel like playing guitar. I hope I continue playing with the same rigor in the coming years and the love for this beauty never abates.

I hope you enjoyed reading this. I will write more on guitar and music in the coming few days.

How to choose the probability cut-off in classification problem

2017-05-18T08:00:00-03:00

Yesterday, I was taking a session on Data Science for few of my colleagues. The aim was to give a brief overview of machine learning. There were two of us taking the session. We had a rough idea what all we wanted to cover in the two hours session.

I started the session starting with what machine learning is. The types of learning - supervised and unsupervised and the examples that fall into each of these.

What is machine learning?

Machine learning helps the machine learn from the data. Understand the pattern in the data and use it to predict the future values. There is a true function that maps the inputs to the output values. Machine learning is the process that helps to estimate that function. We try to find the proxy function that is as close as possible to the true function.

The essence of machine learning is function estimation.

If you are interested to read more on this, click on this post. The essence of machine learning is function estimation.

We then moved to explaining the simplest technique in machine learning - linear regression. We then moved to cover classification problems - where you have to classify an observation into one of the pre-defined classes. Talked about logistic regression, explained how in regression we try to regress the response variable with respect to the independent variables and how in logistic regression we regress the probability that the response variable will belong to class one.

In binary classification problem, logistic regression gives you the probability to belong to class 1. And if you want the probability for class 0, you just subtract this probability from 1. So what you have right now is the probability to fall in class 1.

What you want is the class to which the observations would fall? How do you convert this probabilities to get the classes? I posed this question to my colleagues. One of them said, this should come from business knowledge. How strict or lenient you want your model to become? Some said they would take any value greater than 0.5 as belonging to class 1 or else class 0. And few others were still contemplating.

I am assuming you understand what TPR and FPR means. If not, you may want to visit this post - TPR, FPR, ROC, and AUC.

By this time, I had already explained them ROC curve and confusion matrix. We went back to ROC curve and explained how the ROC curve gives you the true positive rate, false positive rate corresponding to a probability cut-off. The graph looks something like below.

How do you generate the above graph?

There are functions in R that can give you this plot in a single line. However, for the sake of doing it, I have written the below code that generated the above plot.

Choosing the probability cut-off

Once you have an understanding of ROC curve, we will proceed further to understand how we can use this plot to get the probability cut-off. You choose some probability cut-offs say from 0.5 till 0.9 with some increment say 0.05 and calculate the TPR and FPR corresponding to each probability value.

You have to decide how much TPR and FPR you want. There is a trade-off between the tpr and fpr. If you want to increase TPR, your FPR will also increase. So depending on whether you want to detect all the positives (higher TPR) and willing to incur some error in terms of FPR, you decide the probability cut-off.

Many a times you may want to choose probability that gives you the maximum accuracy. However, care should be taken when you have a case where the response column is skewed. For instance, a bank wants to predict the loan defaulters.

Markdown commands I use frequently

2017-05-15T08:00:00-03:00

This post is aimed to capture the list of commands I use frequently while writing a post for this site in Markdown. You can write a post both in Markdown format and in HTML format. Markdown is preferred when you want to have a free-flow writing style. HTML is used when you want to have more to do with the publishing of the post.

I normally write in Markdown format because I have to write a lot of content. There is not much to do with image placement or ordering of the content. However, many a times I have to resort to HTML form of writing as well. For instance, the image that I have embedded in the 'About' tab is written in HTML format because I want to have an image floating to the right of the text. And it is easy to do this using HTML's image tag. Let us get started.

Inserting an image of appropriate size

<img src="url" alt="some_text" style="width:width;height:height;">

Use the HTML element to define an image
Use the HTML src attribute to define the URL of the image
Use the HTML alt attribute to define an alternate text for an image, if it cannot be displayed properly
Use the HTML width and height attributes to define the size of the image
Use the CSS width and height properties to define the size of the image (alternatively)
Use the CSS float property to let the image float

<p>
<img src="http://manishbarnwal.com/images/author/high.jpeg" alt="Photo of author" style="float:right;width:128px;height:128px;">
I am a Senior Statistical Analyst at <b>@WalmartLabs</b>. In am a graduate from IIT Kharagpur with a Masters in Mathematics and Computing. I joined @WalmartLabs in July 2014 where I have been working mostly on supply chain projects and have been deploying machine learning models across Hadoop cluster.
</p>

For instance, the above code would produce the below result.

I am a Senior Statistical Analyst at @WalmartLabs. In am a graduate from IIT Kharagpur with a Masters in Mathematics and Computing. I joined @WalmartLabs in July 2014 where I have been working mostly on supply chain projects and have been deploying machine learning models across Hadoop cluster.

Inserting an image without much change in layout

Tutorial on dplyr- a package for data manipulation in R

2017-05-15T08:00:00-03:00

R is the most used tool in data science. It has no dearth of packages for specific use cases. There are three packages that I feel can get your most of the work done - ggplot2, dplyr, data.table.

ggplot2- Used for visualization. Also known as grammar of graphics. This package is used to plot graphs. The syntax is intuitive and easy to learn.
dplyr- Used for data manipulation. Also known as grammar of data manipulation. Most of the data munging processes and methods gets done easily using this package.
data.table- Used for large files. You can read huge files within seconds. The data manipulation library for larger datasets.

In this post we will focus on learning dplyr package. Dplyr is a fast tool for data manipulation for data frame like object both in memory and out of memory. Let us get started. We will use hflights dataset to demonstrate the functions and syntax of dplyr.

We need to first install dplyr and hflights if you don't have it already in your R environment. You can install these using:

install.packages("dplyr", dependencies = T)
install.packages("hflights", dependencies = T)

dplyr is a package which is referred to as a grammar of data manipulation. hflights is a package which has the dataframe hflights which consists of details about all the flight that fly in and out of Houston area in the year 2011

Load the installed packages and data using:

library(dplyr)
library(hflights)
data("hflights")

Get an idea of the data we will be working on by looking at its structure and printing the first few rows of the data.

str(hflights)
head(hflights)

Let me introduce to tbl_df. Look at the code below.

flights = tbl_df(hflights)
flights # prints only 10 rows and only those many columns that fits the screen for easy view

print(flights, n=20) # you can also specify how many rows you want to display

tbl_df creates a local data frame. It creates a wrapper around the original data frame that prints nicely. tbl_df introduces a new data frame like structure called tbl.

A tbl is of class data frame just that data manipulation is easy with this. This means that any function that we use with a data frame can be used by tbl type objects.

We will not use tbl objects. I rarely use it. I am comfortable with data frame class and we will focus on data frames for the rest of the tutorial as well. Below are the list of some of the important functions of dplyr.

data.frame(head(flights)) # convert back to data.frame to see all the columns

filter: Keep rows matching criteria

Example: View all flights on January 1

Base approach

flights[flights$Month==1 & flights$DayofMonth==1, ]

dplyr approach

filter(flights, Month==1 & DayofMonth==1) # Return rows with matching conditions.

unique(flights$UniqueCarrier)
filter(flights, UniqueCarrier=="AA" | UniqueCarrier=="AS") # use pipe for OR condition

select: Pick columns by name

Example: Select data for only UniqueCarrier, Distance, AirTime columns

Base approach

flights[, c("UniqueCarrier", "Distance", "AirTime")]

dplyr approach

select(flights, UniqueCarrier, Distance, AirTime) # select() keeps only the variables you mention

To get contiguous columns, use starts_with(), ends_with, contains

select(flights, Year:DayOfWeek, starts_with("Taxi"), ends_with("Time"))

Chaining or Pipelining

%>% operator helps you write multiple operations in a chain. The output of first operation becomes an input to the next command.

Let us try to understand this operator using an example. Say we want to select UniqueCarrier and DepDelay columns and filter only rows having delays over 60 minutes.

Normal approach

filter(select(flights, UniqueCarrier, DepDelay), DepDelay > 60)

# nesting method to select UniqueCarrier and DepDelay columns and filter for delays over 60 minutes

Chain operator approach

flights %>%
        select(UniqueCarrier, DepDelay) %>%
        filter(DepDelay > 60)

arrange: Reorder rows

Example: Select UniqueCarrier and DepDelay columns and sort by DepDelay

Base approach

flights[order(flights$DepDelay), c("UniqueCarrier", "DepDelay")] # default is increasing

dplyr approach

flights %>%
        select(UniqueCarrier, DepDelay) %>%
        arrange(DepDelay)  # default is increasing/ascending


# use `desc` for descending
flights %>%
        select(UniqueCarrier, DepDelay) %>%
        arrange(desc(DepDelay))

mutate: Add new variables

Mutate adds new variables and preserves existing
Create new variables that are functions of existing variables

Example: Add new variable named Speed

Base approach

flights$Speed = flights$Distance/flights$AirTime * 60
flights[, c("Distance", "AirTime", "Speed")]

dplyr approach

flights %>%
        select(Distance, AirTime) %>%
        mutate(Speed = Distance/AirTime * 60)

The above code doesn't add the Speed variable to flights data frame. To explicitly add variable to the table you need to assign this to a flights table. The below code does this.

# Adding 'Speed' variable to the table
flights = flights %>% mutate(Speed = Distance/AirTime * 60)

with and within

Perform R expressions using the items (variables) contained in a list or data frame
The within function will even keep track of changes made, including adding or deleting elements, and return a new object with these revised contents.

with is a generic function that evaluates expr in a local environment constructed from data.

with(flights, mean(ArrDelay, na.rm = T))
with(flights, plot(AirTime, Year))

flightsWith = with(flights, rm(Year))

# note that expr in with takes place only in the environment constructed

# Using 'with' you cannot assign the output value to any variable

names(flightsWith) # flights will still have 'Year' column

within is similar, except that it examines the environment after the evaluation of expr and makes the corresponding modifications to a copy of data (this may fail in the data frame case if objects are created which cannot be stored in a data frame) and returns it.

flightsTemp = flights
flightsTemp1 = within(flightsTemp, rm(Year))
names(flightsTemp1)

summarise: Reduce variables to values

Example: Create a table grouped by Dest, and then summarise each group by taking the mean of ArrDelay

dplyr approach

flights %>%
        group_by(Dest) %>%
        summarise(avg_delay = mean(ArrDelay, na.rm = T))

Another example: For each carrier, calculate the percentage of flights cancelled or diverted

flights %>%
        group_by(UniqueCarrier) %>%
        summarise_each(funs(mean), Cancelled, Diverted)

One more example: For each carrier, calculate the minimum and maximum arrival and departure delays

flights %>%
        group_by(UniqueCarrier) %>%
        summarise_each(funs(min(., na.rm=T), max(., na.rm=T)), ArrDelay, DepDelay)

Helper function n() counts the number of rows in a group
Helper function n_distinct(vector) counts the number of unique items in that vector

Example: For each day of the year, count the total number of flights and sort in descending order

flights %>%
        group_by(Month, DayofMonth) %>%
        summarise(flight_count = n()) %>%
        arrange(desc(flight_count))

SQL Joins

inner_join(x, y)

Return all rows from x where there are matching values in y, and all columns from x and y. If there are multiple matches between x and y, all combination of the matches are returned.

left_join(x, y)

Return all rows from x, and all columns from x and y. Rows in x with no match in y will have NA values in the new columns. If there are multiple matches between x and y, all combinations of the matches are returned.

right_join(x, y)

Return all rows from y, and all columns from x and y. Rows in y with no match in x will have NA values in the new columns. If there are multiple matches between x and y, all combinations of the matches are returned.

semi_join(x, y)

Return all rows from x where there are matching values in y, keeping just columns from x. A semi join differs from an inner join because an inner join will return one row of x for each matching row of y, where a semi join will never duplicate rows of x.

anti_join(x, y)

Return all rows from x where there are not matching values in y, keeping just columns from x.

full_join(x, y)

Return all rows and all columns from both x and y. Where there are not matching values, returns NA for the one missing.

There are many other functions in dplyr. For now, I have listed down a few of these. I will update this post soon.

Did you find the article useful? If you did, share your thoughts in the comments. Share this post with people who you think would enjoy reading this. Let's talk more of data-science.

The essence of machine learning is function estimation

2017-05-12T08:00:00-03:00

Machine learning is cool. There is no denying in that. In this post we will try to make it a little uncool, well it will still be cool but you may start looking at it differently. Machine learning is not a black box. It is intuitive and this post is just to convey that.

If I give you this function f(x) = x^2 + log(x) and ask to you tell me what will be f(2), you will first laugh at me and then run away to do something important. This is trivial for you, right? If a function is there that maps inputs to outputs then it is very easy to get the output for any new input.

Machine learning helps you get a function that can map the input to the output. How does it do it? What is this function? We will try to answer such questions in the paragraphs below.

Let us try to answer the above questions using a problem that can be solved using machine learning. Assume, you are a technical recruiter. You have been running a recruitment firm for the last 3 years. Now you being tech savvy, you follow the latest trends in technology and you came to know about machine learning. You understand that machine learning can be used to predict the future given you have data from the past.

You thought, how can I use it to predict the expected salary of a candidate given other factors. The first thing that comes to your mind- do you have the data? And you hear out a pleasant yes!

You have the following data collected at individual level:

Age of the candidate
Gender of the candidate
Number of years of experience
Highest level education degree
College - Top notch, Average, normal
Current salary
Sector - IT, Finance, Electronics
Salary

And a few others. For now let us assume we have just these features and we want to predict the expected salary using these features. We have 3 years of data that has approximately 10,000 rows. So your dataset looks something like the below data:

So essentially we have seven independent features, X - age, gender, years of experience, highest level of education, college, current salary, sector and corresponding salary, Y. What we want is next time when we have a candidate, we would obviously have his age, gender, years of experience, and other features. What we won't have is his salary. And, we want to estimate this value.

There would be some function, say f that would map these X to the Y values. How do we find this function? We will use the 3 years of data we have - the training data.

We won't be able to find the actual function, the true function, f because we don't have all of the data in the world. You can't collect the entire dataset available. It is impossible. What we use is a sample of data from the population. And, we use this sample as our training data.

Many a times, there are some factors that can't be captured. The set of independent features that we have captured is not an exhaustive list. There would obviously be other features that will have an impact on the salary.

Say in our example of salary prediction, some of the factors like exclusive and exceptional knowledge on some rare topics may land a candidate exorbitant offers from few of the companies. It is difficult to capture factors like these.

Now we understand why we can't have the true f. So we will try to get an estimate of f, say f^. We want this f^ to be as close to the true f i.e. a proxy for the true function. There would obviously be error in estimating this true function and w want to minimize this error to as low as possible. How do we go about getting this f^, an estimate of the true function, f?

We have the data, remember the 3 years of historical data which contained the X features and the corresponding Y values. This is called the training data and there is a reason why it is called training data. Because we use this data to train the underlying algorithms to get the estimated function f^.

You get that we use training data to get the estimated function, f^. But how do we do it? We try to minimize the error between the true salary, Y and the predicted salary, Y^ from the model. For now, understand that there is a way to minimize this error and get the estimated function.

Now these functions could be a simple one like having a linear relationship between the salary and the features or many a times a complex relationship which is not linear. There are techniques say linear regression or say decision trees that help you get the simple estimate or even a complex one respectively.

Once you have this estimate of function,

f(age, gender, years of experience, highest level of education, college, current salary, sector) ---> salary

you just pass in the X and you should get your Y. There, you have a machine learning model. And you know what you have done - you have just come up with a nice estimate of the true function.

Once you have this estimate, there are other questions that you might want to think over. How good an estimate is this function to the true function? What all assumptions you made to estimate this function? When would this estimate not be a good choice?. I will try to answer these questions in future posts. For now, I hope you get the gist that the essence of machine learning is function estimation.

Did you find the article useful? If you did, share your thoughts in the comments. Share this post with people who you think would enjoy reading this. Let's talk more of data-science.

Time series and forecasting using R

2017-05-03T08:00:00-03:00

Time series forecasting is a skill that few people claim to know. Machine learning is cool. And there are a lot of people interested in becoming a machine learning expert. But forecasting is something that is a little domain specific.

Retailers like Walmart, Target use forecasting systems and tools to replenish their products in the stores. An excellent forecast system helps in winning the other pipelines of the supply chain. If you are good at predicting the sale of items in the store, you can plan your inventory count well. You can plan your assortment well.

A good forecast leads to a series of wins in the other pipelines in the supply chain.

Disclaimer: The following post is my notes on forecasting which I have taken while having read several posts from Prof. Hyndman.

Let us get started. First things first.

What is time series?

A time series is a sequence of observations collected at some time intervals. Time plays an important role here. The observations collected are dependent on the time at which it is collected.

The sale of an item say Turkey wings in a retail store like Walmart will be a time series. The sale could be at daily level or weekly level. The number of people flying from Bangalore to Kolkata on daily basis is a time series. Time is important here. During Durga Puja holidays, this number would be humongous compared to the other days. This is know as seasonality.

What is the difference between a time series and a normal series?

Time component is important here. The time series is dependent on the time. However a normal series say 1, 2, 3...100 has no time component to it. When the value that a series will take depends on the time it was recorded, it is a time series.

How to define a time series object in R

ts() function is used for equally spaced time series data, it can be at any level. Daily, weekly, monthly, quarterly, yearly or even at minutes level. If you wish to use unequally spaced observations then you will have to use other packages.

ts() is used for numerical observations and you can set frequency of the data. ts() takes a single frequency argument. There are times when there will be multiple frequencies in a time series. We use msts() multiple seasonality time series in such cases. I will talk about msts() in later part of the post. For now, let us define what is frequency.

Frequency

When setting the frequency, many people are confused what should be the correct value. This is the simple definition of frequency. Frequency is the number of observations per cycle. Now, how you define what a cycle is for a time series?

Say, you have electricity consumption of Bangalore at hourly level. The cycle could be a day, a week or even annual. I will cover what frequency would be for all different type of time series.

Before we proceed I will reiterate this.

Frequency is the number of observations per cycle.

We will see what values frequency takes for different interval time series.

Daily data There could be a weekly cycle or annual cycle. So the frequency could be 7 or 365.25.

Some of the years have 366 days (leap years). So if your time series data has longer periods, it is better to use frequency = 365.25. This takes care of the leap year as well which may come in your data.

Weekly data There could be an annual cycle. frequency = 52 and if you want to take care of leap years then use frequency = 365.25/7

Monthly data Cycle is of one year. So frequency = 12

Quarterly data Again cycle is of one year. So frequency = 4

Yearly data Frequency = 1

How about frequency for smaller interval time series

Hourly The cycles could be a day, a week, a year. Corresponding frequencies could be 24, 24 X 7, 24 X 7 X 365.25

Half-hourly The cycle could be a day, a week, a year. Corresponding frequencies could be 48, 48 X 7, 48 X 7 X 365.25

Minutes The cycle could be hourly, daily, weekly, annual. Corresponding frequencies would be 60, 60 X 24, 60 X 24 X 7, 60 X 24 X 365.25

Seconds The cycle could be a minute, hourly, daily, weekly, annual. Corresponding frequencies would be 60, 60 X 60, 60 X 60 X 24, 60 X 60 X 24 X 7, 60 X 60 X 24 X 365.25

You might have observed, I have not included monthly cycles in any of the time series be it daily or weekly, minutes, etc. The short answer is, it is rare to have monthly seasonality in time series. To read more on this visit monthly-seasonality.

Now that we understand what is time series and how frequency is associated with it let us look at some practical examples.

Some useful packages

forecast: For forecasting functions
tseries: For unit root tests and GARC models
Mcomp: Time series data from forecasting competitions
fma: For data
expsmooth: For data
fpp: For data

We will now look at few examples of forecasting. We will look at three examples. Before that we will need to install and load this R package - fpp.


install.packages('fpp')
library(fpp)

Three examples

Sale of beer in Australia


dev.off() # to open up the plots with default settings.
ausbeer # is at quarterly level the sale of beer in each quarter.
plot(ausbeer)
beer <- aggregate(ausbeer) # Converting to sale of beer at yearly level
plot(beer, main = 'Sale of beer at yearly level')

Sales of a group of pharmaceuticals


*a10 is a group of pharmaceuticals*
a10 # Sale of pharmaceuticals at monthly level from 1991 to 2008
head(a10)
plot(a10, main = 'Sale of pharmaceuticals at monthly level')

Electricity demand for a period of 12 weeks on daily basis

head(taylor)
plot(taylor)

Fully automated forecast

plot(forecast(beer))

The blue line is a point forecast. You can see it has picked the annual trend. The inner shade is a 90% prediction interval and the outer shade is a 95% prediction interval.

Similar forecast plots for a10 and electricity demand can be plotted using

plot(forecast(a10))
plot(forecast(taylor))

Some simple forecasting methods

These are benchmark methods. You shouldn't use them. You will see why. These are naive and basic methods.

Mean method: Forecast of all future values is equal to mean of historical data Mean: meanf(x, h=10)
Naive method: Forecasts equal to last observed value Optimal for efficient stock markets naive(x, h=10) or rwf(x, h=10); rwf stands for random walk function
Seasonal Naive method: Forecast equal to last historical value in the same season snaive(x, h=10)
Drift method: Forecasts equal to last value plus average change Equivalent to extrapolating the line between the first and last observations rwf(x, drift = T, h=10)

Forecast objects in R

Functions that output a forecast object are:

meanf()
croston() Method used in supply chain forecast. For example to forecast the number of spare parts required in weekend
holt(), hw()
stlf()
ses() Simple exponential smoothing

Once you train a forecast model on a time series object, the model returns an output of forecast class that contains the following:
Original series
Point forecasts
Prediction intervals
Forecasting method used
Residuals and in-sample one-step forecasts

A simple example on the beer time series

plot(beer)
fit <- ses(beer)
attributes(fit)
plot(fit)

Measures of forecast accuracy

MAE: Mean Absolute Error
MSE or RMSE: Mean Square Error or Root Mean Square Error
MAPE: Mean Absolute Percentage Error

MAE, MSE, RMSE are scale dependent.

MAPE is scale independent but is only sensible if the time series values >>0 for all i and y has a natural zero

Test methods on a test set

ausbeer # is at quarterly level the sale of beer in each quarter.
plot(ausbeer)
beer <- aggregate(ausbeer) # Converting to sale of beer at yearly level
plot(beer) # plot of yearly beer sales from 1956 to 2007

beer_train <- window(beer, end = 1994.99) # data from 1956 till 1994
plot(beer_train)

beer_test <- window(beer, start = 1995) # data from 1995 till 2007
plot(beer_test)

a10Train <- window(a10, end=2005.99)
a10Test <- window(a10, start = 2006)

Simple methods for the BEER data

f1 <- meanf(beer_train, h=8)
f2 <- rwf(beer_train, h=8)
f3 <- rwf(beer_train, drift = T, h = 8)

plot(f1)
plot(f2)
plot(f3)

In-sample accuracy

This will give you in-sample accuracy but that is not of much use. It just gives you an idea how will the model fit into the data. Chances are that the model may not fit well into the test data. So we should always look at the accuracy from the test data.

accuracy(f1)
accuracy(f2)
accuracy(f3)

Out-of-sample accuracy

accuracy(f1, beer_test)
accuracy(f2, beer_test)
accuracy(f3, beer_test)

Exponential smoothing

This method has been around since 1990s.

fit1 <- ets(beer_train, model = "ANN", damped = F)
fit2 <- ets(beer_train)

accuracy(fit1)
accuracy(fit2)

fcast1 <- forecast(fit1, h=8)
fcast2 <- forecast(fit2, h=8)
plot(fcast2)

accuracy(fcast1, beer_test)
accuracy(fcast2, beer_test)

General notation

ETS(Error, Trend, Seasonal) ETS(ExponenTial Smoothing)

ETS(X, Y, Z): 'X' stands for whether you add the errors or multiply the errors on point forecasts.

'Y' stands for whehter the trend component is additive or multiplicative or multiplicative damped

'Z' stands for whether the seasonal component is additive or multiplicative or multiplicative damped

Some examples

ETS(A, N, N): Simple exponential smoothing with additive errors 'A'/'M' stands for whether you add the errors on or multiply the errors on the point forecsats
ETS(A, A, N): HOlt's linear method with additive errors
ETS(A, A, A): Additive Holt-Winter's method with addtitive errors
ETS(M, A, M): Multiplicative Holt-Winter's method with multiplicative errors

There are 30 separate models in the ETS framework. However 11 of them are unstable so only 19 ETS models. So when you don't specify what model to use in model parameter, it fits all the 19 models and comes out with the best model using AIC criteria.

model1 <- ets(a10Train)
model2 <- ets(a10Train, model = 'MMM', damped = F)

forecast1 <- forecast(model1, h=30)
forecast2 <- forecast(model2, h=30)

There are many other parameters in the model which I suggest not to touch unless you know what you are doing. - Prof Hyndman

If you want to have a look at the parameters that the method chose. Just type in the name of your model.

model1

You will see the values of alpha, beta, gamma. Also,

sigma: the standard deviation of the residuals

AIC: Akaike Information criteria. AIC gives you and idea how well the model fits the data. ets fits all the 19 models, looks at the AIC and give the model with the lowest AIC. The lower the AIC, the better the model fits.

AICc: Corrected Akaike Information criteria

BIC: Bayesian Information Criteria

ets() function

Automatically chooses a model by default using the AIC, AICc, BIC
Can handle any combination of trend, seasonality and damping
Produces prediction intervals for every model
Ensures the parameters are admissible (equivalent to invertible)
Produces an object of class ets

ets objects
Methods: coef(), plot(), summary(), residuals(), fitted(), simulate() and forecast()
plot() function shows the time plots of the original series along with the extracted components (level, growth and seasonal)

Automatic forecasting

Why use automatic forecasting?

Most users are not very expert at fitting time series models
Most experts cannot beat the best automatic algorithms. Prof. Hyndman accepted this fact for himself as well. He has been doing forecasting for the last 20 years.
Most busines need thousands of forecasts every week/month and they need it fast. You have to do it automatically.
Some multivariate forecasting methods depend on many univariate forecasts.

Box-Cox transformations

Transformations to stabilize the variance If the data show different variation at different levels of the series, then a transformation can be useful.

plot(a10)

As you can see, the variation is increasing with the level of the series and the variation is multiplicative. If we take a log of the series, we will see that the variation becomes a little stable.

plot(log(a10), xlab = 'Time', ylab = 'Log of a10', main = 'log of a10')

Box-Cox transformations gives you value of parameter, lambda. And based on this value you decide if any transformation is needed or not.

lambda = 1 ; No substantive transformation
lambda = 1/2 ; Square root plus linear transformation
lambda = 0 ; Natural logarithm
lambda = -1; Inverse plus 1

Back-transformation

We must reverse the transformation (or back transform) to obtain forecasts on the original scale.

a10 # Sale of pharmaceuticals at monthly level from 1991 to 2008
plot(a10)
lam <- BoxCox.lambda(a10) # 0.131
lam
fit <- ets(a10, additive = T, lambda = lam)  # 'additive = T' implies we only want to consider additive models
plot(forecast(fit))
plot(forecast(fit), include=60)

ARIMA forecasting

R functions

The arima() function in the stats package provides seasonal and non-seasonal ARIMA model estimation including covariates
However, it does not allow a constant unless the model is stationary
It does not return everything required for forecast()
It does not allow re-fitting a model to new data
Use the Arima() function in the forecast package which acts as a wrapper to arima()
Or use auto.arima() function in the forecast package and it will find the model for you

This post was just a starter to time series. I will talk more about time series and forecasting in future posts. I plan to cover each of these methods - ses(), ets(), and Arima() in detail in future posts.

Did you find the article useful? If you did, share your thoughts in the comments. Share this post with people who you think would enjoy reading this. Let's talk more of data-science.

Do you believe in yourself?

2017-04-03T08:00:00-03:00

Are you a confident person? The mind asked itself. I replied with an excited Yes!. The mind asked again, are you sure?. This time I thought for a while but again after a pause, I said yes.

Why was there a pause the second time?
We humans doubt our capabilities a lot. And what makes us doubt our strengths? A single failure sometimes makes you feel cringed. You feel you are worthless. Life before this failure seems all happy and confident. If you have been confident all your life, how can a single failure shake away the temple of your confidence.

You know the reason? I will tell you. We humans fear failure. A lot. The thought of failure often surpasses the strength of trying it. We are taught from childhood that if you try something and if it doesn’t see the light of success then it is a bad thing. Why can’t we accept and celebrate failures?Now I am not talking about any failure. I am talking about failure where you gave your best. You tried all your means. You presented the best version of yourself.

Recently, I was talking to one of senior colleagues. He has two beautiful daughters. On mundane days, our discussions mostly revolve around deep learning and the new world it is going to present. Today was different. It was a Friday evening.

We happen to discuss about life and the lessons it teaches us. How sports shape your character? How teamwork is not to be seen just on the field playing football. Teamwork is everywhere were more than one person is involved. A person having a team game on his resume gets a plus point according to him. I instantly realized the reason behind this. It is not just about playing in a team. If you play sports, the chances of you failing increases manifold. Every match is either a win or a loss (read failure).

It is important to embrace failure gracefully. That way, you won’t fear the thought of failure. A fear can haunt you only till you run away from it. The moment you face it, the fear will be gone. The fear doesn’t know what to do once you accept it.

Do you feel low when you fail at achieving a goal or fail at delivering a task? Of course! We all do. Imagine you are famished, the first first few swallows of food are divine, right? That’s what success tastes like. It is tantalizing. On the contrary, failure is bland. There is always a feeling of something missing.

Now, if you are a confident person, you would have developed your confidence from some of your accomplishments in the past. Some examples of such accomplishments could be- you stood first in your class, you presented in front of a large audience despite all the fear of stammering and dry mouth, you learnt something from scratch and you are very proud of it.

Now the above examples are personal. Try to remember the tasks and achievements you are proud of. This achievements could be as small as fixing your bathroom door without the help of a carpenter. You fixed it on your own. Know that feeling when you do something uncertain all on your own? If yes, I am talking about such feelings.

How does these thoughts make you feel like? You feel happy about it. A sense of pride erupts in your body. The neurons in your brain all fire at once like sky shot fireworks. Remember these accomplishments and make this your fuel to lighten you up. Fight back at failure. Roar back at life. You have all the ingredients for the perfect recipe to success.

Treat failure as another chapter of life. And life doesn’t end at just one chapter. It is a long story with indefinite number of pages. Your life can’t be written in one book or 10 books. It is a saga of millions of books. Don’t stop at that chapter. You are the author of the next chapter as well. Decide what you want it to be.

So next time, when failure knocks at your door and gets in your home, welcome it. Let is sit in your home for a weekend. Talk to it. Try to understand it. And then one afternoon gather your ingredients to prepare the delicious meal of success. Failure will never eat that lunch and will jump off your balcony. Now that the failure is gone. Begin the new chapter. You already know the title to that chapter—Believe in yourself.

Diving into H2O with R

2017-03-28T07:00:00-03:00

Do you understand the pain when you have to train advanced machine learning algorithms like Random Forest on huge datasets? When there is a factor column that has way too many number of levels? When the time taken to train the model is so huge that you went to your pantry for snacks and came back, you are even done browsing 9gag but your model is still training, the code is still running? Fear no more, we will talk about these problems and how we can address them.

We will first try to understand why it takes so much time to train models on huge datasets in R. And then the solution to this - build models using H2O in R.

Our laptop has multiple cores in it. Most of the laptops these days come with at least 4 cores. R by default uses only one of the cores of your laptop. Say your laptop has 4 cores, then the remaining 3 cores are unused or may be partially used by other processes that your computer is running. Using just one core would obviously be slower than if R could use multiple cores in parallel.

What if R could use the other cores as well? This would definitely make the R codes run faster. The solution is H2o.

What is H2O?

H2O is The Open Source In-Memory, Prediction Engine for Big Data Science. H2o enables R to use the other cores of the laptop as well. R then runs on multiple cores of your laptop. You laptop is now like a standalone cluster. The power of you laptop has just increased by H2o. You know what your laptop is thinking right now - With great power, comes great responsibility.

Do you need to worry about how H2O converts your laptop to a cluster?

Not at all. It is as simple as running this - h2o.init(). We will come to this in later part of the post. The whole idea why h2o was built was keeping in mind that most of its users would be Data Scientists and it should be easy for them to build these models using h2o without any hassle of worrying about distributing the code across the cluster. And the developers of h2o have achieved this with flying colors.

Initializing the H2O cluster

So we now understand what is the use of h2o and how it can make our life easier. Let us now talk about execution. You would first need to install the h2o package using

install.packages(h2o)

Next, we load the installed package and launch h2o from R.

library(h2o)
local_h2o <- h2o.init()

The above will initialize your h2o cluster. Now there are other parameter in h2o.init() but for now let's go with the default settings. Few important parameters to h2o.init() are:

nthreads: Number of cores of the computer you want to use.
max_mem_size: The total RAM size allocated to the cluster.

If you are interested in the other input parameters that can be passed, you can read for help by typing ?h2o.init() in R.

Once you have initiated the cluster, check if your connection is working using h2o.clusterInfo(). You should see this line- R is connected to the H2O cluster along with other details of your initiated cluster. Now your cluster is ready and you are good to start your coding workflow in h2o.

Working with H2O

There are two ways you can work with h2o. Either with the flow or writing the code in R editor. The h2o flow is like an user interface that can be accessed at http://localhost:54321 after you have initiated the h2o instance using h2o.init(). The other way is to write the code in R. The serious work gets done by writing the code in R.

I normally use h2o flow just to create the first base model to see how the base model is doing. H2o flow is easy to get started with. There is no coding involved here. The image below shows the look of h2o flow. You parse in your datasets using the Data tab in the top right. The Model tab gives you a list of models you can train on your parsed dataset. The flow is pretty easy to go through.

Almost all of the machine learning models are supported in H2o. Some of them being Deep Learning, Generalized Linear Models (GLM), Gradient Boosted Regression (GBM), K-Means, Naive Bayes, Principal Components Analysis (PCA), Principal Components Regression (PCR), Random Forest (RF) and few others.

Interaction of R with the cluster

The data is not saved in the R workspace. All data munging occurs in the h2o instance. By the look and feel it is easy to believe that all the processing is taking place in R but that is not true.
You are now not limited by R's ability to handle the data but by the amount of memory allocated to your h2o instance.
When the user makes a request, R queries the server via the REST API, which returns a JSON file with the relevant information that R then displays in the console.

Demonstration on a dataset

We will try to understand the working of h2o using a classification problem. The problem is simple. The government of a country wants to understand if its citizens are happy or not.

There are various independent features like WorkStatus, Residence_Region, income, Gender, Unemployed10, Alcohol_Consumption and a few others. There is a response column called Happy. Your task is to classify a citizen as happy or not happy. I will write a follow up post that will involve the code for the above problem statement.

As always, feedbacks and comments are welcomed. Share this with people who are interested in learning machine learning and data science. Let us talk more of Data Science.

An illustrated introduction to adversarial validation part 2

2017-02-16T08:00:00-02:00

In the last post we talked about the idea of adversarial validation and how it helps the problem when your validation set result doesn't comply with that of test set result. In this post, I will share the R code to help achieve the idea of adversarial validation. The data used would be from Numerai competition.

Loading required packages


library(randomForest)
library(glmnet)
library(data.table)
library(MLmetrics)
getwd()
dir()

Reading train and test data set


train <- fread("Data/numerai_training_data.csv")
train <- as.data.frame(train)
train$target <- as.factor(train$target)

str(train)
dim(train) # has close to 136000 rows and having no missing values
head(train)

test <- fread("Data/numerai_tournament_data.csv")
test <- as.data.frame(test)
dim(test) # has close to 150000 rows and having no missing values
head(test)

Creating the target variable to distinguish between train and test data


train$isTest <- 0 # assigning 0 for train and 1 for test data
test$isTest <- 1

Combining train and test data into a single data frame


combi <- rbindlist(list(train[, -51], test[, -1])) # removing 'target' from train data and 't_id' from test data
combi$isTest <- as.factor(combi$isTest)
combi <- as.data.frame(combi)
str(combi)

Train a classifier to identify whether data comes from the train or test set


logitMod <- glm(formula = isTest~. , data = combi, family = 'binomial')
summary(logitMod)
head(logitMod$fitted.values)

Predict on the training data to see which rows resembles most to the test data


pred <- predict(logitMod, newdata = train, type = 'response')
head(pred)

trainData <- train
head(trainData)
trainData$predictTest <- pred

Sort the training data by it’s probability of being in the test set


trainData <- trainData[order(trainData$predictTest, decreasing = T), ]

valIndx <- 1:(0.2*nrow(trainData))
colsToKeep <- names(trainData)[!names(trainData) %in% c('isTest', 'predictTest')]

trainFinal <- trainData[-valIndx, colsToKeep]
valData <- trainData[valIndx, colsToKeep]

write.csv(trainFinal, 'trainfinal.csv', row.names = F, quote = F)

Build a random forest classifier to predict the 'target' variable


set.seed(1) # setting seed for reproducibility of the result

matX <- trainFinal[, -grep('target', names(trainFinal))]
response <- trainFinal[, 'target']
table(response)

rfMod <- randomForest(x = matX, y = response, ntree = 200, mtry = 7) # training randomForest model

Prediction on validation set


rfValPreds <- predict(rfMod, newdata = valData, type="prob")
head(valPreds)
LogLoss(rfLassoValPreds, as.numeric(as.character(valData$target))) # LogLoss function from MLmetrics package

The validation set gives a LogLoss of 0.699. Let us see how does this come out on test data set. For this we will predict on the test data and upload the predictions to the site.

Prediction on actual test data


testPreds <- predict(rfMod, newdata = test, type = 'prob')
testPreds <- testPreds[, 2]

submission <- data.frame(t_id = test$t_id, probability = testPreds)
head(submission)
write.csv(submission, 'submission.csv', row.names = F, quote = F)

The predictions on test data shows a LogLoss of 0.694 which is same as that of the validation set. We can now hope to have same result on both validation set and test set.

Did you find the article useful? If you did, share your thoughts in the comments. Share this post with people who you think would enjoy reading this. Let's talk more of data-science.

How to use Git and Github

2017-02-15T10:00:00-02:00

I had taken this course - How to use git and github some time last year. This post is an amalgamation of the course notes and other tutorials I have completed in understanding git. I will talk about the most frequently used commands. If you already are confident of your git skills and wants more of practical tutorial, you should head to this post - git and github for data scientists. Let us get started.

git init

Initialises an empty git repository in the directory. Creates a hidden folder .git in your directory. If you click on the .git folder, you will see many sub-directories and files in side this folder. But, you will hardly need to know what these files are. These folders are the guts of Git where all the magic happens.

git status

Gives you the status of the snapshot of your repository. Suppose you made some changes on your local disk, then this change doesn’t automatically gets reflected on GIT. git status tells you what all were modified since the last commit. It’s healthy to run git status often. Sometimes things change and you don’t notice it. Status of your repository could be one of the following:

staged: Files are ready to be committed.

unstaged: Files with changes that have not been prepared to be committed.

untracked: Files aren’t tracked by Git yet. This usually indicates a newly created file.

deleted: File has been deleted and is waiting to be removed from Git.

In a nutshell, you run git status to see if anything has been modified and/or staged since your last commit so you can decide if you want to commit a new snapshot and what will be recorded in it.

Staging Area: A place where we can group files together before we “commit” them to Git. Once, you add the files to Git using git add, the files come to staging area. Staged files are files we have told git that are ready to be committed. The files listed here are in the Staging area, and they are not in our repository yet.

To store our staged changes we run the commit command with a message describing what we’ve changed. After using git add, make sure to use git status to see what all files are there in the staged area and be sure that you want to commit these files only.

You can unstage files by using the git reset command.

git reset filename.txt

git reset merely unstage the files but these files will still be there in your local. So if you want to get rid of these unstaged files, you will have to use git checkout to the commit after which you added these files.

git — version

Gives you the version of git you are using

Commits

Commits are part of GIT VCS (version control systems) which lets you save meaningful changes of your code. It is because of commit that you will be able to go to any of the previous versions of your code.

A commit is a snapshot of every file in your repository at the time of commit. Commits are Git’s way of saving versions, so to save two different versions, you would create two commits.

Commits with multiple files

You will often work with multiple files and not just one code and these file may or may not be directly related. Any collection of such files is called a repository. Now, GIT keeps track of all the changes made to each of the files and these are carried forward i.e when you save a version of a file using commit you will save a version of all the files in that repository.

Git does not rename files when you save a new commit. Instead, Git uses the commit IDs to refer to different versions of the files, and you can use git checkout to access old versions of your files.

git commit -m “added new feature RMSE to the function”

git checkout

Is sort of like restoring previous version.

Say, the commit ID of the most recent commit is 3884eab839af1e82c44267484cf2945a766081f3. You can use this commit ID to return to the latest commit after checking out an older commit.

Format of git checkout

git checkout 3884eab839af1e82c44267484cf2945a766081f3

How often to commit

Since you can choose when to make a commit, you might be wondering how often to commit your changes. It’s usually a good idea to keep commits small. As the diff between two versions gets bigger, it gets harder to understand and less useful. However, you don’t want to make your commits too small either. If you always save a commit every time you change a line of code, your history will be harder to read since it will have a huge number of commits over a short time period.

A good rule of thumb is to make one commit per logical change. For example, if you fixed a typo, then fixed a bug in a separate part of the file, you should use one commit for each change since they are logically separate. If you do this, each commit will have one purpose that can be easily understood. Git allows you to write a short message explaining what was changed in each commit, and that message will be more useful if each commit has a single logical change.

Reflect: Manual Commits

What do you think are the pros and cons of manually choosing when to create a commit, like you do in Git, vs having versions automatically saved, like Google Docs does?

Manually choosing when to create a commit like in Git:

Pros: You won’t have to wait for some time or lines of code and then that commit will be made. It depends on you and you get to decide whether the code change that you have done deserves or needs to be mentioned in the commit.

Cons: Getting you to decide when to commit is a little subjective and different people will have different answer to it and so the size of commits will be different for different user and if you are not used to it, understanding the differences in the commits can prove to be problematic for you

Having versions automatically saved like Google Docs does:

Pros: You don’t need to worry about manually commiting. You know this gets committed automatically

Cons: Too many commits or too less commits will be generated depending on what settings are there for automatic commits

Branching

When developers are working on a feature or bug they’ll often create a copy (aka. branch) of their code they can make separate commits to. Then when they’re done they can merge this branch back into their main master branch.

Branches are what naturally happens when you want to work on multiple features at the same time. You wouldn’t want to end up with a master branch which has Feature A half done and Feature B half done.

Rather you’d separate the code base into two “snapshots” (branches) and work on and commit to them separately. As soon as one was ready, you might merge this branch back into the master branch and push it to the remote server.

Remove all the things!

git rm command will not only remove the actual files from disk, but will also stage the removal of the files for us. Now that you’ve removed all the required files you’ll need to commit your changes. Feel free to run git status to check the changes you’re about to commit.

Removing one file is great and all, but what if you want to remove an entire folder? You can use the recursive option on git. This will recursively remove all folders and files from the given directory.

git rm -r folder_of_cats

git log Think of Git’s log as a journal that remembers all the changes we’ve committed so far, in the order we committed them. Gives you the list of all commits that have been made to the code. It gives you the commit number and an associated message that was added to the commit.

Exiting git log: To stop viewing git log output, press q (which stands for quit).

Getting Colored Output: To get colored diff output, run git config — global color.ui auto

git log — stat

Gives you the statistics of all the commits that has been made, with information like which file changed and whether lines were added or deleted for each commit.

git push The push command tells Git where to put our commits when we’re ready.

git push -u origin master

git push

The -u tells Git to remember the parameters, so that next time we can simply run git push and Git will know what to do.

Pulling Remotely

Let’s pretend some time has passed. We’ve invited other people to our GitHub project who have pulled your changes, made their own commits, and pushed them. We can check for changes on our GitHub repository and pull down any new changes by running this command.

git pull origin master

HEAD

The HEAD is a pointer that holds your position within all your different commits. By default HEAD points to your most recent commit.

git diff Gives you the difference between the commits that you have performed in your code. If you want to understand the differences between 2 commits (say comNo.1 and comNo.2) you just do this:

git diff comNo.1 comNo.2

git diff

Without any extra arguments, a simple git diff will display in unified diff format (a patch) what code or content you’ve changed in your project since the last commit that are not yet staged for the next commit snapshot.

So where git status will show you what files have changed and/or been staged since your last commit, git diff will show you what those changes actually are, line by line. It’s generally a good follow-up command to git status

What is a README?

Many projects contain a file named “README” that gives a general description of what the project does and how to use it. It’s often a good idea to read this file before doing anything with the project, so the file is given this name to make users more likely to read it.

Cloning a Repository

There is a difference between downloading and cloning a repository. When you clone a repository, you don’t just download the files i.e. the latest commit file but the entire commit history as well. To clone a repository, run git clone followed by a space and the repository URL.

git clone

Example: Use the following url to clone the Asteroids repository: https://github.com/udacity/asteroids.git

git clone https://github.com/udacity/asteroids.git

Git Errors and Warnings Solution

Should not be doing an octopus

Octopus is a strategy Git uses to combine many different versions of code together. This message can appear if you try to use this strategy in an inappropriate situation.

You are in ‘detached HEAD’ state

HEAD is what Git calls the commit you are currently on. You can “detach” the HEAD by switching to a previous commit. Despite what it sounds like, it’s actually not a bad thing to detach the HEAD. Git just warns you so that you’ll realize you’re doing it.

Panic! (the ‘impossible’ happened) This is a real error message, but it’s not output by Git. Instead it’s output by GHC, the compiler for a programming language called Haskell. It’s reserved for particularly surprising errors!

Takeaway I hope these errors and warnings amused you as much as they amused me! Now you know what kind of errors Git can throw.

Git command review

Compare two commits, printing each line that is present in one commit but not the other.

git diff will do this. It takes two arguments — the two commit ids to compare.

Make a copy of an entire Git repository, including the history, onto your own computer.

git clone will do this. It takes one argument — the url of the repository to copy.

Temporarily reset all files in a directory to their state at the time of a specific commit.

git checkout will do this. It takes one argument — the commit ID to restore.

Show the commits made in this repository, starting with the most recent.

git log will do this. It doesn’t take any arguments.

Behavior of git clone

If someone else gives you the location of their directory or repository, you can copy or clone it to your own computer. This is true for both copying a directory and cloning a repository.

If you have a URL to a repository, you can copy it to your computer using git clone. For copying a directory, you weren’t expected to know this, but it is possible to copy a directory from one computer to another using the command scp, which stands for “secure copy”. The name was chosen because the scp command lets you securely copy a directory from one computer to another.

The history of changes to the directory or repository is copied.

This is true for cloning a repository, but not for copying a directory. The main reason to use git clone rather than copying the directory is because git clone will also copy the commit history of the repository. However, copying can be done on any directory, whereas git clone only works on a Git repository.

If you make changes to the copied directory or cloned repository, the original will not change.

This is true for both copying a directory and cloning a repository. In both cases, you’re making a copy that you can alter without changing the original.

- The state of every file in the directory or repository is copied.

This is true for both copying a directory and cloning a repository. In both cases, all the files are copied.

An illustrated introduction to adversarial validation part 1

2017-02-15T08:00:00-02:00

You'd have heard about cross-validation - a common technique used in data-science process to avoid overfitting and many a times to tune the optimal parameters. Overfitting is when the model does well on training data but fails drastically on test data. The reason could be one of the following:

The model is trying to map the exact findings of training data to test data instead of generalizing the patterns.
The train data and test data are significantly different from each other i.e. they have not been derived from the same population.

The problem

We will try to understand the second issue. What is the problem with the second issue? If you have participated in Kaggle like competitions, then you would know the way these competitions work. You are given a training data set and test dataset. You train your model on training data, predict on the test data and upload the predictions on Kaggle to get your rank.

What we typically do is divide the training data into train and validation data set. Validation data is used to get an idea of how your model will work on the test data. Now imagine if your train data and test data are different in terms of the population from where they've been derived. You won't see the same result in validation and test data. You see the problem here?

The use of validation data is to understand how the model is expected to perform on test data. But if train and test are not identically distributed, validation and test data would show different results.

The solution

The solution is adversarial validation. I got to know about this recently when I started participating in Numerai competition and read about this technique on fastml. Here's how it is done.

1. Build a classifier to distinguish between training and test data

Combine your train and test data into one data. Create a response variable say isTest and assign it as 0 to all the rows in training data and 1 to all the rows in test data. Now your task is to build a classification model that will distinguish between the training and test data. This could be any classification model - logistic or random forest.

2. Sort the predicted probabilities of training data in decreasing order

Once you have the model built, use this model to predict on the training data. You will get the fitted probabilities. Sort the probabilities in decreasing order i.e. the row having highest probability of being classified as test data comes to the top.

3. Take the starting few rows as your validation set

The starting few rows are now those rows of training data that resembles the most to test data. Take the starting few rows say 30% as your validation set. And the remaining as your train data to train your model.

Now, the accuracy metrics on your validation set should be similar to that on the test data. If your model works well on validation data, it should work well on test data. If you are interested in the implementation of what we just talked, head out to this post for the part 2 where the code is written in R.

Did you find the article useful? If you did, share your thoughts in the comments. Share this post with people who you think would enjoy reading this. Let's talk more of data-science.

The curse of bias and variance [draft]

2017-02-08T08:00:00-02:00

Statistics is the field of study where we try to draw conclusions about the population from a sample. Why do we talk about sample? Why can't we get the conclusions about the population directly from the population? Let me illustrate this by an example.

Let us say we want to understand which brand of beer do the people of Bangalore prefers? An interesting question. If I ask you this question, how would you approach this problem?

You can't go around asking each and every person their favorite beer? Or can we? No, we can't cover each and every individual because the 'population' is huge. One thing you can do is you may ask among your circle of friends their preference of beer and get an overall idea of the population. But we have a problem with this analysis. Do you see the problem? Your estimation is suffering from bias or we say your sample is biased. Biased sample is when it is not random. There is some form of personal preference in the choice of picking data.

In your case, chances are most of your friends will be of same age as you. Intuitively, I feel that your age also decides what kind of beer you like. Say, when I was in college I'd never heard of Corona and Kingfisher was the best beer I had tasted. So we can't estimate the best brand of beer that Bangalore prefers from the sample of your friends.

So population is a broader set of data that covers all the data points in the whole universe. Sample is a subset of that data. We try to infer the characteristics of the population from the sample or try to answer questions about the population from the sample. Getting or collecting population data is tough as explained in the above example (The favorite beer example).

I hope the concept of population and sample is clear now. It's normally not feasible to get the data for the complete population so we try to estimate parameters or findings of the population from a sample.

Another example. Not the beer but a rather completely different one - marriages.

Say, now we want to understand what are the factors that affect the age at which one gets married. Some of the factors that came to me without much thought are:

Gender
Love or arranged marriage
Plan for higher studies
Company type - Government of Private
Salary
Region
Religion

Now we don't know the true relationship between marital age and the variables listed above. There should be a true function that maps response variable to the predictors but we don't know what that function is.

Let's say that true function is f.

f: (gender, love-marriage, higher-studies, ...) --> marital age

Now we don't know the true f(). But we can estimate the true function using the data we may have from the past. It's easy to collect the data for the respective variables and the marital age. We are trying to come up with a function say fcap that resembles closely to the true function f.

Whenever we try to estimate true f from the data in hand, we will obviously get some error. This error can be categorized in two types:

Reducible error

As the name suggests, this is something that the analyst has some control over. This can be reduced based on the kind of data you collect and the models (not all models are same right?)one uses to estimate the true f. This error can arise from a combination of 'bias' and 'variance'. We will talk about bias and variance in coming few paragraphs.

Irreducible error

As the name suggests, this error is something that the analyst has not control over. There is always some information that is difficult to capture in data. There is always some randomness in the data and that is difficult to explain. This error can't be reduced using any model whatsoever.

Let us dig deeper into reducible error. We will talk about bias and variance here.

Bias

When we talk about estimating the true f, there are various models that can be used. Now not all models are same. Each has some characteristics of its own. A linear regression, the simplest model is different from say a random forest model. Bias is the error that captures how far is the predicted value (say predicted marital age given the variables like gender, love or arranged marriage, etc.) from the true value (actual marital age).

Now you may ask is there a relationship between the bias and the type of models used? Yes there is one. Bias tends to decrease as the complexity of model increases i.e it is expected that the model error will decrease if we use a more complex model instead of a simple model. Now this is intuitive isn't it?

You may ask what is meant by complexity of a model or an example of it? Linear regression is a very simple model whereas random forest is a more complex model. Linear regression tries to fit just a line to the actual data and it assumes a linear relationship. A random forest model is more complex is the sense that it uses an ensemble of decision trees and is able to explain non-linear relationship as well.

Variance

When you estimate the true f, you use some data to train the model, that data is called training data. The data that the machine uses to train on, to find the patterns. Now once your model is trained, you want to use this to predict on unseen data (the data that the model has not been trained on) that data is called test data.

There is always some variability in the training data and the test data. You can't expect both these data to be exactly same. It is important to note that when you train your model, the model doesn't learn the exact values but instead try to find patterns so that when this pattern is seen on test data, the model is able to predict correctly. Many complex model overfits the data i.e models that perform well on training data but fails drastically on test data.

How is variance related to the complexity of models? As the complexity increases, the chances of overfitting increases i.e the variance increases. Coming to random forest and linear regression example, a random forest's variance is expected to be higher than that from a linear regression model.

So if we talk about complexity, bias and variance together, this is the relationship between them.

As the complexity of model increases, the bias decreases but the variance increases.

So there is a trade-off between bias and variance. You can't get both low bias and low variance at the same time. You will have to accept a trade-off. Do you now understand why we call it the curse of bias and variance? I hope you do.

Did you find the article useful? If you did, share your thoughts in the comments. Share this post with people who you think would enjoy reading this. Let's talk more of data-science.

Visualization in ML is under-rated

2017-01-27T08:00:00-02:00

Visualization is one of the most important pillars of data science. Every one wants to learn Machine learning but if you explain them the little tasks that involve the overall workflow of the process, it turns them off. Everyone just wants to do the cool stuff. They want to build models and be done with it. And I was one of them. I understand that feeling when you get the data and without much understanding of the features in the dataset, we just want to throw in the data to a model and hope that something good comes out.

I have participated in many hacks relating to ML and I used to just hope the trained model would do the task. Sadly, this thought always betrayed me. Even ensemble of various models may not work. Sometimes there are patterns in data which one gets to know only when one does EDA, when one plots a few graphs. One tries to understand the relationship of one feature with the other or do univariate analysis of the columns. An ML model is not always sufficient to understand the patterns in data.

After a point, the learning of a model becomes saturated. It can only perform to a certain point. If there is a pattern that you have identified, it would definitely help the model to better train on it.

I have learnt this hard way - You can't ignore EDA, visualization if you want to come in the top 1% of the leaderboard. Anyone and everyone can run a random forest model. Tuning the parameters is a little tedious but that too can be done with little practice. But finding the hidden patterns in the dataset, finding the relationship, understanding the little nuances in the data is an art. It's a skill. It takes practice. A lot of practice.

I have always wondered, how does the winners go about finding those patterns. Isn't there a course that could teach me these hacks to find the unseen patterns. Unfortunately, there are courses that teaches EDA, they teach you how to use ggplot2. But the thing I am talking about takes altogether a different mindset. One needs to be patient with this chunk of work. There are no defined paths to it. You just keep doing the EDA, observe the patterns, try to create new features based on this and once you have done this a hundred times, then you realize you finally understand this thing.

So how do you master the art of EDA?

The short answer is practice. But then how do you practice EDA? You take shorter assignments, try to write the code snippet of common plots like histogram, scatterplot, bar-plot. And when you plot these graphs, I would strongly recommend to use ggplot2 if you are using R. In one of the talks, Hadley Wickham has strongly asserted that one should start learning visualization using ggplot2. The syntax is very intuitive and makes plots interesting and beautiful. And when a person like Hadley advises a thing, you just follow it.

I have learnt more about R by looking at other people's codes. Look at the Kernels shared on Kaggle by the top performers of Kaggle. The kind of analysis they do is just mind-blowing. And the best part is they share the code for everyone.

In this post, I would just share the code snippets for the most common visualization tasks. Below are some of the most common plots to know about your data.

Barplot

 ggplot(data = mpg) +
  geom_bar(mapping = aes(x = as.factor(class))) +
  ggtitle(label = 'Plot of class of vehicle') +
  xlab('Class of vehicle') + ylab('Count')

Scatterplot

The scatterplot is useful for displaying the relationship between two continuous variables, although it can also be used with one continuous and one categorical variable, or two categorical variables.

 ggplot(data = mpg) +
  geom_point(mapping = aes(x = displ, y = hwy)) +
  ggtitle(label = 'Plot of engine size v/s mileage') +
  xlab('Engine size (litres)') + ylab('Highway mileage')

Linechart plot

 ggplot(data = mpg) +
  geom_line(mapping = aes(x = displ, y = hwy)) +
  ggtitle(label = 'Plot of engine size v/s mileage') +
  xlab('Engine size (litres)') + ylab('Highway mileage')

I will add a tutorial on visualization in R using ggplot2 soon. Did you find the article useful? If you did, share your thoughts in the comments.

The filmy Secret Santa

2016-12-22T07:00:00-02:00

Ever heard of secret Santa game? If not, you may not appreciate this article. Wait till you join the corporate world. But if you have, secret Santa may ring some bell.

Dialogues define our cinema. Ever imagined this twist. What if the Secret Santa gets introduced to Bollywood? Presenting, Secret Santa meets Bollywood!

Bade bade companies me aise chhote chhote tasks milte rehte hai

Gift se dar nahi lagta saheb, task se lagta hai

Secret Santa, naam toh suna hi hoga

Rishte me toh hum tumhare Secret Santa lagte hai, naam hum hints ke through bataenge

Saara office mujhe Secret Santa ke naam se jaanta hai

Tasks pe tasks, tasks pe tasks, tasks pe tasks milti rahi hai magar hints na mila. Mile toh bas tasks

Main aaj bhi feke hue hints nahi uthata

Tasks abhi baaki hai mere dost

Babu moshai, hints interesting hone chahiye lambe nahi

It's not the gift that matters...but the task given by your Secret Santa that defines you

Secret Santa is coming

That is all for now, folks.

Is your favorite Bollywood dialogue in this list? Drop down your Secret Santa dialogues in comments.

Coldplay-Fix you | Ghar aa jaao

2016-11-24T07:00:00-02:00

Last month on a mundane weekend, I was practicing this song - Fix you by Coldplay when my friend, Dipu came to the room. He stood there for some time to understand what I was trying to play. I am not that pro on guitar yet. Chances are you may want to throw me and my guitar out of your sight.

After some trouble, Dipu finally guessed the song and randomly uttered some lines in Hindi along the tune of the original song. And when he was bored, he left the room. I continued my attempt to make the song perfect on the guitar. It was then when I thought what if I write the complete song in Hindi along the tune of the original.

Without much thought, I picked up my notebook and started scribbling few lines. Later, Dipu also joined me in this attempt and below is the final song that we came up with.

Verse 1

Jab har koshish pe ho jaati hai haar

Jab milta nahi jo ho tera pyaar

Jab saath nahin ho mera wo yaar

Akela hoon main...

Verse 2

Jab lagta hai sab gaya hai bikhar

Jab lagta hai suna, bin tere sheher

Jab aati nahi ho tum nazar

Sab kuch hai soo... oo... na

Chorus

Ghar aa jaao tum

Sath seh lenge hum

Kal fir aayega

Verse 3

Sadkon pe hoon, ab main khada

Aakhon me hain, bas chehra tera

Aur hoton pe hai, bas vaada tera

Tujh sang hamesha...aaa

Chorus

Ghar aa jaao tum

Sath seh lenge hum

Kal fir aayega

Bridge

Aasoon bahein, tere chehre pe

Aur main kuch na kar paaya

Aasoon bahein, tere chehre pe

Aur main...

Aasoo bahein tere chehre pe

Aur main bebas hoon yahan

Aasoo bahein tere chehre pe

Aur main...

Chorus

Ghar aa jaao tum

Sath seh lenge hum

Kal fir aayega

Chorus

Ghar aa jaao tum

Sath seh lenge hum

Kal fir aayega

Here is how the song sounds. Give it a listen here.

Random Forest explained intuitively

2016-10-18T08:00:00-02:00

Random Forests algorithm has always fascinated me. I like how this algorithm can be easily explained to anyone without much hassle. One quick example, I use very frequently to explain the working of random forests is the way a company has multiple rounds of interview to hire a candidate. Let me elaborate.

Say, you appeared for the position of Statistical analyst at WalmartLabs. Now like most of the companies, you don't just have one round of interview. You have multiple rounds of interviews. Each one of these interviews is chaired by independent panels. Each panel assesses the candidate separately and independently. Generally, even the questions asked in these interviews differ from each other. Randomness is important here.

The other thing of utmost importance is diversity. The reason we have a panel of interviews is that we assume a committee of people generally takes better decision than a single individual. Now this committee is not any collection of people. We make sure that the interview panel is a little diversified in terms of topics to be covered in each interview, the type of questions asked, and many other details. You don't go about asking same question in each round of interviews.

After having all the rounds of interviews, the final call whether to select or reject the candidate is based on the majority of the decision from each panel. If out of 5 panel of interviewers, 3 recommends a hire and two against a hire, we tend to go ahead with selecting the candidate. I hope you get the gist.

If you have heard about decision tree, then you are not very far from understanding what random forests are. There are two keywords here - random and forests. Let us first understand what forest means. Random forests is a collection of many decision trees. Instead of relying on single decision tree, you build many decision trees say 100 of them. And you know what a collection of trees is called - a forest. So you now understand why is it called forest.

Why is it called random then?

Say our dataset has 1,000 rows and 30 columns.

There are two levels of randomness in this algorithm:

At row level: Each of these decision trees gets a random sample of the training data (say 10%) i.e. each of these trees will be trained independently on 100 randomly chosen rows out of 1,000 rows of data. Keep in mind that each of these decision trees is getting trained on 100 randomly chosen rows from the dataset i.e they are different from each other in terms of predictions.
At column level: Update: The second level of randomness comes at column level. Say, we want to use only 10% of the columns i.e out of a total of 30 columns (from our example data), only 3 columns will be randomly selected at each node level of the decision tree getting build. So, for the first node of the tree, maybe columns C1, C2, and, C4 will be chosen and based on some metric (Gini coefficients or other metrics to decide on the optimal node), one of these three columns will be chosen as the optimal node.

This process repeats again for the next node of the tree. Again, we will randomly choose 3 columns, say C2, C5, C6 and the best column will be chosen for this node as well.

NOTE: Many beginners and even experts mistakenly understand that the columns are randomly selected at tree level. However, the correct concept is that the columns are randomly selected at each node level of each tree. I had received an email from Prof. Adel Cutler about the same (a while back) and so I have updated this post accordingly. Prof. Adel Cutler is the co-author of Random forest and has worked with Prof. Breiman extensively.

Let me draw an analogy now.

Let us now understand how interview selection process resembles a random forest algorithm. Each panel in the interview process is actually a decision tree. Each panel gives a result whether the candidate is a pass or fail and then a majority of these results is declared as final. Say there were 5 panels, 3 said yes and 2 said no. The final verdict will be yes.

Something similar happens in random forest as well. The results from each of the tree is taken and final result is declared accordingly. Voting and averaging is used to predict in case of classification and regression respectively.

With the advent of huge computational power at our disposal, we hardly think for even a second before we apply random forests. And very conveniently our predictions are made. Let us try to understand other aspects of this algorithm.

When is a random forest a poor choice relative to other algorithms?

Random forests doesn't train well on smaller datasets as it fails to pick on the pattern. To simplify, say we know that 1 pen costs INR 1, 2 pens cost INR 2, 3 pens cost INR 6. In this case linear regression will easily estimate the cost of 4 pens but random forests will fail to come up with a good estimate.
There is a problem of interpretability with random forest. You can't see or understand the relationship between the response and the independent variables. Understand that random forest is a predictive tool and not a descriptive tool. You get variable importance but this may not suffice in many analysis of interests where the objective might be to see the relationship between response and the independent features.
The time taken to train random forests may sometimes be too huge as you train multiple decision trees. Also, in case of categorical variable, the time complexity increases exponentially. For a categorical column with n levels, RF tries split at 2^n -1 points to find the maximal splitting point. However, with the power of H2O we can now train random forests pretty fast. You may want to read about H2O at H2O in R explained.
In case of regression problem, the range of values response variable can take is determine by the values already available in the training dataset. Unlike linear regression, decision trees and hence random forest can't take values outside the training data.

What are the advantages of using random forest?

Since we are using multiple decision trees, the bias remains same as that of a single decision tree. However, the variance decreases and thus we decrease the chances of overfitting. I have explained bias and variance intuitively at The curse of bias and variance.
When all you care about is the predictions and want a quick and dirty way-out, random forest comes to the rescue. You don't have to worry much about the assumptions of the model or linearity in the dataset.

I will add in the R code snippets as well to get an idea of how this is executed soon.

Did you find the article useful? If you did, share your thoughts in the comments. Share this post with people who you think would enjoy reading this. Let's talk more of data-science.

Improve runtime of Random Forest in R

2016-10-13T10:00:00-03:00

There are two ways one can write the code to train a random forest model in R. Both the ways are listed below.

A normal and frequent way of writing the command to train the random forest model is something like this.

rfModel <- randomForest(Survived~. , data = trainSample[, -c(6, 8, 9)])

Notice the ~ sign. We call this the formula way of writing.

Another way of writing the command to train the random forest model is shown below.

rfModel <- randomForest(y=trainSample$Survived, x= trainSample[, -c(6, 8, 9)], data = trainSample)

Here we explicitly mention the y-variable and the x variables.

Recently, I was working on a huge dataset where the task was to predict a variable based on some 12 independent variables. The dataset had close to 1.3 million rows. I tried train the model using the first method i.e the formula way. Sadly, I had to kill the task as it was taking a lot of time.

It was then, I got to know that if you train the model using the second format command, the code runs relatively faster. When investigated further, I got to know that the reason for the difference in time is that the code for random forest is written in C and when one writes in the formula format, the x and y variables are explicitly converted into proper format and this is what takes time. ref

Even the help page of randomForest in R says the same thing.

?randomForest

For large data sets, especially those with large number of variables, calling randomForest via the formula interface is not advised: There may be too much overhead in handling the formula.

Look just before the 'Authors' section.

The time difference is not significant for smaller datasets. However, for larger datasets I would suggest using the second format.

To illustrate the time difference, I have taken a dataset having 2,23,874 rows and 6 columns.

system.time( rfModel1 <- randomForest(DepDelay~. , data = subset(df, select = c(1,2,3,4,5, 13))) )

user system elapsed

94.582 11.356 109.311

system.time( rfModel2 <-randomForest(y=df$DepDelay, x=df[,c(1,2,3,4,5)], data=subset(df, select = c(1,2,3,4,5, 13))) )

user system elapsed

93.499 10.248 106.205

This shows a time difference of 3 seconds. Note that all of these columns were numeric and if one takes categorical columns as well, the time difference would be more.

Also, this dataset is not huge. One would appreciate this more on larger datasets.

Did you find the article useful? If you did, share your thoughts on the topic in the comments.

How to install a package of a particular version in R

2016-10-05T07:00:00-03:00

I recently tried installing caret package in R using

install.packages('caret', dependencies=T)

Normally this installation of package works and I continue to work with the functions associated with the package. When I tried including the package using

library(caret)

I got the following error.

Error in loadNamespace(j <- i[[1L]], c(lib.loc, .libPaths()), versionCheck = vI[[j]]) :there is no package called 'pbkrtest'
In addition: Warning message:package 'caret' was built under R version 3.2.5
Error: package or namespace load failed for 'caret'

R was not able to install this dependency package- pbkrtest. So I tried installing it separately, again using

install.package('pbkrtest', dependencies=T)

This too didn't work. And the error again this time was

Error in loadNamespace(j <- i[[1L]], c(lib.loc, .libPaths()), versionCheck = vI[[j]]) : there is no package called ‘pbkrtest’ In addition: Warning message: package ‘pbkrtest’ was built under R version 3.2.3

So this particular package is not available for the R version that I am using.

What is the way out?

We install an older version of this package.

How do you install an older version of some package in R?

I will go ahead with package-pbkrtest to illustrate the steps.

Open this url. You will see various versions of the package listed there. Choose the version you want to install. Let's say I want to install the pbkrtest_0.4-5.tar.gz version.

We now create a variable packageUrl which contains all this information and assign it the url of the page.

packageUrl<- "https://cran.r-project.org/src/contrib/Archive/pbkrtest/pbkrtest_0.4-5.tar.gz"

You then install this version of the package using

install.packages(packageUrl, repos=NULL, type='source')

And we are done!

Now if you have some other package to install, just replace the last word pbkrtest with the package name you want to install and you will be able to see all the older versions of that package. For Rcpp the link would be url. Choose the version you want to install and follow the steps stated above.

Did you find the article useful? If you did, share your thoughts on the topic in the comments.

Cheers,

Manish

Don’t introduce that bias in your child

2016-10-02T08:00:00-03:00

A thought crossed my mind yesterday and just when it was about to get lost in the cloud of many thoughts that scatters in mind…I caught hold of it. I felt I had gathered the maximum out of this thought but I wanted to write about it so that the mind has to think more on it and may be, just may be, I will be able to know more. Here’s the thought.

Every child learns something or the other from every human it spends time with.

This statement sounds more promising when we think about it in this context. We tend to become more like our parents because we spend maximum time with them. Our characters and behaviors are imitated from theirs.

How does it happen that a baniya’s son is almost always good in calculations and Maths? Why is that? Why doesn’t it tend to behave more like a musician?

Because, the child spends the maximum time with his parents. He sees the kind of work his father does or the daily chores that his mother gets involved it. He might not be paying much attention to all of this throughout his life. But the brain keeps on peeping through. The unconscious learning happens. Or what we call an unsupervised learning in data-science.

In the long run, the best learning happens when it is not enforced. When the child learns on his own, at his own rate. You want your child to be interested in reading? Make your home a library. Not so fancy or anything. Just surround the room and tables with books. Read it to your baby even when you think the baby doesn’t understand a single word. Let the unsupervised learning happen.

Let the brain enjoy the feel of reading. The idea of living many lives through these books. Every character in these stories will add to the learning. And by the time the child grows, the brain is already hooked to this habit of reading. It has tasted the palatable food of reading. And it can’t survive without it now.

Let’s say, you want him to be a musician. You already know what you need to do.

Do you realize, what have you done by reading out everyday to your child?

You have introduced a bias in your child’s character. A good bias, I would say. How interesting is that we can relate machine-learning and a child’s learning behaviors. And, we all know what happens to a biased model. It doesn’t work well. There is a trade-off between bias and variance you know.

While, it’s good to have the child spend time with good people and have a nurtured learning. The child would know about the right things. But don’t stop your child from interacting with someone who is not in your good books. Let us call that someone Asura.

Now Asura may not be on good terms with you and you may have legitimate reasons for that. Maybe you feel that Asura is immoral and unscrupulous. But you have seen only a part of Asura’s character. Maybe Asura is an excellent guitar player or a seasoned writer or he could be an expert in other domains. You don’t know.

Let his learning be flexible. Let the bad data be part of his life. Let him enjoy the outliers as well, be it a good or a bad one. With the bad outlier — Asura, make sure you enforce a little of your experience and monitor the child’s learning of good and the evil. Let the supervised learning come into play.

Don’t let your child have limited experiences of just the right things. Don’t limit the datasets in his life or the variance will be huge which again is not a good thing.

Let the learning be both a mix of unsupervised and supervised. A semi-supervised learning we call it in data-science.

Did you find the article useful? If you did, share your thoughts in the comments.

Shell commands come in handy for a data scientist

2016-09-30T07:00:00-03:00

I am no expert of shell commands. I have been using them for quite some time and thought I give an attempt to list down the most common commands. I am writing these mostly from the perspective of a data-science guy. Let us get started.

I will use the file- ‘data.txt’ to illustrate these commands. ‘data.txt’ is a file having 200 rows and 8 columns. You can access the data here.

cat Throws the the contents of the entire file at your terminal

    cat data.txt

We don’t want to bombard our terminal with the complete content of the file. Instead, if you want to have a complete look at the file, open the file in vim editor

vim Opens the file in the editor

    vim data.txt

head Gives the top 10 rows of the text file at your terminal

    head data.txt

tail Gives the bottom 10 rows of the text file at your terminal

tail -n 2 data.txt -- This will give you the bottom 2 rows of the file

The piping operator

cat data.txt | head

Notice the | operator. This is called pipe operator. Piping is a concept wherein you can perform a sequence of operations in a single command.

So what exactly is piping?

A pipe is a facility of the shell that makes it easy to chain together multiple commands. When used between two Unix commands, it means that output from the first command should become the input to the second command. Just read the | in the command as pass the data onto. More on this operator later in the post.

wc

wc is a fairly useful shell command that lets you count the number of lines(-l), words(-w) or characters(-c) in a given file

wc -l data.txt -- gives you the number of lines in the file

wc -w data.txt -- gives you the number of words in the file

wc -c data.txt -- gives you the number of characters in the file

head -n 1 data.txt| wc -w -- gives you the number of columns in the file

grep

Consider ‘grep’ as a command to filter on the results you get. You may want to print all the lines in your file which have a particular phrase. Say for example you want to see people who are ‘Very Happy’. You simply pass this to grep command.

grep 'Very Happy' data.txt | head
-- gives you the top 10 rows having 'Very Happy'

Let us say, we want to count the number of users who are ‘Not Happy’

grep 'Not Happy' data.txt | wc -l
-- gives you the top 10 rows having 'Very Happy'

sort

If you want to sort the data based on some column, say ‘Score’; ‘Score’ is the 3rd column in the file- data.txt

sort -t ',' -k 3 -n -r data.txt |head -5
-- gives you the top 10 rows having 'Very Happy'

Explanation: -t is used to specify the delimiter; ‘,’ in this case.

If the delimiter is ‘\t’, we don’t need to specify -t argument. Space is taken as delimiter by default.

k is used to specify the column based on which you want to sort the data; 3 in this case
n is to specify that sorting is to be done numerically
r is to imply that the sorting is descending

cut

This command gives you only specific column. Say you want to see only the 4th column of the file.

cut data.txt -d ',' -f 4 |head

Explanation: ‘,’ is the delimiter. 4 is the column number that you want to see.

uniq

Do not confuse this command for ‘unique’. It is slightly different. This removes sequential duplicates. So if you want to get unique values from a column, you need to first sort the data and then use this uniq command in sequence.

To get the unique of a column, say the 2nd column

cut data.txt -d ',' -f 2 |sort|uniq

This command could be used with argument -c to count the occurrence of these distinct values. Something like to count distinct in SQL.

cut data.txt -d ',' -f 2 |sort|uniq-c

tr tr stands for translate

‘Find and Replace’ function that we have in excel. Yes we have that in UNIX as well. A typical use of this command that I use on regular basis is that I get the file from HIVE which are tab delimited. And say I want to convert it to ‘,’ delimited.

You may also want to replace certain characters in file with something else using the tr command.

cat data.txt | tr ',' '\t'  -- Changed ',' delimited to '\t' delimited

Save to a new file or append to an existing file

> and >> operator Say you want to save the output of operations to some file. You use ‘>’ or ‘>>’ depending on whether you want it to be a new file or you want to append it to an existing file.

I will update this list as and when I see a command deserving enough to be in a data scientist’s toolbox.

Did you find the article useful? If you did, share your thoughts on the topic in the comments.

ROC and AUC - The three lettered acronyms

2016-09-26T08:00:00-03:00

I don't feel bad to confess this that ROC curve, AUC, True-positive and related terms took quite some time for me to understand. If today I contemplate on the reasons why I found this topic confusing. The first would be there are not many resources that explains intuitively what these mean. They just jump to the terms and the mathematical formula for them. The second being I had not used them even in my project work. You see the project work is never enough for all your learnings.

In this post, I will try to explain my understanding both intuitively and mathematically.

I will illustrate this concept with the help of an example.

There is a bank say SBI that wants to understand which of its future customers would default on a loan granted by the bank. The bank would already have historical data from the past years that says how many of the customers have defaulted, what type of customer were they and many other information about the past loans.

It is very rare that customers default. To give an example, say the bank has data having 1000 rows that contains information like age, gender, income, marital-status, other related columns and then a variable named default that says whether the customer has defaulted or not.

As I mentioned earlier, very few customers default on their loans. So a realistic example would be say 100 customers defaulting out of a total of 1000 customers.

The bank is interested in knowing if a customer would default or not. This is a typical binary classification problem. The bank would really be interested in the customers who are likely to default. Let us now try to understand terms like True-positive, True-negative, False-positive and other related terms.

There are two levels in the default variable that we are trying to predict-default and not-default. One has to define first whether default will be treated as positive or not-default as positive. It is just a convention. Normally, the class of interest is treated as positive. So in the bank's case, what do you think we should take as positive? You might have guessed it correctly, we will treat default as positive and not-default as negative.

We train a binary classification model on this dataset having 1000 rows. The predictions that the model makes for the training data will either be correct or incorrect.

There will be two cases for incorrect predictions

Predicting positive when the actual was negative i.e classifying a customer as default when in actual he is not-default
Predicting negative when the actual was positive i.e classifying a customer as not-default when in actual he is default

There will be two cases for correct predictions

Predicting positive when the actual was positive i.e correctly classifying default as default
Predicting negative when the actual was negative i.e correctly classifying not-default as not-default

Now, FP, FN are incorrect predictions (notice False in the name) as the name suggests and TP, TN are the correct predictions. Don't be in a hurry here. Take some time to digest these 2 lettered acronyms. Read them loud. Take a notebook and write them down on your own.

Once we are comfortable with these terms we will discuss about something called confusion-matrix. Don't get confused yet. If you understood TP, TN, FP, FN then confusion-matrix is just a matrix having these values. The diagonal elements contain the count of correct predictions (TP, TN) whereas the off-diagonal contain the count of incorrect predictions (FP, FN). The rows are the predicted and the columns are the actual. This looks something like this.

Why we need TPR, FPR if we already have mis-classification error?

The data I describe above is a typical case of imbalanced data wherein one of the class is having majority of observations (90% non-defaulters (negatives) in our data) and the remaining class is a minority (only 10% defaulters (positives)). In such cases, the predictions on new dataset will be skewed towards negatives i.e the model will classify a lot of defaulters (positives) into negatives. The bank can't afford to have such predictions. The bank wants to know for sure the defaulters (positives). Imagine, the loss to bank if the model classify a probable defaulter (positive) to non-defaulter.

In such cases, accuracy corresponding to mis-classification alone is not acceptable. The bank would be more interested in correctly classifying the positives into positive i.e the bank wants to classify the defaulters into defaulters without fail.

Comes into picture TPR, TNR, FPR, FNR. These 3 lettered acronyms are nothing but the rates of TP, TN, FP and FN respectively. Below is the formula. To digest the formula, let's move to our data having 1000 rows - 100 defaulters (positives) and 900 non-defaulters(negatives). Suppose we employed a logistic regression that classified 80 defaulters correctly and incorrectly classified 90 non-defaulters as defaulters. TPR = 80 / 100 = 0.8 FPR = 90/ 900 = 0.1

How does the TPR and FPR gets calculated?

Whenever you do any classification, the model always gives you probabilities of each observation getting classified in each of the classes. Based on what cut-off you choose, you will get different predictions for the data and hence different TPR and FPR overall. You can choose whatever probability cut-off between [0, 1] and you will get different tuples of TPR and FPR.

TPR and FPR are be generated for each of the probability value one chooses. And these values are then plotted on an ROC curve.

One can plot these tuples (probability, FPR, TPR) on a graph. You know what this graph is called? ROC-Receiver Operating Characteristics. There is a trade-off between TPR and FPR. Depending on the requirement one can choose the probability cut-off that best fulfills their purpose. For instance, in the bank's case, the bank wants to not miss a single defaulter(positives) i.e the bank wants a higher TPR. The ROC curve looks something like this.

Alright, all this is clear to me.

What about AUC?

AUC is nothing but the area under curve of ROC curve. Let's say we built a logistic regression model that gave us the probabilities for each row. Now we try with probability cut-offs from 0.1 to 1.0 with step size of 0.1 i.e we would have 10 probabilites to try with and corresponding to each of the 10 values we would have corresponding (FPR, TPR). If we plot these values on a graph we would get a graph having 10 points. This 10 point graph is what we call an ROC curve and the area under this graph is called AUC.

The AUC is a common evaluation metric for binary classification problems. Consider a plot of the true positive rate vs the false positive rate as the threshold value for classifying an item as 0 or is increased from 0 to 1: if the classifier is very good, the true positive rate will increase quickly and the area under the curve will be close to 1. If the classifier is no better than random guessing, the true positive rate will increase linearly with the false positive rate and the area under the curve will be around 0.5.

One characteristic of the AUC is that it is independent of the fraction of the test population which is class 0 or class 1: this makes the AUC useful for evaluating the performance of classifiers on unbalanced data sets.

The larger the area the better. If we have to choose between two classifiers having different AUCs, we choose the one having larger AUC.

How do you decide what probability cut-off should you choose to classify them into either of the classes?

Some say they take a cut-off of 0.5 i.e observation having probability greater than 0.5 will be classified as positive or else negative. Do you see the problem in here? Try thinking.

Come into picture ROC curve. Look at the ROC curve and depending on what value of TPR or FPR you want from your model, you take probability corresponding to that point.

ROC, AUC can easily be plotted and calculated using modern analytical tools like R or Python. But I would suggest for better understanding of this topic, try writing your own code.

I will make an attempt of the same soon and will share it on this space. Keep learning and sharing.

Did you find the article useful? If you did, share your thoughts on the topic in the comments.

Vim/Vi editor shortcuts

2016-09-22T04:00:00-03:00

Repetitive tasks should be done using as many shortcuts as possible. You are not doing anything new and hence not even an extra minute should be spent on doing the same. This post refers to the shortcuts that come in handy when working on the vi/vim editor.

This is not an exhaustive list. These are the ones I use frequently. Feel free to comment down your favorite shortcuts.

Navigation keys

0 Moves cursor to the start of the line
$ Moves cursor to the end of the line
w Moves forward one word
b Moves backward one word
G Moves to the end of the file
1 + G Moves to the beginning of file

Delete text

dw Deletes a word ahead of the cursor
db Deletes a word behind the cursor
d0 Deletes the complete line till beginning
d$ Deletes the complete line till the end
dd Deletes the complete line
10dd Deletes the following 5 lines
dG Deletes till the end of the file

Undo/Redo operation

u Undo the last operation
Ctrl + r Redo the last undo

Search and Replace keys

/search_text Finds 'search_text' in file going forward
?search_text Finds 'search_text' in file going backward
n Finds the next occurrence of 'search_text'
N Finds the previous occurrence of 'search_text'
:%s/replace_what/replace_with Replaces first occurrence
:%s/replace_what/replace_with/g Replaces all occurrences globally
:%s/replace_what/replace_with/c Asks for confirmation

Save and quit

:q! Force quit without saving
:wq Saves the changes made to the document
:wq! Forcefully saves the changes made to the document
:w new_file_name Saves the file to a new file named new_file_name

Command line shortcuts

Ctrl + a Brings to the beginning of the line
Ctrl + e Brings to the end of the line

Hadoop Streaming

2016-08-29T07:00:00-03:00

A few days ago, I had written a post on The Big Data Problem which attempted to understand why we need big data and what the fuss is all about. You may want to read it here.

Having understood why we need big data, let’s understand how we can go about analyzing the same. What is the way out to do analysis on big data? The solution is Streaming…Hadoop Streaming. James Bond style

To recount a personal experience, I was faced with the following Data Analysis task — Scaling a particular analytics technique on retail data from one store to all stores across the US. Sounds interesting so far, doesn’t it? The catch however, was that the program took 2 days just to compute the results for 1 department.

Imagine, the time it would take if we were restricted to a single machine/Rstudio.

Let’s understand the gravity of the problem and get an idea of a rough time estimate. So for one department the time taken was two days. Let us assume, each store has close to 100 departments and we have a total of 5000 stores. So the total time taken for all of the stores would be roughly 2 X 100 X 5000 = 1 million days = 2740 years. This is just a rough estimate assuming a linear relationship between the number of departments and the time taken.

Obviously, using RStudio stand alone was not a feasible solution. We had at our disposal Hadoop cluster. We wanted to see if we could exploit the number of machines in our cluster to solve this Herculean challenge of the Big Data world.

A fresh college graduate, I was a complete stranger to the world of Hadoop and out of anxiety, I purchased a copy of Hadoop-The definitive guide. I started going through the chapters. It was in no time that I realized this book was meant for someone with an understanding of Java. Unfortunately, I was not one of them. Saddened and almost defeated. What other choice do I have? I came across this wonderful concept — Hadoop streaming.

Wow! This looks like it could solve the problem. I started reading more about it on the internet. There were not many resources on this technique. Fortunately, a colleague of mine had worked on Hadoop Streaming earlier and with his help we were able to accomplish the task successfully.

Now that you have the context, let me try to answer some basic technical questions.

So what exactly is Hadoop streaming?

Hadoop streaming — ‘hadoop’ and ‘streaming’. We use the availability of machines in the cluster to process data in parallel. Let me elaborate. When we do our analysis on Rstudio, we just have one machine (our laptop). Now imagine we have 100s of such machines in our cluster. We can take advantage of this and distribute our input data in such a way that each machine gets some portion of your input data and they are processed in parallel.

Now, you may ask what is streaming here.

The input data is sent to each of the machines in the cluster via stdin() (standard input) and the analyzed output is thrown at stdout() (standard output). Your input data is streamed via stdin and the output gets flushed out to stdout.

What languages can I use for Hadoop streaming?

You can use any scripting language — R, Python, Ruby. Care should be taken to choose your scripting language. The analysis you are planning to do on your data should decide your choice of language. Say, we want to do a forecasting of sales at item level for a retail store. We know that R is the suitable choice to use when it comes to do forecasting as we have well-developed packages in R which is not the case with Ruby. So do put some thought into choosing the language.

I hope we are now clear on the basics of Hadoop Streaming and its benefits.

Hadoop Streaming is not the answer to all your Big Data analysis problems. It can be used only for cases where each machine can independently perform its analysis with a small portion of input data that is fed to it.

Example where Hadoop Streaming can be used

Sales forecasting at item level. Say, we have weekly data for 2 years at item level. And say there are 10,000 items for which we want to do the forecasting. Each item has close to 104 rows (2 years weekly data). So our input data has close to 104 X 10,000 = 1,040,000 ~ 1 million rows.

Assume we have 100 machines in our cluster. What we do next is pretty intuitive. We distribute our data such that each machine receives 1 item at a time, and once it is done processing that item, we send the next item to it. So, in a single go, we will have 100 items processed across the cluster. In, the next go, again more 100 items will be processed and this goes on until all the 10,000 items are processed.

Example where Hadoop Streaming won’t work

Clustering. In clustering we need to find the hidden patterns that are there in the complete dataset. A smaller portion of the dataset can’t be sent to a machine to do the analysis because the purpose will not be served.

I think this is worth mentioning.

In Hadoop Streaming, you are not writing an org.apache.hadoop.mapred.Mapper class! This is just a simple script that reads rows from stdin (columns separated by ‘\t’ or any delimiter) and should write rows to stdout (again, columns separated by ‘\t’ or other delimiter). It’s probably worth mentioning this again but you shouldn’t be thinking in traditional map-reduce Key Value terms, you need to think about columns.

You can write your script in any language you want, but it needs to be available on all machines in the cluster. Any easy way to do this is to take advantage of the Hadoop distributed cache support, and just use add file /path/to/script within hive. The script will then be distributed and can be run as just ./script (assuming it is executable) Enough theoretical stuff. The interesting part is the code framework of mapper and reducer that I will explain in the next post.

I hope you feel more educated on big data after reading this. Please leave a comment if you have questions/insights. I will reply as soon as possible.

When R package is not available across the cluster

2016-08-02T07:00:00-03:00

When deploying R codes across the cluster, many a times the reason for the failure of the task is unavailability of a particular package across all nodes of the cluster. We wait for someone to get the package installed across all the nodes. This may take some days. Do we wait for them? Naah!

Presenting a temporary solution that one of my colleague came up with. I have used this technique and this works smoothly.

The following are the steps:

Install the package you require on one of the edge nodes into a local directory
- Create a local directory. Let's say our directory name is rPackages
```
    mkdir rPackages
    
```
- Install the required package, say 'randomForest' in the directory just created
```
    install.packages('randomForest', repos=’repo_name', lib='rPackages/')
    Note that you need to choose the appropriate repo_name. The one that your company allows.
    
```

2. Check if you can load the package from this local directory

library(randomForest, lib.loc='rPackages/')

3. Create zip file of “dir_location” using command

zip -r rPackages.zip rPackages/

4. Add this zip file in your HIVE hql (or anything else)

add file rPackages.zip;
Don’t forget the semicolon

5. Unzip the file inside R script i.e. each reducer will have rPackages directory now

unzip('rPackages.zip', overwrite=TRUE)

6. Load the package now

library(randomForest, lib.loc='rPackages/')

And you’re done! Remember, you have to build the package where you want to use, because built packages are OS dependent.

Did you find the article useful? If you did, share your thoughts in the comments.

The most inspiring talk I ever attended live

2016-07-26T08:00:00-03:00

Last week WalmartLabs India was one of the sponsors of IIT Bombay’s alumni meet and I along with few of my colleagues got a chance to attend the meet-up. I didn’t have much idea about the speakers who were going to present. All I knew was, we wanted to get some good talent for our company and we had set up a booth where we were presenting the work we do at WalmartLabs.

Of all the speakers that presented their talk, one was mesmerizing. I was impressed by this speaker. The way he confidently took the stage. Stood casually by the dais. He didn’t carry any presentation with him unlike the other speakers. There was a charm and calmness in his presence. It was as if he knew that we were eagerly waiting to hear him out.

He introduced himself as Ravi Venkatesan. He gave a brief background about his college, the works he had done, and the great experiences he had in his career of 30 years. Nothing unusual or impressive so far. But when he mentioned the different fields in which he had done his work, I was awestruck. I was not able to digest the fact that he had worked as someone who established Microsoft in India during its early years after having manufactured engines in his previous job. See the difference in the two kinds of work.

Many had asked him not to take this job at Microsoft. That it was not safe to make such a change from working on engines to IT sector. Even his most trusted mentors were not in favor of this change. But this guy wanted to wear this new hat and take this risk. Not to prove anyone wrong but to ensure that later in his life he never had the guilt of not trying.

People generally don’t try new things because they fear the idea of failing. The moment one conquers this feeling and see beyond…all one can see is opportunity, learning, and experience.

He was in no hurry to finish up his talk. Slowly, he went on describing his journey of ups-and-downs. His talk was mostly on advices on career. I remember him mentioning this.

The age of 40 is a magical year. Once you reach this, you start getting serious. You have in front of you, the 40 years you have already lived, you have the list of things you wanted to accomplish, and the things you have actually conquered.

The next thought that comes to your mind is you don’t have much time left. A few more years and your career will come to a halt. In the next few years, you would like to get the most out of your life. Do not wait to reach this magical year.

Do things you want to do NOW. Get out of your comfort zone. Take risks. His words were powerful and convincing. Another anecdote from his life. He had applied for Harvard just after graduating from IIT Bombay. Sadly, he was rejected. Dejected and shattered, he called upon the office asking why was he rejected?

The person on the other side replied — you don’t have anything different. Your grades are good and all but how are you different? He had no answer. Few years passed by. Ravi liked reading about various things. He started writing about these things. His interest towards writing increased exponentially and fortunately at the age of 26, he became the youngest man to have his article published in the Harvard Business Review. Obviously, he got admitted to Harvard the next time.

My eyes glistened with awe. Here was a guy who had not given up. He continued doing things he liked and gave an answer to how he was different from others.

The auditorium echoed with claps from the audience. The speaker opened up for questions from the audience which he patiently answered. There were many in the audience who had more questions to be answered. Time was running short and he had to leave the stage for the next speaker to come.

Did you find the post useful? If you did, share your thoughts in the comments.

Things I have learned (So Far)

2016-07-15T08:00:00-03:00

Today when I opened my blog on Quora, I saw a draft of this post that I had written some six months post to my first job. The learnings I had moving from college to corporate.

Things I have learned (So Far)

I graduated from my college almost six months back. The last six months have been exciting in many ways. The transition from being a college student to a full-time corporate employee has been enlightening. In this post I will talk about some of the things which I've learnt... so far.

Travel-On

Just few days after my college, I traveled for some 20 days and those days were some of the good days I'll always cheer. Uttarakhand is beautiful.

Talk to people

People are interesting. Every person has something interesting to share, something that makes you feel happy. Try to live the happiness of others by experiencing how their happiness makes you feel. You would be able to relate something from their reasons for happiness.

Trekkings

Well, I won’t say I am a hardcore trekker- someone how has got a all his gadgets ready to hop on for next trek. But I’ve trekked around Bangalore a few times. And man! You get to know so much about people. If you want to know about a person, know some of his truest qualities, go on a long trek with him.

Smile and laugh

Office gets boring sometimes, the work gets bland. You get the feeling right? You don’t hate your job but sometimes you just feel aaj kaam na ho paaega yaar office me , that time, a thought of only a few of your office friends make you come to the office. Those people are mostly who smile and talk to people.

The Guitarest oh sorry I meant The Greatest!

Guitar is happiness. Think about your happiness. What makes you happy? There would be some activity in your daily life which you truly adore doing. Know the feeling I’m talking about. Guitar is that feeling to me.

More on this soon.

The Big Data Problem

2016-06-29T07:00:00-03:00

Big data has become a sensation these days. Anyone and everyone wants to use this in their discussions. When I was still in my college and preparing for campus placements, I had attended almost all the pre-placement talks that companies gave to its prospective candidates.

American Express was one such company that had talked extensively about big data and hadoop in their presentation. I remember clearly, the blank faces that all of us had. We were not able to follow a single term. Hadoop, clusters, big data, distributed environment and many other terms were just bouncers for me. I did try to google them up later but these terms were still alien to me.

I have worked at WalmartLabs for close to 2 years now and the work has mostly been on the big data side. Walmart is one of the largest companies when it comes to capturing the transactions data that we have on daily basis both at the stores and at the e-commerce. Having worked on the big data technologies a little, I felt I would give it a try to make big data and related questions a little easy to understand.

Following are few of the questions that I will try to explain:

What is Big Data? And why should I care about it?

What is the problem with a single laptop?

What is a cluster?

The data is distributed across the cluster. What is meant by this?

What is Hadoop and Spark?

Everyone is generating data. Data is the side-product of any work we do. Just like smoke is emitted when things burn, data is created when machines work or interact with each other. These machines used to emit data in the past as well, but we were not advanced enough to collect the data. The technologies weren’t capable of capturing these.

Today, every other thing is emitting data. Be it the signals from your refrigerator, the air-conditioners in your room, the cars on the street, the videos that you like or dislike on YouTube, your transactions at the nearby Super-market, the posts you make on Facebook. These are all emissions of your day-to-day work.

Companies and analysts want to understand you and your behavior using the data you generate. To give you an idea of how much data is generated, let’s look at the ridiculous amount of data some of the big giants produce on daily basis.

Facebook’s daily logs are more than 60 terabytes every day.

Google’s web index is more than 10 petabytes of information.

Millions of customers visit Walmart stores on daily basis. Imagine the size of these transactions data.

Do we need this data? Of course!

I strongly believe this.

Every data has a story. With proper analysis and techniques, we can get a lot from data. Now, there will be times when you won’t get anything insightful from it. I consider this also good. Maybe, there is something wrong with the way we are collecting the data and we need to have a look at it.

So far we know that every other thing is puking data. And they are puking in bulk. To get something good out of these pukes, we need to store them somewhere, do our analysis on them and get insights. How do we go about it?

Storage of this data has never been a problem. The disks are relatively cheap. A terabyte of disk costs ~ $35. So storing them on physical disks is not an issue. What is the problem then?

Do you have an idea how fast the data can be read from a disk? It is still in Mbs/sec. Let us assume a speed of 100 Mbs/sec. With this speed, the time taken to read 1 terabyte of data would be ~ 3 hours! Below is the calculation for 3 hours.

1 terabytes = 1000,000 Mb. So the total time ~ 1000,000/100 ~ 10000 seconds ~ 10000/3600 ~ 3 hours

3 hours is too huge a time to read 1 terabyte of data given the amount of data giants like Facebook generates on daily basis. Also, a single machine won’t be able to handle this much data. So what is the solution?

What if we could distribute the data across multiple machines? Yes, that sounds promising. The data that is big enough to handle by one machine can be distributed across multiple machines such that each machine has some portion of the actual data. Now these machines are interconnected with each other and so there is no issue of interactions between these machines. And that is exactly what a cluster is.

A cluster is a combination of many machines connected to each other. Think of it like this. There is a process by which, my computer and your computer can be connected to each other and we can have a cluster with two nodes or machines.

Now, that you have your machines connected with each other in the cluster, you can take advantage of the computation power of each of the machines. And thus, the task which one couldn’t even think of accomplishing by a single machine can now be very easily be done by a cluster of machines.

What is hadoop and Spark? And how are they different?

Hadoop is a framework that supports the storage and processing of large data sets in a distributed computing environment.It provides massive storage for any kind of data, enormous processing power and the ability to handle virtually limitless concurrent tasks or jobs.

Spark is a powerful analytics engine. It is a fast and general engine for large-scale data processing.

Hadoop provides both the data storage and processing power whereas Spark is meant for doing analysis and processing of big data.

One thing to notice is, Spark is about 10 to 100 times faster than the Hadoop MapReduce framework by making use of in-memory processing compared to persistence storage used by Hadoop.

Spark is a Swiss army knife of analytics world. There are various APIs- Python, R, Scala thorough which one can interact with the Spark framework. For machine learning algorithms, there is MlLib which can be used to perform some common analysis like regression, K-means clustering, and classification.

We will talk more about Spark in future posts. I am exploring it rigorously and plan to write out my understanding of it soon.

I hope you enjoyed reading this post and feel a little more familiar with big data now. Hit the share button if you would like your friends to read this.

My first attempt at public speaking

2016-06-27T07:00:00-03:00

Having graduated from college, it was just 6 months into my new job that I was asked to give a tech-talk on Introduction to Data Science. Imagine the thrill and fright of a fresher when he is presented with a situation like this. The first thing I asked my manager was — how many people are we expecting in the audience? Notice the fright in the question. I was scared that I might screw up in front of a large audience. I may not be able to present the talk eloquently, the audience may laugh at me and all those crappy thoughts that quickly clouds your mind with fear and uncertainty.

Without much thought and ignoring the above questions, I straight away agreed to present the talk. The plan was simple. Commit to the task and then learn how to do it. In a few days, after a lot many iterations I had the presentation ready with me. I had not prepared any script or dialogues for the talk. The reason being, I wanted to give an impromptu talk. Days passed and I got busy in routine office work and meeting project deadlines.

Writing and public-speaking was something that has always fascinated me. I had read many a posts on effective public-speaking. Now was the time to apply whatever I had read about it.

The day had come when I had to give the tech-talk. That morning, I woke up a little early and I looked at a few videos on dealing with the stage-fear. I found this video in particular to be helpful.

It’s not that I had not practiced at all. We had a few dry runs. Frankly, I didn’t had the exact lines that I intended to say. I had just planned the overall flow of the talk. As I had read on various blogs on public-speaking, I reached the conference room a little early, walked down the room confidently, talked to some of my colleagues about random topics so as to divert the attention and calm the nerves.

Within a few minutes, the room got packed with folks from various teams. I hadn’t expected these many people to turn around. Nevertheless, I was happy and confident. I kept saying this to myself — you are going to rock today! So many people have come just to listen to you. You know this stuff. This is your bread and butter. I took a few deep breaths and started the talk.

After first few minutes of the talk, I started reading some of the faces in the audience, trying to understand if they are following what was being spoken. A good proxy for this include people asking questions, nodding to your statements in agreement, and making notes. I could see the audience doing these. So the feedback was good and I continued with the talk.

It was not all that easy. I stammered in between, fell short of the right words, the throat felt thirsty, some of the faces in audience looked scary but somehow the confidence in me was at peak. I was happy talking about something that I enjoyed doing.

It has been almost a year since I had presented this talk. Now when I think about it, I feel good. I feel accomplished. I know I could have screwed up badly. I had to start somewhere. And today I am happy that I did take that chance.

Next time, don’t let that chance pass away. Take that risk, prepare well and make it your first attempt at public speaking. We all need to start somewhere.

Did you find the article useful? If you did, share your thoughts in the comments.

Understanding HIVE for data science people

2016-06-22T04:43:00-03:00

I have been working as Statistical analyst for the last 1.5 years and fortunately I got to work on Hadoop on one of my initial projects. Hadoop sounds scary to a lot of people and I am no exception. In this post, I would make an attempt to explain HIVE-the data warehouse for Hadoop ecosystem.

What is HIVE? And why as an analyst I should care about it?

HIVE is the part of the Hadoop infrastructure where data gets stored and you can write your query to fetch data from HIVE (tables). Now when I had started working on Hadoop, I knew a little SQL and a tad little of data science. I had take a few courses on Statistics during my college but then who really studies in college! Having allocated to a project which required an understanding of Hadoop, I planned how I need to understand this system so that I can contribute my best to the project.

The first thing you need to build any model is data. Comes to picture-HIVE. As I said, I knew SQL and as I started working on HIVE I got to know that if one knows SQL then one knows HIVE, it’s just that he is not aware of it yet.

Why do I say so? Because, HIVE was built with the sole purpose that database people who are comfortable working on SQL need not require to learn a completely new language to fetch and interact with data in the Hadoop ecosystem. And I believe the founders of Hadoop have achieved this successfully.

HIVE hides the map-reduce processes from the user and all you need to worry about is writing your SQL query. You don’t need to worry at all about mappers and reducers. All these information is hidden from the user and if one wants to know more on these, there are ways to know that as well (advanced users of HIVE).

I will cover most common queries that we deal with when working as an analyst on Hadoop- creating a table (external/internal), getting information about table, setting common HIVE settings and a few others.

Creating a table Table in HIVE can be of 2 types: Internal or External Stay with me for next few lines and I will explain the differences. By default, the table that we create in most of the cases is internal table. However, external tables come into picture when you want to build a table over some file.

The difference between external and internal table is in terms of what happens when we drop a table. External table: — If you drop the table, the table and the metadata of the data is dropped but not the data The data is located in hdfs (and not in local file system) and since this data is accessed by many tables we don’t want the data to get dropped.

Just add the keyword external to specify that we want to build an external table. — CREATE EXTERNAL TABLE external_table_name

Internal table: — If you drop the table, the table, metadata and even the data is dropped — If the data for the table resides in the local file system, you should go for creating internal table

— CREATE INTERNAL TABLE internal_table or just CREATE TABLE internal_table Once the table gets created, we want to get the data from the table for our analysis. Before we dive into getting the data from the table, it is a good idea to get an idea about the overall structure of the table — column names, column types, whether external or internal, owner of the table.

All of these information can be retrieved by using this: describe formatted table_name; This gives you a lot of information in a formatted manner. If you are in a hurry and just want to see the column names and no other details, use desc table_name;

External tables are created over some file. This file should be located in hdfs-file system for hadoop. A few command lines would easily move the file to hdfs.

How to move a file to hdfs?

Below are the steps: - Create a directory where you want to move the file hadoop fs -mkdir directory_name - Check if the directory got created hadoop fs -ls ; you should see your directory name here This directory is empty right now. You can check this using hadoop fs -ls directory_name; It's empty - Move the file from local to hdfs hadoop fs -copyFromLocal 'path where file is stored' directory_name - Check if the file has been copied to directory hadoop fs -ls directory_name; You should see the file now.

Open vim editor and write the below codes in any file. I will name my file create_table.hql vim create_table.hql Creating table. Code below

drop external table if exists db_name.table_name; create external table if not exists db_name.table_name ( ID string, WorkStatus string, Score int, Residence_Region string, income string, Gender smallint, Alcohol_Consumption string, Happy string ) row format delimited fields terminated by ‘\t’ stored as textfile location ‘/user/mbarnwa/data1’ ;

Dropping a database Many a times, you may want to drop a database that you don’t need anymore. Say, the database to be dropped is ‘userdb’, then if you do this: drop database userdb; Now the above command will work if your database is empty i.e. the database, ‘userdb’ has no tables in it. But that is rarely a case, so either you can go ahead on deleting each table in the database by using this command drop table ‘tableName’; or addition of a simple keyword — CASCADE will solve your purpose. So the final command to drop a database (even if it has tables) is drop database ‘userdb’ cascade;

Pre-Placement Talk Snippets

2014-09-07T08:00:00-03:00

I'd attended PPTs (Pre-placement Talks) of almost all the companies I'd interest in during my placement time. Many students find attending PPTs a waste of time. I was not one of them.

I liked attending PPTs. I felt they gave me continuous motivation to be consistent in my preparation. I remember the first PPT I'd attended. It was EXL and it felt like almost every final year student was in that room. Not a single chair in Kalidas Auditorium was unoccupied.

Looking at such a big crowd of students in that auditorium, me and my friend looked at each other with confused looks. We exchanged these words.

Me: "Yaar Rahul, yaha toh bahot log hai be. Competition tagda hai be."

Rahul: "Ho jaaega be. Naukri hi toh hai."

Yes the competition is going to be tough. So be prepared. I'm not trying to scare you. I'm just saying give your best shot at preparing for the companies you're interested in. It's just a matter of few months. I remember I'd studied more in that semester than all the previous semesters combined. This semester is an opportunity to learn the maximum out of your college life. Make it count.

I always made it a practice to take notes of the presentations I was attending. And then when needed I would look at them and try to understand what exactly is the company looking for. Be the boring one and take notes.

Following are the snippets of some of the companies I could find in my mobile.

FICo

Define score card for banking and lending institutions

Clients include DB, HSBC, ICICI bank, Coca-Cola, JP Morgan Chase, Healthways

Project in diverse fields

FiCo score - Number 1 score in North America

Hiring 25 to 30 people across India

Custom analytic modeling for Maths and Computing

Informal structure and informal culture

Miebach consulting

Supply chain Company

What is supply chain - Everything involving from the start of manufacture of product till delivery at right time at right price

Started in 1973, started in India in 1996, FOUNDED by Dr. Miebach

Consulting in automotive, spare parts, fashion, FMCG, retail and e-commerce

Hiring for Associate consultant

Close to 60 employees in India

You need to be a team player, result oriented, a leader

Procedure for hiring: CV submission, test, GD/case interview

Test quantitative, analytical

Capital one

Credit cards, auto finance, home loans, consumer and commercial banking...50 percent bigger than SBI in terms of bank labs

Us, UK, Canada

Fortune 500 company, ranked 127

Only 22 years old but is comparable to top banks

Credit card interest rate in India? Look for it.

Making decisions based on data

Times internet

PLP, product leader program

Analytical ability, General aptitude

Basic aptitude test followed by pseudo algorithm test

Goldman Sachs

Market maker, provides liquidity

Advisory services eg. Merger and acquisition

Financier eg. For a growing company like Facebook

Asset manager

Principal, find the best companies to be invested in

Founded in 1869, 32400 people in 65 offices in 34 countries

Survived 2008 crisis

Financial holding company

Bangalore is global hub of GS with representatives from all divisions

Our client's interests always come first

We anticipate the rapidly changing needs of our clients

GS culture: Excellence, integrity team work, meritocracy, entrepreneurial spirit client device

Quantitative roles

Investment management, automated trading, developing algorithms to run trade on electronic markets like NYSE, NASDAQ

Risk modeling building models to quantify market risk, liquidity risk. Post trade activities, surveillance analytics

What is risk? Why measure risk? Relevant to whom?

One possible measure: How much can we loose in one day?

Historical simulation

Value at risk

AmEx, American Express

Started in 1850 as courier firm

Moved to charge and credit card business in 1958

Transactions worth 900 billion dollars by volume globally

Interesting projects like shopping through Twitter, tweet to buy launched in US last year

AmEx is all three parties combined in online transaction

500 plus people based in Gurgaon and Bangalore

Decision sciences, campaign management, big data.

Newly involved in e-commerce

Generate different kinds of score

Wipro

Wipro, applying thought

WIPRO: western India vegetable products limited

Got Listed on NYSE in 2001

Provider of Consulting services, BPO, IT, R&D services

Largest independent R&D services provider in the world

Six strategic SBU (small business units)

Banking & financial services: healthcare, life science and services, retail consumer products, global media & telecom, energy natural resources & utilities, manufacturing and hi-tech

One of the most ethical companies by Ethisphere institute

Futures First

Propriety trading, most challenging work in financial services

You don't take calls from clients...you invest on your own

GHF group

Trading...buy low, sell high...having an expectation and trying to trade on that expectation

Discipline, risk taking ability, decision making, street smartness, flexibility and adaptability, mental toughness, analytical ability

Trades in: Interest rates, commodities, foreign exchange, equity index futures

In terms of volume based market share, we are in top 10

Group of 700 trader and analysts

Started in 1993 in London, in India in 2004

Performance bonus on an average of 1.2 lakhs

Bonus can be zero or explode to multiple times of your salary

Work week from Monday to Friday, working hours depend on your product market

The best part is "even I don't know what is the maximum amount I can make" - one of the employees in the video

Citi Bank

PPT. presented by Raj Kurucheti

Global bank, 200 years of history

Leader in the financial sector with presence globally

Our primary business in citi corp

Citi technology centre: Pune

Test analytical, logical reasoning, language agnostics, coding

I hope you found this post helpful. Share it with your friends if you found it useful. And as always, drop a comment if you have any queries and I will respond back to you as soon as possible.

Prepare well for your placements. All the best!

The Placement Season

2014-07-31T08:00:00-03:00

The placement season is back. Nothing new, it comes every year but may be...this is new for you and I'm talking to those who're about to witness probably the most important phase of their college life.

First things first. I'll start with a disclaimer - "Everything written on this space is my personal experience which is definitely not a statistically significant sample size"

The very first question that comes to a typical final year student- What should I start my preparation with? You've got to do a lot, you know that but you don't know what to start with?

I'll help you out here. I'd started with the TnPedia (this is specific to just students from IIT Kharagpur; others please ignore this).

Heard of TnPedia? I'm sure you'd have. If not, just download it from dc++. Ohh yes! dc is used for useful stuff as well apart from the 'general daily stuff'. Download the last three years TnPedia and read through the companies you're targeting. This'll give you an overview of what is expected in an interview, what procedures are followed in different companies, you may also gain some insight to what kind of preparation is needed for a particular company.

Now that you have given a thorough read to your target companies. Start preparing accordingly. I'll try to give a brief introduction as to what preparation is needed for different profiles.

CONSULTING

Victor Cheng is your friend. Super-smart guy. Download his video lectures(again dc++) to get an understanding of what solving a case-study looks like. I'd suggest you to subscribe to his email newsletters as well. Those will help you in slowly gaining the overall idea of consulting.

Preparing for this takes time and you'll have to be persistent. Practise with your wingies, your department friends, anyone...but practise!

Do the role play. Some of you become the interviewer and one of you could be the interviewee. I know it sounds weird. But trust me you'll learn a lot. You'll overcome your anxiety of facing the interviewers. At first the role-play thing won't go that fine but with time it'll be smooth and everyone involved will come out beneficial.

Share your feedback about each other. I remember I used to get a lot aggressive while participating in mock group discussions with my friends. You sure will get to know about your weaknesses. Work on these weaknesses before it's too late. By doing this, you get a chance to commit all your mistakes before the real interview and that's a good thing. Enough words. It's time to execute.

"Being pro-active is the key."

Resources for preparation:

Download the video lectures from dc++. Some 10 videos are there

Subscribe to case interview emails

Case in point. The pdf is easily available on dc++ or online as well

Practise your basic math skills. You can't afford to make calculation mistakes in interview. CaseInterview will improve your calculation speed and accuracy

Case-interview is all about practise. After going through some frameworks you'll feel like you've earned the mastery over this but I guarantee you'll find this wrong once you actually solve a case-study problem in a real interview set-up. So practise, practise and practise.

That's all for now. I'll add about other job profiles soon as time permits. No more a free bird like I used to be in Kgp. Next update would be on Coding.

CODING

Let's talk about coding preparation. Coding was one thing which I'd started learning and practicing only during my placement preparation. Prior to that I never really liked it. Actually I'd never really put much effort into it. And let me tell you this - Coding is interesting stuff to say the least.

Some of the resources that I'd found helpful.

GeeksForGeeks

Arrays Archives

Strings Archives

Linked Lists Archives

Trees Archives

Bit Magic Archives

Output Archives

GeeksQuiz

Cracking the Coding interview

Data Structures And Algorithms Made Easy. You get this book in Java, C and C++ version. Get the one in which you're comfortable with. For a starter like me, I'd read the C version.

The above should suffice for a normal coding preparation. I'll add more soon. I'll ask some of my coder friends.

That's all for now. Next update would be on puzzles and brainteasers.

All the best for the placements. Prepare well. Any help needed? Drop me an email or comment down and I will get back to you as soon as possible.

Beautiful Sadness

2014-06-03T10:00:00-03:00

As I woke up this morning, it took me a few moments to realize where I was. I looked around, gained my consciousness and a thought thundered into my mind –I was not in Kgp. Not anymore. Life at Kgp is now a past, a past which will always be cherished and remembered as the guide of all my future endeavors. I had heard that time flies but this fast I didn’t had even an iota of idea. The five years passed by in a blink.

When you leave Kgp, you find yourself juggling with thoughts. You experience mixed emotions of both happiness and sadness or rather a term –‘beautiful sadness’ perfectly contain these feelings. You feel sad for the life of peace is over –the phenomenal five years. At the same time you find yourself smiling that Kgp happened to you, that you had your best time here, that you couldn’t have imagined a better college life.

Thank you Kgp for making me what I’m today and know that you’ll play a prominent role in whatever I aspire to become in my future. You are one of the best things that has happened to me in my past five years. Kgp, you have introduced me to some of the most amazing people, well some weird ones as well and a fewer limited editions.

Now that we’re separated, I feel the void that has been created between us. But know that no one can ever fill this space of yours. You will always be special and I want you to know that just a thought of yours will automatically bring a smile on my lips. Thank you for the wonderful time we had together. Kgp, thou shall always be missed.

Yours forever,

--Just another Kgpian

My stay at IIT Kharagpur -A synopsis

2014-04-16T08:00:00-03:00

Mathematics has always been my favorite and when I got Mathematics & Computing, IIT Kharagpur I was on cloud nine. I did a quick Google and landed to the official site of our institute. That was the first time and also the last time when I had paid close attention to each of the tabs in the site. The five years of stay at this land has been phenomenal. I know time flies but this fast, I had no idea.

So I started my first year with full innocence, attended almost all my classes the first few days after which I came to know about this beautiful concept--proxy. The first year passed mostly in making new friends, understanding DC++, giving auditions for societies I had no idea about whatsoever, celebrating birthdays--GPLand attending a few classes. Needless to say, I had screwed my CGPA badly.

Something happened in the very beginning of 2nd year. I got serious with my academics, a little and then I met Professor G.P Raja Sekhar. Well, he needs no introduction. We had 9 credits under him, such a strict professor he was that I managed to attend almost all his classes even the 7:30 ones. That helped and I ended up improving my CGPA. Thank you Sir. I owe this one to you.

Third year went normal –attending a few more classes, this time in our department mostly. I had my internship at IDRBT–an RBI institute in Hyderabad. I feel this internship opportunity is highly under-rated. I urge all you juniors to have a look at this opportunity. I met some juniors and a few seniors of our department. One can find a Kgpian in almost every city. Two months internship got over like a breeze.

Fourth year, this is the time when you get to know the real competition –internship season. Some 10 students could manage to get an internship offer from TnP. The rest of us tried our luck off-campus. I landed up in Mumbai at CRISIL Limited for internship. Mumbai is a city full of life and happenings.

Then came the final year –the placement season. You end up attending more PPTs than your classes. This green site–TnP automatically becomes your homepage. You study more in these four months than all of your first four years combined. A typical day consists of getting CV reviews from your seniors, discussing puzzles, brushing up your coding skills, solving case studies and attending some PPTs. Being proactive is the key. Be well prepared and you will bag your dream job quite easily. Talk to relevant seniors –they know a lot.

I was offered the job of statistical analyst at WalmartLabs. The interview revolved mostly around Statistics, my internships and a few puzzles. Having placed, the last semester is a total delight. You are about to leave your institute so all your emotional chords get stroked. You wonder doing all things which you might have missed during your four years stay at IIT Kharagpur.

I wish all the very best to my junior batch in their future endeavors.

You won't get a life like Kgp anywhere else. Make it count.

Happy Women's Day

2014-03-08T08:00:00-03:00

Today is 8th March, just another day except that today is women’s day. Women’s day started as a Socialist political movement but over time it has changed to a day when it’s celebrated as a sign of respect and love to women across the world. Women’s day is all about respecting and caring for women around you. Take some time to wish and appreciate women in your life. Not that you shouldn’t love them on any ordinary day. All I’m saying is –It’s their day. Make it special just like you do on someone’s birthday.

Give a call or pay a visit to your mother, wife, daughter, girlfriend, colleague or any women who has stood by you and supported you through the ups and downs of your life. Don’t let this pass like just another day. Wish them. Tell them that you are thankful to them; I guarantee you their smile –a beautiful smile in return.

I called my mother, wished her, and made her smile. Did you?

The Walmart Interview

2014-03-04T08:00:00-03:00

Now that you are reading this, I’ll assume that you are someone looking for an opportunity of placement or an internship. Go ahead and read. You have a long journey to cover.

I was offered a job of Statistical Analyst at WalmartLabs. Let me guide you through their placement procedure. Sometime in October, we had the Pre-placement talk (we call it PPT here) where we were introduced to the company’s fundamentals, its businesses and some awesomeness of Walmart. The guy presenting the PPT looked geeky without paying much heed to if we were at all listening. But frankly, he had covered almost everything.

A month later, we had the online test which was based on statistical questions covering mostly topics such as:

Hypothesis Testing

R square

Regression

Basic probability distribution functions

Confidence interval

It also had 2 coding questions out of which I somehow managed to write the logic of one of them. The duration of the test was 90 minutes. I had started with statistical questions first and later moved to the coding part. My friends who had started with the coding questions were not able to attempt some very easy statistics questions. My advice to you is, go for the objective statistics questions first. I managed to attempt some 17 questions out of a total of 23 questions. Based on the performance in the test a further shortlist of 7 students was declared. Needless to say I was one of them.

So I was selected for the personal interview of WalmartLabs. A lot had to be done. I didn’t know where to start. The first thing I did was to talk to one of my seniors, Manikanta Dinesh. His advice and suggestions helped. A lot. I kept myself surrounded with books for quite a few days. I had interviews for a few other companies as well. I couldn't put all my energy into this one company.

“Don’t keep all your fruits in one basket. Diversify.”

It was day 1, 1st December. I was selected in only one company. CGPA plays a crucial part. If you have a low CGPA, chances are that you will see students getting shortlisted and your name won’t grace any of the lists. I know that feeling. You see, CGPA plays its part. A strong one.

Next day, Day 2, the only company to appear for – WalmartLabs. So I took all the interview funda from a friend, actually a mentor, Tushar – well, he knew more about interviews at that time, Day 1 placed at Credit Suisse. Having all well-prepared, suited up, drove my cycle to Nalanda Complex. On my way, I kept blabbering and preparing my answers, mostly the introduction part. You see, how you open the interview is important.

Though there were 2 HR guys from Walmart, the 1st round was a Skype interview. I had the experience of talking to my girlfriend at Skype. So 1st round well. That was just said in jest. The 1st round lasted for some 50 minutes. I was asked mostly about my internship at IDRBT. A few puzzles – 3 doors puzzle or the famous Monty hall. I was able to answer almost all the puzzles. The interviewer was a CS graduate from IIT Bombay working at WalmartLabs for the past 6 years. How do I know? You ask. Well, I had asked him about it towards the end of the interview where you get to ask questions to the interviewer. “Never miss this opportunity. Always ask questions. Period.”

So the 1st round went well. Two of my friends couldn’t make to the 2nd round. This round was a Skype interview as well. This guy, the interviewer was a dude. We talked – I didn’t feel like it was an interview for some 90 minutes. Some very basic concepts of statistics were talked about. Some of which I recall:

How would you explain probability to your grandmother?

How would you teach statistics to a layman?

What are assumptions of linear regression? What are their implications?

A few puzzles.

Explain your work at CRISIL Ltd.

Confident of my grasp at Statistics, the interviewer moved towards courses like Real Analysis and Measure Theory. Question: Tell me about Cauchy’s law of convergence? I was blank. Total blank. Few seconds passed. I muttered a few words vaguely. Finally I reiterated confidently. “Sir, I don’t remember this as of now. I didn’t like these courses much. But if you want me to know these subjects, I can very well be versed in them given a day or two.” Guess what? He was impressed by my say. He said – “Not a problem Manish! I was just curious. You are good in Statistics and that is what mostly matters here.”

The 3rd round started with only two of us in the game now. The interviewer was a senior executive from US. He asked some general questions from my CV. No more puzzles. This round too lasted for some 50 minutes. Finishing this round, I expected they would announce their result. But they had more in their basket. The other friend of mine couldn’t clear the 3rd round.

The 4th round was an HR round with one of the 2 guys who were present there. This lasted for some 15 minutes. By this time, I was determined – determined to be selected. Cleared it with a smile. Yes there was a 5th round –a telephonic interview with the senior HR, Bangalore office. Again a 30 minutes interview. Some of the questions I recall:

Why Walmart?

Who is your best buddy? Worst buddy?

What is your greatest weakness?

Any questions you would like to ask?

The time now was 7:00 p.m. I was waiting for the result of longest interview of my life, walking restlessly outside the room. Waited for some 15 minutes. I could see the HR guys talking and discussing intensely over phone. A few more minutes passed. The wait was getting heavy now. The HR guy finally came out and with a broad smile, he said –“Congratulations! Welcome to the team.”

That moment I knew what success tasted like. I was happy, excited. Overwhelmed! All of a sudden, I felt like a hero. They clicked a photo of me wearing the Walmart cap. They gave me a bag of gifts as a sign of welcome. On my way to hall, I gave a quick call to my father and then to my maa –She said she had been worshipping the whole day. Her prayers had played their charms and she was thankful to her Gods.

Reached my wing, told my friends and then they congratulated –GPL. The night was great. It called for a celebration. We headed to the bar. Time for some drinks.

One final thought:

“Ek din hota hai, ek job hota hai aur ek tum hote ho.”

Did you find the article useful? If you did, share your thoughts in the comments.