An illustrated introduction to adversarial validation part 2

In the last post we talked about the idea of adversarial validation and how it helps the problem when your validation set result doesn't comply with that of test set result. In this post, I will share the R code to help achieve the idea of adversarial validation. The data used would be from Numerai competition.

Loading required packages


library(randomForest)
library(glmnet)
library(data.table)
library(MLmetrics)
getwd()
dir()

Reading train and test data set


train <- fread("Data/numerai_training_data.csv")
train <- as.data.frame(train)
train$target <- as.factor(train$target)

str(train)
dim(train) # has close to 136000 rows and having no missing values
head(train)

test <- fread("Data/numerai_tournament_data.csv")
test <- as.data.frame(test)
dim(test) # has close to 150000 rows and having no missing values
head(test)

Creating the target variable to distinguish between train and test data


train$isTest <- 0 # assigning 0 for train and 1 for test data
test$isTest <- 1

Combining train and test data into a single data frame


combi <- rbindlist(list(train[, -51], test[, -1])) # removing 'target' from train data and 't_id' from test data
combi$isTest <- as.factor(combi$isTest)
combi <- as.data.frame(combi)
str(combi)

Train a classifier to identify whether data comes from the train or test set


logitMod <- glm(formula = isTest~. , data = combi, family = 'binomial')
summary(logitMod)
head(logitMod$fitted.values)

Predict on the training data to see which rows resembles most to the test data


pred <- predict(logitMod, newdata = train, type = 'response')
head(pred)

trainData <- train
head(trainData)
trainData$predictTest <- pred

Sort the training data by it’s probability of being in the test set


trainData <- trainData[order(trainData$predictTest, decreasing = T), ]

valIndx <- 1:(0.2*nrow(trainData))
colsToKeep <- names(trainData)[!names(trainData) %in% c('isTest', 'predictTest')]

trainFinal <- trainData[-valIndx, colsToKeep]
valData <- trainData[valIndx, colsToKeep]

write.csv(trainFinal, 'trainfinal.csv', row.names = F, quote = F)

Build a random forest classifier to predict the 'target' variable


set.seed(1) # setting seed for reproducibility of the result

matX <- trainFinal[, -grep('target', names(trainFinal))]
response <- trainFinal[, 'target']
table(response)

rfMod <- randomForest(x = matX, y = response, ntree = 200, mtry = 7) # training randomForest model

Prediction on validation set


rfValPreds <- predict(rfMod, newdata = valData, type="prob")
head(valPreds)
LogLoss(rfLassoValPreds, as.numeric(as.character(valData$target))) # LogLoss function from MLmetrics package

The validation set gives a LogLoss of 0.699. Let us see how does this come out on test data set. For this we will predict on the test data and upload the predictions to the site.

Prediction on actual test data


testPreds <- predict(rfMod, newdata = test, type = 'prob')
testPreds <- testPreds[, 2]

submission <- data.frame(t_id = test$t_id, probability = testPreds)
head(submission)
write.csv(submission, 'submission.csv', row.names = F, quote = F)

The predictions on test data shows a LogLoss of 0.694 which is same as that of the validation set. We can now hope to have same result on both validation set and test set.

Did you find the article useful? If you did, share your thoughts in the comments. Share this post with people who you think would enjoy reading this. Let's talk more of data-science.

Advertiser Disclosure: This post contains affiliate links, which means I receive a commission if you make a purchase using this link. Your purchase helps support my work.

Manish Barnwal

...just another human