Manish Barnwal

...just another human

An illustrated introduction to adversarial validation part 2

In the last post we talked about the idea of adversarial validation and how it helps the problem when your validation set result doesn't comply with that of test set result. In this post, I will share the R code to help achieve the idea of adversarial validation. The data used would be from Numerai competition.

Loading required packages


Reading train and test data set

train <- fread("Data/numerai_training_data.csv")
train <-
train$target <- as.factor(train$target)

dim(train) # has close to 136000 rows and having no missing values

test <- fread("Data/numerai_tournament_data.csv")
test <-
dim(test) # has close to 150000 rows and having no missing values

Creating the target variable to distinguish between train and test data

train$isTest <- 0 # assigning 0 for train and 1 for test data
test$isTest <- 1

Combining train and test data into a single data frame

combi <- rbindlist(list(train[, -51], test[, -1])) # removing 'target' from train data and 't_id' from test data
combi$isTest <- as.factor(combi$isTest)
combi <-

Train a classifier to identify whether data comes from the train or test set

logitMod <- glm(formula = isTest~. , data = combi, family = 'binomial')

Predict on the training data to see which rows resembles most to the test data

pred <- predict(logitMod, newdata = train, type = 'response')

trainData <- train
trainData$predictTest <- pred

Sort the training data by it’s probability of being in the test set

trainData <- trainData[order(trainData$predictTest, decreasing = T), ]

valIndx <- 1:(0.2*nrow(trainData))
colsToKeep <- names(trainData)[!names(trainData) %in% c('isTest', 'predictTest')]

trainFinal <- trainData[-valIndx, colsToKeep]
valData <- trainData[valIndx, colsToKeep]

write.csv(trainFinal, 'trainfinal.csv', row.names = F, quote = F)

Build a random forest classifier to predict the 'target' variable

set.seed(1) # setting seed for reproducibility of the result

matX <- trainFinal[, -grep('target', names(trainFinal))]
response <- trainFinal[, 'target']

rfMod <- randomForest(x = matX, y = response, ntree = 200, mtry = 7) # training randomForest model

Prediction on validation set

rfValPreds <- predict(rfMod, newdata = valData, type="prob")
LogLoss(rfLassoValPreds, as.numeric(as.character(valData$target))) # LogLoss function from MLmetrics package

The validation set gives a LogLoss of 0.699. Let us see how does this come out on test data set. For this we will predict on the test data and upload the predictions to the site.

Prediction on actual test data

testPreds <- predict(rfMod, newdata = test, type = 'prob')
testPreds <- testPreds[, 2]

submission <- data.frame(t_id = test$t_id, probability = testPreds)
write.csv(submission, 'submission.csv', row.names = F, quote = F)

The predictions on test data shows a LogLoss of 0.694 which is same as that of the validation set. We can now hope to have same result on both validation set and test set.

Did you find the article useful? If you did, share your thoughts in the comments. Share this post with people who you think would enjoy reading this. Let's talk more of data-science.

Advertiser Disclosure: This post contains affiliate links, which means I receive a commission if you make a purchase using this link. Your purchase helps support my work.