In the last post we talked about the idea of adversarial validation and how it helps the problem when your validation set result doesn't comply with that of test set result. In this post, I will share the R code to help achieve the idea of adversarial validation. The data used would be from Numerai competition.
Loading required packages
library(randomForest)
library(glmnet)
library(data.table)
library(MLmetrics)
getwd()
dir()
Reading train and test data set
train <- fread("Data/numerai_training_data.csv")
train <- as.data.frame(train)
train$target <- as.factor(train$target)
str(train)
dim(train) # has close to 136000 rows and having no missing values
head(train)
test <- fread("Data/numerai_tournament_data.csv")
test <- as.data.frame(test)
dim(test) # has close to 150000 rows and having no missing values
head(test)
Creating the target variable to distinguish between train and test data
train$isTest <- 0 # assigning 0 for train and 1 for test data
test$isTest <- 1
Combining train and test data into a single data frame
combi <- rbindlist(list(train[, -51], test[, -1])) # removing 'target' from train data and 't_id' from test data
combi$isTest <- as.factor(combi$isTest)
combi <- as.data.frame(combi)
str(combi)
Train a classifier to identify whether data comes from the train or test set
logitMod <- glm(formula = isTest~. , data = combi, family = 'binomial')
summary(logitMod)
head(logitMod$fitted.values)
Predict on the training data to see which rows resembles most to the test data
pred <- predict(logitMod, newdata = train, type = 'response')
head(pred)
trainData <- train
head(trainData)
trainData$predictTest <- pred
Sort the training data by it’s probability of being in the test set
trainData <- trainData[order(trainData$predictTest, decreasing = T), ]
valIndx <- 1:(0.2*nrow(trainData))
colsToKeep <- names(trainData)[!names(trainData) %in% c('isTest', 'predictTest')]
trainFinal <- trainData[-valIndx, colsToKeep]
valData <- trainData[valIndx, colsToKeep]
write.csv(trainFinal, 'trainfinal.csv', row.names = F, quote = F)
Build a random forest classifier to predict the 'target' variable
set.seed(1) # setting seed for reproducibility of the result
matX <- trainFinal[, -grep('target', names(trainFinal))]
response <- trainFinal[, 'target']
table(response)
rfMod <- randomForest(x = matX, y = response, ntree = 200, mtry = 7) # training randomForest model
Prediction on validation set
rfValPreds <- predict(rfMod, newdata = valData, type="prob")
head(valPreds)
LogLoss(rfLassoValPreds, as.numeric(as.character(valData$target))) # LogLoss function from MLmetrics package
The validation set gives a LogLoss of 0.699. Let us see how does this come out on test data set. For this we will predict on the test data and upload the predictions to the site.
Prediction on actual test data
testPreds <- predict(rfMod, newdata = test, type = 'prob')
testPreds <- testPreds[, 2]
submission <- data.frame(t_id = test$t_id, probability = testPreds)
head(submission)
write.csv(submission, 'submission.csv', row.names = F, quote = F)
The predictions on test data shows a LogLoss of 0.694 which is same as that of the validation set. We can now hope to have same result on both validation set and test set.
Did you find the article useful? If you did, share your thoughts in the comments. Share this post with people who you think would enjoy reading this. Let's talk more of data-science.