7 Best R Packages for Machine Learning

Machine Learning is a subset of artificial intelligence that focuses on the development of computer software or programs that access data to learn themselves and make predictions i.e. without being explicitly programmed. Machine learning consists of different sub-parts i.e. unsupervised learning, supervised learning, and reinforcement learning. It defines numerical problems and categorical problems that help in building models.
R language is used for statistical analysis and computing used by the majority of researchers around the world. R is being used in building machine learning models due to its flexibility, efficient packages, and the ability to perform deep learning models with integration to the cloud. Being an open-source language, all the packages are published on R with contributions from programmers around the world to make it more user friendly. Following R packages are widely used in industry are:
- data.table
- dplyr
- ggplot2
- caret
- e1071
- xgboost
- randomForest
data.table
data.table provides a high-performance version of R’s data.frame with feature enhancements and syntax for ease of use, memory efficient, and rich with features. It provides a fast and friendly delimited file reader and file writer. It is one of the top-rated packages on Github. It provides low-level parallelism, scalable aggregations with feature-rich joins, and feature-rich reshaping data.
R
# Installing the packageinstall.packages("data.table")# Loading packagelibrary(data.table)# Importing datasetGov_mortage <- fread("Government_Mortage.csv")# Record with loan amount 114# With country code 60Answer <- Gov_mortage[loan_amount == 114 & county_code == 60]Answer |
Output:
row_id loan_type property_type loan_purpose occupancy loan_amount preapproval msa_md 1: 65 1 1 3 1 114 3 344 2: 1285 1 1 1 1 114 2 -1 3: 6748 1 1 3 1 114 3 309 4: 31396 1 1 1 1 114 1 333 5: 70311 1 1 1 1 114 3 309 6: 215535 1 1 3 1 114 3 365 7: 217264 1 1 1 2 114 3 333 8: 301947 1 1 3 1 114 3 48 state_code county_code applicant_ethnicity applicant_race applicant_sex 1: 9 60 2 5 1 2: 25 60 2 5 2 3: 47 60 2 5 2 4: 6 60 2 5 1 5: 47 60 2 5 2 6: 21 60 2 5 1 7: 6 60 2 5 1 8: 14 60 2 5 2 applicant_income population minority_population_pct ffiecmedian_family_income 1: 68 6355 14.844 61840 2: 57 1538 2.734 58558 3: 54 4084 5.329 76241 4: 116 5445 41.429 70519 5: 50 5214 3.141 74094 6: 57 6951 4.219 56341 7: 37 2416 18.382 70031 8: 35 3159 8.533 56335 tract_to_msa_md_income_pct number_of_owner-occupied_units 1: 100.000 1493 2: 100.000 561 3: 100.000 1359 4: 100.000 1522 5: 100.000 1694 6: 97.845 2134 7: 65.868 905 8: 100.000 1080 number_of_1_to_4_family_units lender co_applicant 1: 2202 3507 TRUE 2: 694 6325 FALSE 3: 1561 3549 FALSE 4: 1730 2111 TRUE 5: 2153 3229 FALSE 6: 2993 1574 TRUE 7: 3308 3110 FALSE 8: 1492 6314 FALSE
Dplyr
Dplyr is a data manipulation package widely used in the industry. It consists of five key data manipulation functions also known as verbs i.e Select, Filter, Arrange, Mutate, and Summarize.
R
# Installing the packageinstall.packages("dplyr")# Loading packagelibrary(dplyr)# Importing datasetGov_mortage <- fread("Government_Mortage.csv")# Selectselect(Gov_mortage, state_code)# Mutatem <- mutate(Gov_mortage, amount = loan_amount - applicant_income)m# Filterf = filter(Gov_mortage, county_code == 80)f# Arrangearrange(Gov_mortage, county_code == 80)# Summarizesummarise(f, max_loan = max(loan_amount)) |
Output:
# Filter
row_id loan_type property_type loan_purpose occupancy loan_amount preapproval msa_md
1 16 1 1 3 2 177 3 333
2 25 1 1 3 1 292 3 333
3 585 1 1 3 1 134 3 333
4 1033 1 1 1 1 349 2 333
5 1120 1 1 1 1 109 3 333
6 1303 1 1 3 2 166 3 333
7 1758 1 1 2 1 45 3 333
8 2053 3 1 1 1 267 3 333
9 3305 1 1 3 1 392 3 333
10 3555 1 1 3 1 98 3 333
11 3769 1 1 3 1 288 3 333
12 3807 1 1 1 1 185 3 333
13 3840 1 1 3 1 280 3 333
14 5356 1 1 3 1 123 3 333
15 5604 2 1 1 1 191 1 333
16 6294 1 1 2 1 102 3 333
17 7631 3 1 3 1 107 3 333
18 8222 2 1 3 1 62 3 333
19 9335 1 1 3 1 113 3 333
20 10137 1 1 1 1 204 3 333
21 10387 3 1 1 1 434 2 333
22 10601 2 1 1 1 299 2 333
23 13076 1 1 1 1 586 3 333
24 13763 1 1 3 1 29 3 333
25 13769 3 1 1 1 262 3 333
26 13818 2 1 1 1 233 3 333
27 14102 1 1 3 1 130 3 333
28 14196 1 1 2 1 3 3 333
29 15569 1 1 1 1 536 2 333
30 15863 1 1 1 1 20 3 333
31 16184 1 1 3 1 755 3 333
32 16296 1 1 1 2 123 2 333
33 16328 1 1 3 1 153 3 333
34 16486 3 1 3 1 95 3 333
35 16684 1 1 2 1 26 3 333
36 16922 1 1 1 1 160 2 333
37 17470 1 1 3 1 174 3 333
38 18336 1 2 1 1 37 3 333
39 18586 1 2 1 1 114 3 333
40 19249 1 1 3 1 422 3 333
41 19405 1 1 1 1 288 2 333
42 19421 1 1 2 1 301 3 333
43 20125 1 1 3 1 449 3 333
44 20388 1 1 1 1 494 3 333
45 21434 1 1 3 1 251 3 333
state_code county_code applicant_ethnicity applicant_race applicant_sex
1 6 80 1 5 2
2 6 80 2 3 2
3 6 80 2 5 2
4 6 80 3 6 1
5 6 80 2 5 2
6 6 80 1 5 1
7 6 80 2 5 1
8 6 80 1 5 2
9 6 80 2 5 1
10 6 80 2 5 1
11 6 80 2 5 1
12 6 80 1 5 1
13 6 80 1 5 1
14 6 80 1 5 1
15 6 80 2 5 1
16 6 80 2 5 1
17 6 80 3 6 3
18 6 80 2 5 1
19 6 80 2 5 2
20 6 80 2 5 1
21 6 80 2 5 1
22 6 80 2 5 1
23 6 80 3 6 3
24 6 80 2 5 2
25 6 80 2 5 1
26 6 80 2 5 1
27 6 80 2 5 2
28 6 80 1 6 1
29 6 80 2 5 1
30 6 80 2 5 1
31 6 80 2 5 1
32 6 80 2 5 2
33 6 80 1 5 1
34 6 80 2 5 1
35 6 80 2 5 1
36 6 80 2 5 2
37 6 80 2 5 1
38 6 80 1 5 1
39 6 80 1 5 1
40 6 80 1 5 1
41 6 80 2 2 1
42 6 80 2 5 1
43 6 80 2 5 1
44 6 80 2 5 1
45 6 80 2 5 1
applicant_income population minority_population_pct ffiecmedian_family_income
1 NA 6420 29.818 68065
2 99 4346 16.489 70745
3 46 6782 20.265 69818
4 236 9813 15.168 69691
5 49 5854 35.968 70555
6 148 4234 19.864 72156
7 231 5699 17.130 71892
8 48 6537 13.024 71562
9 219 18911 26.595 69795
10 71 8454 17.436 68727
11 94 6304 13.490 69181
12 78 9451 14.684 69337
13 74 15540 43.148 70000
14 54 16183 42.388 70862
15 73 11198 40.481 70039
16 199 12133 10.971 70023
17 43 10712 33.973 68117
18 115 8759 17.669 70526
19 59 24887 32.833 71510
20 135 25252 31.854 69602
21 108 6507 13.613 70267
22 191 9261 22.583 71505
23 430 7943 19.990 70801
24 206 7193 18.002 69973
25 150 7413 14.092 68202
26 94 7611 14.618 71260
27 81 10946 34.220 70386
28 64 10438 36.395 70141
29 387 8258 20.666 69409
30 80 7525 26.604 70104
31 NA 4525 20.299 71947
32 40 8397 32.542 68087
33 87 20083 19.750 69893
34 96 20539 19.673 72152
35 45 10497 12.920 70134
36 54 15686 26.071 70890
37 119 7558 14.710 69052
38 62 25960 32.858 68061
39 18 5790 39.450 68878
40 103 18086 26.099 69925
41 70 8689 31.467 70794
42 38 3923 30.206 68821
43 183 6522 13.795 69779
44 169 18459 26.874 69392
45 140 15954 25.330 71096
tract_to_msa_md_income_pct number_of_owner-occupied_units
1 100.000 1553
2 100.000 1198
3 100.000 1910
4 100.000 2351
5 100.000 1463
6 100.000 1276
7 100.000 1467
8 100.000 1766
9 100.000 4316
10 90.071 2324
11 100.000 1784
12 100.000 2357
13 100.000 3252
14 100.000 3319
15 79.049 2438
16 100.000 3639
17 100.000 2612
18 87.201 2345
19 100.000 6713
20 100.000 6987
21 100.000 1788
22 91.023 2349
23 100.000 1997
24 100.000 2012
25 100.000 2359
26 100.000 2304
27 100.000 2674
28 80.957 2023
29 100.000 2034
30 100.000 2343
31 77.707 1059
32 100.000 1546
33 100.000 5929
34 100.000 6017
35 100.000 3542
36 100.000 4277
37 100.000 2316
38 100.000 6989
39 56.933 1021
40 100.000 4183
41 100.000 1540
42 100.000 882
43 100.000 1774
44 100.000 4417
45 100.000 4169
number_of_1_to_4_family_units lender co_applicant
1 2001 3354 FALSE
2 1349 2458 FALSE
3 2326 4129 FALSE
4 2928 4701 TRUE
5 1914 2134 FALSE
6 1638 5710 FALSE
7 1670 3110 FALSE
8 1926 3080 TRUE
9 5241 5710 TRUE
10 3121 5710 FALSE
11 1953 933 FALSE
12 2989 186 TRUE
13 4482 2134 TRUE
14 4380 5339 TRUE
15 3495 5267 TRUE
16 4875 1831 TRUE
17 3220 5710 FALSE
18 3024 3885 TRUE
19 7980 2458 FALSE
20 7949 6240 TRUE
21 2015 542 TRUE
22 3215 2702 TRUE
23 2361 3216 FALSE
24 2482 6240 FALSE
25 2597 3970 TRUE
26 2503 3264 FALSE
27 3226 2570 TRUE
28 3044 6240 FALSE
29 2423 1928 TRUE
30 2659 5738 FALSE
31 1544 2458 FALSE
32 2316 3950 FALSE
33 7105 3143 FALSE
34 7191 4701 FALSE
35 4325 5339 FALSE
36 5188 2702 FALSE
37 2531 2458 TRUE
38 7976 2318 TRUE
39 1755 5026 FALSE
40 5159 4931 TRUE
41 2337 2352 FALSE
42 1317 2458 FALSE
43 1949 5726 FALSE
44 5055 5316 TRUE
45 5197 5726 FALSE
[ reached 'max' / getOption("max.print") -- omitted 1034 rows ]
#
ggplot2
ggplot2 also termed as Grammar of Graphics is a free, opensource, and easy to use visualization package widely used in R. It is the most powerful visualization package written by Hadley Wickham.
R
# Installing the package install.packages("dplyr") install.packages("ggplot2") # Loading packages library(dplyr) library(ggplot2)# Data Layer ggplot(data = mtcars) # Aesthetic Layer ggplot(data = mtcars, aes(x = hp, y = mpg, col = disp)) # Geometric layer ggplot(data = mtcars, aes(x = hp, y = mpg, col = disp)) + geom_point() |
Output:
- Geometric layer:
- Geometric layer – Adding Size:
caret
caret termed as Classification and Regression Training uses many functions for training and plotting classification & regression models. It is one of the most widely used packages among R developers and in various machine learning competitions.
R
# Installing Packages install.packages("e1071") install.packages("caTools") install.packages("caret") # Loading package library(e1071) library(caTools) library(caret) # Loading data data(iris) # Splitting data into train # and test data split <- sample.split(iris, SplitRatio = 0.7) train_cl <- subset(iris, split == "TRUE") test_cl <- subset(iris, split == "FALSE") # Feature Scaling train_scale <- scale(train_cl[, 1:4]) test_scale <- scale(test_cl[, 1:4]) # Fitting Naive Bayes Model # to training dataset set.seed(120) # Setting Seed classifier_cl <- naiveBayes(Species ~ ., data = train_cl) classifier_cl # Predicting on test data' y_pred <- predict(classifier_cl, newdata = test_cl) # Confusion Matrix cm <- table(test_cl$Species, y_pred) cm |
Output:
- Model classifier_cl:
- Confusion Matrix:
e1071
e1071 package is used for performing clustering algorithms, support vector machines(SVM), shortest path computations, bagged clustering algorithms, K-NN algorithm, etc. Mostly, it is used for performing a K-NN algorithm. which depends on its k value(Neighbors) and finds itĺs applications in many industries like the finance industry, healthcare industry, etc. K-Nearest Neighbor or K-NN is a Supervised Non-linear classification algorithm. K-NN is a Non-parametric algorithm i.e it doesn’t make any assumption about underlying data or its distribution.
R
# Installing Packages install.packages("e1071") install.packages("caTools") install.packages("class") # Loading package library(e1071) library(caTools) library(class) # Loading data data(iris) # Splitting data into train # and test data split <- sample.split(iris, SplitRatio = 0.7) train_cl <- subset(iris, split == "TRUE") test_cl <- subset(iris, split == "FALSE") # Feature Scaling train_scale <- scale(train_cl[, 1:4]) test_scale <- scale(test_cl[, 1:4]) # Fitting KNN Model # to training dataset classifier_knn <- knn(train = train_scale, test = test_scale, cl = train_cl$Species, k = 1) classifier_knn # Confusiin Matrix cm <- table(test_cl$Species, classifier_knn) cm # Model Evaluation - Choosing K # Calculate out of Sample error misClassError <- mean(classifier_knn != test_cl$Species) print(paste('Accuracy =', 1 - misClassError)) |
Outputs:
- Model classifier_knn(k=1):
- Confusion Matrix:
- Model Evaluation(k=1):
XGBoost
XGBoost works only with numeric variables. It is a part of the boosting technique in which the selection of the sample is done more intelligently to classify observations. There are interfaces of XGBoost in C++, R, Python, Julia, Java, and Scala. It consists of Bagging and Boosting techniques. The dataset used BigMart.
R
# Installing Packages install.packages("data.table") install.packages("dplyr") install.packages("ggplot2") install.packages("caret") install.packages("xgboost") install.packages("e1071") install.packages("cowplot") # Loading packages library(data.table) # for reading and manipulation of data library(dplyr) # for data manipulation and joining library(ggplot2) # for plotting library(caret) # for modeling library(xgboost) # for building XGBoost model library(e1071) # for skewness library(cowplot) # for combining multiple plots # Setting test dataset # Combining datasets # add Item_Outlet_Sales to test data test[, Item_Outlet_Sales := NA] combi = rbind(train, test) # Missing Value Treatment missing_index = which(is.na(combi$Item_Weight)) for(i in missing_index){ item = combi$Item_Identifier[i] combi$Item_Weight[i] = mean(combi$Item_Weight [combi$Item_Identifier == item], na.rm = T) } # Replacing 0 in Item_Visibility with mean zero_index = which(combi$Item_Visibility == 0) for(i in zero_index){ item = combi$Item_Identifier[i] combi$Item_Visibility[i] = mean( combi$Item_Visibility[combi$Item_Identifier == item], na.rm = T) } # Label Encoding # To convert categorical in numerical combi[, Outlet_Size_num := ifelse(Outlet_Size == "Small", 0, ifelse(Outlet_Size == "Medium", 1, 2))] combi[, Outlet_Location_Type_num := ifelse(Outlet_Location_Type == "Tier 3", 0, ifelse(Outlet_Location_Type == "Tier 2", 1, 2))] combi[, c("Outlet_Size", "Outlet_Location_Type") := NULL] # One Hot Encoding # To convert categorical in numerical ohe_1 = dummyVars("~.", data = combi[, -c("Item_Identifier", "Outlet_Establishment_Year", "Item_Type")], fullRank = T) ohe_df = data.table(predict(ohe_1, combi[, -c("Item_Identifier", "Outlet_Establishment_Year", "Item_Type")])) combi = cbind(combi[, "Item_Identifier"], ohe_df) # Remove skewness skewness(combi$Item_Visibility) skewness(combi$price_per_unit_wt) # log + 1 to avoid division by zero combi[, Item_Visibility := log(Item_Visibility + 1)] # Scaling and Centering data # index of numeric features num_vars = which(sapply(combi, is.numeric)) num_vars_names = names(num_vars) combi_numeric = combi[, setdiff(num_vars_names, "Item_Outlet_Sales"), with = F] prep_num = preProcess(combi_numeric, method = c("center", "scale")) combi_numeric_norm = predict(prep_num, combi_numeric) # removing numeric independent variables combi[, setdiff(num_vars_names, "Item_Outlet_Sales") := NULL] combi = cbind(combi, combi_numeric_norm) # Splitting data back to train and test train = combi[1:nrow(train)] test = combi[(nrow(train) + 1):nrow(combi)] # Removing Item_Outlet_Sales test[, Item_Outlet_Sales := NULL] # Model Building: XGBoost param_list = list( objective = "reg:linear", eta = 0.01, gamma = 1, max_depth = 6, subsample = 0.8, colsample_bytree = 0.5 ) # Converting train and test into xgb.DMatrix format Dtrain = xgb.DMatrix( data = as.matrix(train[, -c("Item_Identifier", "Item_Outlet_Sales")]), label = train$Item_Outlet_Sales) Dtest = xgb.DMatrix( data = as.matrix(test[, -c("Item_Identifier")])) # 5-fold cross-validation to # find optimal value of nrounds set.seed(112) # Setting seed xgbcv = xgb.cv(params = param_list, data = Dtrain, nrounds = 1000, nfold = 5, print_every_n = 10, early_stopping_rounds = 30, maximize = F) # Training XGBoost model at nrounds = 428 xgb_model = xgb.train(data = Dtrain, params = param_list, nrounds = 428) xgb_model |
Output:
- Training of Xgboost model:
- Model xgb_model:
randomForest
Random Forest in R Programming is an ensemble of decision trees. It builds and combines multiple decision trees to get more accurate predictions. It’s a non-linear classification algorithm. Each decision tree model is used when employed on its own. An error estimate of cases is made that is not used when constructing the tree. This is called an out-of-bag error estimate mentioned as a percentage.
R
# Installing package # For sampling the dataset install.packages("caTools")# For implementing random forest algorithm install.packages("randomForest") # Loading package library(caTools) library(randomForest) # Loading data data(iris) # Splitting data in train and test data split <- sample.split(iris, SplitRatio = 0.7) split train <- subset(iris, split == "TRUE") test <- subset(iris, split == "FALSE") # Fitting Random Forest to the train dataset set.seed(120) # Setting seed classifier_RF = randomForest(x = train[-5], y = train$Species, ntree = 500) classifier_RF # Predicting the Test set results y_pred = predict(classifier_RF, newdata = test[-5]) # Confusion Matrix confusion_mtx = table(test[, 5], y_pred) confusion_mtx |
Outputs:
- Model classifier_RF:
- Confusion Matrix:




