Basic Tutorial¶
In this tutorial, we will use the well-known flight dataset used in the benchmark available on this github page. It includes flight records during years 2005 - 2006. The goal is to predict whether a flight is going to be delayed for more than 15 min.
Download Dataset¶
Navigate to the directory where the Silas executable is located, let’s call it bin for now. Create a new directory called tutorial1 in bin.
Download the training data of 1 million flight records and testing data. Place them in bin/tutorial1/data/ folder.
Generate Configuration Files for Machine Learning¶
Go back to bin. Open a terminal from this directory, and run the following command:
silas gen-all -o tutorial1 tutorial1/data/train-1m.csv tutorial1/data/test.csv
This command automatically generates all the required configuration files for Silas machine learning with default settings. It outputs the configuration files in tutorial1 and sanitised the data files.
Note that the configuration generator automatically chooses the last feature (column in the dataset) as the outcome feature (target), which is fine for this example. For other examples, you may want to check the “outcome_feature” in settings.json and make sure that the outcome feature is not listed in “selected_features”.
Run Machine Learning¶
To build a predictive model using Silas machine learning, run the command:
silas learn -o model/flights tutorial1/settings.json
This command will run machine learning using the parameters in tutorial1/settings.json, and store the predictive model in model/flights. It will also output some information about the performance of the predictive model against the testing data set. With the default parameters, you will probably get an ROC-AUC of 0.75+ (in other places it might be displayed as 75+).
You may want to tune the parameters of the learner to improve results. To do so, open tutorial1/settings.json with your preferred text editor and change the value of feature_proportion from “sqrt” to 0.8. Re-train a model using the above command once more: you should obtain an ROC-AUC of 0.76+.
Use Machine Learning To Perform Prediction¶
Now that we have a predictive model stored in model/flights, we can use it to predict the outcome of new data samples. Since we only have two data files at hand, let’s just run prediction over the testing dataset:
silas predict -o tutorial1/predictions.csv model/flights/ tutorial1/data/clean-test.csv
This command will output the predictions in tutorial1/predictions.csv. Each row in this file corresponds to a row in test.csv. The first column gives the outcome value, the second column gives the probability of that value.