CS230 tensorflow project를 참고하면서 제 나름대로 Machine learning project의 토대가 되는 구조를 만들어보았습니다.
Source code is here
1. Structure
ROOT_DIR
├── env.py
├── utils.py
├── logger.py
├── main.py
├── data
│ └── original
│ ├── sample_submission.csv
│ ├── test.csv
│ └── train.csv
├── models
│ ├── NN
│ │ ├── network.py
│ │ ├── processor.py
│ │ ├── train.py
│ │ └── tuning.py
│ └── RandomForestClassifier
│ ├── processor.py
│ ├── train.py
│ └── tuning.py
└── experiments
├── log.csv
├── NN
│ └── hyperparameter_set1
│ ├── learning_curve.png
│ ├── log.csv
│ ├── model
│ │ ├── best_weight.ckpt
│ │ │ ├── assets
│ │ │ ├── saved_model.pb
│ │ │ └── variables
│ │ │ ├── variables.data-00000-of-00002
│ │ │ ├── variables.data-00001-of-00002
│ │ │ └── variables.index
│ │ └── model.h5
│ └── params.json
└── RandomForestClassifier
└── hyperparameter_set1
├── model.joblib
└── params.json
2. Description
2.1. ‘ROOT_DIR’ directory
env.py
Python script where global(general) packages, constants, variables, settings are declared.
env.py
includes importing utils.py
. Other python script files import env.py
in first with
import sys
sys.path.append(ROOT_DIR)
(this should be changed in a better way)
utils.py
Python script where general functions are declared.
logger.py
Python script where Logger
class besides. Logger
records all train, validation loss and the parameter at that time.
main.py
Python script where final evaluation takes place. Here, trained models are loaded and evaluated using test data.
2.2. ‘data’ directory
Directory which stores train, test data and any other data.
2.3. ‘models’ directory
Directory where training models place. Each model has train.py
, tuning.py
, processor.py
.
Neural Network(NN) model has network.py
in addition which defines network and other related functions.
models/MODEL/processor.py
Python script where (pre)processing pipelines and functions for MODEL
reside.
models/MODEL/train.py
Python script where a model is trained with a specific hyperparameter given by argparser
.
models/MODEL/tuning.py
Python script where a model searches the best hyperparameter in hyperparameter set.
models/NN/network
Python script where deep learning models and Callbacks
generator functions reside.
2.4. ‘experiments’ directory
Directory where training results are stored in each MODEL
directory.
experiments/log.csv
CSV file generated by Logger
. This stores all training, validation losses and accuracies and the parameters at that time. Model selection should be based on this log file(Logger
)
experiments/MODEL/PARAM
Each MODEL directory has many PARAM directories which has training results.
Each PARAM directory has params.json
which has a hyperparameter set.
For NN model, PARAM directory has log.csv
which stores losses and accuracies per epochs and model
directory where weights(best_weight.ckpt
) and model(model.h5
) besides.
For other models, PARAM directory has freezed model(model.joblib
).
3. Usage
First in models/MODEL
directory, set the base hyperparameter in train.py
, then
$ python train.py // with argparse
Next, deeply search the best hyperparameter changing the hyperparameter set tuning.py
.
By doing that, try various feature engineering tasks specialized in that model in processor.py
.
$ python tuning.py
Finally, evaluate the final model in ROOT_DIR/main.py
. Maybe final model is ensemble of several trained model.