Configuration file#

The configuration file is used to store all relevant configurations, like paths to the datasets and spark parameters. It should be appropriately edited before executing the code. It is written in YAML.

Spark parameters#

Spark parameters are set under the spark heading. Syntax for specifying Spark parameters derives from Spark’s own property names. For example, to conigure the parameter spark.app.name in Cider config, we’d use

spark:
  app:
    name: "my_first_cider_app"

Here is a more complete example config. It’s not meant as an endorsement of these specific config values; optimal choices vary greatly based on your environment and use case.

spark:
  app:
    name: "my_first_cider_app"
  master: "local[*]"
  sql:
    shuffle:
      partitions: 144
  driver:
    memory: "8G"
    maxResultSize: "2G"
    supervise: true
  executor:
    memory: "8G"
  rpc:
    askTimeout: "600s"
  loglevel: "WARN"
  logConf: true

File and folder locations#

Under the path heading, we specify folder and file locations. File subpaths are given relative to a “parent” directory: Either the input_data directory or the working directory (if you’d rather specify an absolute path, use a leading slash). The locations of the parent directories must be specified with absolute paths (with leading slashes). Cider will not modify files under the input_data directory. It will use the working directory for program outputs, some of which may act as inputs for later steps. For example, the featurizer writes features to the working directory, and then the ml module reads features back in (from that same directory, unless a different one is specified as input). At present, file names/sub-paths written programmatically under the working directory are hard-coded and can’t be specified in config.

path:
  input_data: 
    directory_path: "/example/path/to/input/data/"
    file_paths:
      antennas: "antennas.csv"
      cdr: "cdr.csv"
      home_ground_truth: "home_locations.csv"
      labels: "labels.csv"
      mobiledata: "mobiledata.csv"
      mobilemoney: "mobilemoney.csv"
      population: "population_tgo_2019-07-01.tif"
      poverty_scores: null
      recharges: "recharges.csv"
      rwi: "relative_wealth_index.csv"
      shapefiles:
        regions: "regions.geojson"
        cantons: "cantons.geojson"
        prefectures: "prefectures.geojson"
      user_consent: null
  working: 
    directory_path: "/example/path/to/working/directory"

Column names#

Cider expects certain columns to be present, and we can specify their names under the col_names heading (this is not a complete list):

col_names:
  cdr:
    txn_type: "txn_type"
    caller_id: "caller_id"
    recipient_id: "recipient_id"
    timestamp: "timestamp"
    duration: "duration"
    caller_antenna: "caller_antenna"
    recipient_antenna: "recipient_antenna"
    international: "international"
  antennas:
    antenna_id: "antenna_id"
    tower_id: "tower_id"
    latitude: "latitude"
    longitude: "longitude"
  recharges:
    caller_id: "caller_id"
    amount: "amount"
    timestamp: "timestamp"
  mobiledata:
    caller_id: "caller_id"
    volume: "volume"
    timestamp: "timestamp"
  mobilemoney:
    txn_type: "txn_type"
    caller_id: "caller_id"
    recipient_id: "recipient_id"
    timestamp: "timestamp"
    amount: "amount"
    sender_balance_before: "sender_balance_before"
    sender_balance_after: "sender_balance_after"
    recipient_balance_before: "recipient_balance_before"
    recipient_balance_after: "recipient_balance_after"
  geo: "tower_id"

Miscellaneous parameters#

Under the params heading we are able to specify certain miscellaneous parameters that affect Cider behavior:

  cdr:
    weekend: [1, 7] // definition of weekend (Sun = 1 to Sat = 7)
    start_of_day: 7 // hour when day starts (used to define day/night)
    end_of_day: 19 // hour when night starts (used to define day/night)
  home_location:
    filter_hours: null // hours to filter out when inferring home locations
  automl: // params used by the autoML libraries
    autosklearn:
      time_left: 3600
      n_jobs: 1
      memory_limit: 3072
    autogluon:
      time_limit: 3600
      eval_metric: "r2"
      label: "label"
      sample_weight: "weight"
  opt_in_default: false // if true opt-in is set as default, i.e. all users give their consent unless they opt-out

ML tuning parameters#

Under the hyperparams heading, we set the hyper-parameters that will be tested during a grid-search performed by the ML module:

hyperparams:
  "linear":
    "dropmissing__threshold": [0.9, 1]
    "droplowvariance__threshold": [ 0, 0.01 ]
    "winsorizer__limits": [!!python/tuple [0., 1.], !!python/tuple [0.005, .995]]
  "lasso":
    "dropmissing__threshold": [ 0.9, 1 ]
    "droplowvariance__threshold": [ 0, 0.01 ]
    "winsorizer__limits": [!!python/tuple [0., 1.], !!python/tuple [0.005, .995]]
    "model__alpha": [ .001, .01, .05, .03, .1 ]
  "ridge":
    "dropmissing__threshold": [ 0.9, 1 ]
    "droplowvariance__threshold": [ 0, 0.01 ]
    "winsorizer__limits": [!!python/tuple [0., 1.], !!python/tuple [0.005, .995]]
    "model__alpha": [ .001, .01, .05, .03, .1 ]
  "randomforest":
    "dropmissing__threshold": [ 0.9, 1 ]
    "droplowvariance__threshold": [ 0, 0.01 ]
    "winsorizer__limits": [!!python/tuple [0., 1.], !!python/tuple [0.005, .995]]
    "model__max_depth": [ 2, 4, 6, 8, 10 ]
    "model__n_estimators": [ 50, 100, 200 ]
  "gradientboosting":
    "dropmissing__threshold": [ 0.99 ]
    "droplowvariance__threshold": [ 0.01 ]
    "winsorizer__limits": [!!python/tuple [0., 1.], !!python/tuple [0.005, .995]]
    "model__min_data_in_leaf": [ 10, 20, 50 ]
    "model__num_leaves": [ 5, 10, 20 ]
    "model__learning_rate": [ 0.05, 0.075, 0.1 ]
    "model__n_estimators": [ 50, 100, 200 ]