Note: All code, visualizations, images, insights, techniques,
and methodological notes in this document and all other documents in
this portfolio ARE NOT open source, and MUST ALWAYS be cited.
Overview:
For decades, violence against civilians in the Eastern Democratic
Republic of Congo (DRC) has been commonplace. The conflict remains one
of the world’s most complex and enduring humanitarian crises, with a
devastating impact on civilians. The interplay of local grievances,
national politics, and international support for armed groups in the
Eastern DRC makes it a challenging scenario for conflict resolution and
peace-building efforts. It is deeply rooted in a mix of historical,
political, ethnic, and economic factors. It also has been fueled by
competition over the Eastern DRC’s land and rich natural resources -
including gold, diamonds, and cobalt - by a variety of armed militias
over time.
Project Goals:
The complex dynamics of the Eastern DRC conflict naturally make
forecasting violence against civilians a difficult task. This portfolio
project takes on this challenge. While many analysts would simply refer
to prior locations of violence on a Kernel Density Estimation (KDE) heat
map as indicative of where violence will likely occur in the future, I
take a machine learning approach: predicting violence even in areas
where violence may not have occurred yet. Geospatial machine learning
models rely on understanding the underlying spatial distribution of a
map, occurrences of past events (like civilian killings) and the
exposure to risk factors to predict the latent risk of a harmful
event occurring at a given location even if that harmful event has never
occurred there previously. Such models are often able to generalize
well to a large variety of communities and locations irrespective of the
unique characteristics of any given community or location via a spatial
cross-validation technique.
Note: Keep in mind that the number of attacks on civilians by armed
actors - especially since I will be dividing the map into fishnet grid
cells with an area of only 10 square mi. - will be highly skewed in its
distribution. Most grid cell observations will have 0 attacks, and only
a small portion will have 1 or more attacks in the training and test
sets. For this reason, I will use dynamically weighted metrics for
hyperparameter tuning.
What are Dynamically Weighted Metrics?:
A standard metric (like RMSE, MAE, etc.) treats all errors equally
across the dataset. Each error contributes equally to the overall loss,
regardless of the true value or the importance of specific regions in
the data. E.g., if your data contains both high and low target values, a
standard metric would penalize errors the same way for both types,
without adjusting for the significance or rarity of certain events (such
as non-zero grid cells).
In contrast, a dynamically weighted metric adjusts the contribution
of each error based on the true values or other characteristics of the
data. In a dynamically weighted setup:
Errors in regions with more important or rare target values (such as
grid cells with a value greater than 0) are given more weight, meaning
the model is penalized more heavily for mistakes in these areas.
Errors in regions with more common or less important target values
(such as grid cells with a value of 0) are given less weight, so the
model is not overly focused on minimizing these errors at the expense of
others.
This dynamic weighting ensures that the model focuses on reducing
errors in the most critical parts of the dataset, such as rare events or
regions with higher target values, which might otherwise be overlooked
by a standard metric.
Multi-Objective Hyperparameter Tuning Rather Than
Single-Objective Tuning:
I also will be using multi-objective tuning rather than standard
single objective tuning for this project. Using multi-objective tuning
with dynamically weighted metrics offers several advantages over
single-objective tuning, especially in the context of highly skewed
geospatial data.
1) Balancing Different Types of Errors:
With single-objective tuning, the model may focus excessively on
optimizing a specific type of error, such as minimizing overall error
(e.g., using RMSE, which penalizes larger errors due to the squaring of
residuals and emphasizes reducing significant mispredictions) or
absolute differences (e.g., using MAE, which averages the magnitude of
errors evenly across all predictions and provides a balanced view of
typical errors but may underemphasize the impact of larger deviations),
at the expense of other important error characteristics.
Multi-objective tuning, on the other hand, allows the model to
strike a balance between different types of errors. E.g., dynamically
weighted RMSE captures large errors by emphasizing significant
mispredictions in grid cells where rare, high target values occur, while
dynamically weighted MAE focuses on the average magnitude of errors
across all grid cells but gives additional weight to the few areas with
high target values. This ensures that both large and typical errors are
prioritized according to the significance of these rare events, despite
the overall skew towards zero in the dataset.
2) Mitigating the Impact of Imbalanced Data:
With highly imbalanced data, focusing solely on one metric (like
dynamically weighted RMSE) could cause the model to overly prioritize
reducing large errors in cells with high values, potentially at the
expense of capturing smaller but more frequent errors in the zero-valued
cells.
Multi-objective tuning helps balance predictions across this highly
skewed count data by incorporating multiple metrics. For example, a
metric like dynamically weighted MASE can help the model improve
accuracy in the few grid cells with higher counts by adjusting for
varying scales of error, while dynamically weighted Huber loss enhances
robustness by addressing both smaller errors in the many zero-count
cells and larger outliers in cells with higher counts. This approach
ensures that the model remains sensitive to rare but significant counts
without neglecting the majority of cells with zero values.
3) Improving Spatial Generalization:
My spatial data likely exhibits spatial autocorrelation, meaning
nearby locations are likely to have similar values. Multi-objective
tuning can ensure that the model generalizes well across various spatial
regions, not just fitting to one region or spatial pattern.
4) Addressing Heteroscedasticity:
My data likely exhibits heteroscedasticity, where the variance of
target values differs across regions (e.g., some regions might have more
variability in target values than others). Single-objective tuning may
not adequately capture this variability.
Multi-objective tuning using a combination of dynamic metrics can
better handle this by weighting regions differently. Dynamic huber loss
helps address both small and large errors robustly across variable
regions, and dynamic MASE ensures errors are scaled appropriately,
adapting to local variability in the target variable.
The 5 Dynamically Weighted Metrics Used for Hyperparameter
Tuning:
1. Dynamic Quantile Loss:
Even when τ = 0.5, quantile loss retains a subtle directional bias,
meaning it might penalize underpredictions and overpredictions slightly
differently, depending on how the residuals are split. This could result
in different regions of the data (where overpredictions or
underpredictions dominate) being prioritized differently. Despite the
fact that both metrics are similar in nature and thus the potential for
some redundancy, including both metrics will provide a model with the
ability to address directional bias in error while also keeping overall
errors in check.
2. Dynamic Huber Loss:
Huber Loss is a combination of Mean Squared Error (MSE) and Mean
Absolute Error (MAE), designed to handle both small and large errors in
a more balanced way. It is particularly useful for making a model robust
to outliers while still penalizing smaller errors effectively. The key
idea is that:
For small errors (within a defined threshold δ), it behaves like
MSE, squaring the residuals.
For larger errors (beyond δ), it behaves like MAE, penalizing the
errors linearly to prevent the loss from blowing up due to large
outliers. The dynamic aspect of Huber loss further weights errors based
on the target value, helping to focus on important regions.
In a multi-objective context, it helps - across different regions -
to balance the need for robustness against outliers with the need to
minimize small errors.
3. Dynamic Root Mean Squared Error (RMSE):
RMSE emphasizes larger errors more than MAE, making it suitable for
scenarios where large errors are especially problematic. In my spatial
context, this helps reduce large errors in regions with rare events
(target values > 0), which might be overlooked by simpler metrics
like MAE.
Dynamic RMSE in multi-objective tuning helps control for large
prediction errors in critical areas while balancing them with other
metrics that focus on average or scaled errors.
4. Dynamic Mean Absolute Error (MAE):
This metric is mentioned above. Also note that MAE provides a clear
measure of the average error and is less sensitive to outliers than
RMSE. It is particularly useful for understanding the typical prediction
error in geospatial data.
5. Dynamic Mean Absolute Scaled Error (MASE):
Outlier Robustness:
- Most robust: Dynamic Huber Loss
- Moderately robust: Dynamic MAE, Dynamic MASE
- Less robust: Dynamic Quantile Loss (τ = 0.5), Dynamic RMSE
Calculating the Pareto Optimal Front With the Help of
Principal Component Analysis (PCA):
- What is the Pareto Optimal Front?
The Pareto optimal front (or Pareto frontier) represents a set of
solutions in multi-objective optimization where no individual solution
can be improved in one objective without worsening another. I.e., it is
the set of all “non-dominated” solutions, meaning that you cannot
improve one metric score without degrading at least one other metric
score.
- How Do I Calculate the Pareto Front?
In my code, the Pareto optimal front is calculated using the
fastNonDominatedSorting function from the nsga2R library. Here is a
summary of the steps:
- Normalization: The various metrics are normalized using
min-max scaling to bring them into a comparable range.
- Outlier Removal: Outliers are identified and removed
using the Interquartile Range (IQR) method to prevent them from
affecting the Pareto front calculation.
- Pareto Front Selection: The fastNonDominatedSorting
function is applied to the normalized metrics. This function sorts the
solutions into different fronts, identifying the set of non-dominated
solutions that form the Pareto front.
- Near-Optimal Solutions: Since the Pareto front typically
contains only a few solutions at most, using PCA weights solely on this
small set - later to be used for ranking which models are best - might
not be very representative. Therefore, the code considers a specified
number (50) “near-optimal” solutions that include those both close to
the Pareto front and that are on it. The Euclidean distance of each
solution to the Pareto front is calculated to identify these
near-optimal solutions. Including these solutions helps make the PCA
weights more representative of the entire solution space, providing a
better understanding of the importance of each metric across a broader
set of potential hyperparameter settings.
- Brief Explanation of PCA - Especially as it Relates to My
Code:
- What Role does PCA Play in My Code?
PCA is used to A) determine the level of importance of each metric
in the set of Pareto optimal and near-Pareto optimal solutions, and to
B) identify which among the multiple solutions (assuming there are more
than one) existing on the Pareto front contains the best hyperparameter
set:
1) PCA on Normalized Metrics: PCA is performed on the normalized
metrics to understand the variance captured by each metric.
2) Combining Principal Components: The code uses the first two
principal components (PC1 and PC2) because they capture almost all the
variance. The loadings (weights) of these components are combined to
calculate the importance of each metric.
3) Composite Score Calculation: These weights are then used to
compute a composite score for each Pareto optimal solution (but not the
near Pareto solutions), allowing the identification of the best
hyperparameter set based on this score.
4) Selection of Best Hyperparameters: The hyperparameter set of the
Pareto front solution with the lowest composite score is considered the
best, as it effectively balances the various metrics considered in the
multi-objective optimization.
Test Set Metrics:
For scoring performance on my test set, I use 29 extra metrics, some
of which are common, some of which are dynamically weighted, and some of
which are asymmetric (and dynamically weighted) in that they penalize
unperpredictions more than overpredictions. In the context of attacks on
civilians, it would be natural to assume that the costs of
underpredicting violence is greater than the costs of overpredicting
it.
1. RMSE Family
- regr.rmse (Root Mean Square Error): Calculates the square
root of the average squared differences between predicted and actual
values. RMSE is sensitive to large errors, making it useful for
penalizing larger mistakes, but it may overemphasize outliers (e.g.,
high target values) in a skewed dataset.
- dynamic_rmse: This version adjusts RMSE by introducing
dynamic weights, which can be based on true values (e.g., grid cells
with higher target values get more weight). This makes it more suitable
for geospatial data with heterogeneous distributions, ensuring that
regions with more significant outcomes (e.g., non-zero target values)
have a larger impact on the error calculation.
- asymmetric_dynamic_rmse: Further adds asymmetry by
penalizing underestimations more than overestimations. For instance,
underestimating the likelihood of attacks (non-zero values) is penalized
more, reflecting real-world priorities in sensitive areas:
underpredicting violent attacks on civilians is more costly than
overpredicting them.
2. MSE Family
- regr.mse (Mean Squared Error): A variant of RMSE without
the square root. It penalizes larger errors even more heavily since it
squares the error directly, which can be problematic in skewed data as
it amplifies the influence of outliers. Still, I use this as a standard
metric which can help spot especially problemmatic models.
- dynamic_mse: Dynamic MSE. Similar to Dynamic RMSE,
Dynamic MSE incorporates dynamic weights based on true values to ensure
that errors in more significant areas (non-zero target regions) are
emphasized.
- asymmetric_dynamic_mse: Asymmetric Dynamic MSE. Applies
both dynamic weights to errors and assymmetric penalties to
underpredictions, as before.
3. MAE Family
- regr.mae (Mean Absolute Error): Measures the average of
the absolute differences between predictions and actual values. It is
less sensitive to outliers compared to MSE but still treats all errors
equally - taking their absolute value - which may not be ideal for
geospatial data where target values are heavily skewed toward zero.
- dynamic_mae: This introduces dynamic weights, giving more
weight to areas with non-zero values (e.g., regions with recorded
attacks), making it more responsive to important prediction areas in
highly skewed and heterogeneous geospatial datasets.
- asymmetric_dynamic_mae: Modifies dynamic_mae by
penalizing underestimations more heavily than overestimations.
4. Quantile Loss Family
- quantile_loss: Quantile loss - as discussed above - can
capture specific percentiles (e.g., median with τ = 0.5) rather than
mean or absolute errors, offering a more flexible approach to error
measurement. It is especially useful when the error distribution is
skewed, like in my case, where most values are zero, but a small
percentage are significant. It differs from MAE by using median instead
of mean, and by multiplying positive residuals by 0.5 and negative
residuals by -0.5, which splits the residuals into positive and negative
parts but with equal weight. In contrast, MAE treats both positive and
negative residuals equally as positive values, by taking their absolute
value.
- dynamic_quantile_loss: Weights the quantile loss
dynamically based on true values, emphasizing the importance of higher
target areas.
- asymmetric_dynamic_quantile_loss: Applies dynamic
penalties, and also asymmetrically penalizes underpredictions.
5. MASE Family
- regr.mase (Mean Absolute Scaled Error): Scales the
absolute error by the mean absolute error of a naïve forecast. MASE can
address seasonality or spatial autocorrelation, making it useful in
geospatial data, especially when there is spatial autocorrelation within
target values (e.g., similar behavior across neighboring grid cell
locations).
- dynamic_mase: Dynamically weighted Mean Absolute Scaled
Error, introducing weights based on true values.
- asymmetric_dynamic_mase: Dynamically weights residuals,
and asymmetrically penalizes underpredictions.
6. Huber Loss Family
- huber_loss: A hybrid between MAE and MSE, Huber loss
penalizes small errors similarly to MAE and larger errors like MSE, but
it is more robust to outliers. It is beneficial in my context because it
balances sensitivity to both small and large errors in skewed geospatial
data.
- dynamic_huber_loss: A dynamically weighted version of
Huber Loss.
- asymmetric_dynamic_huber_loss: A version of Huber Loss
that both dynamically weights residuals, and further penalizes
underestimations more heavily.
7. Logarithmic Median Absolute Error (MedAE)
Family
- dynamic_logarithmic_medae: Uses a logarithmic penalty
function that smooths out the influence of extreme errors. The median in
this context is important because it is more robust to outliers compared
to the mean. This is particularly useful in geospatial data where large
errors are less frequent but still important. The dynamic weighting
ensures that regions with higher target values (e.g., attacks) get more
emphasis while still controlling for the impact of outliers.
- asymmetric_dynamic_logarithmic_medae: Further adds
asymmetric penalization to underpredictions.
8. Square Root Median Absolute Error (MedAE)
Family
- dynamic_logarithmic_medae: Uses a logarithmic penalty
function that smooths out the influence of extreme errors. The median in
this context is more robust to outliers compared to the mean. This is
particularly useful in geospatial data where large errors are less
frequent but still important. The dynamic weighting ensures regions with
higher target values (e.g., attacks) get more emphasis while still
controlling for the impact of outliers.
- asymmetric_dynamic_sqrt_medae: Besides dynamic weighting,
this metric further adds asymmetric penalties to underpredictions. The
median ensures the primary focus remains on typical errors, ignoring
extreme outliers that could otherwise distort model performance in
highly skewed datasets.
9. Logistic Median Absolute Error (MedAE)
Family
- dynamic_logistic_medae: The logistic function penalizes
errors, offering a smoother transition between smaller and larger
errors. The median plays a crucial role in handling outliers, ensuring
that large deviations do not disproportionately influence the overall
error calculation. The dynamic weights again prioritize errors in
higher-value grid cell regions.
- asymmetric_dynamic_logistic_medae: This variant modifies
applies dynamic_logistic_medae by applying asymmetric penalties to
underpredictions.
10. Polynomial Median Absolute Error (MedAE)
Family
- dynamic_poly_2.3_and_1.3_medae: This metric applies a
polynomial penalty function that uses a 2/3 exponent for
underpredictions and a 1/3 exponent penalty for overpredictions. The 2/3
exponent for underpredictions results in a steeper penalty for cases
where the model underpredicts, meaning that the cost of missing critical
events (e.g., attacks) is more heavily weighted. The 1/3 exponent for
overpredictions leads to a gentler penalty, making the model more
tolerant of overestimates, which is useful when overpredicting is less
risky.
The use of the median again mitigates the influence of outliers. By
focusing on the typical errors and applying dynamic weights, this metric
ensures regions with non-zero target values (i.e., 1 or more attacks)
receive more attention, while preventing extreme values from overly
influencing the model evaluation.
- dynamic_poly_2.3_and_1.2_medae: Similar to the previous
version, this metric applies polynomial penalties but with slightly
different exponents: a 2/3 exponent for underpredictions and a 1/2
exponent for overpredictions. The 2/3 exponent for underpredictions
still imposes a steep penalty, ensuring that underestimations
(especially in critical areas) are penalized more. However, the 1/2
exponent (square root penalty) for overpredictions is slightly steeper
than the 1/3 exponent used in the previous version, meaning
overestimations are penalized more here than in the
dynamic_poly_2.3_and_1.3_medae metric above, though still less than
underestimations.
11. GMAE
- gmae (Geometric Mean Absolute Error): Instead of focusing
on additive errors like in MAE or RMSE, GMAE emphasizes multiplicative
errors. Additive errors occur when the difference between the predicted
and actual values is treated as a straightforward subtraction (i.e., ∣
truth−prediction ∣). In contrast, multiplicative errors consider the
ratio or relative scale between predicted and actual values. This means
errors are treated in proportion to the size of the prediction or true
value, which can be especially important when the target values vary
widely in magnitude, as with my data.
In the context of data where the target value is often zero but
occasionally non-zero and potentially significant — multiplicative
errors become more relevant because they capture relative differences
better than additive errors. E.g., a small error in a low-target value
grid cell area (e.g., an error of 0.05 when predicting 0.1) can be
proportionally larger than the same error in a high-target value area
(e.g., an error of 0.05 when predicting 10). Multiplicative errors scale
the error based on the value of the target, so overpredictions or
underpredictions in critical areas (non-zero targets) will have a
greater proportional impact than they would using additive errors alone.
This approach is particularly useful for skewed data because GMAE helps
balance error contributions by minimizing the impact of both extremely
large and extremely small errors from both zero-value and non-zero-value
grid cell areas.
12. Log Cosh Loss
- log_cosh_loss: A smooth, differentiable loss function
based on the hyperbolic cosine of the prediction error, that behaves
like MAE for smaller errors (applying a linear penalty that does not
over-penalize small discrepancies between predictions and true values)
and RMSE for larger ones (applying a squared penalty that penalizes
larger discrepancies more). However, unlike RMSE, the logarithmic
component of the log-cosh loss ensures that for extreme errors, the loss
penalty increases more slowly than it would with RMSE (limiting the
penalty for extreme outliers compared to RMSE), which while recognizing
the presence of outliers, helps prevent them from dominating the error
metric and having a disproportionate impact.
13. Others
- regr.medae (Median Absolute Error): Measures the median
of absolute errors, offering a more robust measure against outliers
compared to MAE. It is useful for datasets with extreme values, as it
focuses on the central tendency of errors rather than extreme
deviations.
- regr.medse (Median Squared Error): This calculates the
median of squared errors, emphasizing a balance between robustness (due
to the median) and sensitivity to larger errors (due to squaring).
- regr.msle (Mean Squared Logarithmic Error): This metric
calculates the mean squared difference between the logarithms of the
predicted and true values. It is useful for cases where the target
varies across orders of magnitude and penalizes underpredictions of
small values more heavily than overpredictions.
- regr.rmsle (Root Mean Squared Logarithmic Error): This is
the square root of the MSLE, making the scale of the errors easier to
interpret. It is especially effective when the target values are heavily
skewed.
- regr.rae (Relative Absolute Error): RAE compares the
absolute errors of the model’s predictions with the absolute errors of a
baseline model (such as predicting the mean of the true values). It
provides a relative measure of how well the model is performing in
comparison to this baseline. A value less than 1 indicates that the
model performs better than the baseline, while a value greater than 1
indicates worse performance. RAE focuses on absolute errors, so it gives
equal weight to all errors, regardless of their size.
- regr.rrse (Relative Root Squared Error): RRSE is similar
to RAE but uses squared errors instead of absolute errors. It compares
the squared errors of the model’s predictions with those of the baseline
model. By squaring the errors, RRSE penalizes larger errors more heavily
than smaller ones, making it more sensitive to outliers. Like RAE, a
value below 1 indicates that the model is performing better than the
baseline, but RRSE emphasizes the importance of larger errors in the
comparison.
Explanation of the Data:
The training set data I will use spanned from Jan. 19, 2018 to
Jun. 31, 2023 (about 5.5 years). The test set data I will use spanned
from Jul. 1, 2023 to Dec. 31, 2023 (6 months). A future portfolio
project will integrate time-series data and take the added step of
forecasting violence against civilians literally into the future in the
Eastern DRC beyond the end of the test set period.
The 33 features (i.e., explanatory variables) used in this project,
the datasets they come from, and hypotheses as to why including these
features in the model may be important are as follows:
Features:
- Mean nighttime light levels (per 10 square mile
fishnet grid cell) as a proxy for levels of economic development. These
light levels come from the Day/Night Band (DNB) radiance measured by
sensor dataset within the VIIRS/NPP Granular Gridded Day Night Band 500m
Linear Lat. Lon. Grid Night NRT dataset.
Hypothesis: Lower levels of economic development in a given
grid cell in the Eastern DRC increase the risk that that grid cell will
experience attacks on civilians by armed groups because:
1) Economic underdevelopment leads to a lack of essential resources,
infrastructure, and government presence and protection, creating a power
vacuum.
2) Armed groups exploit this vacuum to establish control and recruit
from impoverished populations. Economic hardship forces civilians to
depend on these groups, increasing their vulnerability to exploitation
and attacks.
3) These groups use violence to secure resources and maintain
authority.
- Mean population density, from the Gridded
Population of the World (GPW), v4 - UN WPP-Adjusted Population Density,
v4.11 (2020) dataset, by the Socioeconomic Data and Applications Center
(SEDAC) at Columbia University.
Hypothesis: Less densely populated civilian areas are more
likely to be targets for attack by armed groups than more densely
populated areas because armed groups (especially non-state armed groups)
tend to operate in more rural areas, and because these less densely
rural areas tend to be more isolated and far away from state forces that
could come to their rescue.
Hypothesis: Non-State Armed Groups (NSAGs) often operate in
mountainous and high elevation regions where it is harder for state
forces to reach. Civilians living in these regions are often left
unprotected and vulnerable to violence from groups who would exploit
them.
- Mean forest height, from the Global Land Analysis
& Discovery Forest Height, 2020 dataset (pixel value: forest height
in meters) at the University of Maryland (See https://glad.umd.edu/dataset/GLCLUC2020)
Hypothesis: It is likely that NSAGs often operate in
forests where it is harder for state forces to reach. If so, it is quite
possible that civilians near where these groups operate may be subject
to violence from these groups. The higher the growth of forests, the
more likely NSAGs are to seek shelter there as higher forest growth can
hide their location.
- Mean travel time to the nearest city, from the
raster dataset “A suite of global accessibility indicators”, by Andy
Nelson et al., Sci Data 6, 266 (2019) (See https://www.nature.com/articles/s41597-019-0265-5). I
merged raster files in QGIS that provide travel times to cities of
population sizes 50,000 - 100,000; 100,000 - 200,000; and 200,000 -
500,000. I then clipped the merged raster file to the boundaries of
e_drc_adm3_map (which has the same boundaries of fishnet). The result is
the imported raster file. This raster file thus contains travel times
from any point on the map to any city with a population between 50,000
and 1,000,000.
Hypothesis: Longer traveling times make to cities from
rural areas make it hard for civilians to escape from violence when it
occurs and for state security forces to respond. Thus, armed groups may
feel more secure in attacking civilians in such areas without fear of
repercussions. I require a population of at least 50,000 since such a
city - though relatively small - would be likely to have at least a
minimal state security force presence able to respond to nearby NSAG
violence.
- Minimum distance to the border with contiguous
countries. For every fishnet grid cell, I personally will
calculate the distances to the border with neighboring countries, and
find the minimum distance among these for each grid cell.
Hypothesis: Border areas of countries - especially those
with weak state authority in rural areas - are often especially
vulnerable to the absence of state security forces. Borders also provide
NSAGs protection from state security forces since NSAGs can simply cross
borders into neighboring countries territories to escape retaliation for
violence. As a result, border areas are likely to be more vulnerable to
attacks on civilians as state security forces have difficulty securing
these areas.
- Total mines controlled by armed groups, from the
International Peace Information Service (IPIS) dataset “Artisanal Mining
Sites in Eastern DRC”. The dataset’s variables - “armed_group1”,
“armed_group2”, and “armed_group3” - provide the names of armed groups
that maintain a presence, if any, at any given mine. I will filter out
the categories of “None”, NA, and government forces, leaving us with
mines occupied by NSAGs.
- Weighted harmonic mean Inverse-Distance Weighted (IDW)
distance to mines armed by non-state armed groups (NSAGs) -
using the inverse of the distances to the 3 nearest neighboring grid
cells (i.e., k nearest neighbors = 3) containing mines controlled by
armed NSAGs and number of such armed mines per grid cell.
Explanation:
IDW weights are made inversely proportional to the distance.
I.e.:
1) If nearby cells have higher values or are closer, the harmonic
mean value will be lower, suggesting a higher threat.
2) If the nearest cells have lower values or are further away, the
weighted harmonic mean value will be higher, suggesting a lower
threat.
Hypothesis: The Weighted Harmonic Mean IDW distance to
mines controlled by NSAGs is a significant predictor of the spatial
distribution and frequency of attacks on civilians. Lower Weighted
Harmonic Mean IDW distance to NSAG-controlled mines is expected to
increase the likelihood and frequency of attacks. As this distance
decreases, civilians are more likely to interact with armed groups,
raising the risk of violence. The weighting applied through IDW reflects
the diminishing influence of more distant mines on violence levels, with
the greatest impact felt in areas nearest to armed groups’ activities.
This relationship is especially pronounced in regions where NSAGs use
violence as a means to exert control over both the civilian population
and economic resources.
- Total number of mines (also from the IPIS
dataset), regardless of mine type (not necessarily documented to have
been controlled by armed groups) within each fishnet grid cell.
Hypothesis: More mines in any given location will attract
armed groups and the potential for violence against civilians,
irrespective of whether such mines were found to have armed groups
present at the exact time when the creators of the IPIS dataset happened
to visit the mines. It is possible that this variable will more
accurately reflect the potential for armed groups being present.
- Weighted harmonic mean IDW distance to the nearest
mines, regardless of the type of mine.
- Total number of 3T mines (Tin, Tantalum,
Tungsten). This feature also comes from the IPIS dataset:
Explanation:
Economic Importance: 3T minerals are crucial for electronics and
industrial applications globally.
Scale of Operations: Typically, 3T mining operations are smaller and
more dispersed than gold mining. They are often mined by artisanal and
small-scale miners.
Hypothesis:
The total number of 3T mines in a given location vs. the total
number of gold mines might increase the likelihood of non-state armed
groups attacking civilians because:
A higher number of 3T mines could increase the number of sites armed
groups need to control to secure revenue, leading to widespread violence
as they establish dominance. In contrast, gold mines, being fewer but
more valuable, might draw concentrated efforts by armed groups,
resulting in intense but localized conflict.
- Weighted harmonic mean IDW distance to the nearest 3T
mines.
- Total number of gold mines in a given grid cell
(from the IPIS dataset).
- Weighted harmonic mean IDW distance to the nearest gold
mines.
- Total number of active non-state armed groups per grid
cell, from the Armed Conflict Location and Event Data (ACLED)
dataset.
- Weighted harmonic mean IDW distance to each grid cell’s k
= 3 nearest neighbors who have at least 1 active NSAG present, weighted
by the total number of NSAGs present.
- Total number of territorial seizures, either by
NSAGs or by government forces, from the ACLED dataset.
Hypothesis:
- Weighted harmonic mean IDW distance to the nearest
territorial seizures, either by NSAGs or by government forces.
This harmonic mean IDW distance is measured from each grid cell to its 3
nearest neighboring cells in which at least 1 territorial seizure
occurred, and is weighted by the number of territorial seizures within
those 3 grid cells.
- Total armed clashes in a given grid cell, from
the ACLED dataset.
Hypothesis: Armed clashes between militant groups with each
other or with government forces are likely an indicator of areas
susceptible to violence being carried out against civilians as well,
either because armed groups may seek revenge against civilians deemed
loyal to opponents, or because one side (e.g., actor A) may attack
opponents (e.g., actor B) occupying areas where they believe
(potentially correctly) civilians loyal to their side (actor A) who live
there are in danger from attacks by the other side (actor B).
- Weighted harmonic mean IDW distance from each grid cell to
the 3 nearest neighboring grid cells in which at least 1 armed clash
occurred. The weights consist of the number of armed clashes in
each of these 3 grid cells.
- Number of “direct strikes” per grid cell, defined
as a combination of the following sub event types from ACLED’s
“Explosions and remote violence” event type:
- Shelling/artillery/missile attack
- Grenade
- Weighted harmonic mean IDW distance to 3 nearest grid
cells having direct strikes.
- Total number of ethnic groups in a single grid
cell, measured from the Spatially Interpolated Data on
Ethnicity (SIDE) dataset (2018). See https://icr.ethz.ch/data/side/. I specifically seek to
know the number of ethnic groups in any given grid cell which constitute
less than approximately 50 percent of the population. I.e., I seek to
know how many ethnic minority groups exist in any given location on the
map.
Hypothesis: The greater the number of ethnic groups
residing in any single location, the greater are the chances for ethnic
tensions and attacks on civilians by NSAGs belonging to rival ethnic
groups.
- Total number of ACLED events involving foreign
troops (other than UN forces) by grid cell location. I do not
restrict my analysis here to ACLED events involving violence.
Note: I consider foreign forces that are part of the East
African Community Regional Force to the DRC to be included in this
analysis of foreign forces since these are not part of the UN
peacekeeping mission in the DRC (MONUSCO), and thus may not be as
restrained from using deadly force against civilians or acting in ways
towards NSAGs that may in turn result in NSAGs attacking civilians.
Hypothesis:
The total number of ACLED events involving foreign troops (other
than UN forces) in a given grid cell increases the likelihood of attacks
on civilians in that grid cell because:
1) Increased Tensions: The presence and actions of foreign troops,
especially those not part of the UN peacekeeping mission, may heighten
tensions with non-state armed groups (NSAGs), leading to retaliatory
attacks on civilians perceived as collaborators or sympathizers.
2) Less Restrained Use of Force: Foreign forces that are not part of
MONUSCO may be less restrained in their operations, potentially
resulting in civilian casualties, which could provoke NSAGs to retaliate
by targeting civilians.
3) Destabilization: The involvement of foreign troops might disrupt
local power dynamics and create instability, which NSAGs could exploit
by attacking civilians to assert control or punish perceived cooperation
with foreign forces.
4) Perceived Threat: NSAGs may view foreign troops as a significant
threat, leading them to preemptively attack civilians to weaken local
support for these forces or to send a message of resistance.
- Weighted harmonic mean IDW distance from each grid cell to
the nearest location where foreign troops have been present at
any point in time, measured in ACLED. I do not limit my analysis here to
ACLED events involving violence. I find the k=1 nearest neighbor rather
than k=3 because there are very few observations in the dataset where
foreign forces were recorded as present.
Hypothesis: The Weighted Harmonic Mean IDW distance from
each grid cell to the nearest location where foreign troops (excluding
UN forces) have been present will be inversely related to the likelihood
of attacks on civilians. Shorter distances (lower IDW values) to grid
cells with foreign troop presence will indicate a higher risk of
violence, as the presence of these forces may escalate tensions with
non-state armed groups (NSAGs).
NSAGs may retaliate against civilians perceived to be collaborating
with foreign forces or may preemptively attack civilians to weaken local
support for these troops. Longer distances (higher IDW values) suggest a
lower threat, as the influence of foreign troop presence diminishes with
distance.
- Total number of ACLED events involving MONUSCO
(the UN peacekeeping mission in the DRC) by grid cell location.
Hypothesis: More events involving the presence of MONUSCO
will lead to fewer attacks on civilians, for the following reasons:
1) Deterrence Effect: A higher number of MONUSCO-related events in a
grid cell may indicate a stronger presence of UN peacekeeping forces,
which could deter non-state armed groups (NSAGs) from attacking
civilians due to the perceived or actual risk of intervention by
MONUSCO.
2) Increased Security: The presence of MONUSCO forces may improve
overall security and stability in the area, leading to a reduction in
the frequency or severity of attacks on civilians as NSAGs find it more
difficult to operate.
- Weighted harmonic mean IDW distance from each grid cell to
the nearest 3 locations where events involving MONUSCO have
occurred, as measured by the ACLED dataset.
- Total number of violent events in any given grid cell
involving state troops or police, from the ACLED dataset.
Hypothesis:
Higher numbers of events involving national troops or police in a
given grid cell could theoretically increase or decrease the total
number of attacks on civilians depending on the context of these
interactions:
2) Decreased Attacks: Conversely, if national troops or police are
actively engaged in protecting civilians and maintaining security, their
presence might deter NSAGs from attacking civilians. Effective and
disciplined state forces could reduce the opportunities for armed groups
to commit violence by establishing greater control and providing
security in vulnerable areas.
Given these possible dynamics, the model can learn whether the
presence of national troops or police tends to correlate with increased
or decreased violence in different contexts.
- Weighted harmonic mean IDW distance from each grid cell to
the nearest location where state military or police forces have been
active in violent events.
Hypothesis: Closer distances to refugee and IDP camps could
theoretically either increase the likelihood of attacks on civilians in
a given location, or decrease the likelihood of such attacks.
1) Why close distances to camps may increase attacks:
Refugees and IDPs may be seen as targets by armed groups - either
for recruitment - or for retaliation on vulnerable civilians seen to be
connected with rival groups. As a result, it is possible that attacks
may occur near where refugees and IDPs are living.
2) Why close distances to camps may decrease attacks:
In contrast, it is also possible that refugee and IDP camps are
intentionally built in areas seen as safer in terms of distance from
where non-state armed groups are believed to be located.
Hypothesis:
Longer distances to major roads may make civilians more susceptible
to NSAG presence and to violent attacks since NSAGs are more likely to
hide in areas further from major roads that are easily accessible to
state forces, and because state forces will have more difficulty rapidly
responding to violence against civilians should NSAGs attack them,
factoring into NSAGs decisions on whether to attack civilians.
- Locations with Moran’s I p-values that are
hyper-statistically significant at ≤ 0.0000001.
Explanation:
Including a feature that captures whether a given grid cell contains
local Moran’s I values with hyper-significant p-values (≤ 0.0000001)
when the target variable is the number of attacks on civilians by armed
groups can serve to identify areas with extreme and statistically rare
spatial clustering patterns. This feature is particularly useful in
highlighting grid cells where the spatial distribution of attacks is
highly unusual, indicating unique underlying factors that could either
significantly increase or decrease the likelihood of attacks.
Benefits:
1) Identification of Anomalies: The feature can help the model
detect not just hotspots but also areas with unusually low clustering of
civilian attacks. High local Moran’s I values with such an extreme
p-value threshold suggest rare and distinct spatial patterns, which can
be critical in understanding both high-risk and low-risk areas.
2) Enhanced Spatial Dependency Analysis: By focusing on
hyper-significant p-values, the model can better account for extreme
spatial autocorrelation, recognizing grid cells where the spatial
relationship between attacks is exceptionally strong or weak, which may
reveal non-obvious insights about the spread or containment of
violence.
3) Refined Risk Assessment: This feature can serve as an indicator
of areas with either exceptionally high or low risks of attacks, aiding
in more precise early warning systems and targeted interventions. It can
help identify not just emerging high-risk areas but also potential safe
zones that might otherwise be overlooked.
Hypothesis:
The inclusion of local Moran’s I values with hyper-significant
p-values as a feature may reveal that grid cells with extremely rare
spatial clustering patterns (either high or low) are more likely to
experience deviations in the typical number of attacks on civilians by
armed groups. Specifically, grid cells with high clustering may be more
susceptible to attacks due to concentrated risk factors, while cells
with unusually low clustering might indicate areas of lower-than-usual
risks, possibly due to strong local control, effective protection
measures, or other stabilizing influences.
- Distance from each grid cell to the nearest
hyper-statistically significant grid cell.
Explanation:
This feature captures the proximity of a grid cell to areas with
extreme spatial clustering of attacks, which might influence the
likelihood of violence spreading or being contained within the grid cell
in question.
Benefits:
1) Spatial Influence Analysis: This feature helps the model assess
how close a grid cell is to a potential source of heightened risk.
Proximity to a grid cell with hyper-significant clustering could
increase the likelihood of attacks due to the spillover effect or the
diffusion of violence from one area to another.
2) Risk Propagation Insight: The feature allows the model to capture
the potential for violence to spread across neighboring areas. A shorter
distance might indicate a higher risk of attack due to spatial
contagion, while a longer distance could suggest a buffer zone that
mitigates the spread of violence.
3) Strategic Intervention Targeting: Identifying grid cells close to
areas of extreme clustering can guide strategic interventions, focusing
resources on cells that are at higher risk due to their proximity to
violent hotspots.
Hypothesis:
The inclusion of a feature measuring the distance to the nearest
grid cell with a hyper-statistically significant Moran’s I p-value will
reveal that shorter distances increase the likelihood of attacks on
civilians. This is because proximity to areas of extreme spatial
clustering likely exposes the grid cell to spillover effects, where
violence in one area influences and amplifies the risk of violence in
neighboring cells. Conversely, greater distances might indicate lower
risk due to reduced spatial influence from violent hotspots.