Note: All code, visualizations, images, insights, techniques, and methodological notes in this document and all other documents in this portfolio ARE NOT open source, and MUST ALWAYS be cited.






Overview:


For decades, violence against civilians in the Eastern Democratic Republic of Congo (DRC) has been commonplace. The conflict remains one of the world’s most complex and enduring humanitarian crises, with a devastating impact on civilians. The interplay of local grievances, national politics, and international support for armed groups in the Eastern DRC makes it a challenging scenario for conflict resolution and peace-building efforts. It is deeply rooted in a mix of historical, political, ethnic, and economic factors. It also has been fueled by competition over the Eastern DRC’s land and rich natural resources - including gold, diamonds, and cobalt - by a variety of armed militias over time.


As of the latest reports, there are approximately 120 armed groups active in the Eastern DRC’s Ituri, North Kivu, South Kivu, and Tanganyika provinces.. These groups include fighters from neighboring countries such as Rwanda, Uganda, and Burundi. Many commanders of these groups have been implicated in war crimes, including massacres, sexual violence, recruiting children, pillaging, and attacks on schools and hospitals. Between January and late October 2023, various armed actors killed at least 2,446 civilians in South Kivu, North Kivu, and Ituri provinces.


Project Goals:


The complex dynamics of the Eastern DRC conflict naturally make forecasting violence against civilians a difficult task. This portfolio project takes on this challenge. While many analysts would simply refer to prior locations of violence on a Kernel Density Estimation (KDE) heat map as indicative of where violence will likely occur in the future, I take a machine learning approach: predicting violence even in areas where violence may not have occurred yet. Geospatial machine learning models rely on understanding the underlying spatial distribution of a map, occurrences of past events (like civilian killings) and the exposure to risk factors to predict the latent risk of a harmful event occurring at a given location even if that harmful event has never occurred there previously. Such models are often able to generalize well to a large variety of communities and locations irrespective of the unique characteristics of any given community or location via a spatial cross-validation technique.


Specifically, I focus on the performance levels of 3 machine learning algorithms - Extreme Gradient Boosting (XGBoost), Random Forest, and Support Vector Machines (SVM’s) to forecast locations and numbers of attacks on civilians. As such, this is a regression rather than a classification problem. This portfolio project will not use time-series models, but a future project will. Assuming KDE as a default baseline, at the end of the project I will compare KDE predicted spatial risk categories (from low to high risk) with equivalent risk categories predicted by machine learning models in terms of the proportion of actual test set period attacks on civilians that occurred in each technique’s predicted risk categories. I will then perform additional empirical evaluations to determine the overall quality of the predictions made by the optimal machine learning models.


Since determining which cross-validation strategy employed when working with geospatial data results in the best performance is a less explored research area to date than determining which algorithms often result in the best performance, I will first use a baseline algorithm known to typically perform well - XGBoost, and compare 9 different geospatial cross-validation strategies. Then - to save computational resources and time - I will then use the cross-validation strategy I found to perform best with XGBoost with the other 2 algorithms as well to see if performance is further improved.


Note: Keep in mind that the number of attacks on civilians by armed actors - especially since I will be dividing the map into fishnet grid cells with an area of only 10 square mi. - will be highly skewed in its distribution. Most grid cell observations will have 0 attacks, and only a small portion will have 1 or more attacks in the training and test sets. For this reason, I will use dynamically weighted metrics for hyperparameter tuning.


What are Dynamically Weighted Metrics?:


A standard metric (like RMSE, MAE, etc.) treats all errors equally across the dataset. Each error contributes equally to the overall loss, regardless of the true value or the importance of specific regions in the data. E.g., if your data contains both high and low target values, a standard metric would penalize errors the same way for both types, without adjusting for the significance or rarity of certain events (such as non-zero grid cells).


In contrast, a dynamically weighted metric adjusts the contribution of each error based on the true values or other characteristics of the data. In a dynamically weighted setup:


Errors in regions with more important or rare target values (such as grid cells with a value greater than 0) are given more weight, meaning the model is penalized more heavily for mistakes in these areas.


Errors in regions with more common or less important target values (such as grid cells with a value of 0) are given less weight, so the model is not overly focused on minimizing these errors at the expense of others.


This dynamic weighting ensures that the model focuses on reducing errors in the most critical parts of the dataset, such as rare events or regions with higher target values, which might otherwise be overlooked by a standard metric.


Multi-Objective Hyperparameter Tuning Rather Than Single-Objective Tuning:


I also will be using multi-objective tuning rather than standard single objective tuning for this project. Using multi-objective tuning with dynamically weighted metrics offers several advantages over single-objective tuning, especially in the context of highly skewed geospatial data.


1) Balancing Different Types of Errors:

With single-objective tuning, the model may focus excessively on optimizing a specific type of error, such as minimizing overall error (e.g., using RMSE, which penalizes larger errors due to the squaring of residuals and emphasizes reducing significant mispredictions) or absolute differences (e.g., using MAE, which averages the magnitude of errors evenly across all predictions and provides a balanced view of typical errors but may underemphasize the impact of larger deviations), at the expense of other important error characteristics.


Multi-objective tuning, on the other hand, allows the model to strike a balance between different types of errors. E.g., dynamically weighted RMSE captures large errors by emphasizing significant mispredictions in grid cells where rare, high target values occur, while dynamically weighted MAE focuses on the average magnitude of errors across all grid cells but gives additional weight to the few areas with high target values. This ensures that both large and typical errors are prioritized according to the significance of these rare events, despite the overall skew towards zero in the dataset.


2) Mitigating the Impact of Imbalanced Data:

With highly imbalanced data, focusing solely on one metric (like dynamically weighted RMSE) could cause the model to overly prioritize reducing large errors in cells with high values, potentially at the expense of capturing smaller but more frequent errors in the zero-valued cells.


Multi-objective tuning helps balance predictions across this highly skewed count data by incorporating multiple metrics. For example, a metric like dynamically weighted MASE can help the model improve accuracy in the few grid cells with higher counts by adjusting for varying scales of error, while dynamically weighted Huber loss enhances robustness by addressing both smaller errors in the many zero-count cells and larger outliers in cells with higher counts. This approach ensures that the model remains sensitive to rare but significant counts without neglecting the majority of cells with zero values.


3) Improving Spatial Generalization:

My spatial data likely exhibits spatial autocorrelation, meaning nearby locations are likely to have similar values. Multi-objective tuning can ensure that the model generalizes well across various spatial regions, not just fitting to one region or spatial pattern.


E.g., dynamic MAE ensures the model minimizes errors in a more interpretable, location-specific way, while dynamic RMSE ensures larger prediction errors (which may occur in key spatial regions) are penalized more heavily. This can allow for better overall spatial performance.


4) Addressing Heteroscedasticity:

My data likely exhibits heteroscedasticity, where the variance of target values differs across regions (e.g., some regions might have more variability in target values than others). Single-objective tuning may not adequately capture this variability.


Multi-objective tuning using a combination of dynamic metrics can better handle this by weighting regions differently. Dynamic huber loss helps address both small and large errors robustly across variable regions, and dynamic MASE ensures errors are scaled appropriately, adapting to local variability in the target variable.


The 5 Dynamically Weighted Metrics Used for Hyperparameter Tuning:


1. Dynamic Quantile Loss:

Quantile loss is often useful when you want to penalize underestimations or overestimations differently. It can tune the model to minimize specific types of errors (e.g., more focus on underprediction, which might be more critical in certain regions). However, in my case, I set the argument τ (tau) equal to 0.5, which means underpredictions and overpredictions are penalized the same. It prioritizes reducing errors in the rare regions/grid cells where the target value is greater than 0, encouraging better performance.


Note: with this metric, positive residuals are multiplied by 0.5 and negative residuals are multiplied by -0.5, effectively splitting the residuals into positive and negative parts but with equal weight. In contrast, dynamic Mean Absolute Error (MAE) (which I will also use in tuning) treats all residuals equally by taking their absolute value. Additionally, when tau = 0.5, the quantile loss function is effectively targeting the median of the response variable. In contrast, the dynamic MAE metric does not explicitly use the median. It calculates the mean absolute error, which is the average of the absolute differences between the predicted and actual values, ensuring that both overpredictions and underpredictions are treated equally in magnitude.


Even when τ = 0.5, quantile loss retains a subtle directional bias, meaning it might penalize underpredictions and overpredictions slightly differently, depending on how the residuals are split. This could result in different regions of the data (where overpredictions or underpredictions dominate) being prioritized differently. Despite the fact that both metrics are similar in nature and thus the potential for some redundancy, including both metrics will provide a model with the ability to address directional bias in error while also keeping overall errors in check.


2. Dynamic Huber Loss:

Huber Loss is a combination of Mean Squared Error (MSE) and Mean Absolute Error (MAE), designed to handle both small and large errors in a more balanced way. It is particularly useful for making a model robust to outliers while still penalizing smaller errors effectively. The key idea is that:


For small errors (within a defined threshold δ), it behaves like MSE, squaring the residuals.


For larger errors (beyond δ), it behaves like MAE, penalizing the errors linearly to prevent the loss from blowing up due to large outliers. The dynamic aspect of Huber loss further weights errors based on the target value, helping to focus on important regions.


In a multi-objective context, it helps - across different regions - to balance the need for robustness against outliers with the need to minimize small errors.


3. Dynamic Root Mean Squared Error (RMSE):

RMSE emphasizes larger errors more than MAE, making it suitable for scenarios where large errors are especially problematic. In my spatial context, this helps reduce large errors in regions with rare events (target values > 0), which might be overlooked by simpler metrics like MAE.


Dynamic RMSE in multi-objective tuning helps control for large prediction errors in critical areas while balancing them with other metrics that focus on average or scaled errors.


4. Dynamic Mean Absolute Error (MAE):

This metric is mentioned above. Also note that MAE provides a clear measure of the average error and is less sensitive to outliers than RMSE. It is particularly useful for understanding the typical prediction error in geospatial data.


5. Dynamic Mean Absolute Scaled Error (MASE):

Scale-independent error measurement: MASE scales errors relative to a naive forecast, making it suitable for datasets with varying target values. This helps ensure that the model is not just minimizing errors for the majority of 0-valued cells but also improving performance for the minority class.


Outlier Robustness:

- Most robust: Dynamic Huber Loss

- Moderately robust: Dynamic MAE, Dynamic MASE

- Less robust: Dynamic Quantile Loss (τ = 0.5), Dynamic RMSE



Calculating the Pareto Optimal Front With the Help of Principal Component Analysis (PCA):


- What is the Pareto Optimal Front?

The Pareto optimal front (or Pareto frontier) represents a set of solutions in multi-objective optimization where no individual solution can be improved in one objective without worsening another. I.e., it is the set of all “non-dominated” solutions, meaning that you cannot improve one metric score without degrading at least one other metric score.


- How Do I Calculate the Pareto Front?

In my code, the Pareto optimal front is calculated using the fastNonDominatedSorting function from the nsga2R library. Here is a summary of the steps:


- Normalization: The various metrics are normalized using min-max scaling to bring them into a comparable range.
- Outlier Removal: Outliers are identified and removed using the Interquartile Range (IQR) method to prevent them from affecting the Pareto front calculation.
- Pareto Front Selection: The fastNonDominatedSorting function is applied to the normalized metrics. This function sorts the solutions into different fronts, identifying the set of non-dominated solutions that form the Pareto front.
- Near-Optimal Solutions: Since the Pareto front typically contains only a few solutions at most, using PCA weights solely on this small set - later to be used for ranking which models are best - might not be very representative. Therefore, the code considers a specified number (50) “near-optimal” solutions that include those both close to the Pareto front and that are on it. The Euclidean distance of each solution to the Pareto front is calculated to identify these near-optimal solutions. Including these solutions helps make the PCA weights more representative of the entire solution space, providing a better understanding of the importance of each metric across a broader set of potential hyperparameter settings.


- Brief Explanation of PCA - Especially as it Relates to My Code:

PCA is a statistical method used to reduce the dimensionality of data while retaining most of the variance in the dataset. It does this by transforming the data into a new set of uncorrelated variables called principal components, which are ordered by the amount of variance they capture from the data.


- What Role does PCA Play in My Code?

PCA is used to A) determine the level of importance of each metric in the set of Pareto optimal and near-Pareto optimal solutions, and to B) identify which among the multiple solutions (assuming there are more than one) existing on the Pareto front contains the best hyperparameter set:

1) PCA on Normalized Metrics: PCA is performed on the normalized metrics to understand the variance captured by each metric.
2) Combining Principal Components: The code uses the first two principal components (PC1 and PC2) because they capture almost all the variance. The loadings (weights) of these components are combined to calculate the importance of each metric.
3) Composite Score Calculation: These weights are then used to compute a composite score for each Pareto optimal solution (but not the near Pareto solutions), allowing the identification of the best hyperparameter set based on this score.
4) Selection of Best Hyperparameters: The hyperparameter set of the Pareto front solution with the lowest composite score is considered the best, as it effectively balances the various metrics considered in the multi-objective optimization.

Test Set Metrics:


For scoring performance on my test set, I use 29 extra metrics, some of which are common, some of which are dynamically weighted, and some of which are asymmetric (and dynamically weighted) in that they penalize unperpredictions more than overpredictions. In the context of attacks on civilians, it would be natural to assume that the costs of underpredicting violence is greater than the costs of overpredicting it.


1. RMSE Family

- regr.rmse (Root Mean Square Error): Calculates the square root of the average squared differences between predicted and actual values. RMSE is sensitive to large errors, making it useful for penalizing larger mistakes, but it may overemphasize outliers (e.g., high target values) in a skewed dataset.

- dynamic_rmse: This version adjusts RMSE by introducing dynamic weights, which can be based on true values (e.g., grid cells with higher target values get more weight). This makes it more suitable for geospatial data with heterogeneous distributions, ensuring that regions with more significant outcomes (e.g., non-zero target values) have a larger impact on the error calculation.

- asymmetric_dynamic_rmse: Further adds asymmetry by penalizing underestimations more than overestimations. For instance, underestimating the likelihood of attacks (non-zero values) is penalized more, reflecting real-world priorities in sensitive areas: underpredicting violent attacks on civilians is more costly than overpredicting them.


2. MSE Family

- regr.mse (Mean Squared Error): A variant of RMSE without the square root. It penalizes larger errors even more heavily since it squares the error directly, which can be problematic in skewed data as it amplifies the influence of outliers. Still, I use this as a standard metric which can help spot especially problemmatic models.

- dynamic_mse: Dynamic MSE. Similar to Dynamic RMSE, Dynamic MSE incorporates dynamic weights based on true values to ensure that errors in more significant areas (non-zero target regions) are emphasized.

- asymmetric_dynamic_mse: Asymmetric Dynamic MSE. Applies both dynamic weights to errors and assymmetric penalties to underpredictions, as before.


3. MAE Family

- regr.mae (Mean Absolute Error): Measures the average of the absolute differences between predictions and actual values. It is less sensitive to outliers compared to MSE but still treats all errors equally - taking their absolute value - which may not be ideal for geospatial data where target values are heavily skewed toward zero.

- dynamic_mae: This introduces dynamic weights, giving more weight to areas with non-zero values (e.g., regions with recorded attacks), making it more responsive to important prediction areas in highly skewed and heterogeneous geospatial datasets.

- asymmetric_dynamic_mae: Modifies dynamic_mae by penalizing underestimations more heavily than overestimations.


4. Quantile Loss Family

- quantile_loss: Quantile loss - as discussed above - can capture specific percentiles (e.g., median with τ = 0.5) rather than mean or absolute errors, offering a more flexible approach to error measurement. It is especially useful when the error distribution is skewed, like in my case, where most values are zero, but a small percentage are significant. It differs from MAE by using median instead of mean, and by multiplying positive residuals by 0.5 and negative residuals by -0.5, which splits the residuals into positive and negative parts but with equal weight. In contrast, MAE treats both positive and negative residuals equally as positive values, by taking their absolute value.

- dynamic_quantile_loss: Weights the quantile loss dynamically based on true values, emphasizing the importance of higher target areas.

- asymmetric_dynamic_quantile_loss: Applies dynamic penalties, and also asymmetrically penalizes underpredictions.


5. MASE Family

- regr.mase (Mean Absolute Scaled Error): Scales the absolute error by the mean absolute error of a naïve forecast. MASE can address seasonality or spatial autocorrelation, making it useful in geospatial data, especially when there is spatial autocorrelation within target values (e.g., similar behavior across neighboring grid cell locations).

- dynamic_mase: Dynamically weighted Mean Absolute Scaled Error, introducing weights based on true values.

- asymmetric_dynamic_mase: Dynamically weights residuals, and asymmetrically penalizes underpredictions.


6. Huber Loss Family

- huber_loss: A hybrid between MAE and MSE, Huber loss penalizes small errors similarly to MAE and larger errors like MSE, but it is more robust to outliers. It is beneficial in my context because it balances sensitivity to both small and large errors in skewed geospatial data.

- dynamic_huber_loss: A dynamically weighted version of Huber Loss.

- asymmetric_dynamic_huber_loss: A version of Huber Loss that both dynamically weights residuals, and further penalizes underestimations more heavily.


7. Logarithmic Median Absolute Error (MedAE) Family

- dynamic_logarithmic_medae: Uses a logarithmic penalty function that smooths out the influence of extreme errors. The median in this context is important because it is more robust to outliers compared to the mean. This is particularly useful in geospatial data where large errors are less frequent but still important. The dynamic weighting ensures that regions with higher target values (e.g., attacks) get more emphasis while still controlling for the impact of outliers.

- asymmetric_dynamic_logarithmic_medae: Further adds asymmetric penalization to underpredictions.


8. Square Root Median Absolute Error (MedAE) Family

- dynamic_logarithmic_medae: Uses a logarithmic penalty function that smooths out the influence of extreme errors. The median in this context is more robust to outliers compared to the mean. This is particularly useful in geospatial data where large errors are less frequent but still important. The dynamic weighting ensures regions with higher target values (e.g., attacks) get more emphasis while still controlling for the impact of outliers.

- asymmetric_dynamic_sqrt_medae: Besides dynamic weighting, this metric further adds asymmetric penalties to underpredictions. The median ensures the primary focus remains on typical errors, ignoring extreme outliers that could otherwise distort model performance in highly skewed datasets.


9. Logistic Median Absolute Error (MedAE) Family

- dynamic_logistic_medae: The logistic function penalizes errors, offering a smoother transition between smaller and larger errors. The median plays a crucial role in handling outliers, ensuring that large deviations do not disproportionately influence the overall error calculation. The dynamic weights again prioritize errors in higher-value grid cell regions.

- asymmetric_dynamic_logistic_medae: This variant modifies applies dynamic_logistic_medae by applying asymmetric penalties to underpredictions.


10. Polynomial Median Absolute Error (MedAE) Family

- dynamic_poly_2.3_and_1.3_medae: This metric applies a polynomial penalty function that uses a 2/3 exponent for underpredictions and a 1/3 exponent penalty for overpredictions. The 2/3 exponent for underpredictions results in a steeper penalty for cases where the model underpredicts, meaning that the cost of missing critical events (e.g., attacks) is more heavily weighted. The 1/3 exponent for overpredictions leads to a gentler penalty, making the model more tolerant of overestimates, which is useful when overpredicting is less risky.

The use of the median again mitigates the influence of outliers. By focusing on the typical errors and applying dynamic weights, this metric ensures regions with non-zero target values (i.e., 1 or more attacks) receive more attention, while preventing extreme values from overly influencing the model evaluation.

- dynamic_poly_2.3_and_1.2_medae: Similar to the previous version, this metric applies polynomial penalties but with slightly different exponents: a 2/3 exponent for underpredictions and a 1/2 exponent for overpredictions. The 2/3 exponent for underpredictions still imposes a steep penalty, ensuring that underestimations (especially in critical areas) are penalized more. However, the 1/2 exponent (square root penalty) for overpredictions is slightly steeper than the 1/3 exponent used in the previous version, meaning overestimations are penalized more here than in the dynamic_poly_2.3_and_1.3_medae metric above, though still less than underestimations.


11. GMAE

- gmae (Geometric Mean Absolute Error): Instead of focusing on additive errors like in MAE or RMSE, GMAE emphasizes multiplicative errors. Additive errors occur when the difference between the predicted and actual values is treated as a straightforward subtraction (i.e., ∣ truth−prediction ∣). In contrast, multiplicative errors consider the ratio or relative scale between predicted and actual values. This means errors are treated in proportion to the size of the prediction or true value, which can be especially important when the target values vary widely in magnitude, as with my data.

In the context of data where the target value is often zero but occasionally non-zero and potentially significant — multiplicative errors become more relevant because they capture relative differences better than additive errors. E.g., a small error in a low-target value grid cell area (e.g., an error of 0.05 when predicting 0.1) can be proportionally larger than the same error in a high-target value area (e.g., an error of 0.05 when predicting 10). Multiplicative errors scale the error based on the value of the target, so overpredictions or underpredictions in critical areas (non-zero targets) will have a greater proportional impact than they would using additive errors alone. This approach is particularly useful for skewed data because GMAE helps balance error contributions by minimizing the impact of both extremely large and extremely small errors from both zero-value and non-zero-value grid cell areas.


12. Log Cosh Loss

- log_cosh_loss: A smooth, differentiable loss function based on the hyperbolic cosine of the prediction error, that behaves like MAE for smaller errors (applying a linear penalty that does not over-penalize small discrepancies between predictions and true values) and RMSE for larger ones (applying a squared penalty that penalizes larger discrepancies more). However, unlike RMSE, the logarithmic component of the log-cosh loss ensures that for extreme errors, the loss penalty increases more slowly than it would with RMSE (limiting the penalty for extreme outliers compared to RMSE), which while recognizing the presence of outliers, helps prevent them from dominating the error metric and having a disproportionate impact.


13. Others

- regr.medae (Median Absolute Error): Measures the median of absolute errors, offering a more robust measure against outliers compared to MAE. It is useful for datasets with extreme values, as it focuses on the central tendency of errors rather than extreme deviations.

- regr.medse (Median Squared Error): This calculates the median of squared errors, emphasizing a balance between robustness (due to the median) and sensitivity to larger errors (due to squaring).

- regr.msle (Mean Squared Logarithmic Error): This metric calculates the mean squared difference between the logarithms of the predicted and true values. It is useful for cases where the target varies across orders of magnitude and penalizes underpredictions of small values more heavily than overpredictions.

- regr.rmsle (Root Mean Squared Logarithmic Error): This is the square root of the MSLE, making the scale of the errors easier to interpret. It is especially effective when the target values are heavily skewed.

- regr.rae (Relative Absolute Error): RAE compares the absolute errors of the model’s predictions with the absolute errors of a baseline model (such as predicting the mean of the true values). It provides a relative measure of how well the model is performing in comparison to this baseline. A value less than 1 indicates that the model performs better than the baseline, while a value greater than 1 indicates worse performance. RAE focuses on absolute errors, so it gives equal weight to all errors, regardless of their size.

- regr.rrse (Relative Root Squared Error): RRSE is similar to RAE but uses squared errors instead of absolute errors. It compares the squared errors of the model’s predictions with those of the baseline model. By squaring the errors, RRSE penalizes larger errors more heavily than smaller ones, making it more sensitive to outliers. Like RAE, a value below 1 indicates that the model is performing better than the baseline, but RRSE emphasizes the importance of larger errors in the comparison.


Explanation of the Data:


The training set data I will use spanned from Jan. 19, 2018 to Jun. 31, 2023 (about 5.5 years). The test set data I will use spanned from Jul. 1, 2023 to Dec. 31, 2023 (6 months). A future portfolio project will integrate time-series data and take the added step of forecasting violence against civilians literally into the future in the Eastern DRC beyond the end of the test set period.


The 33 features (i.e., explanatory variables) used in this project, the datasets they come from, and hypotheses as to why including these features in the model may be important are as follows:


Features:


- Mean nighttime light levels (per 10 square mile fishnet grid cell) as a proxy for levels of economic development. These light levels come from the Day/Night Band (DNB) radiance measured by sensor dataset within the VIIRS/NPP Granular Gridded Day Night Band 500m Linear Lat. Lon. Grid Night NRT dataset.


Hypothesis: Lower levels of economic development in a given grid cell in the Eastern DRC increase the risk that that grid cell will experience attacks on civilians by armed groups because:

1) Economic underdevelopment leads to a lack of essential resources, infrastructure, and government presence and protection, creating a power vacuum.

2) Armed groups exploit this vacuum to establish control and recruit from impoverished populations. Economic hardship forces civilians to depend on these groups, increasing their vulnerability to exploitation and attacks.

3) These groups use violence to secure resources and maintain authority.


- Mean population density, from the Gridded Population of the World (GPW), v4 - UN WPP-Adjusted Population Density, v4.11 (2020) dataset, by the Socioeconomic Data and Applications Center (SEDAC) at Columbia University.


Hypothesis: Less densely populated civilian areas are more likely to be targets for attack by armed groups than more densely populated areas because armed groups (especially non-state armed groups) tend to operate in more rural areas, and because these less densely rural areas tend to be more isolated and far away from state forces that could come to their rescue.


- Mean altitude, measured by DIVA-GIS (90 meter resolution). See https://diva-gis.org/.


Hypothesis: Non-State Armed Groups (NSAGs) often operate in mountainous and high elevation regions where it is harder for state forces to reach. Civilians living in these regions are often left unprotected and vulnerable to violence from groups who would exploit them.


- Mean forest height, from the Global Land Analysis & Discovery Forest Height, 2020 dataset (pixel value: forest height in meters) at the University of Maryland (See https://glad.umd.edu/dataset/GLCLUC2020)


Hypothesis: It is likely that NSAGs often operate in forests where it is harder for state forces to reach. If so, it is quite possible that civilians near where these groups operate may be subject to violence from these groups. The higher the growth of forests, the more likely NSAGs are to seek shelter there as higher forest growth can hide their location.


- Mean travel time to the nearest city, from the raster dataset “A suite of global accessibility indicators”, by Andy Nelson et al., Sci Data 6, 266 (2019) (See https://www.nature.com/articles/s41597-019-0265-5). I merged raster files in QGIS that provide travel times to cities of population sizes 50,000 - 100,000; 100,000 - 200,000; and 200,000 - 500,000. I then clipped the merged raster file to the boundaries of e_drc_adm3_map (which has the same boundaries of fishnet). The result is the imported raster file. This raster file thus contains travel times from any point on the map to any city with a population between 50,000 and 1,000,000.


Hypothesis: Longer traveling times make to cities from rural areas make it hard for civilians to escape from violence when it occurs and for state security forces to respond. Thus, armed groups may feel more secure in attacking civilians in such areas without fear of repercussions. I require a population of at least 50,000 since such a city - though relatively small - would be likely to have at least a minimal state security force presence able to respond to nearby NSAG violence.


- Minimum distance to the border with contiguous countries. For every fishnet grid cell, I personally will calculate the distances to the border with neighboring countries, and find the minimum distance among these for each grid cell.


Hypothesis: Border areas of countries - especially those with weak state authority in rural areas - are often especially vulnerable to the absence of state security forces. Borders also provide NSAGs protection from state security forces since NSAGs can simply cross borders into neighboring countries territories to escape retaliation for violence. As a result, border areas are likely to be more vulnerable to attacks on civilians as state security forces have difficulty securing these areas.


- Total mines controlled by armed groups, from the International Peace Information Service (IPIS) dataset “Artisanal Mining Sites in Eastern DRC”. The dataset’s variables - “armed_group1”, “armed_group2”, and “armed_group3” - provide the names of armed groups that maintain a presence, if any, at any given mine. I will filter out the categories of “None”, NA, and government forces, leaving us with mines occupied by NSAGs.


Hypothesis: Illegal mining of precious minerals in the Eastern DRC by non-state armed groups (NSAGs) has long been considered one of the key drivers of the conflict, enabling armed groups to fund their activities. The IPIS dataset contains verified artisinal mining site locations across the Eastern DRC. The more mines there are known to have at some point been controlled by armed NSAGs in any given grid cell, the more likely are civilians to come in contact with NSAGs, and thus the higher the chances are of violence against civilians by these NSAGs - particularly for civilians deemed loyal to opposing groups or the government, or from ethnic groups deemed to be enemies. In some cases, it is documented that civilians have been forced to work for NSAGs in these mines, increasing the likelihood of violence against them.


- Weighted harmonic mean Inverse-Distance Weighted (IDW) distance to mines armed by non-state armed groups (NSAGs) - using the inverse of the distances to the 3 nearest neighboring grid cells (i.e., k nearest neighbors = 3) containing mines controlled by armed NSAGs and number of such armed mines per grid cell.


Explanation:

IDW weights are made inversely proportional to the distance. I.e.:

1) If nearby cells have higher values or are closer, the harmonic mean value will be lower, suggesting a higher threat.

2) If the nearest cells have lower values or are further away, the weighted harmonic mean value will be higher, suggesting a lower threat.


Hypothesis: The Weighted Harmonic Mean IDW distance to mines controlled by NSAGs is a significant predictor of the spatial distribution and frequency of attacks on civilians. Lower Weighted Harmonic Mean IDW distance to NSAG-controlled mines is expected to increase the likelihood and frequency of attacks. As this distance decreases, civilians are more likely to interact with armed groups, raising the risk of violence. The weighting applied through IDW reflects the diminishing influence of more distant mines on violence levels, with the greatest impact felt in areas nearest to armed groups’ activities. This relationship is especially pronounced in regions where NSAGs use violence as a means to exert control over both the civilian population and economic resources.


- Total number of mines (also from the IPIS dataset), regardless of mine type (not necessarily documented to have been controlled by armed groups) within each fishnet grid cell.


Hypothesis: More mines in any given location will attract armed groups and the potential for violence against civilians, irrespective of whether such mines were found to have armed groups present at the exact time when the creators of the IPIS dataset happened to visit the mines. It is possible that this variable will more accurately reflect the potential for armed groups being present.


- Weighted harmonic mean IDW distance to the nearest mines, regardless of the type of mine.


- Total number of 3T mines (Tin, Tantalum, Tungsten). This feature also comes from the IPIS dataset:


Explanation:

Economic Importance: 3T minerals are crucial for electronics and industrial applications globally.

Scale of Operations: Typically, 3T mining operations are smaller and more dispersed than gold mining. They are often mined by artisanal and small-scale miners.

Value and Transport: 3T minerals generally have lower per-unit value than gold and are easier to transport covertly, making them attractive for informal or illegal trade.


Hypothesis:

The total number of 3T mines in a given location vs. the total number of gold mines might increase the likelihood of non-state armed groups attacking civilians because:

A higher number of 3T mines could increase the number of sites armed groups need to control to secure revenue, leading to widespread violence as they establish dominance. In contrast, gold mines, being fewer but more valuable, might draw concentrated efforts by armed groups, resulting in intense but localized conflict.


- Weighted harmonic mean IDW distance to the nearest 3T mines.


- Total number of gold mines in a given grid cell (from the IPIS dataset).


Hypothesis: The total number of gold mines in a given grid cell will be positively correlated with the likelihood of non-state armed groups attacking civilians. Gold mines, due to their especially high per-unit value and the ease of converting gold into financial resources, attract more intense and concentrated efforts by armed groups to seize control. As the number of gold mines increases in a given grid cell location, competition among armed groups for control may escalate, leading to heightened conflict and violence against civilians as these groups attempt to secure revenue from these valuable resources.


- Weighted harmonic mean IDW distance to the nearest gold mines.


- Total number of active non-state armed groups per grid cell, from the Armed Conflict Location and Event Data (ACLED) dataset.


Hypothesis: The greater the number of NSAGs in any given location that are involved in conflict events, the more likely it will be that more violence will occur within that location in the future since NSAGs will seek to outperform other competing groups by demonstrating their dominance and control by committing violence against civilians and others living in the same area. Some of this violence may be against civilians deemed loyal to competing NSAGs in the same region.


- Weighted harmonic mean IDW distance to each grid cell’s k = 3 nearest neighbors who have at least 1 active NSAG present, weighted by the total number of NSAGs present.


- Total number of territorial seizures, either by NSAGs or by government forces, from the ACLED dataset.


Hypothesis:

Civilians who previously lived under the control of one side in a conflict are often perceived as loyal to that side when enemy forces conquer territory. Violence can be carried out to send a message to a broader community of who is now in charge or to punish civilians thought to be loyal to the prior occupiers, or to indirectly punish the prior occupiers via the same logic.


- Weighted harmonic mean IDW distance to the nearest territorial seizures, either by NSAGs or by government forces. This harmonic mean IDW distance is measured from each grid cell to its 3 nearest neighboring cells in which at least 1 territorial seizure occurred, and is weighted by the number of territorial seizures within those 3 grid cells.


- Total armed clashes in a given grid cell, from the ACLED dataset.


Hypothesis: Armed clashes between militant groups with each other or with government forces are likely an indicator of areas susceptible to violence being carried out against civilians as well, either because armed groups may seek revenge against civilians deemed loyal to opponents, or because one side (e.g., actor A) may attack opponents (e.g., actor B) occupying areas where they believe (potentially correctly) civilians loyal to their side (actor A) who live there are in danger from attacks by the other side (actor B).


- Weighted harmonic mean IDW distance from each grid cell to the 3 nearest neighboring grid cells in which at least 1 armed clash occurred. The weights consist of the number of armed clashes in each of these 3 grid cells.


- Number of “direct strikes” per grid cell, defined as a combination of the following sub event types from ACLED’s “Explosions and remote violence” event type:

- Shelling/artillery/missile attack

- Grenade

- Air/drone strike


Hypothesis: Areas which undergo attacks from one party may be subject to retaliatory actions taken out on civilians accused of being loyal to the party which initiated the attack (and potentially accused of having informed the enemy of their location for the attack to occur).


- Weighted harmonic mean IDW distance to 3 nearest grid cells having direct strikes.


- Total number of ethnic groups in a single grid cell, measured from the Spatially Interpolated Data on Ethnicity (SIDE) dataset (2018). See https://icr.ethz.ch/data/side/. I specifically seek to know the number of ethnic groups in any given grid cell which constitute less than approximately 50 percent of the population. I.e., I seek to know how many ethnic minority groups exist in any given location on the map.


Hypothesis: The greater the number of ethnic groups residing in any single location, the greater are the chances for ethnic tensions and attacks on civilians by NSAGs belonging to rival ethnic groups.


- Total number of ACLED events involving foreign troops (other than UN forces) by grid cell location. I do not restrict my analysis here to ACLED events involving violence.


Note: I consider foreign forces that are part of the East African Community Regional Force to the DRC to be included in this analysis of foreign forces since these are not part of the UN peacekeeping mission in the DRC (MONUSCO), and thus may not be as restrained from using deadly force against civilians or acting in ways towards NSAGs that may in turn result in NSAGs attacking civilians.


Hypothesis:

The total number of ACLED events involving foreign troops (other than UN forces) in a given grid cell increases the likelihood of attacks on civilians in that grid cell because:

1) Increased Tensions: The presence and actions of foreign troops, especially those not part of the UN peacekeeping mission, may heighten tensions with non-state armed groups (NSAGs), leading to retaliatory attacks on civilians perceived as collaborators or sympathizers.

2) Less Restrained Use of Force: Foreign forces that are not part of MONUSCO may be less restrained in their operations, potentially resulting in civilian casualties, which could provoke NSAGs to retaliate by targeting civilians.

3) Destabilization: The involvement of foreign troops might disrupt local power dynamics and create instability, which NSAGs could exploit by attacking civilians to assert control or punish perceived cooperation with foreign forces.

4) Perceived Threat: NSAGs may view foreign troops as a significant threat, leading them to preemptively attack civilians to weaken local support for these forces or to send a message of resistance.


- Weighted harmonic mean IDW distance from each grid cell to the nearest location where foreign troops have been present at any point in time, measured in ACLED. I do not limit my analysis here to ACLED events involving violence. I find the k=1 nearest neighbor rather than k=3 because there are very few observations in the dataset where foreign forces were recorded as present.


NSAGs may retaliate against civilians perceived to be collaborating with foreign forces or may preemptively attack civilians to weaken local support for these troops. Longer distances (higher IDW values) suggest a lower threat, as the influence of foreign troop presence diminishes with distance.


- Total number of ACLED events involving MONUSCO (the UN peacekeeping mission in the DRC) by grid cell location.


Hypothesis: More events involving the presence of MONUSCO will lead to fewer attacks on civilians, for the following reasons:

2) Increased Security: The presence of MONUSCO forces may improve overall security and stability in the area, leading to a reduction in the frequency or severity of attacks on civilians as NSAGs find it more difficult to operate.

3) Conflict Mitigation: MONUSCO’s involvement in conflict resolution and protection of civilians might reduce the likelihood of attacks as they work to mediate disputes and protect vulnerable populations.


- Weighted harmonic mean IDW distance from each grid cell to the nearest 3 locations where events involving MONUSCO have occurred, as measured by the ACLED dataset.


- Total number of violent events in any given grid cell involving state troops or police, from the ACLED dataset.


Hypothesis:

Higher numbers of events involving national troops or police in a given grid cell could theoretically increase or decrease the total number of attacks on civilians depending on the context of these interactions:

1) Increased Attacks: If national troops or police are involved in abusive actions, such as extrajudicial killings, arbitrary arrests, or other human rights violations, this could provoke retaliatory violence from non-state armed groups (NSAGs). Additionally, these forces might act as aggressors themselves, contributing directly to civilian casualties. The presence of state forces might also increase local tensions, leading to more attacks on civilians as armed groups seek to challenge state authority or control territory.

2) Decreased Attacks: Conversely, if national troops or police are actively engaged in protecting civilians and maintaining security, their presence might deter NSAGs from attacking civilians. Effective and disciplined state forces could reduce the opportunities for armed groups to commit violence by establishing greater control and providing security in vulnerable areas.

Given these possible dynamics, the model can learn whether the presence of national troops or police tends to correlate with increased or decreased violence in different contexts.


- Weighted harmonic mean IDW distance from each grid cell to the nearest location where state military or police forces have been active in violent events.


- Distance to the nearest refugee or Internally Displaced Person (IDP) camp, recorded from the “UNHCR People of Concern” dataset, found on https://data.unhcr.org/en/geoservices/.


Hypothesis: Closer distances to refugee and IDP camps could theoretically either increase the likelihood of attacks on civilians in a given location, or decrease the likelihood of such attacks.

1) Why close distances to camps may increase attacks:

Refugees and IDPs may be seen as targets by armed groups - either for recruitment - or for retaliation on vulnerable civilians seen to be connected with rival groups. As a result, it is possible that attacks may occur near where refugees and IDPs are living.

2) Why close distances to camps may decrease attacks:

In contrast, it is also possible that refugee and IDP camps are intentionally built in areas seen as safer in terms of distance from where non-state armed groups are believed to be located.


- Distance to the nearest main road, from the “Democratic Republic of Congo (DRC) Major Roads Network (OpenStreetMap Export)” dataset, available from the OCHA Humanitarian Data Exchange at: https://data.humdata.org/dataset/democratic-republic-of-congo-drc-major-roads-network-openstreetmap-export


Hypothesis:

Longer distances to major roads may make civilians more susceptible to NSAG presence and to violent attacks since NSAGs are more likely to hide in areas further from major roads that are easily accessible to state forces, and because state forces will have more difficulty rapidly responding to violence against civilians should NSAGs attack them, factoring into NSAGs decisions on whether to attack civilians.


- Locations with Moran’s I p-values that are hyper-statistically significant at ≤ 0.0000001.


Explanation:

Including a feature that captures whether a given grid cell contains local Moran’s I values with hyper-significant p-values (≤ 0.0000001) when the target variable is the number of attacks on civilians by armed groups can serve to identify areas with extreme and statistically rare spatial clustering patterns. This feature is particularly useful in highlighting grid cells where the spatial distribution of attacks is highly unusual, indicating unique underlying factors that could either significantly increase or decrease the likelihood of attacks.


Benefits:

1) Identification of Anomalies: The feature can help the model detect not just hotspots but also areas with unusually low clustering of civilian attacks. High local Moran’s I values with such an extreme p-value threshold suggest rare and distinct spatial patterns, which can be critical in understanding both high-risk and low-risk areas.

2) Enhanced Spatial Dependency Analysis: By focusing on hyper-significant p-values, the model can better account for extreme spatial autocorrelation, recognizing grid cells where the spatial relationship between attacks is exceptionally strong or weak, which may reveal non-obvious insights about the spread or containment of violence.

3) Refined Risk Assessment: This feature can serve as an indicator of areas with either exceptionally high or low risks of attacks, aiding in more precise early warning systems and targeted interventions. It can help identify not just emerging high-risk areas but also potential safe zones that might otherwise be overlooked.


Hypothesis:

The inclusion of local Moran’s I values with hyper-significant p-values as a feature may reveal that grid cells with extremely rare spatial clustering patterns (either high or low) are more likely to experience deviations in the typical number of attacks on civilians by armed groups. Specifically, grid cells with high clustering may be more susceptible to attacks due to concentrated risk factors, while cells with unusually low clustering might indicate areas of lower-than-usual risks, possibly due to strong local control, effective protection measures, or other stabilizing influences.


- Distance from each grid cell to the nearest hyper-statistically significant grid cell.


Explanation:

This feature captures the proximity of a grid cell to areas with extreme spatial clustering of attacks, which might influence the likelihood of violence spreading or being contained within the grid cell in question.


Benefits:

1) Spatial Influence Analysis: This feature helps the model assess how close a grid cell is to a potential source of heightened risk. Proximity to a grid cell with hyper-significant clustering could increase the likelihood of attacks due to the spillover effect or the diffusion of violence from one area to another.

2) Risk Propagation Insight: The feature allows the model to capture the potential for violence to spread across neighboring areas. A shorter distance might indicate a higher risk of attack due to spatial contagion, while a longer distance could suggest a buffer zone that mitigates the spread of violence.

3) Strategic Intervention Targeting: Identifying grid cells close to areas of extreme clustering can guide strategic interventions, focusing resources on cells that are at higher risk due to their proximity to violent hotspots.


Hypothesis:

The inclusion of a feature measuring the distance to the nearest grid cell with a hyper-statistically significant Moran’s I p-value will reveal that shorter distances increase the likelihood of attacks on civilians. This is because proximity to areas of extreme spatial clustering likely exposes the grid cell to spillover effects, where violence in one area influences and amplifies the risk of violence in neighboring cells. Conversely, greater distances might indicate lower risk due to reduced spatial influence from violent hotspots.