Regional Conflict Modelling
Heatmap of conflicts between 2007 and 2020
Overview
This project tested the hypothesis that it would be possible to design a classification model which outperforms the baseline accuracy when classifying acts of organised violence into one of four severity levels.
Introduction & Data Sourcing
This project was submitted as my final project for the Data Science Immersive course offered by General Assembly. With it being my final capstone project, the culmination of all studies to date, I had a number of criteria for my project which I wanted to achieve. Mainly I wanted to:
Tackle a real world challenge which could provide tangible benefits if the project succeeded
Use real-world data, as opposed to a data set curated for a competition
Learn and implement at least 2 Python libraries which had not been introduced in the DSI course
In researching project ideas which would allow me to meet these three criteria, I came across the Uppsala Conflict Data Program (UCDP) who describe themselves as "the world’s main provider of data on organized violence and the oldest ongoing data collection project for civil war".
As part of their impressive work in collating and maintaining detailed conflict level data, the UCDP have a wide range of datasets available to download and use. While exploring their available data, I began to wonder if it would be possible to predict when and where conflicts would occur. With the data available, these two questions were not possible to answer, however, a new idea occurred to me: maybe I could create a predictive warning system which could signal when an act of organised violence, producing a fatality, looked likely to be the starting point of a higher fatality clash.
A predictive model which could successfully accomplish this would be of immense value in the effort to reduce loss of civilian life. It would also allow international aid organisations, human rights groups and humanitarian agencies to better direct their resources to support civilians living in regions with high levels of conflict. It is in this spirit I decided to take on this project - testing the following hypothesis.
Data Shape
The particular data set used for this project was the “UCDP Georeferenced Event Dataset (GED) Global version 21.1”, which is one of the most disaggregated data sets maintained by the UCDP.
261,864 observations
49 features
Each observation is an individual act of organised violence
Data available between years 1989 and 2020
Citations
Pettersson, Therese, Shawn Davis, Amber Deniz, Garoun Engström, Nanar Hawach, Stina Högbladh, Margareta Sollenberg & Magnus Öberg (2021). Organized violence 1989-2020, with a special emphasis on Syria. Journal of Peace Research 58(4).
Sundberg, Ralph and Erik Melander (2013) Introducing the UCDP Georeferenced Event Dataset. Journal of Peace Research 50(4).
Hypothesis
Using contextual country level data, such as national statistics and development indicators, in combination with the data available in the UCDP data set referenced above, a machine learning model will be able to outperform the baseline accuracy (25%) when classifying individual acts of organised violence as belonging to one of the following severity levels:
Low Fatality Incident - An organised incident of violence with 2 fatalities of fewer.
Moderate Fatality Incident - An organised incident of violence with over 2, but less than 10 (inclusive), fatalities.
High fatality Incident - An organised incident of violence with over 10, but less than 100 (inclusive), fatalities.
Very High Fatality Incident - An organised incident of violence with over 100 fatalities.
It is worth noting these severity levels were entirely created by myself, and I would like to explore whether there are any internationally accepted standards which I could use instead.
Data Cleaning
As the core data set used for this project is very well maintained and frequently updated by the UCDP, the actual cleaning required to proceed with the project was minimal. However, some preparation of the data was necessary and focused mainly on:
Dropping features not relevant to modelling or with high percentage of missing data:
Various unique conflict identifiers
Highest granularity of conflict location (dropped due to missing data)
High and low fatality estimates (dropped due to existence of a "Best Estimate" variable for fatalities - the project target)
Conflict clarity variable (dropped as vast majority of observation belonging to "Very Precise" class)
Source Headline variable (dropped due to missing data)
Data type conversions:
Type of conflict and active/non-active conflict variables converted from continuous variables to categorical for legibility
Dates converted to datetime format
Imputing missing data (kept to a minimum where possible):
Name of organisation first reporting conflict - ~10k NaN values replaced with "No Source"
Feature Creation
Created the target variable for the project using np.select to assign each observation a severity level based on the Best Estimate fatality variable.
Target Variable (y) = "incident_classification"
Feature Engineering
This stage of the project became the most challenging and time consuming, as I had to locate a wide range of country level data (GDP, birthrates.etc) for all countries present in our core data set (119 countries), with data available between the years 1989 and 2020. After several days of searching for reliable data sources, it became very clear to me what my main challenges were going to be:
Handling missing data. With such a varying rate of data publication between regions and nations, no matter the data source there was guaranteed to be gaps in the data.
Obtaining all the data from one source to maintain consistency
The joining of the external feature data to the core dataset
After much frustration I eventually came across TheGlobalEconomy, who had the most complete collection of economic data I could find. From this source I collected data on nearly 100 different economic, business and development indicators for each of the 119 countries present in the core UCDP data set. Although the data was now in my hands, it was exported in a horizontal structure, with country and indicator names for all countries presented as columns expanding from left to right. The UCDP data set presented observations vertically, with conflicts and country names cascading from top down, meaning some reshaping of data was necessary before joining the two.
Although this reshaping might at first not sound like a massive challenge, it turned out to be the most frustrating part of the entire project. Although many efforts were made to reshape using Stack(), Unstack() and Melt(), in the end the reshaping challenge was mainly resolved using elbow grease and some good old fashioned manual Excel work. Once the reshaping was complete I could finally import it into Pandas.
Once in Pandas, it became clear there were a number of feature variables which contained significant amounts of missing data the further back through the years you went. Once you started approaching the early 1990s, many of the variables were just too sparse on data to use. The question at this point was which approach was better:
Use a larger number of the external feature variables, but reduce the scope of the project by only using conflict data from the mid 2010s to 2020
Use fewer external feature variables, but include a larger amount of conflict data going back further in time
Try to strike a balance between the above options and impute data where necessary
The final decision was to take the third approach, however, I think it would be very interesting to attempt both of the other approaches for comparison of outcomes. Having decided on an approach, the next step was to decide on which external features to add into the UCDP data set and take forward into modelling stages. This was decided by creating a correlation table looking at the correlation between the various external features and the best_estimate_of_fatalities variable. From this table the features with the most positive and negative correlations to the fatality variable were selected to be merged into the core data set and taken forward to modelling. Using these particular features meant reducing the UCDP conflict data to the years 2007 - 2020, in order to minimise missing data and imputations. Once both data sets had been reduced to this time frame, the task remaining was to merge the two data sets on Country and Year variables using Pandas.
Final List of External Features
Capital Investment as a Percent of GDP
Economic Growth: the Rate of Change of Real GDP
Savings as a Percent of GDP
Inflation: Percent Change in the Consumer Price Index
Percent of World Tourist Arrivals
Death Rate per 1000 People
Human Flight and Brain Drain Index
Government Debt as a Percent of GDP
Innovations Index
External Debt as a Percent of Gross National Income
Labor Freedom Index
Remittances as a Percent of GDP
Population Growth Percent
Banking System Z-Scores
Oil Reserves (in Billions of Barrels)
Percent of World Oil Reserves
Daily Oil Production (in Thousands of Barrels)
Trade Balance as Percent of GDP
Missing External Feature Data
Now the external features had been merged into the core data set, where there was missing data I had to make a decision on how to approach imputations. In general, one should be careful with imputations as the more you impute the more you introduce bias into the data. However, in order to proceed with analysis it is a necessary evil. I approached imputation of the external feature data in a sequential three step process:
Using a self-defined function to iterate through observations and if a NaN is encountered, fill with the mean of the previous two observations (if the Country variable remains the same)
Grouping by Country and back-filling
Grouping by Country and forward-filling
This three step process will undoubtedly have introduced bias into the data, however, it also removed all missing data and allowed me to proceed with the project.
Exploratory Data Analysis
The goal of my EDA was to get a better understanding of the data, and the relationships between the various variables - particularly the target variable. I also wanted to achieve a better understanding of trends and look for outliers which may skew analysis down the road. I began by looking at conflict fatalities to get an understanding of trends across years and regions.
Trends
These graphs clearly indicate the presence of both past and current trends in conflict fatalities. Some of the key points to take away from this analysis:
The yearly sum of conflict fatalities have been dropping sharply in the Middle East and likewise decreasing in Asia, while trending upwards in the Americas and Africa
Mean fatalities per conflict in Europe and Africa have been steadily decreasing over past years, but have reversed that trend in 2020 and are now trending upwards. Asia and the Middle East are slowly trending downwards, while the Americas have been decreasing rapidly following a significant spike in 2018
Looking at the number of incidents by Severity Class, across regions, we can see the number of Low Fatality Incidents decreased steadily between 2013 and 2017, before plateauing and beginning a slow increase. Moderate and High Fatality Incidents are slowly but steadily decreasing across the board
Correlation Table
The correlation table showed some very interesting relationships between our target variable (incident_classification) and the other features. It's worth mentioning here that in the Modelling stage I used both Frequency and Label encoders on categorical predictor variables, hence the prefixes in the table above. The target variable, incident_classification, was:
Most positively correlated with:
Population Growth in Percent
Banking System Z Scores
Oil Reserves (in Billions of Barrels)
Percent of World Oil Reserves
Daily Oil Production (in Thousands of Barrels)
Human Flight and Brain Drain Index
Most negatively correlated with:
Capital Investment as a Percent of GDP
Economic Growth: the Rate of Change of Real GDP
Savings as a Percent of GDP
Inflation: Percent Change in the Consumer Price Index
Percent of World Tourist Arrivals
Conflict Groups
The word clouds above looked at which parties were involved in the various dyadic conflicts, with the media reporting Side A as the groups initiating a particular incident of violence and Side B as the groups being targeted.
Reporting Media Sources
Here we see the reporting media sources where conflict incidents were reported and data sourced from.
Modelling Approach
Once the EDA had been completed, it was time to move on to modelling. Given the nature of the data set, it felt important to me to use as many of the observations as possible in modelling, which proved a challenge due to the volume of data. Having carried out some research, and not wanting to delve deep into the world of cloud computing, I decided on two approaches to modelling:
- Modelling With Vaex Library
Vaex is an impressive library which allows one to process very large amounts of data on a local machine, by being more efficient with how memory is used. Using Vaex allowed me to model using all observations, as opposed to just a slice of the data. Vaex provides wrappers to scikit-learn models, at least the ones which follow the .fit and .transform conventions.
- Traditional Modelling by Sub-sample
As Vaex was not a library covered in the course, and something I had to learn to implement on my own, I wanted to also take a more traditional approach by sub-sampling the data and executing models as one usually would using scikit-learn.
As there are four incident severity levels being predicted, and (thankfully) more lower severity incidents in the data set than higher severity incidents, there was a problem of class imbalance to solve. I addressed this problem by using SMOTE, a library which solves class imbalances through over-sampling minority classes with synthetic data. For my own curiosity, and for comparison, I did run an XGBoost model in Vaex without solving class imbalance. The only real difference between the XGBoost run on imbalanced data, compared to the same model run on data balanced with SMOTE, was the baseline accuracy dropping from 56.8% to 25%, with the accuracy score varying only by one or two percentage points.
Model Selection
- Vaex
Although Vaex provides wrappers to a range of scikit-learn models, in the end due to time constraints in completing the project, I used only one - an XGBoost model. Why try the XGBoost model first, over say a Logistic Regression? My reasoning was:
XGboost models work very well with larger data sets in tabular format.
More control over the model training process due to the hyperparameter fine-tuning options offered by the algorithm.
I had not used an XGBoost model in previous projects and wanted to try one.
- Sub-sample Models
The following models were selected and trained on a subsample of 50,000 observations, and tested on 20,000:
Logistic Regression (+ GridSearch): The old faithful - commonly used model for classification tasks. GridSearch used to fine tune the hyperparameters.
Logistic Regression (+ GridSearch, + Class Weightings): Same model and fine tuned parameters as above, but with weightings applied to improve classification rates of higher severity incidents.
Multi-Layer Perceptron: Selected to compare performance of a neural network to traditional models.
Decision Tree: Selected as tree based models tend to be well suited to classification models.
Bagging Model: Selected to explore model averaging.
Results
The model accuracy scores were interesting for a number of different reasons. Firstly, the variance of the scores was not as high as I had anticipated and all the scores were centered around 60%. Secondly, between the Vaex models there was not as much of a difference in scores as I had expected. Lastly, the MLP performed better than I thought it woluld, and actually became the model with the 2nd highest accuracy score of the models run on data with class balance.
Key Learnings
Aside from gaining hands-on experience with a number of new libraries, as well as the XGBoost model, there were two key learnings for me with this project:
Data Science Workflow
It became very apparent to me with this project that the regular workflow for a data science project is a very iterative thing. It is much less linear than I had first expected and experienced in smaller projects I had worked on previously. It was not a case of progressing from the EDA straight into modelling, for example, but rather a case of learning new things during the EDA which meant making tweaks to the work carried out during cleaning, prior to finally proceeding with modelling.
Don’t Underestimate Feature Engineering
Finding and joining appropriate data from various disparate sources was a much more time consuming process than I had first expected. The cleaning and reshaping required to mould external data into a state where it could be joined was no small task and my experience during this project has highlighted to me a need to become more familiar with the Pandas reshaping functionality.