
The dataset we are using for my Final Project is the “Global Landslide Catalog Export” dataset obtained from NASA’s Open Data Portal.

A summary of the dataset is available at:

The downloadable version of the dataset in the CSV format is from:

To learn more about NASA’s work on the landslide, please visit the homepage:

Motivation behind using the Global Landslide Catalog Export (GLC)

Landslides cause billions of dollars in infrastructural damage and thousands of deaths worldwide. Data on past landslide events guides future disaster prevention, but we do not have a global picture of exactly when and where landslides occur.

NASA scientists have been building an open global inventory of landslides to address this problem. Knowing where and when landslides occur can help communities worldwide prepare for these disasters.

Below figure is just an overview of the collected landslides so far:

Global Landslides

Through this project, we would welcome an opportunity to help make an informed decision that could save lives and property

About the dataset

The Global Landslide Catalog (GLC) has been compiled since 2007 at NASA Goddard Space Flight Center.
The GLC is a collection of observational studies.
The GLC was developed to identify rainfall-triggered landslide events worldwide, regardless of size, impact, or location.
The GLC considers all types of mass movements triggered by rainfall, which have been reported in the media, disaster databases, scientific reports, or other sources.

The dataset contains 31 Attributes and has 11033 observations.
Each observation is a Landslide.

The list of all the Variables which are available for us to explore along with their types are listed below:

knitr::kable(attributes, "simple", col.names = c("Attribute Names", "Attribute Type"), align = c("l", "c"), caption = "Variables of the Dataset and their Types")
Variables of the Dataset and their Types
Attribute Names Attribute Type
source_name Text
source_link Website URL
event_id Number
event_date Date & Time
event_time Date & Time
event_title Text
event_description Text
location_description Text
location_accuracy Text
landslide_category Text
landslide_trigger Text
landslide_size Text
landslide_setting Text
fatality_count Number
injury_count Number
storm_name Text
photo_link Website URL
notes Text
event_import_source Text
event_import_id Number
country_name Text
country_code Text
admin_division_name Text
admin_division_population Number
gazeteer_closest_point Text
gazeteer_distance Number
submitted_date Date & Time
created_date Date & Time
last_edited_date Date & Time
longitude Number
latitude Number

However,we will be focusing only on the below Variables for my Project Analysis:

knitr::kable(attributes_interest, "simple", col.names = c("Variable Names", "Variable Type"), align = c("l", "c"), caption = "Variables of the Dataset we will explore")
Variables of the Dataset we will explore
Variable Names Variable Type
source_name Text
event_id Number
event_date Date & Time
event_time Date & Time
event_title Text
event_description Text
location_description Text
location_accuracy Text
landslide_category Text
landslide_trigger Text
landslide_size Text
fatality_count Number
injury_count Number
storm_name Text
country_name Text
country_code Text
admin_division_name Text
admin_division_population Number
longitude Number
latitude Number


Through the below data analysis, we want to answer these questions:

  1. How many people were killed in the largest landslide ever recorded?
  2. Are the sizes of various landslides equally distributed in the dataset?
  3. What are the countries with more than 50 injured recorded in any landslide?
  4. Is there any correlation between the numerical variable?
  5. Perform hypothesis testing to see if the mean of the fatality_count of any two countries with the same number of landslides will be the same or not
  6. Use Logistic Regression to predict the size of the landslide


Loading the Package and Data

The first step is to load the necessary library, which contains all the datasets

If the package is not installed, please uncomment the below line of R code and execute it on your machine. The below statement installs the pre-requisite package


Now that the package is installed, we need to load the required library

# Loading required libraries #
In the next step, we will create a local object that will hold our dataset.

The name of our local object will be “globalLandslide_data”.

globalLandslide_data <- as_tibble(read.csv("D:\\SJSU\\1stSem\\Study\\ISE-201\\Submissions\\ProjectProposal-2\\originaldataset\\Global_Landslide_Catalog_Export.csv"))

Examining the Data

This section will use some basic steps to examine our data. Here we will also see if any changes are required in our dataset to make our work easy and smooth.

This stage can also be referred to as “Cleaning the Data.”

1. Evaluating the structure of the data

## tibble [11,033 x 31] (S3: tbl_df/tbl/data.frame)
##  $ source_name              : chr [1:11033] "AGU" "Oregonian" "CBS News" "Reuters" ...
##  $ source_link              : chr [1:11033] "" "" "" "" ...
##  $ event_id                 : int [1:11033] 684 956 973 1067 2603 4203 4290 225 236 873 ...
##  $ event_date               : chr [1:11033] "08/01/2008 12:00:00 AM" "01/02/2009 02:00:00 AM" "01/19/2007 12:00:00 AM" "07/31/2009 12:00:00 AM" ...
##  $ event_time               : logi [1:11033] NA NA NA NA NA NA ...
##  $ event_title              : chr [1:11033] "Sigou Village, Loufan County, Shanxi Province" "Lake Oswego, Oregon" "San Ramon district, 195 miles northeast of the capital, Lima, " "Dailekh district" ...
##  $ event_description        : chr [1:11033] "occurred early in morning, 11 villagers buried in 7 houses" "Hours of heavy rain are to blame for an overnight mudslide in Lake Oswego. " "(CBS/AP) At least 10 people died and as many as 80 were still missing Wednesday in central Peru after torrentia"| __truncated__ "One person was killed in Dailekh district, police said." ...
##  $ location_description     : chr [1:11033] "Sigou Village, Loufan County, Shanxi Province" "Lake Oswego, Oregon" "San Ramon district, 195 miles northeast of the capital, Lima, " "Dailekh district" ...
##  $ location_accuracy        : chr [1:11033] "unknown" "5km" "10km" "unknown" ...
##  $ landslide_category       : chr [1:11033] "landslide" "mudslide" "landslide" "landslide" ...
##  $ landslide_trigger        : chr [1:11033] "rain" "downpour" "downpour" "monsoon" ...
##  $ landslide_size           : chr [1:11033] "large" "small" "large" "medium" ...
##  $ landslide_setting        : chr [1:11033] "mine" "unknown" "unknown" "unknown" ...
##  $ fatality_count           : int [1:11033] 11 0 10 1 0 0 0 3 NA 2 ...
##  $ injury_count             : int [1:11033] NA NA NA NA NA NA NA NA NA NA ...
##  $ storm_name               : chr [1:11033] "" "" "" "" ...
##  $ photo_link               : chr [1:11033] "" "" "" "" ...
##  $ notes                    : chr [1:11033] "" "" "" "" ...
##  $ event_import_source      : chr [1:11033] "glc" "glc" "glc" "glc" ...
##  $ event_import_id          : num [1:11033] 684 956 973 1067 2603 ...
##  $ country_name             : chr [1:11033] "China" "United States" "Peru" "Nepal" ...
##  $ country_code             : chr [1:11033] "CN" "US" "PE" "NP" ...
##  $ admin_division_name      : chr [1:11033] "Shaanxi" "Oregon" "Junín" "Mid Western" ...
##  $ admin_division_population: int [1:11033] 0 36619 14708 20908 798634 2404 2126 3191 2689 0 ...
##  $ gazeteer_closest_point   : chr [1:11033] "Jingyang" "Lake Oswego" "San Ramón" "Dailekh" ...
##  $ gazeteer_distance        : num [1:11033] 41.021 0.603 0.855 0.754 2.022 ...
##  $ submitted_date           : chr [1:11033] "04/01/2014 12:00:00 AM" "04/01/2014 12:00:00 AM" "04/01/2014 12:00:00 AM" "04/01/2014 12:00:00 AM" ...
##  $ created_date             : chr [1:11033] "11/20/2017 03:17:00 PM" "11/20/2017 03:17:00 PM" "11/20/2017 03:17:00 PM" "11/20/2017 03:17:00 PM" ...
##  $ last_edited_date         : chr [1:11033] "02/15/2018 03:51:00 PM" "02/15/2018 03:51:00 PM" "02/15/2018 03:51:00 PM" "02/15/2018 03:51:00 PM" ...
##  $ longitude                : num [1:11033] 107.5 -122.7 -75.4 81.7 123.9 ...
##  $ latitude                 : num [1:11033] 32.6 45.4 -11.1 28.8 10.3 ...

The output of the above query tells us the structure of the dataset.

We have a data frame with 11,033 observations on 31 variables. And the dataset is a mix of categorical, nominal, numerical, and continuous variables.

2. Peeking at the data

Looking at the first few observations in the dataframe

## # A tibble: 6 x 31
##   source_name          source_link   event_id event_date event_time event_title 
##   <chr>                <chr>            <int> <chr>      <lgl>      <chr>       
## 1 AGU                  https://blog~      684 08/01/200~ NA         "Sigou Vill~
## 2 Oregonian            http://www.o~      956 01/02/200~ NA         "Lake Osweg~
## 3 CBS News             https://www.~      973 01/19/200~ NA         "San Ramon ~
## 4 Reuters              https://in.r~     1067 07/31/200~ NA         "Dailekh di~
## 5 The Freeman          http://www.p~     2603 10/16/201~ NA         "sitio Baki~
## 6 BusinessWorld Online http://www.b~     4203 02/16/201~ NA         "Paguite, A~
## # ... with 25 more variables: event_description <chr>,
## #   location_description <chr>, location_accuracy <chr>,
## #   landslide_category <chr>, landslide_trigger <chr>, landslide_size <chr>,
## #   landslide_setting <chr>, fatality_count <int>, injury_count <int>,
## #   storm_name <chr>, photo_link <chr>, notes <chr>, event_import_source <chr>,
## #   event_import_id <dbl>, country_name <chr>, country_code <chr>,
## #   admin_division_name <chr>, admin_division_population <int>, ...

Looking at the last few observations in the dataset

## # A tibble: 6 x 31
##   source_name   source_link       event_id event_date  event_time event_title   
##   <chr>         <chr>                <int> <chr>       <lgl>      <chr>         
## 1 St. Maries G~ http://www.gazet~    10518 03/23/2017~ NA         Mudslide abov~
## 2 The Jakarta ~ http://www.theja~    11109 04/01/2017~ NA         Major landsli~
## 3 Greater Kash~ http://www.great~    10845 03/25/2017~ NA         Barnari Sigdi~
## 4 NBC Daily     http://www.nbcda~    10973 12/15/2016~ NA         Landslide at ~
## 5 AGU Landslid~ http://blogs.agu~    10901 04/29/2017~ NA         Mayor landsli~
## 6 The Times of~ https://timesofi~    10949 03/13/2017~ NA         Kondapur Comm~
## # ... with 25 more variables: event_description <chr>,
## #   location_description <chr>, location_accuracy <chr>,
## #   landslide_category <chr>, landslide_trigger <chr>, landslide_size <chr>,
## #   landslide_setting <chr>, fatality_count <int>, injury_count <int>,
## #   storm_name <chr>, photo_link <chr>, notes <chr>, event_import_source <chr>,
## #   event_import_id <dbl>, country_name <chr>, country_code <chr>,
## #   admin_division_name <chr>, admin_division_population <int>, ...

3. Checking the summary of the Tibble

##  source_name        source_link           event_id      event_date       
##  Length:11033       Length:11033       Min.   :    1   Length:11033      
##  Class :character   Class :character   1st Qu.: 2785   Class :character  
##  Mode  :character   Mode  :character   Median : 5563   Mode  :character  
##                                        Mean   : 5599                     
##                                        3rd Qu.: 8435                     
##                                        Max.   :11221                     
##  event_time     event_title        event_description  location_description
##  Mode:logical   Length:11033       Length:11033       Length:11033        
##  NA's:11033     Class :character   Class :character   Class :character    
##                 Mode  :character   Mode  :character   Mode  :character    
##  location_accuracy  landslide_category landslide_trigger  landslide_size    
##  Length:11033       Length:11033       Length:11033       Length:11033      
##  Class :character   Class :character   Class :character   Class :character  
##  Mode  :character   Mode  :character   Mode  :character   Mode  :character  
##  landslide_setting  fatality_count      injury_count      storm_name       
##  Length:11033       Min.   :   0.000   Min.   :  0.000   Length:11033      
##  Class :character   1st Qu.:   0.000   1st Qu.:  0.000   Class :character  
##  Mode  :character   Median :   0.000   Median :  0.000   Mode  :character  
##                     Mean   :   3.219   Mean   :  0.752                     
##                     3rd Qu.:   1.000   3rd Qu.:  0.000                     
##                     Max.   :5000.000   Max.   :374.000                     
##                     NA's   :1385       NA's   :5674                        
##   photo_link           notes           event_import_source event_import_id 
##  Length:11033       Length:11033       Length:11033        Min.   :-111.2  
##  Class :character   Class :character   Class :character    1st Qu.:2386.5  
##  Mode  :character   Mode  :character   Mode  :character    Median :4773.0  
##                                                            Mean   :4798.6  
##                                                            3rd Qu.:7189.5  
##                                                            Max.   :9669.0  
##                                                            NA's   :1562    
##  country_name       country_code       admin_division_name
##  Length:11033       Length:11033       Length:11033       
##  Class :character   Class :character   Class :character   
##  Mode  :character   Mode  :character   Mode  :character   
##  admin_division_population gazeteer_closest_point gazeteer_distance
##  Min.   :       0          Length:11033           Min.   :  0.000  
##  1st Qu.:    1963          Class :character       1st Qu.:  2.364  
##  Median :    7365          Mode  :character       Median :  6.255  
##  Mean   :  157760                                 Mean   : 11.874  
##  3rd Qu.:   34021                                 3rd Qu.: 15.816  
##  Max.   :12691836                                 Max.   :215.449  
##  NA's   :1562                                     NA's   :1562     
##  submitted_date     created_date       last_edited_date     longitude      
##  Length:11033       Length:11033       Length:11033       Min.   :-179.98  
##  Class :character   Class :character   Class :character   1st Qu.:-107.87  
##  Mode  :character   Mode  :character   Mode  :character   Median :  19.69  
##                                                           Mean   :   2.52  
##                                                           3rd Qu.:  93.95  
##                                                           Max.   : 179.99  
##     latitude     
##  Min.   :-46.77  
##  1st Qu.: 13.92  
##  Median : 30.53  
##  Mean   : 25.88  
##  3rd Qu.: 40.87  
##  Max.   : 72.63  

Quality Check and Data Cleaning

In this section, we will be checking the quality of our data and cleaning our dataset, if required.

1. Checking the Variable Name

# Checking the names of the Columns 
Our variable names are meaningful, which is great. In addition, the clear variable names help us know the feature we are working on.

2. Removing the Columns which we are not including in our analysis

Let us remove the columns which we are not going to focus on for our further analysis:

globalLandslide_df <- globalLandslide_data[!(colnames(globalLandslide_data) %in% c("source_link", "landslide_setting", "photo_link", "notes", 
                "event_import_source", "event_import_id", "gazeteer_closest_point", "gazeteer_distance", "submitted_date", "created_date",

Looking at the structure of the dataframe we will be proceeding with:

## tibble [11,033 x 20] (S3: tbl_df/tbl/data.frame)
##  $ source_name              : chr [1:11033] "AGU" "Oregonian" "CBS News" "Reuters" ...
##  $ event_id                 : int [1:11033] 684 956 973 1067 2603 4203 4290 225 236 873 ...
##  $ event_date               : chr [1:11033] "08/01/2008 12:00:00 AM" "01/02/2009 02:00:00 AM" "01/19/2007 12:00:00 AM" "07/31/2009 12:00:00 AM" ...
##  $ event_time               : logi [1:11033] NA NA NA NA NA NA ...
##  $ event_title              : chr [1:11033] "Sigou Village, Loufan County, Shanxi Province" "Lake Oswego, Oregon" "San Ramon district, 195 miles northeast of the capital, Lima, " "Dailekh district" ...
##  $ event_description        : chr [1:11033] "occurred early in morning, 11 villagers buried in 7 houses" "Hours of heavy rain are to blame for an overnight mudslide in Lake Oswego. " "(CBS/AP) At least 10 people died and as many as 80 were still missing Wednesday in central Peru after torrentia"| __truncated__ "One person was killed in Dailekh district, police said." ...
##  $ location_description     : chr [1:11033] "Sigou Village, Loufan County, Shanxi Province" "Lake Oswego, Oregon" "San Ramon district, 195 miles northeast of the capital, Lima, " "Dailekh district" ...
##  $ location_accuracy        : chr [1:11033] "unknown" "5km" "10km" "unknown" ...
##  $ landslide_category       : chr [1:11033] "landslide" "mudslide" "landslide" "landslide" ...
##  $ landslide_trigger        : chr [1:11033] "rain" "downpour" "downpour" "monsoon" ...
##  $ landslide_size           : chr [1:11033] "large" "small" "large" "medium" ...
##  $ fatality_count           : int [1:11033] 11 0 10 1 0 0 0 3 NA 2 ...
##  $ injury_count             : int [1:11033] NA NA NA NA NA NA NA NA NA NA ...
##  $ storm_name               : chr [1:11033] "" "" "" "" ...
##  $ country_name             : chr [1:11033] "China" "United States" "Peru" "Nepal" ...
##  $ country_code             : chr [1:11033] "CN" "US" "PE" "NP" ...
##  $ admin_division_name      : chr [1:11033] "Shaanxi" "Oregon" "Junín" "Mid Western" ...
##  $ admin_division_population: int [1:11033] 0 36619 14708 20908 798634 2404 2126 3191 2689 0 ...
##  $ longitude                : num [1:11033] 107.5 -122.7 -75.4 81.7 123.9 ...
##  $ latitude                 : num [1:11033] 32.6 45.4 -11.1 28.8 10.3 ...

We will be proceeding with globalLandslide_df, which has 11,033 observations on 20 variables.

3. Checking for missing data

# Calculate the total numbers of "Not Available" data 
## [1] 19656

There are many cells with missing data in our dataset. Let’s see which columns do not have data in them

names(which(colSums( > 0))
## [1] "event_time"                "fatality_count"           
## [3] "injury_count"              "country_code"             
## [5] "admin_division_population"

4. Handling Missing Data

Let us address the above columns with missing data one-by-one

# Handling missing data in "injury_count" column
globalLandslide_df$injury_count[$injury_count)] = 0

# Handling missing data in "admin_division_population" column
globalLandslide_df$admin_division_population[$admin_division_population)] = 0

# Handling missing data in "fatality_count" column
## # A tibble: 9,648 x 20
##    source_name          event_id event_date event_time event_title event_descripti~
##    <chr>                   <int> <chr>      <lgl>      <chr>       <chr>           
##  1 AGU                       684 08/01/200~ NA         "Sigou Vil~ "occurred early~
##  2 Oregonian                 956 01/02/200~ NA         "Lake Oswe~ "Hours of heavy~
##  3 CBS News                  973 01/19/200~ NA         "San Ramon~ "(CBS/AP) At le~
##  4 Reuters                  1067 07/31/200~ NA         "Dailekh d~ "One person was~
##  5 The Freeman              2603 10/16/201~ NA         "sitio Bak~ "Another landsl~
##  6 BusinessWorld Online     4203 02/16/201~ NA         "Paguite, ~ "Thursday’s l~
##  7 The Spokesman-Review     4290 03/30/201~ NA         "Pend Orei~ "In Pend Oreill~
##  8 Crónica Diaria           225 09/02/200~ NA         "3 killed ~ "3 killed, incl~
##  9 UPI                       873 11/01/200~ NA         "Lincang C~ "The report sai~
## 10 BBC News                  874 11/01/200~ NA         "Kunming, ~ "Yunnan has so ~
## # ... with 9,638 more rows, and 14 more variables: location_description <chr>,
## #   location_accuracy <chr>, landslide_category <chr>, landslide_trigger <chr>,
## #   landslide_size <chr>, fatality_count <int>, injury_count <dbl>,
## #   storm_name <chr>, country_name <chr>, country_code <chr>,
## #   admin_division_name <chr>, admin_division_population <dbl>,
## #   longitude <dbl>, latitude <dbl>
# Handling missing data for "country_name"
globalLandslide_df <- globalLandslide_df[grep('^[A-Za-z]', globalLandslide_df$country_name),]

5. Checking for outliers

To check if there are any outliers in our data-set, we will be using boxplots to easily view them.

## Warning in (function (z, notch = FALSE, width = NULL, varwidth = FALSE, : some
## notches went outside hinges ('box'): maybe set notch=FALSE

By looking at the above plot, we see that although the death count because of the Landslides over all the years was below 1000, there was one where the fatality count was 5000.
It must have been an enormous disaster.

Hence, this solves our 1st Question: 5000 people were killed in the largest landslide disaster ever recorded

Data Transformation

1. Separating the Event_Date to Date and Time

We saw in the above section that one of the columns with missing value is “event_time”.

Let’s check how many cells of the “event_time” are empty.

## [1] 9471

There is no data in the event_time variable as 11033 cells are empty.

This transformation step will get the time value from the “event_date” variable.

# Checking the values of the event_date column
head(globalLandslide_df$event_date, 5)
## [1] "08/01/2008 12:00:00 AM" "01/02/2009 02:00:00 AM" "01/19/2007 12:00:00 AM"
## [4] "07/31/2009 12:00:00 AM" "10/16/2010 12:00:00 PM"

By just checking the 5 records of the “event_date”, we can say that it also contains the time. So, we will be using the Solution-2 in this Data Transformation Step

Let’s separate the time from the “event_date” and store it into the “event_time” column

# Splitting the event_date by the first space
ev_dates <-$event_date, " ", 2))
colnames(ev_dates) <- c("DATE", "TIME")
head(ev_dates, 10)
##          DATE        TIME
## 1  08/01/2008 12:00:00 AM
## 2  01/02/2009 02:00:00 AM
## 3  01/19/2007 12:00:00 AM
## 4  07/31/2009 12:00:00 AM
## 5  10/16/2010 12:00:00 PM
## 6  02/16/2012 12:00:00 AM
## 7  03/30/2012 12:00:00 AM
## 8  09/02/2007 12:00:00 AM
## 9  09/05/2007 12:00:00 AM
## 10 11/01/2008 12:00:00 AM

Now storing the “TIME” column of y to our “event_time” and “DATE” column of y to our event_date

globalLandslide_df$event_date <- ev_dates$DATE
globalLandslide_df$event_time <- ev_dates$TIME

Now, let’s check the missing data in the event_time

## [1] 0

In the above step, we performed Data-Transformation for the “event_date” and “event_time” variable.

# Removing the landslide size which is 'unknown'
globalLandslide_df <- globalLandslide_df[!grepl('unknown',globalLandslide_df$landslide_size),]

Visual and Descriptive Analysis

1. Analyzing single Categorical Variables

Now, let us start by analyzing our 2 Categorical Variables - “landslide_size” and “landslide_category.”

The below query returns a table of frequency occurrences of data in each “landslide_size” category

##      large     medium      small very_large 
##        632       6039       1908         86

We can see that our data is not equally distributed among the different sizes of landslides by the output. And there are nine landslides whose size is not determined in our dataset.

Let’s see this in a Visual format.

# Create a bar graph of wool observations
ggplot(data = globalLandslide_df) + 
  geom_bar(mapping = aes(x = landslide_size))

The above bar graph clarifies that the highest number of landslides that occurred so far is medium-sized. The size of the maximum occurring landslides in our dataset is ** 2 categories of Wool are equally distributed in our dataset. There are nine landslides with no measure in our dataset.

This answers Question 2. No, the landslide sizes are not equally distributed in our dataset.

In the above plot, we can see that the number of very_large landslides is minimal. So, we can easily merge the large and very_large landslides as “large.”

globalLandslide_df$landslide_size[globalLandslide_df$landslide_size == "very_large"] <- "large"

2. Analyzing the spread of landslides based on the category

Now, let us plot the landslide based on the category

# Create a bar graph of wool observations

p <- ggplot(data = globalLandslide_df) + 
  geom_bar(mapping = aes(x = landslide_category))

p + theme(axis.text.x = element_text(color="#993333", 
                           size=10, angle=45),
          axis.text.y = element_text(color="#993333", 
                           size=10, angle=45))

From the above plot, we can see that most landslides are categorized as landslides. We also have significant mudslides and rock-fall types of landslides in our dataset.

3. Analyzing the number of injuries

# Create a boxplot of breaks
  x = globalLandslide_df$injury_count,
  xlab = "Number of Injuries",
  horizontal = TRUE,
  main = "BoxPlot of Injury Count recorded on the Landslide dataset"

The above Box Plot shows the spread of the injury count across our data frame

3. Plotting only those coutries with total injury more than 50 on a Pie Plot

landslide_distribution <- globalLandslide_df %>% 
  group_by(country_name) %>% 
  summarise(sum_injuries = sum(injury_count)) %>% 

## # A tibble: 141 x 2
##    country_name    sum_injuries
##    <chr>                  <dbl>
##  1 Guatemala                408
##  2 China                    318
##  3 Peru                     277
##  4 Nepal                    257
##  5 Myanmar [Burma]          224
##  6 India                    217
##  7 Bangladesh               170
##  8 Philippines              138
##  9 Indonesia                129
## 10 Brazil                   103
## # ... with 131 more rows
graph_data <- landslide_distribution[apply(landslide_distribution[,-1], 1, function(x) !all(x<=50)),]
x <- c(graph_data$sum_injuries)
yy <- c(graph_data$country_name)

pie(x, yy, main = "Country pie chart with more than 50 Injuries", edges = 10) 

The above plot shows the countries which have more than 50 injury_count in any of the landslides.

Hypothesis testing

In this section, we will move on to finding further answers to the hypothesis question listed above.

Now, let’s find out about the variables with character datatypes

data_char<-globalLandslide_df %>% dplyr::select(where(is.character))

for(i in colnames(data_char)){
Studying correlation

In order to study the correlation in the dataset, let’s take the numerical variables of importance

numerical_data <- globalLandslide_df[, c('fatality_count', 'injury_count', 'longitude', 'latitude')] # Numerical variables

# Removing na values
numerical_data <- na.omit(numerical_data)
## corrplot 0.90 loaded
corr <-cor(numerical_data)
corrplot(corr, type="upper", order="hclust",
         col=brewer.pal(n=8, name="RdYlBu"))

The above plot solves answers our Question-3. There is a very minimum correlation between the fatality_count and injury_count. However, the only correlation we can find is between latitude and longitude.

Hypothesis 1: Is the mean value of fatality_count in one country equal to the mean value of fatality_count of another country with the same number of landslide

First, let’s take the frequency table of the various countries

country_tbl <- table(data_char$country_name)
country_tbl <- sort(country_tbl, decreasing = T)
##                    United States                            India 
##                             2224                             1261 
##                      Philippines                            Nepal 
##                              669                              479 
##                            China                        Indonesia 
##                              425                              350 
##                   United Kingdom                           Brazil 
##                              225                              214 
##                           Canada                         Malaysia 
##                              173                              166 
##                         Pakistan                          Vietnam 
##                              141                              116 
##                      New Zealand                        Australia 
##                              106                              105 
##                         Colombia                           Mexico 
##                              101                               86 
##                        Guatemala                            Japan 
##                               82                               82 
##                         Thailand                       Costa Rica 
##                               77                               76 
##                        Sri Lanka                           Taiwan 
##                               75                               66 
##              Trinidad and Tobago                       Bangladesh 
##                               65                               58 
##                             Peru                            Italy 
##                               58                               55 
##                            Kenya                           Uganda 
##                               53                               45 
##                           Panama                  Myanmar [Burma] 
##                               44                               43 
##                          Georgia                         Honduras 
##                               39                               39 
##                          Jamaica                         Bulgaria 
##                               37                               35 
##                             Fiji                          Ecuador 
##                               34                               32 
##                       Kyrgyzstan                        Nicaragua 
##                               32                               31 
##                          Ireland                      El Salvador 
##                               23                               22 
##                           France                            Haiti 
##                               22                               22 
##                           Norway                       Azerbaijan 
##                               22                               21 
##                 Papua New Guinea                     South Africa 
##                               21                               21 
##                           Turkey                        Venezuela 
##                               21                               21 
##                           Bhutan                       Tajikistan 
##                               20                               20 
##                      Switzerland               Dominican Republic 
##                               19                               17 
##                         Dominica                      Afghanistan 
##                               16                               15 
##                          Nigeria                           Brunei 
##                               15                               14 
##                            Chile                            Spain 
##                               14                               14 
##                           Russia           Bosnia and Herzegovina 
##                               13                               12 
##                          Bolivia                      South Korea 
##                               11                               11 
##                          Austria                          Lebanon 
##                               10                                9 
##                      Saint Lucia                        Argentina 
##                                8                                7 
##                            Ghana                      Ivory Coast 
##                                7                                7 
##                       Madagascar                      Puerto Rico 
##                                7                                7 
##                     Sierra Leone                            Yemen 
##                                7                                7 
##                             Iran                         Portugal 
##                                6                                6 
##                           Rwanda                   American Samoa 
##                                6                                5 
##                           Greece Saint Vincent and the Grenadines 
##                                5                                5 
##                     Saudi Arabia                           Serbia 
##                                5                                5 
##                  Solomon Islands                         Tanzania 
##                                5                                5 
##                          Armenia                         Cameroon 
##                                4                                4 
##                           Guinea                          Iceland 
##                                4                                4 
##                      Isle of Man                             Laos 
##                                4                                4 
##                        Macedonia                      North Korea 
##                                4                                4 
##                           Angola                             Cuba 
##                                3                                3 
## Democratic Republic of the Congo                          Germany 
##                                3                                3 
##                          Romania                          Ukraine 
##                                3                                3 
##                          Bermuda                          Croatia 
##                                2                                2 
##                          Czechia                       East Timor 
##                                2                                2 
##                         Ethiopia                          Grenada 
##                                2                                2 
##                        Hong Kong                           Israel 
##                                2                                2 
##                          Liberia                       Luxembourg 
##                                2                                2 
##                          Namibia                           Poland 
##                                2                                2 
##                         Slovakia              U.S. Virgin Islands 
##                                2                                2 
##                          Vanuatu                          Albania 
##                                2                                1 
##                          Algeria                         Barbados 
##                                1                                1 
##                           Belize                     Burkina Faso 
##                                1                                1 
##                          Burundi                         Cambodia 
##                                1                                1 
##                   Czech Republic                            Egypt 
##                                1                                1 
##                            Gabon                             Guam 
##                                1                                1 
##                           Jersey                           Jordan 
##                                1                                1 
##                       Kazakhstan                           Malawi 
##                                1                                1 
##                        Mauritius                         Mongolia 
##                                1                                1 
##                       Montenegro                          Morocco 
##                                1                                1 
##                             Oman                         Paraguay 
##                                1                                1 
##            Republic of the Congo            Saint Kitts and Nevis 
##                                1                                1 
##                        Singapore                         Slovenia 
##                                1                                1 
##                            Sudan                        Swaziland 
##                                1                                1 
##             United Arab Emirates                       Uzbekistan 
##                                1                                1 
##                           Zambia 
##                                1

Above is the frequency count of how many times a country had a landslide recorded.


Looking at the above distribution, we see that the ‘Guatemala’ and ‘Japan’ had same number of landslides Now, let’s see if there is any relation between their fatality counts

data_globalLandslide_Japan <- filter(globalLandslide_df,country_name== "Japan")
data_globalLandslide_Guatemala <- filter(globalLandslide_df, country_name =="Guatemala")

Now, let’s perform the T-Test

## [1] 3.097222
## [1] 9.525641
## [1] 99.52563
## [1] 1789.707
(t.test(data_globalLandslide_Guatemala$fatality_count,data_globalLandslide_Japan$fatality_count, alternative = "two.sided", var.equal = FALSE))
##  Welch Two Sample t-test
## data:  data_globalLandslide_Guatemala$fatality_count and data_globalLandslide_Japan$fatality_count
## t = 1.3033, df = 86.218, p-value = 0.1959
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  -3.376251 16.233089
## sample estimates:
## mean of x mean of y 
##  9.525641  3.097222

By looking at the mean of X and Y, we can say that although the number of landslides was the same, the mean of fatality count of these two countries is very different, and the alternative hypothesis is true


Now, let’s take another two countries with same number of landslides - Afghanistan and Nigeria with 15 landslides

data_globalLandslide_Afghanistan <- filter(globalLandslide_df,country_name== "Afghanistan")
data_globalLandslide_Nigeria <- filter(globalLandslide_df, country_name =="Nigeria")

Now, let’s perform the T-Test on the injury_count

## [1] 191.1667
## [1] 2.272727
## [1] 362483.2
## [1] 6.418182
(t.test(data_globalLandslide_Afghanistan$fatality_count,data_globalLandslide_Nigeria$fatality_count, alternative = "two.sided", var.equal = FALSE))
##  Welch Two Sample t-test
## data:  data_globalLandslide_Afghanistan$fatality_count and data_globalLandslide_Nigeria$fatality_count
## t = 1.0868, df = 11, p-value = 0.3004
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  -193.6423  571.4302
## sample estimates:
##  mean of x  mean of y 
## 191.166667   2.272727

Here, as well there is a big difference on the fatality_count of the two countries And the alternative hypothesis is true

Fitting a Logistic Regression model to predict landslide_size based on Country name and fatality_count

sizeTbl <- table(data_char$landslide_size)
sizeTbl <- sort(sizeTbl, decreasing = T)
## medium  small  large 
##   6039   1908    718

As we have 3 categories in our landslide_size, we will be using “multinorm”

## Loading required package: nnet
globalLandslide_df$landslide_size <- as.factor(globalLandslide_df$landslide_size)

globalLandslide_df$landslide_size <- relevel(globalLandslide_df$landslide_size, ref = "small")

(test <- multinom(landslide_size ~ fatality_count+country_name, data = globalLandslide_df))
The Residual Deviance here depicts how much the curve cannot fit and is very high.

In this case, our Logistic Regression failed. Hence, using the fatality_count and country_name, we were not able to predict the landslide_size as medium and large, keeping small as base

Summary of findings and questions for further analysis

  1. We found the solution to Question-1(What was the landslide’s maximum death count so far?)  We found the answer by plotting the fatality_count and looking at the maximum outlier. It could be an erroneous record, but the dataset can also have the correct number, and the death count could be 5000

  2. We found out the solution to Question-2(Are the sizes of various landslides equally distributed in the dataset)
    No, the dataset is highly skewed towards landslides’ “medium” size. The maximum values in the dataset were recorded for the “medium” landslide. Hence the various landslide sizes are not equally distributed in the dataset.

  3. We found out the solution to Question-3(What are the countries with more than 50 injured recorded in any landslide?) and used a pie-chart above to show the countries with more than 50 people wounded during any global landslide.

  4. We found out the solution to Question-4(Is there any correlation between the numerical variable)
    We only took the fatality_count, injury_count, latitude, and longitude to study the correlation between the numerical variables. However, we found no significant correlation between the location(depicted by latitude and longitude) and fatality_count or injury_count. We also found a very minimum correlation between the count of injured and demised people.

  5. For our Question-5(Perform hypothesis testing to see if the mean of the fatality_count of any two countries with the same number of landslides will be the same or not), we performed a T-test on two sets of countries with the same number of landslides. In both cases, we found that the mean value of fatality_count is no match. Even if those two countries have the same number of landslides recorded, one of them lost more lives than the other.

  6. For our Question-5(Use Logistic Regression to predict the size of the landslide), we used logistic regression to predict the size of the landslide based on the country name and fatality_count. But our model failed. Hence, we could not predict the size of the landslide and could not build an efficient model using this dataset.


With this study on the GLC dataset, we could identify some of the relevant queries we could form using this global data. And we were also able to find solutions to some of them. The result was that the landslide size or the landslide count does not predict the number of deaths. After seeing the records, we also found that so many medium-size landslides cause more fatality after seeing the records. Unfortunately, we could not create a helpful prediction model of the landslide size based on the various countries.


During this project, I realized I kept going back to the materials to clarify my understanding of the extensive dataset analysis. However, I also found that I lack my knowledge of R programming. There were some objectives for which I had to code many lines manually, but I’m sure there are many libraries of R which could make the work simple in one sentence. With this project, I also got practical exposure to understanding massive datasets. This GLC dataset can be used to create predictive models to identify potential landslide regions and their impact.