::p_load(tidyverse, jsonlite,
pacman
tidygraph, ggraph, SmartEDA)
MC3 Kickstarter
Overview
By the end of this hands-on exercise, you will be able to:
- import Mini Case 3 data file R object,
- split the knowledge graph into nodes and edges tibble data frames,
- tidy nodes and edges tibble data frames for conforming to the requirements of tidygraph,
- create a tidygrpah object by using the tidied nides and edges, and
- visualise the tidygraph
Getting Started
For the purpose of this exercise, five R packages will be used. They are tidyverse, jsonlite, tidygraph, ggraph and SmartEDA.
You are required to install the R packages above, if necessary, before continue to the next step.
In the code chunk below, p_load()
of pacman package is used to load the R packages into R environemnt.
Importing Knowledge Graph Data
For the purpose of this exercise, mc3.json file will be used. Before getting started, you should have the data set in the data sub-folder.
In the code chunk below, fromJSON()
of jsonlite package is used to import mc3.json file into R and save the output object
<- fromJSON("data/MC3_graph.json")
MC3 <- fromJSON("data/MC3_schema.json") MC3_schema
Inspecting knowledge graph structure
Before preparing the data, it is always a good practice to examine the structure of mc3 knowledge graph.
In the code chunk below glimpse()
is used to reveal the structure of mc3 knowledge graph.
glimpse(MC3)
List of 5
$ directed : logi TRUE
$ multigraph: logi FALSE
$ graph :List of 4
..$ mode : chr "static"
..$ edge_default: Named list()
..$ node_default: Named list()
..$ name : chr "VAST_MC3_Knowledge_Graph"
$ nodes :'data.frame': 1159 obs. of 31 variables:
..$ type : chr [1:1159] "Entity" "Entity" "Entity" "Entity" ...
..$ label : chr [1:1159] "Sam" "Kelly" "Nadia Conti" "Elise" ...
..$ name : chr [1:1159] "Sam" "Kelly" "Nadia Conti" "Elise" ...
..$ sub_type : chr [1:1159] "Person" "Person" "Person" "Person" ...
..$ id : chr [1:1159] "Sam" "Kelly" "Nadia Conti" "Elise" ...
..$ timestamp : chr [1:1159] NA NA NA NA ...
..$ monitoring_type : chr [1:1159] NA NA NA NA ...
..$ findings : chr [1:1159] NA NA NA NA ...
..$ content : chr [1:1159] NA NA NA NA ...
..$ assessment_type : chr [1:1159] NA NA NA NA ...
..$ results : chr [1:1159] NA NA NA NA ...
..$ movement_type : chr [1:1159] NA NA NA NA ...
..$ destination : chr [1:1159] NA NA NA NA ...
..$ enforcement_type : chr [1:1159] NA NA NA NA ...
..$ outcome : chr [1:1159] NA NA NA NA ...
..$ activity_type : chr [1:1159] NA NA NA NA ...
..$ participants : int [1:1159] NA NA NA NA NA NA NA NA NA NA ...
..$ thing_collected :'data.frame': 1159 obs. of 2 variables:
.. ..$ type: chr [1:1159] NA NA NA NA ...
.. ..$ name: chr [1:1159] NA NA NA NA ...
..$ reference : chr [1:1159] NA NA NA NA ...
..$ date : chr [1:1159] NA NA NA NA ...
..$ time : chr [1:1159] NA NA NA NA ...
..$ friendship_type : chr [1:1159] NA NA NA NA ...
..$ permission_type : chr [1:1159] NA NA NA NA ...
..$ start_date : chr [1:1159] NA NA NA NA ...
..$ end_date : chr [1:1159] NA NA NA NA ...
..$ report_type : chr [1:1159] NA NA NA NA ...
..$ submission_date : chr [1:1159] NA NA NA NA ...
..$ jurisdiction_type: chr [1:1159] NA NA NA NA ...
..$ authority_level : chr [1:1159] NA NA NA NA ...
..$ coordination_type: chr [1:1159] NA NA NA NA ...
..$ operational_role : chr [1:1159] NA NA NA NA ...
$ edges :'data.frame': 3226 obs. of 5 variables:
..$ id : chr [1:3226] "2" "3" "5" "3013" ...
..$ is_inferred: logi [1:3226] TRUE FALSE TRUE TRUE TRUE TRUE ...
..$ source : chr [1:3226] "Sam" "Sam" "Sam" "Sam" ...
..$ target : chr [1:3226] "Relationship_Suspicious_217" "Event_Communication_370" "Event_Assessment_600" "Relationship_Colleagues_430" ...
..$ type : chr [1:3226] NA "sent" NA NA ...
Notice that Industry field is in list data type. In general, this data type is not acceptable by tbl_graph()
of tidygraph. In order to avoid error arise when building tidygraph object, it is wiser to exclude this field from the edges data table. However, it might be still useful in subsequent analysis.
Extracting the edges and nodes tables
Next, as_tibble()
of tibble package package is used to extract the nodes and links tibble data frames from mc3 tibble dataframe into two separate tibble dataframes called mc3_nodes and mc3_edges respectively.
<- as_tibble(MC3$nodes)
mc3_nodes <- as_tibble(MC3$edges) mc3_edges
Initial EDA
It is time for us to apply appropriate EDA methods to examine the data.
In the code chunk below, ExpCatViz()
of SmartEDA package is used to reveal the frequency distribution of all categorical fields in mc3_nodes tibble dataframe.
ExpCatViz(data=mc3_nodes,
col="lightblue")
[[1]]
[[2]]
[[3]]
[[4]]
[[5]]
[[6]]
[[7]]
[[8]]
[[9]]
[[10]]
[[11]]
[[12]]
[[13]]
[[14]]
What useful discovery can you obtained from the visualisation above?
On the other hands, code chunk below uses ExpCATViz()
of SmartEDA package to reveal the frequency distribution of all categorical fields in mc3_edges tibble dataframe.
ExpCatViz(data=mc3_edges,
col="lightblue")
[[1]]
What useful discovery can you obtained from the visualisation above?
Data Cleaning and Wrangling
Cleaning and wrangling nodes
Code chunk below performs the following data cleaning tasks:
- convert values in id field into character data type,
- exclude records with
id
value are na, - exclude records with similar id values,
- exclude
thing_collected
field, and - save the cleaned tibble dataframe into a new tibble datatable called
mc3_nodes_cleaned
.
<- mc3_nodes %>%
mc3_nodes_cleaned mutate(id = as.character(id)) %>%
filter(!is.na(id)) %>%
distinct(id, .keep_all = TRUE) %>%
select(-thing_collected)
Cleaning and wrangling edges
Next, the code chunk below will be used to:
- rename source and target fields to from_id and to_id respectively,
- convert values in from_id and to_id fields to character data type,
- exclude values in from_id and to_id which not found in the id field of mc3_nodes_cleaned,
- exclude records whereby from_id and/or to_id values are missing, and
- save the cleaned tibble dataframe and called it mc3_edges_cleaned.
<- mc3_edges %>%
mc3_edges_cleaned rename(from_id = source,
to_id = target) %>%
mutate(across(c(from_id, to_id),
%>%
as.character)) filter(from_id %in% mc3_nodes_cleaned$id,
%in% mc3_nodes_cleaned$id) %>%
to_id filter(!is.na(from_id), !is.na(to_id))
Next, code chunk below will be used to create mapping of character id in mc3_nodes_cleaned
to row index
<- mc3_nodes_cleaned %>%
node_index_lookup mutate(.row_id = row_number()) %>%
select(id, .row_id)
Next, the code chunk below will be used to join and convert from_id
and to_id
to integer indices. At the same time we also drop rows with unmatched nodes.
<- mc3_edges_cleaned %>%
mc3_edges_indexed left_join(node_index_lookup,
by = c("from_id" = "id")) %>%
rename(from = .row_id) %>%
left_join(node_index_lookup,
by = c("to_id" = "id")) %>%
rename(to = .row_id) %>%
select(from, to, is_inferred, type) %>%
filter(!is.na(from) & !is.na(to))
Next the code chunk below is used to subset nodes to only those referenced by edges.
<- sort(
used_node_indices unique(c(mc3_edges_indexed$from,
$to)))
mc3_edges_indexed
<- mc3_nodes_cleaned %>%
mc3_nodes_final slice(used_node_indices) %>%
mutate(new_index = row_number())
We will then use the code chunk below to rebuild lookup from old index to new index.
<- tibble(
old_to_new_index old_index = used_node_indices,
new_index = seq_along(
used_node_indices))
Lastly, the code chunk below will be used to update edge indices to match new node table.
<- mc3_edges_indexed %>%
mc3_edges_final left_join(old_to_new_index,
by = c("from" = "old_index")) %>%
rename(from_new = new_index) %>%
left_join(old_to_new_index,
by = c("to" = "old_index")) %>%
rename(to_new = new_index) %>%
select(from = from_new, to = to_new,
is_inferred, type)
Building the tidygraph object
Now we are ready to build the tidygraph object by using the code chunk below.
<- tbl_graph(
mc3_graph nodes = mc3_nodes_final,
edges = mc3_edges_final,
directed = TRUE
)
After the tidygraph object is created, it is always a good practice to examine the object by using str()
.
str(mc3_graph)
Classes 'tbl_graph', 'igraph' hidden list of 10
$ : num 1159
$ : logi TRUE
$ : num [1:3226] 0 0 0 0 0 0 0 1 1 1 ...
$ : num [1:3226] 1137 356 746 894 875 ...
$ : NULL
$ : NULL
$ : NULL
$ : NULL
$ :List of 4
..$ : num [1:3] 1 0 1
..$ : Named list()
..$ :List of 31
.. ..$ type : chr [1:1159] "Entity" "Entity" "Entity" "Entity" ...
.. ..$ label : chr [1:1159] "Sam" "Kelly" "Nadia Conti" "Elise" ...
.. ..$ name : chr [1:1159] "Sam" "Kelly" "Nadia Conti" "Elise" ...
.. ..$ sub_type : chr [1:1159] "Person" "Person" "Person" "Person" ...
.. ..$ id : chr [1:1159] "Sam" "Kelly" "Nadia Conti" "Elise" ...
.. ..$ timestamp : chr [1:1159] NA NA NA NA ...
.. ..$ monitoring_type : chr [1:1159] NA NA NA NA ...
.. ..$ findings : chr [1:1159] NA NA NA NA ...
.. ..$ content : chr [1:1159] NA NA NA NA ...
.. ..$ assessment_type : chr [1:1159] NA NA NA NA ...
.. ..$ results : chr [1:1159] NA NA NA NA ...
.. ..$ movement_type : chr [1:1159] NA NA NA NA ...
.. ..$ destination : chr [1:1159] NA NA NA NA ...
.. ..$ enforcement_type : chr [1:1159] NA NA NA NA ...
.. ..$ outcome : chr [1:1159] NA NA NA NA ...
.. ..$ activity_type : chr [1:1159] NA NA NA NA ...
.. ..$ participants : int [1:1159] NA NA NA NA NA NA NA NA NA NA ...
.. ..$ reference : chr [1:1159] NA NA NA NA ...
.. ..$ date : chr [1:1159] NA NA NA NA ...
.. ..$ time : chr [1:1159] NA NA NA NA ...
.. ..$ friendship_type : chr [1:1159] NA NA NA NA ...
.. ..$ permission_type : chr [1:1159] NA NA NA NA ...
.. ..$ start_date : chr [1:1159] NA NA NA NA ...
.. ..$ end_date : chr [1:1159] NA NA NA NA ...
.. ..$ report_type : chr [1:1159] NA NA NA NA ...
.. ..$ submission_date : chr [1:1159] NA NA NA NA ...
.. ..$ jurisdiction_type: chr [1:1159] NA NA NA NA ...
.. ..$ authority_level : chr [1:1159] NA NA NA NA ...
.. ..$ coordination_type: chr [1:1159] NA NA NA NA ...
.. ..$ operational_role : chr [1:1159] NA NA NA NA ...
.. ..$ new_index : int [1:1159] 1 2 3 4 5 6 7 8 9 10 ...
..$ :List of 2
.. ..$ is_inferred: logi [1:3226] TRUE FALSE TRUE TRUE TRUE TRUE ...
.. ..$ type : chr [1:3226] NA "sent" NA NA ...
$ :<environment: 0x000002835e7a4a20>
- attr(*, "active")= chr "nodes"
Visualising the knowledge graph
Several of the ggraph layouts involve randomisation. In order to ensure reproducibility, it is necessary to set the seed value before plotting by using the code chunk below.
set.seed(1234)
In the code chunk below, ggraph functions are used to create the whole graph.
ggraph(mc3_graph,
layout = "fr") +
geom_edge_link(alpha = 0.3,
colour = "gray") +
geom_node_point(aes(color = `type`),
size = 4) +
geom_node_text(aes(label = type),
repel = TRUE,
size = 2.5) +
theme_void()
The example below is not model answers, It is used to show you how to use the mantra of Overview first, details on demand of visual investigation.