::p_load(tidyverse, jsonlite,
pacman
tidygraph, ggraph, SmartEDA)
MC2 Kickstarter
Overview
By the end of this hands-on exercise, you will be able to:
- import Mini Case 2 data file R object,
- split the knowledge graph into nodes and edges tibble data frames,
- tidy nodes and edges tibble data frames for conforming to the requirements of tidygraph,
- create a tidygrpah object by using the tidied nides and edges, and
- visualise the tidygraph
Getting Started
For the purpose of this exercise, five R packages will be used. They are tidyverse, jsonlite, tidygraph, ggraph and SmartEDA.
You are required to install the R packages above, if necessary, before continue to the next step.
In the code chunk below, p_load()
of pacman package is used to load the R packages into R environemnt.
Importing Knowledge Graph Data
For the purpose of this exercise, FILAH.json file will be used. Before getting started, you should have the data set in the data sub-folder.
In the code chunk below, fromJSON()
of jsonlite package is used to import FILAH.json file into R and save the output object
<- fromJSON("data/FILAH.json") filah
Inspecting knowledge graph structure
Before preparing the data, it is always a good practice to examine the structure of filah knowledge graph.
In the code chunk below glimpse()
is used to reveal the structure of filah knowledge graph.
glimpse(filah)
List of 5
$ directed : logi TRUE
$ multigraph: logi TRUE
$ graph : Named list()
$ nodes :'data.frame': 396 obs. of 17 variables:
..$ type : chr [1:396] "meeting" "meeting" "meeting" "meeting" ...
..$ date : chr [1:396] "Meeting 1" "Meeting 2" "Meeting 3" "Meeting 4" ...
..$ label : chr [1:396] "Meeting 1" "Meeting 2" "Meeting 3" "Meeting 4" ...
..$ id : chr [1:396] "Meeting_1" "Meeting_2" "Meeting_3" "Meeting_4" ...
..$ name : chr [1:396] NA NA NA NA ...
..$ role : chr [1:396] NA NA NA NA ...
..$ short_topic: chr [1:396] NA NA NA NA ...
..$ long_topic : chr [1:396] NA NA NA NA ...
..$ short_title: chr [1:396] NA NA NA NA ...
..$ long_title : chr [1:396] NA NA NA NA ...
..$ plan_type : chr [1:396] NA NA NA NA ...
..$ lat : num [1:396] NA NA NA NA NA NA NA NA NA NA ...
..$ lon : num [1:396] NA NA NA NA NA NA NA NA NA NA ...
..$ zone : chr [1:396] NA NA NA NA ...
..$ zone_detail: chr [1:396] NA NA NA NA ...
..$ start : chr [1:396] NA NA NA NA ...
..$ end : chr [1:396] NA NA NA NA ...
$ links :'data.frame': 765 obs. of 9 variables:
..$ role : chr [1:765] "part_of" "part_of" "part_of" "part_of" ...
..$ source : chr [1:765] "Meeting_1" "Meeting_1" "Meeting_1" "Meeting_1" ...
..$ target : chr [1:765] "fish_vacuum_Meeting_1_Introduction_Discussion" "fish_vacuum_Meeting_1_Introduction" "seafood_festival_Meeting_1_Discussion" "seafood_festival_Meeting_1_Feasibility" ...
..$ key : int [1:765] 0 0 0 0 0 0 0 0 0 0 ...
..$ sentiment: num [1:765] NA NA NA NA NA NA NA NA NA NA ...
..$ reason : chr [1:765] NA NA NA NA ...
..$ industry :List of 765
.. ..$ : NULL
.. ..$ : NULL
.. ..$ : NULL
.. ..$ : NULL
.. ..$ : NULL
.. ..$ : NULL
.. ..$ : NULL
.. ..$ : NULL
.. ..$ : NULL
.. ..$ : NULL
.. ..$ : NULL
.. ..$ : NULL
.. ..$ : NULL
.. ..$ : NULL
.. ..$ : NULL
.. ..$ : NULL
.. ..$ : NULL
.. ..$ : NULL
.. ..$ : NULL
.. ..$ : NULL
.. ..$ : NULL
.. ..$ : NULL
.. ..$ : NULL
.. ..$ : NULL
.. ..$ : NULL
.. ..$ : NULL
.. ..$ : NULL
.. ..$ : NULL
.. ..$ : NULL
.. ..$ : NULL
.. ..$ : NULL
.. ..$ : NULL
.. ..$ : NULL
.. ..$ : NULL
.. ..$ : NULL
.. ..$ : NULL
.. ..$ : NULL
.. ..$ : NULL
.. ..$ : NULL
.. ..$ : NULL
.. ..$ : NULL
.. ..$ : NULL
.. ..$ : NULL
.. ..$ : NULL
.. ..$ : NULL
.. ..$ : NULL
.. ..$ : NULL
.. ..$ : NULL
.. ..$ : NULL
.. ..$ : NULL
.. ..$ : NULL
.. ..$ : NULL
.. ..$ : NULL
.. ..$ : NULL
.. ..$ : NULL
.. ..$ : NULL
.. ..$ : NULL
.. ..$ : NULL
.. ..$ : NULL
.. ..$ : NULL
.. ..$ : NULL
.. ..$ : NULL
.. ..$ : NULL
.. ..$ : NULL
.. ..$ : NULL
.. ..$ : NULL
.. ..$ : NULL
.. ..$ : NULL
.. ..$ : NULL
.. ..$ : NULL
.. ..$ : NULL
.. ..$ : NULL
.. ..$ : NULL
.. ..$ : NULL
.. ..$ : NULL
.. ..$ : NULL
.. ..$ : NULL
.. ..$ : NULL
.. ..$ : NULL
.. ..$ : NULL
.. ..$ : NULL
.. ..$ : NULL
.. ..$ : NULL
.. ..$ : NULL
.. ..$ : NULL
.. ..$ : NULL
.. ..$ : NULL
.. ..$ : NULL
.. ..$ : NULL
.. ..$ : NULL
.. ..$ : NULL
.. ..$ : NULL
.. ..$ : NULL
.. ..$ : NULL
.. ..$ : NULL
.. ..$ : NULL
.. ..$ : NULL
.. ..$ : NULL
.. ..$ : NULL
.. .. [list output truncated]
..$ status : chr [1:765] NA NA NA NA ...
..$ time : chr [1:765] NA NA NA NA ...
Notice that Industry field is in list data type. In general, this data type is not acceptable by tbl_graph()
of tidygraph. In order to avoid error arise when building tidygraph object, it is wiser to exclude this field from the edges data table. However, it might be still useful in subsequent analysis.
Extracting the edges and nodes tables
Next, as_tibble()
of tibble package package is used to extract the nodes and links tibble data frames from filah tibble dataframe into two separate tibble dataframes called filah_nodes and filah_edges respectively.
<- as_tibble(filah$nodes)
filah_nodes <- as_tibble(filah$links) filah_edges
Initial EDA
It is time for us to apply appropriate EDA methods to examine the data.
In the code chunk below, ExpCatViz()
of SmartEDA package is used to reveal the frequency distribution of all categorical fields in filah_nodes tibble dataframe.
ExpCatViz(data=filah_nodes,
col="lightblue")
[[1]]
[[2]]
[[3]]
[[4]]
[[5]]
What useful discovery can you obtained from the visualisation above?
On the other hands, code chunk below uses ExpCATViz()
of SmartEDA package to reveal the frequency distribution of all categorical fields in filah_edges tibble dataframe.
ExpCatViz(data=filah_edges,
col="lightblue")
[[1]]
[[2]]
[[3]]
What useful discovery can you obtained from the visualisation above?
ExpNumViz(filah_nodes)
[[1]]
[[2]]
What useful discovery can you obtained from the visualisation above?
ExpNumViz(filah_edges)
[[1]]
What useful discovery can you obtained from the visualisation above?
Data Cleaning and Wrangling
Cleaning and wrangling nodes
<- filah_nodes %>%
filah_nodes_cleaned mutate(id = as.character(id)) %>%
filter(!is.na(id)) %>%
distinct(id, .keep_all = TRUE) %>%
select(id, type, label)
Cleaning and wrangling edges
<- filah_edges %>%
filah_edges_cleaned rename(from = source, to = target) %>%
mutate(across(c(from, to), as.character)) %>%
filter(from %in% filah_nodes_cleaned$id, to %in% filah_nodes_cleaned$id)
# Remove problematic columns from edge table for graph building
<- filah_edges_cleaned %>%
filah_edges_min select(from, to, role) # Only basic fields needed for graph structure
Building the tidygraph object
<- tbl_graph(
filah_graph nodes = filah_nodes_cleaned,
edges = filah_edges_min,
directed = TRUE)
Since the similar steps will be used to clean and wrangle TROUT.json
and journalist.json
, you might want to consider converting the above code chunks into R function(s).
Visualising the knowledge graph
In this section, we will use ggraph’s functions to visualise and analyse the graph object.
The example below is not model answers, It is used to show you how to use the mantra of Overview first, details on demand of visual investigation.
Visualising the whole graph
Several of the ggraph layouts involve randomisation. In order to ensure reproducibility, it is necessary to set the seed value before plotting by using the code chunk below.
set.seed(1234)
In the code chunk below, ggraph functions are used to create the whole graph.
ggraph(filah_graph,
layout = "fr") +
geom_edge_link(alpha = 0.3,
colour = "gray") +
geom_node_point(aes(color = `type`),
size = 4) +
geom_node_text(aes(label = type),
repel = TRUE,
size = 2.5) +
theme_void()