MC1 Kickstarter

Published

May 12, 2025

Modified

May 18, 2025

Overview

By the end of this handson exercise, you will be able to:

  • import Mini case 1 data file R object,
  • split the knowledge graph into nodes and edges tibble data frames,
  • tidy nodes and edges tibble data frames for conforming to the requirements of tidygraph,
  • create a tidygrpah object by using the tidied nides and edges, and
  • visualise the tidygraph

Getting Started

For the purpose of this exercise, four R packages will be used. They are tidyverse, jsonlite, tidygraph and ggraph.

Note

You are required to install the R packages above, if necessary, before continue to the next step.

In the code chunk below, p_load() of pacman package is used to load the R packages into R environemnt.

pacman::p_load(tidyverse, jsonlite,
               tidygraph, ggraph)

Importing Kownledge Graph Data

For the purpose of this exercise, MC1_graph.json file will be used. Before getting started, you should have the data set in the data sub-folder.

In the code chunk below, fromJSON() of jsonlite package is used to import MC1_graph.json file into R and save the output object

kg <- fromJSON("data/MC1_graph.json")

Inspect structure

Before preparing the data, it is always a good practice to examine the structure of kg object.

In the code chunk below str() is used to reveal the structure of kg object.

str(kg, max.level = 1)
List of 5
 $ directed  : logi TRUE
 $ multigraph: logi TRUE
 $ graph     :List of 2
 $ nodes     :'data.frame': 17412 obs. of  10 variables:
 $ links     :'data.frame': 37857 obs. of  4 variables:

Extracting the edges and nodes tables

Next, as_tibble() of tibble package package is used to extract the nodes and links tibble data frames from kg object into two separate tibble data frames called nodes_tbl and edges_tbl respectively.

nodes_tbl <- as_tibble(kg$nodes)
edges_tbl <- as_tibble(kg$links) 

Initial EDA

It is time for us to apply appropriate EDA methods to examine the data.

In this code chunk below, ggplot2 functions are used the reveal the frequency distribution of Edge Type field of edges_tbl.

ggplot(data = edges_tbl,
       aes(y = `Edge Type`)) +
  geom_bar()

On the other hands, code chunk below uses ggplot2 functions to reveal the frequency distribution of Node Type field of nodes_tbl.

ggplot(data = nodes_tbl,
       aes(y = `Node Type`)) +
  geom_bar()

Creating Knowledge Graph

Mapping from node id to row index

Before we can go ahead to build the tidygraph object, it is important for us to ensures each id from the node list is mapped to the correct row number. This requirement can be achive by using the code chunk below.

id_map <- tibble(id = nodes_tbl$id,
                 index = seq_len(
                   nrow(nodes_tbl)))

Map source and target IDs to row indices

Next, we will map the source and the target IDs to row indices by using the code chunk below.

edges_tbl <- edges_tbl %>%
  left_join(id_map, by = c("source" = "id")) %>%
  rename(from = index) %>%
  left_join(id_map, by = c("target" = "id")) %>%
  rename(to = index)
Note

To better understand the changes before and after the process, it is to take a screenshot of edges_tbl before and after this process and examine the differences.

Filter out any unmatched (invalid) edges

Lastly, the code chunk below will be used to exclude the unmatch edges.

edges_tbl <- edges_tbl %>%
  filter(!is.na(from), !is.na(to))

Creating tidygraph

Lastly, tbl_graph() is used to create tidygraph’s graph object by using the code chunk below.

graph <- tbl_graph(nodes = nodes_tbl, 
                   edges = edges_tbl, 
                   directed = kg$directed)

You might want to confirm the output object is indeed in tidygraph format by using the code chunk below.

class(graph)
[1] "tbl_graph" "igraph"   

Visualising the knowledge graph

In this section, we will use ggraph’s functions to visualise and analyse the graph object.

Warning

The two examples below are not model answers, they are examples to show you how to use the mantra of Overview first, details on demand of visual investigation.

Several of the ggraph layouts involve randomisation. In order to ensure reproducibility, it is necessary to set the seed value before plotting by using the code chunk below.

set.seed(1234)

Visualising the whole graph

In the code chunk below, ggraph functions are used to visualise the whole graph.

ggraph(graph, layout = "fr") +
  geom_edge_link(alpha = 0.3, 
                 colour = "gray") +
  geom_node_point(aes(color = `Node Type`), 
                  size = 4) +
  geom_node_text(aes(label = name), 
                 repel = TRUE, 
                 size = 2.5) +
  theme_void()

Notice that the whole graph is very messy and we can hardy discover any useful patterns. This is always the case in graph visualisation and analysis. In order to gain meaningful visual discovery, it is always useful for us to looking into the details, for example by plotting sub-graphs.

Visualising the sub-graph

In this section, we are interested to create a sub-graph base on MemberOf value in Edge Type column of the edges data frame.

Filtering edges to only “MemberOf”

graph_memberof <- graph %>%
  activate(edges) %>%
  filter(`Edge Type` == "MemberOf")

Extracting only connected nodes (i.e., used in these edges)

used_node_indices <- graph_memberof %>%
  activate(edges) %>%
  as_tibble() %>%
  select(from, to) %>%
  unlist() %>%
  unique()

Keeping only those nodes

graph_memberof <- graph_memberof %>%
  activate(nodes) %>%
  mutate(row_id = row_number()) %>%
  filter(row_id %in% used_node_indices) %>%
  select(-row_id)  # optional cleanup

Plotting the sub-graph

ggraph(graph_memberof, 
       layout = "fr") +
  geom_edge_link(alpha = 0.5, 
                 colour = "gray") +
  geom_node_point(aes(color = `Node Type`), 
                  size = 1) +
  geom_node_text(aes(label = name), 
                 repel = TRUE, 
                 size = 2.5) +
  theme_void()

Notice that the sub-graph above is very clear and the relationship between musical group and person can be visualise easily.