зеркало из https://github.com/microsoft/topologic.git
436 строки
15 KiB
Plaintext
436 строки
15 KiB
Plaintext
{
|
|
"cells": [
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {
|
|
"pycharm": {
|
|
"name": "#%% md\n"
|
|
}
|
|
},
|
|
"source": [
|
|
"# Complex IO\n",
|
|
"\n",
|
|
"`topologic` contains extensive `io` and `projections` packages for loading data in many ways. \n",
|
|
"\n",
|
|
"Some sources of data are multigraphs, and you may need to make some hard decisions on how you want to handle converting a multigraph into a undirected or directed simple graph.\n",
|
|
"\n",
|
|
"On edge duplication, do you want to:\n",
|
|
"- Sum the weights?\n",
|
|
"- Average them?\n",
|
|
"- Take the latest?\n",
|
|
"- Exclude edges based on some other attribute criteria?\n",
|
|
"\n",
|
|
"You can always pre-process your data to answer this question for you, and write everything you need from scratch. This is a valid strategy and if you feel most comfortable with it, you can either create your own `networkx` Graph objects or use the simple `topologic.io.from_file` function to create a graph for you.\n",
|
|
"\n",
|
|
"`topologic` also contains a number of utility functions and a general opinionated paradigm for operating over input files when building a graph which may help you quickly transform all of your various source files into the exact graph you wish to analyze with the rest of `topologic`'s capabilities. This notebook will show how to use this. The main reason to use this is if you expect to do similar sorts of projections, possibly with minor configuration differences, across many different source files. Building a corpus of convenient projections could save you a lot of time in the future, especially in an enterprise environment. \n",
|
|
"\n",
|
|
"# Data\n",
|
|
"The data we are using is located in `test_data/` colocated at the same directory as this notebook. It is a directed multigraph from <a href=\"https://snap.stanford.edu/data/\">Stanford's Large Network Dataset Collection</a>. In specific, we are going to use the <a href=\"https://snap.stanford.edu/data/soc-RedditHyperlinks.html\">Social Network: Reddit Hyperlink Network</a>. This dataset was generated for the paper:\n",
|
|
"\n",
|
|
"- S. Kumar, W.L. Hamilton, J. Leskovec, D. Jurafsky. \n",
|
|
"<a href=\"https://cs.stanford.edu/~srijan/pubs/conflict-paper-www18.pdf\">Community Interaction and Conflict on the \n",
|
|
"Web.</a> World Wide Web Conference, 2018.\n",
|
|
"\n",
|
|
"The file has been cut down to size by removing the `POST_PROPERTIES` column of the original data file, primarily to speed up cloning the repository. To do this, we executed \n",
|
|
"\n",
|
|
"```bash\n",
|
|
"cut -f 1,2,3,4 soc-redditHyperlinks-body.tsv > test_data/smaller-redditHyperlinks-body.tsv\n",
|
|
"```\n",
|
|
"\n",
|
|
"The **only** reason this file was modified using any cli utilities is to reduce size within our git repository for faster cloning and to remove the `git-lfs` requirement. `topologic` would be quite happy to process a file of this size in normal circumstances.\n",
|
|
"\n",
|
|
"The tab-separated file has the format of:\n",
|
|
"```\n",
|
|
"SOURCE_SUBREDDIT TARGET_SUBREDDIT POST_ID TIMESTAMP POST_LABEL\n",
|
|
"```\n",
|
|
"\n",
|
|
"# Scenario\n",
|
|
"\n",
|
|
"We want to load this graph, filtering any record out before a given timestamp, then aggregate the implied weight (1) with any other existing `source` to `destination` link that exists between the two vertices by using the \n",
|
|
"`topologic.io` and `topologic.projection` packages.\n",
|
|
"\n",
|
|
"We will then make some edge weight cuts using the `topologic.statistics` package.\n",
|
|
"\n",
|
|
"Finally we will show nominal usage of the `topologic.embedding.node2vec_embedding` function.\n",
|
|
"\n",
|
|
"# Disclaimers\n",
|
|
"\n",
|
|
"As we are trying to show the **capability** of the library, we are not necessarily going to make the best decisions with regards to cut dates, or making weight based edge cuts, and we shouldn't expect to glean any useful information from our node2vec embedding."
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {
|
|
"pycharm": {
|
|
"name": "#%% md\n"
|
|
}
|
|
},
|
|
"source": [
|
|
"# Initial Setup and Projection Function Creation\n",
|
|
"First we're going to import the libraries we're going to use, including `topologic`, and set some of the constants we \n",
|
|
"expect to use later.\n",
|
|
"\n",
|
|
"Then we're going to create our projection function. This projection function is simply going to be responsible\n",
|
|
"for processing a single row of data from our csv parser, and optionally modify the `networkx` graph depending on\n",
|
|
"some business rules we've put in place.\n",
|
|
"\n",
|
|
"the `topologic.io.from_dataset` function expects a function of the signature \n",
|
|
"`Callable[[nx.Graph], Callable[[List[str]], None]]` to be provided\n",
|
|
"\n",
|
|
"This is the definition of a function that returns a function that returns a function.\n",
|
|
"\n",
|
|
"The first function lets us specify some configuration properties that we can use later. Configuration properties like\n",
|
|
"changing the date we want to use when we build a graph, or what row index (0 based) that we expect a given column to \n",
|
|
"contain data for. We will call this after we define it with our actual parameters.\n",
|
|
"\n",
|
|
"The first inner function will be called by the `topologic.io.from_dataset` function. It will pass in the networkx\n",
|
|
"Graph object we will be using.\n",
|
|
"\n",
|
|
"The second inner function will be called on a per csv-record basis, and will be responsible for actually applying\n",
|
|
"our business rules and deciding whether to update the graph or not."
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": 1,
|
|
"metadata": {
|
|
"pycharm": {
|
|
"is_executing": false,
|
|
"name": "#%%\n"
|
|
}
|
|
},
|
|
"outputs": [],
|
|
"source": [
|
|
"import networkx as nx\n",
|
|
"import topologic as tc\n",
|
|
"from typing import Callable, List\n",
|
|
"\n",
|
|
"reddit_hyperlinks_path = \"test_data/smaller-redditHyperlinks-body.tsv\"\n",
|
|
"\n",
|
|
"# data file contains edges from Jan 2014 - April 2017 - we'll only take a year's worth\n",
|
|
"# timestamps are in `YYYY-MM-DD HH:mm:SS` format (e.g. `2013-12-31 16:39:58`)\n",
|
|
"# this means we can do a reasonably fast string comparison to determine whether the edge should be taken into\n",
|
|
"# consideration or not\n",
|
|
"\n",
|
|
"date_cutoff = \"2016-05-01\"\n",
|
|
"\n",
|
|
"\n",
|
|
"\n",
|
|
"def sum_after_date(\n",
|
|
" keep_after_date: str,\n",
|
|
" source_index: int,\n",
|
|
" target_index: int,\n",
|
|
" date_index: int\n",
|
|
") -> Callable[[nx.Graph], Callable[[List[str]], None]]:\n",
|
|
" def _csv_parser_setup(\n",
|
|
" graph: nx.Graph\n",
|
|
" ) -> Callable[[List[str]], None]:\n",
|
|
" def _process_row(row: List[str]):\n",
|
|
" # this processes the current row\n",
|
|
" # in our case right now, we expect to be able to drop any record before May 1st of 2016\n",
|
|
" # and sum up any weights if they currently exist in the graph\n",
|
|
" source = row[source_index]\n",
|
|
" target = row[target_index]\n",
|
|
" date = row[date_index]\n",
|
|
" if date >= keep_after_date:\n",
|
|
" original_weight = graph[source][target][\"weight\"] if source in graph and target in graph[source] else 0\n",
|
|
" weight = original_weight + 1\n",
|
|
" graph.add_edge(source, target, weight=weight) \n",
|
|
" return\n",
|
|
" return _process_row\n",
|
|
" return _csv_parser_setup"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {
|
|
"pycharm": {
|
|
"name": "#%% md\n"
|
|
}
|
|
},
|
|
"source": [
|
|
"# topologic.io.CsvDataset Setup\n",
|
|
"\n",
|
|
"Now that we've defined our projection function, now we need to define our CsvDataset.\n",
|
|
"\n",
|
|
"As we mentioned earlier, we are using a tab-separated file. It also uses standard Unix `\\n` line terminators. We use\n",
|
|
"the built-in `csv` Python module to parse our files, and this information is useful\n",
|
|
"for defining the csv `Dialect` type that we will be using. We also have a header included and will want to ignore that."
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": 2,
|
|
"metadata": {
|
|
"pycharm": {
|
|
"is_executing": false,
|
|
"name": "#%%\n"
|
|
}
|
|
},
|
|
"outputs": [],
|
|
"source": [
|
|
"digraph = nx.DiGraph()\n",
|
|
"projection_function = sum_after_date(date_cutoff, 0, 1, 3)\n",
|
|
"\n",
|
|
"with open(reddit_hyperlinks_path, \"r\") as data_input:\n",
|
|
" dataset = tc.io.CsvDataset(\n",
|
|
" source_iterator=data_input,\n",
|
|
" has_headers=True,\n",
|
|
" dialect=\"excel-tab\"\n",
|
|
" )\n",
|
|
" \n",
|
|
" digraph = tc.io.from_dataset(\n",
|
|
" csv_dataset=dataset,\n",
|
|
" projection_function_generator=projection_function,\n",
|
|
" graph=digraph\n",
|
|
" )"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": 3,
|
|
"metadata": {
|
|
"pycharm": {
|
|
"is_executing": false,
|
|
"name": "#%%\n"
|
|
}
|
|
},
|
|
"outputs": [],
|
|
"source": [
|
|
"def print_graph(graph: nx.Graph):\n",
|
|
" print(f\"Number of Graph Vertices: {len(graph)}\")\n",
|
|
" print(f\"Number of Graph Edges: {len(graph.edges())}\")\n",
|
|
" print(f\"Maximum Edge Weight: {max(weight for _, _, weight in graph.edges(data='weight'))}\")\n",
|
|
" print(f\"Minimum Edge Weight: {min(weight for _, _, weight in graph.edges(data='weight'))}\")"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": 4,
|
|
"metadata": {
|
|
"pycharm": {
|
|
"is_executing": false,
|
|
"name": "#%%\n"
|
|
}
|
|
},
|
|
"outputs": [
|
|
{
|
|
"name": "stdout",
|
|
"output_type": "stream",
|
|
"text": [
|
|
"Number of Graph Vertices: 19696\n",
|
|
"Number of Graph Edges: 52903\n",
|
|
"Maximum Edge Weight: 184\n",
|
|
"Minimum Edge Weight: 1\n"
|
|
]
|
|
}
|
|
],
|
|
"source": [
|
|
"print_graph(digraph)"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {
|
|
"pycharm": {
|
|
"name": "#%% md\n"
|
|
}
|
|
},
|
|
"source": [
|
|
"# Pruning Graphs by Edge Weight\n",
|
|
"As we can see from the above print statements, we have a graph of 19696 nodes and 52903 edges. The actual edge count\n",
|
|
"from the file is instead `286561` edges (`wc -l test_data/soc-redditHyperlinks-body.tsv` minus 1 (the header)). \n",
|
|
"\n",
|
|
"We also know that the initial edge list did not contain a weight - we just counted the number of source to target \n",
|
|
"relationships within our time window and used that to be our weight.\n",
|
|
"\n",
|
|
"Now we want to explore some of the tools available to make graph cuts. In this specific case, we're going to explore\n",
|
|
"making graph cuts via the edge weight parameter (`topologic` also supports degree centrality and betweenness centrality\n",
|
|
"using an almost identical API). \n"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": 5,
|
|
"metadata": {
|
|
"pycharm": {
|
|
"is_executing": false,
|
|
"name": "#%%\n"
|
|
}
|
|
},
|
|
"outputs": [
|
|
{
|
|
"data": {
|
|
"text/plain": [
|
|
"DefinedHistogram(histogram=array([52539, 242, 78, 22, 10, 4, 3, 1, 2,\n",
|
|
" 2]), bin_edges=array([ 1. , 19.3, 37.6, 55.9, 74.2, 92.5, 110.8, 129.1, 147.4,\n",
|
|
" 165.7, 184. ]))"
|
|
]
|
|
},
|
|
"execution_count": 5,
|
|
"metadata": {},
|
|
"output_type": "execute_result"
|
|
}
|
|
],
|
|
"source": [
|
|
"# histogram of weights\n",
|
|
"tc.statistics.histogram_edge_weight(digraph)"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": 6,
|
|
"metadata": {
|
|
"pycharm": {
|
|
"is_executing": false,
|
|
"name": "#%%\n"
|
|
}
|
|
},
|
|
"outputs": [
|
|
{
|
|
"name": "stdout",
|
|
"output_type": "stream",
|
|
"text": [
|
|
"Number of Graph Vertices: 19696\n",
|
|
"Number of Graph Edges: 52891\n",
|
|
"Maximum Edge Weight: 92\n",
|
|
"Minimum Edge Weight: 1\n"
|
|
]
|
|
}
|
|
],
|
|
"source": [
|
|
"#let's cut around weight <= 92.5\n",
|
|
"cut_graph = tc.statistics.cut_edges_by_weight(\n",
|
|
" digraph,\n",
|
|
" 92.5,\n",
|
|
" tc.statistics.MakeCuts.LARGER_THAN_EXCLUSIVE,\n",
|
|
" prune_isolates=False\n",
|
|
")\n",
|
|
"\n",
|
|
"print_graph(cut_graph)\n"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {},
|
|
"source": [
|
|
"It my be surprising to you that we have the precise same number of vertices as we did prior. This is because we did not\n",
|
|
"tell our edge cut to ALSO prune our isolated vertices. To address this, set `prune_isolates=True` and run it again."
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": 7,
|
|
"metadata": {
|
|
"pycharm": {
|
|
"is_executing": false,
|
|
"name": "#%%\n"
|
|
}
|
|
},
|
|
"outputs": [
|
|
{
|
|
"name": "stdout",
|
|
"output_type": "stream",
|
|
"text": [
|
|
"Number of Graph Vertices: 19694\n",
|
|
"Number of Graph Edges: 52891\n",
|
|
"Maximum Edge Weight: 92\n",
|
|
"Minimum Edge Weight: 1\n"
|
|
]
|
|
}
|
|
],
|
|
"source": [
|
|
"cut_graph = tc.statistics.cut_edges_by_weight(\n",
|
|
" digraph,\n",
|
|
" 92.5,\n",
|
|
" tc.statistics.MakeCuts.LARGER_THAN_EXCLUSIVE,\n",
|
|
" prune_isolates=True\n",
|
|
")\n",
|
|
"\n",
|
|
"print_graph(cut_graph)"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {
|
|
"pycharm": {
|
|
"name": "#%% md\n"
|
|
}
|
|
},
|
|
"source": [
|
|
"We've now pruned a pair of unused nodes!\n",
|
|
"\n",
|
|
"Now let's extract the largest connected component from our graph; multiple connected components skew our resulting\n",
|
|
"embeddings more than we would like, so we'll work over discrete connected components. Note: It may make sense to create \n",
|
|
"embeddings for each connected component if the sizes are roughly equivalent."
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": 8,
|
|
"metadata": {
|
|
"pycharm": {
|
|
"is_executing": false,
|
|
"name": "#%%\n"
|
|
}
|
|
},
|
|
"outputs": [
|
|
{
|
|
"name": "stdout",
|
|
"output_type": "stream",
|
|
"text": [
|
|
"Number of Graph Vertices: 18601\n",
|
|
"Number of Graph Edges: 52243\n",
|
|
"Maximum Edge Weight: 92\n",
|
|
"Minimum Edge Weight: 1\n"
|
|
]
|
|
}
|
|
],
|
|
"source": [
|
|
"lcc = tc.largest_connected_component(cut_graph)\n",
|
|
"\n",
|
|
"print_graph(lcc)\n",
|
|
"\n",
|
|
"embedding_container = tc.embedding.node2vec_embedding(lcc)\n"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": null,
|
|
"metadata": {},
|
|
"outputs": [],
|
|
"source": []
|
|
}
|
|
],
|
|
"metadata": {
|
|
"kernelspec": {
|
|
"display_name": "Python 3",
|
|
"language": "python",
|
|
"name": "python3"
|
|
},
|
|
"language_info": {
|
|
"codemirror_mode": {
|
|
"name": "ipython",
|
|
"version": 3
|
|
},
|
|
"file_extension": ".py",
|
|
"mimetype": "text/x-python",
|
|
"name": "python",
|
|
"nbconvert_exporter": "python",
|
|
"pygments_lexer": "ipython3",
|
|
"version": "3.8.0"
|
|
},
|
|
"pycharm": {
|
|
"stem_cell": {
|
|
"cell_type": "raw",
|
|
"metadata": {
|
|
"collapsed": false
|
|
},
|
|
"source": []
|
|
}
|
|
}
|
|
},
|
|
"nbformat": 4,
|
|
"nbformat_minor": 2
|
|
}
|