Code Documentation

basics

class crawlers.declarable.Declarable

Subclasses of this class can be specified by their declarations. Each declaration uniquely corresponds to a filename. Allowed class parameters are: primitives (bool, int, float, str, tuple, list, set, dict) or Declarable. In order to pass a complex object to class constructor, it must extend Declarable or try to use primitive types (e.g. class name as a string and parameters as a dict). Subclass of Declarable can define its short name via a static variable ‘short’.

For example let’s define 3 classes. Note that parameters ‘graph’ and ‘param’ in Crawler are not sent to superclass thus ignored (undeclared).

>>> class Crawler(Declarable):
>>>     def __init__(self, graph, a_list, b_dict, pred1, pred2, param, **kwargs):
>>>         self.graph = graph  # undeclared parameter
>>>         self.param = param  # undeclared parameter
>>>         super(Crawler, self).__init__(a_list=a_list, b_dict=b_dict, pred1=pred1, pred2=pred2, **kwargs)
>>>
>>> class Predictor(Declarable):
>>>     def __init__(self, x, param, **kwargs):
>>>         self.param = param  # undeclared parameter
>>>         super(Predictor, self).__init__(x = x, **kwargs)
>>>
>>> class GNN(Declarable): pass

Then create an object with two Declarable parameters, where the first one has its own Declarable parameter. Print its declaration and filename.

>>> g = GraphCollections.get('example')
>>> c = Crawler(graph=g,
>>>             a_list=[1, 'a'],
>>>             b_dict={'b': 2},
>>>             pred1=Predictor(x=100, param=11, gnn=GNN(conv="SAGE")),
>>>             pred2=Predictor(x=200, param=22),
>>>             param="this is undeclared")
>>> print(c.declaration)
(<class '__main__.Crawler'>, {'a_list': [1, 'a'], 'b_dict': {'b': 2}, 'pred1': (<class '__main__.Predictor'>, {'x': 100, 'gnn': (<class '__main__.GNN'>, {'conv': 'SAGE'})}), 'pred2': (<class '__main__.Predictor'>, {'x': 200})})
>>> f = declaration_to_filename(c.declaration)
>>> print(f)
"Crawler(a_list=[1, 'a'];b_dict={'b': 2};pred1=Predictor@;pred2=Predictor@)/Predictor(gnn=GNN@;x=100)/GNN(conv=SAGE)/Predictor(x=200)"

In order to construct the same object as c, we should pass undeclared arguments: >>> c_ = Declarable.from_declaration( >>> filename_to_declaration(f), >>> [(Crawler, {‘graph’: g, ‘param’: “redeclared”}), # for Crawler >>> (Predictor, {‘param’: 11}), # for ‘pred1’ parameter >>> (Predictor, {‘param’: 22})]) # for ‘pred2’ parameter Note that if several parameters have same type, the order in the list should corresponds to the order of parameters in constructor.

Declaration of the object contains recursively all declarations of each Declarable parameter embedded.

Filename is built from a complex declaration using depth-first search. Filenames of embedded declarations are separated with ‘/’, i.e. corresponds to subfolders. ‘@’ in a parameter name marks an ambedded declaration.

NOTE: a problem can occur if one filename is longer than allowed by OS (usually 255 symbols).

__init__

List only those parameters you want to constitute filename. If you need to ignore parameters like ‘name’, ‘observed_set’, etc. in file naming - don’t send them.

Args:
**kwargs: parameters that will be used in file naming
declaration

Get the declaration of this instance.

static extract_declaration()

Parse complex object, return its structure as is replacing Declarables with their declarations. Object can be arbitrary combination of allowed types.

static from_declaration()

Build a class object from its declaration and auxiliary undeclared parameters, e.g. graph. The declaration can contain inner declarations which will be converted recursively. For convenience, aux_kwargs are aux_declarations for outermost class. NOTE: aux_declarations can contain any arguments including not Declarable.

Parameters:
  • declaration – pair (class, kwargs)
  • aux_declarations – list of declarations in DFS order.
  • aux_kwargs – auxiliary keyword arguments for the outermost declaration.
Returns:

static is_declaration()

Check if object is Declaration, i.e. tuple (class, Declaration).

crawlers.declarable.declaration_to_filename()

Convert crawler string declaration into filename. Uniqueness is maintained

crawlers.declarable.filename_to_declaration()

Convert filename into crawler declaration. Uniqueness is maintained.

class base.cgraph.MyGraph

Graph object representing nodes and undirected unweighted edges. Uses SNAP library for graph operations. Graph is stored in edgelist format.

Fast operations via snap:

  • IO operations
  • nodes/edges addition/iteration
  • maximal degree, clustering coefficient, extraction of giant component
  • getting random nodes, neighbours

Other methods:

  • graph statistics are available by index access. Once computed it is saved in file.
  • converting to networkit (for some statistics computing) and networkx (for drawing) graphs

NOTES:

  • When graph is modified, all computed statistics are removed. If want to keep statistics in file and modify graph, make a copy of it.
__init__
Parameters:
  • path – where the graph contents is stored
  • full_name – string or tuple ([collection], [subcollection], … , name) containing at least one element, the last one is treated as graph name. By default, name=’noname’ which means graph will be stored in tmp/noname_timestamp
  • directed – ignored: undirected only
  • weighted – ignored: unweighted only
  • format – ignored: ‘ij’ only
  • not_load – if True do not load the graph (useful for stats exploring). Note: any graph modification will lead to segfault
attributes()

Return a set of available attributes names.

clustering()

Get clustering of a node.

copy()

Create and return a copy of this graph. Attributes are ignored.

deg()

Get node degree.

edges()

Number of edges

full_name

Get graph full name prepended with all collections it belongs.

get_attribute()

Get node attribute value.

Parameters:
  • node – node id
  • attr – attribute name
  • keys – inner attribute keys, e.g. for VK graphs attr=’personal’, *keys=’alcohol’,
Returns:

attribute value or None if attribute is not defined for this node

giant_component()

Return a new graph containing the giant component of this graph.

Parameters:inplace – if True, this graph is overwritten, all computed and saved stats will be removed.
Returns:graph
max_deg()

Get maximal node degree.

name

Get graph name.

neighbors()

Generator of neighbors of the given node in this graph.

nodes()

Number of nodes

random_neighbor()

Return a random neighbor of the given node in this graph.

random_node()

Return a random node. O(1)

random_nodes()

Return a vector of random nodes without repetition. O(N)

save()

Write current edge list of snap graph into file.

set_attributes()

Set nodes attribute values. Save to file.

Parameters:
  • attr – attribute name
  • attr_values – dict {node id -> value}
to_dgl()

Get dgl graph representation, create node ids mapping (dgl_node_id -> snap_node_id) and write it to node_map variable. :param node_map: this variable will store {dgl_node_id -> snap_node_id} mapping :return: undirected (all edges are reciprocal) dgl graph TODO add attributes?

to_networkx()

Get networkx graph representation. :param with_attributes: add all attributes to networkx graph.

Absent attributes are stored as Nones.
Returns:undirected networkx graph
class base.cgraph.StatDecoder
class graph_io.GraphCollections[source]

Manager of graph data. By calling method get(graph_full_name), it loads graph from file if any or downloads a graph from online graph collection. graph_full_name is string or tuple ([collection], [subcollection], … , name) containing at least one element, the last one is treated as graph name. Corresponding graph file is stored at collection/subcollection/../name.format file.

networkrepository collection is available.

Example: >>> graph = GraphCollections.get(‘konect’, ‘dolphins’)

static get(*full_name, directed=False, giant_only=True, self_loops=False, not_load=False) → base.cgraph.MyGraph[source]

Read graph from storage or download it from the specified collection. In order to apply giant_only and self_loops, you need to remove the file manually. #TODO maybe make a rewrite?

Parameters:
  • full_name – string or sequence [collection], [subcollection], … , name containing at least one element, the last one is treated as graph name. In case of konect collection, graph name could be any of e.g. ‘CL’ or ‘Actor collaborations’ or ‘actor-collaborations’.
  • directed – undirected by default
  • giant_only – giant component instead of full graph. Component extraction is applied only once when the graph is downloaded.
  • self_loops – self loops are removed by default. Applied only once when the graph is downloaded.
  • not_load – if True do not load the graph (useful for stats exploring). Note: any graph modification will lead to segfault
Returns:

MyGraph object

static get_by_path(path: str, not_load=False, store=True) → base.cgraph.MyGraph[source]

Create and load graph from specified file path. If the path is <GRAPHS_DIR>/a/b/name.ij the full_name will be (‘a’, ‘b’, ‘name’)

static register_new_graph(*full_name) → base.cgraph.MyGraph[source]

Create a new MyGraph object, define its path corresponding to the specified full_name.

NOTE: by default the graph is not loaded, call load() if want to use this object.

Parameters:full_name – string or sequence [collection], [subcollection], … , name containing at least one element, the last one is treated as graph name.
Returns:new MyGraph

predictors

class search.predictors.simple_predictors.Predictor(name=None, **kwargs)[source]

Parent class for node property prediction based on graph neighborhood.

__init__(name=None, **kwargs)[source]

List only those parameters you want to constitute filename. If you need to ignore parameters like ‘name’, ‘observed_set’, etc. in file naming - don’t send them.

Args:
**kwargs: parameters that will be used in file naming
reset()[source]

Reset the model parameters. Makes the model untrained.

extract_features(node, crawler_helper: crawlers.cadvanced.NodeFeaturesUpdatableCrawlerHelper)[source]

Extract feature vector for a node using NodeFeaturesUpdatableCrawlerHelper.

train(Xs, ys)[source]

Train the model on data samples (Xs, ys).

predict_score(X)[source]

Compute score for feature vector X. It is preferred to be in interval [0; 1] and could be considered as the probability of target class.

class search.predictors.simple_predictors.SklearnPredictor(model: str, feature_extractor: search.feature_extractors.NeighborsFeatureExtractor, max_train_samples=None, name=None, **model_kwargs)[source]

Predictor based on a sci-kit-learn model.

__init__(model: str, feature_extractor: search.feature_extractors.NeighborsFeatureExtractor, max_train_samples=None, name=None, **model_kwargs)[source]
Parameters:
  • model – name of sklearn model, e.g. ‘GradientBoostingClassifier’
  • max_train_samples – upper limit on training data (actual for XGB, RF)
  • model_kwargs – keyword args for the model
  • feature_extractor – callable (neighborhood, index, graph, oracle) -> (Xs, ys)
reset()[source]

Reset the model parameters. Makes the model untrained.

train(Xs, ys)[source]

Train the model on data samples (Xs, ys).

predict_score(X)[source]

Compute score for feature vector X. It is preferred to be in interval [0; 1] and could be considered as the probability of target class.

class search.predictors.simple_predictors.MaximumTargetNeighborsPredictor(name=None, **kwargs)[source]

Predictor which encourages nodes with maximum number of target neighbors.

__init__(name=None, **kwargs)[source]
extract_features(node, crawler_helper: crawlers.cadvanced.NodeFeaturesUpdatableCrawlerHelper)[source]

Extract feature vector for a node using NodeFeaturesUpdatableCrawlerHelper.

predict_score(X)[source]

Assume X = number target neighbors

class search.predictors.gnn_predictors.GNNet(conv_class: str, layer_sizes: tuple, activation='torch.relu', merge='mean', **conv_kwargs)[source]

GNN network based on torch.nn.Module

__init__(conv_class: str, layer_sizes: tuple, activation='torch.relu', merge='mean', **conv_kwargs)[source]
Parameters:
  • conv_class – name of convolution class, e.g. ‘SAGEConv’.
  • layer_sizes – layers’ sizes, except output (always = GNNet.out_dim).
  • activation – activation between layers, default is ‘torch.relu’.
  • conv_kwargs – additional arguments to DGL convolution layer.
forward(g, inputs)[source]

Defines the computation performed at every call.

Should be overridden by all subclasses.

Note

Although the recipe for forward pass needs to be defined within this function, one should call the Module instance afterwards instead of this since the former takes care of running the registered hooks while the latter silently ignores them.

class search.predictors.gnn_predictors.GNNPredictor(conv_class: str, layer_sizes: tuple, activation='torch.relu', attributes=None, epochs=300, batch=100, learn_rate=0.01, name=None, **conv_kwargs)[source]
__init__(conv_class: str, layer_sizes: tuple, activation='torch.relu', attributes=None, epochs=300, batch=100, learn_rate=0.01, name=None, **conv_kwargs)[source]
Parameters:
  • conv_class
  • layer_sizes – hidden layers’ sizes, except input (defined at first training step) and output (always = GNNet.out_dim). Empty layer_sizes corresponds to 1-layer net.
  • activation
  • attributes – list of graph nodes attributes to include into feature vectors
  • epochs – number of training epochs
  • batch – number of dgl graphs to unite in 1 training batch
  • learn_rate – optimizer learning rate
  • name – name on plots
  • name – name for plotting
  • conv_kwargs – additional arguments to GNN convolution
reset()[source]

Reset the model parameters. Makes the model untrained.

train(Xs, ys)[source]

Train the model on sequence of graphs with their inputs (Xs) and classification answers (ys).

extract_features(node, crawler_helper: crawlers.cadvanced.NodeFeaturesUpdatableCrawlerHelper) → tuple[source]

Extract feature vector for a node using NodeFeaturesUpdatableCrawlerHelper.

predict_score(X)[source]

Compute score for feature vector X. It is preferred to be in interval [0; 1] and could be considered as the probability of target class.

crawlers

simple

class crawlers.cbasic.Crawler

The root class for all our crawlers. Keeps fields:

  • observed_graph - observed sample of all crawled and visible nodes and edges,
  • crawled_set - set of all crawled nodes,
  • observed_set - set of osberved but not crawled yet nodes.
  • int next_seed (self) - choose next seed among the observed ones, each crawler defines its own strategy.
  • vector[int] crawl (self, int seed) - crawl the specified node and return a list (vector) of newly observed nodes.
  • int crawl_budget (self, int budget) - crawl budget number of nodes according to the strategy.

Crawler is initialized with the original graph (which must NOT be modified) and optionally with observed_graph, crawled_set, and observed_set - to start with.

Crawler has its declaration which determines the instance. Declaration can be uniquely transformed into string filename and back in order to store the results of measurements.

__init__
Parameters:
  • graph – original graph, must remain unchanged
  • name – specify to use in pictures, by default name == filename generated from declaration
  • observed_graph – optionally use a given observed graph, NOTE: the object will be modified, make a copy if needed
  • crawled_set – optionally use a given crawled set, NOTE: the object will be modified, make a copy if needed
  • observed_set – optionally use a given observed set, NOTE: the object will be modified, make a copy if needed
  • kwargs – all additional parameters (needed in subclasses) - they will be encoded into string declaration
crawl()

Crawl specified node. The observed graph is updated, also are crawled and observed set.

Parameters:seed – node id to crawl
Returns:vector (list) of newly seen nodes
crawl_budget()

Perform budget number of crawls according to the algorithm.

Parameters:
  • budget – so many nodes will be crawled. If can’t crawl any more, raise NoNextSeedError
  • args – customizable additional args for subclasses
Returns:

crawled_set

Get nodes’ ids of observed graph (crawled and observed).

next_seed()

Core of the crawler - the algorithm to choose the next seed to be crawled. Seed must be a node of the original graph.

Returns:node id as int
nodes_set

Get nodes’ ids of observed graph (crawled and observed).

observe()

Add the node to observed set and observed graph.

observed_set

Get nodes’ ids of observed graph (crawled and observed).

class crawlers.cbasic.InitialSeedCrawlerHelper

Crawler helper interface. Starting seed type is specified in constructor. Options:

  • None - randomly chosen from original graph
  • <integer> - start from this specific node
  • <string> - some choosing strategy, e.g.’target’

NOTE: when extending multiple Crawler helpers, this one should probably go before the others since it changes observed set. FIXME

__init__
Parameters:initial_seed – if observed set is empty, the crawler will start from the given initial node or use specified strategy. (If not empty, starting seed is defined from next_seed() call). If None is given (which is default), a random node of original graph will be used as initial one.
choose_initial_seed()

Choose an initial seed depending on specified parameter. The options are:

  • None - randomly choose from all nodes
  • <integer> - start from this specific node
  • <string> - some choosing strategy, e.g.’target’, expected to be be overwritten in subclass
class crawlers.cbasic.RandomCrawler

Crawls a random node from the observed ones.

__init__

Initialize self. See help(type(self)) for accurate signature.

crawl()

Crawl specified node. The observed graph is updated, also are crawled and observed set.

Parameters:seed – node id to crawl
Returns:vector (list) of newly seen nodes
next_seed()

Core of the crawler - the algorithm to choose the next seed to be crawled. Seed must be a node of the original graph.

Returns:node id as int
class crawlers.cbasic.RandomWalkCrawler

Crawls a random neighbor of the previously crawled node. If cannot crawl, goes to a random neighbor and proceed there.

__init__

Initialize self. See help(type(self)) for accurate signature.

next_seed()

Core of the crawler - the algorithm to choose the next seed to be crawled. Seed must be a node of the original graph.

Returns:node id as int
class crawlers.cbasic.BreadthFirstSearchCrawler

Crawls in a breadth first manner.

__init__

Initialize self. See help(type(self)) for accurate signature.

crawl()

Crawl specified node. The observed graph is updated, also are crawled and observed set.

Parameters:seed – node id to crawl
Returns:vector (list) of newly seen nodes
next_seed()

Core of the crawler - the algorithm to choose the next seed to be crawled. Seed must be a node of the original graph.

Returns:node id as int
class crawlers.cbasic.DepthFirstSearchCrawler

Crawls in a depth first manner.

__init__

Initialize self. See help(type(self)) for accurate signature.

crawl()

Crawl specified node. The observed graph is updated, also are crawled and observed set.

Parameters:seed – node id to crawl
Returns:vector (list) of newly seen nodes
next_seed()

Core of the crawler - the algorithm to choose the next seed to be crawled. Seed must be a node of the original graph.

Returns:node id as int
class crawlers.cbasic.SnowBallCrawler

Every step of BFS taking neighbors with probability p. http://www.soundarajan.org/papers/CrawlingAnalysis.pdf https://arxiv.org/pdf/1004.1729.pdf

__init__
Parameters:p – probability of taking neighbor into queue
crawl()

Crawl specified node. The observed graph is updated, also are crawled and observed set.

Parameters:seed – node id to crawl
Returns:vector (list) of newly seen nodes
next_seed()

Core of the crawler - the algorithm to choose the next seed to be crawled. Seed must be a node of the original graph.

Returns:node id as int
class crawlers.cbasic.MaximumObservedDegreeCrawler

Crawls a node with maximal observed degree.

__init__
Parameters:batch – batch size
crawl()

Crawl specified node and update observed ND_Set

next_seed()

Next node with highest degree

update()

Update priority structures with specified nodes (suggested their degrees have changed).

class crawlers.cbasic.PreferentialObservedDegreeCrawler

selects for crawling one of the observed nodes with probability proportional to the observed degree.

__init__
Parameters:batch – batch size
crawl()

Crawl specified node and update observed ND_Set

next_seed()

Core of the crawler - the algorithm to choose the next seed to be crawled. Seed must be a node of the original graph.

Returns:node id as int
update()

Update priority structures with specified nodes (suggested their degrees have changed).

class crawlers.cbasic.MaximumExcessDegreeCrawler

Benchmark Crawler - greedy selection of next node with maximal excess (real - observed) degree

__init__

Initialize self. See help(type(self)) for accurate signature.

crawl()

Crawl specified node and update newly observed in ND_Set

next_seed()

Next node with highest real (unknown) degree

class crawlers.cbasic.MaximumRealDegreeCrawler

Benchmark Crawler - greedy selection of next node with maximal real degree

__init__

Initialize self. See help(type(self)) for accurate signature.

crawl()

Crawl specified node and update newly observed in ND_Set

next_seed()

Next node with highest real (unknown) degree

single-predictor

class search.predictor_based_crawlers.predictor_based.PredictorBasedCrawler(graph: base.cgraph.MyGraph, predictor: search.predictors.simple_predictors.Predictor, oracle: search.oracles.Oracle, training_strategy=None, re_estimate='after_train', initial_seed=None, name=None, **kwargs)[source]

Parent class for crawlers based on a predictor that can estimate the score (probability) of an observed node to be target.

__init__(graph: base.cgraph.MyGraph, predictor: search.predictors.simple_predictors.Predictor, oracle: search.oracles.Oracle, training_strategy=None, re_estimate='after_train', initial_seed=None, name=None, **kwargs)[source]
Parameters:
  • graph
  • predictor – node predictor model (or list of predictors).
  • oracle – target node detector, returns 1/0/None.
  • training_strategy – strategy to train predictor, one of ‘boost’, ‘online’.
  • re_estimate_all – how many nodes to re-estimate after each seed crawling: if True, estimate all observed nodes which is precise, if False, seed’s neighbors only which is faster but not precise. Default True.
  • initial_seed – one of [None, <integer>, ‘target’]
  • kwargs – args for subclass
estimate_node(node) → float[source]

Estimate class=1 score (probability) by the predictor.

crawl(seed) → list[source]

Crawl specified node. The observed graph is updated, also are crawled and observed set.

Parameters:seed – node id to crawl
Returns:vector (list) of newly seen nodes
re_estimate_nodes(nodes)[source]

Recompute predictor estimations for the given nodes.

train_predictor(Xs, ys)[source]

Train predictor using the training data; update estimations for all observed nodes.

next_seed()[source]

Choose the next seed as one with maximal score.

multi-predictor

class search.predictor_based_crawlers.mab.MultiPredictorCrawler(graph: base.cgraph.MyGraph, predictor: search.predictor_based_crawlers.mab.MultiPredictor, oracle, name=None, **kwargs)[source]

Uses several predictors and MAB strategy on them

__init__(graph: base.cgraph.MyGraph, predictor: search.predictor_based_crawlers.mab.MultiPredictor, oracle, name=None, **kwargs)[source]
Parameters:predictors – multi-predictor
re_estimate_nodes(nodes)[source]

Same as in superclass, but manages both _node_score and _node_scores.

estimate_node(node) → float[source]

Estimate class=1 score (probability) by the predictor.

train_predictor(Xs, ys)[source]

Train predictor using the training data; update estimations for all observed nodes.

class search.predictor_based_crawlers.mab.MABCrawler(graph: base.cgraph.MyGraph, predictor: search.predictor_based_crawlers.mab.MultiPredictor, name=None, **kwargs)[source]
__init__(graph: base.cgraph.MyGraph, predictor: search.predictor_based_crawlers.mab.MultiPredictor, name=None, **kwargs)[source]
Parameters:predictors – multi-predictor
estimate_node(node) → float[source]

Estimate class=1 score (probability) by the predictor.

class search.predictor_based_crawlers.mab.ExponentialDynamicWeightsMultiPredictorCrawler(graph: base.cgraph.MyGraph, uniform_distribution=False, name=None, **kwargs)[source]
__init__(graph: base.cgraph.MyGraph, uniform_distribution=False, name=None, **kwargs)[source]
Parameters:uniform_distribution – if True use uniform additive for weights at each step.
crawl(seed) → list[source]

Update weights based on whether seed to be crawled is target

class search.predictor_based_crawlers.mab.FollowLeaderMABCrawler(graph: base.cgraph.MyGraph, name=None, **kwargs)[source]
__init__(graph: base.cgraph.MyGraph, name=None, **kwargs)[source]
Parameters:predictors – multi-predictor
crawl(seed) → list[source]

Update weights based on whether seed to be crawled is target

class search.predictor_based_crawlers.mab.BetaDistributionMultiPredictorCrawler(graph: base.cgraph.MyGraph, beta_distr_param_thr=2, name=None, **kwargs)[source]
__init__(graph: base.cgraph.MyGraph, beta_distr_param_thr=2, name=None, **kwargs)[source]
Parameters:beta_distr_param_thr
crawl(seed) → list[source]

Update weights based on whether seed to be crawled is target

running

class running.history_runner.SmartCrawlersRunner(graph_full_names, crawler_decls, metric_decls, budget: int = -1)[source]

Runs several crawlers and measures several metrics for a given graph. Saves measurements history to the disk. Can run several instances of a crawler in parallel. Step sequence is exponentially growing independent of the graph.

__init__(graph_full_names, crawler_decls, metric_decls, budget: int = -1)[source]

Initialize self. See help(type(self)) for accurate signature.

class running.merger.ResultsMerger(graph_full_names, crawler_decls, metric_decls, budget, n_instances=None, x_lims=None, result_dir=PosixPath('/home/misha/workspace/crawling-framework.github.io/results'), numeric_only=True)[source]

ResultsMerger can aggregate and plot results saved in files. Process all combinations of G graphs x C crawlers x M metrics. Averages over n instances of each. All missed instances are just ignored.

Plotting functions:

  • draw_by_crawler - Draw M x G table of plots with C lines each. Ox - crawling step, Oy - metric value.
  • draw_by_metric_crawler - Draw G plots with C x M lines each. Ox - crawling step, Oy - metric value.
  • draw_by_metric - Draw C x G table of plots with M lines each. Ox - crawling step, Oy - metric value.
  • draw_aggregated - Draw G plots with M lines. Ox - C crawlers, Oy - (w)AUCC value (M curves with error bars).
  • draw_winners - Draw C stacked bars (each of M elements). Ox - C crawlers, Oy - number of wins (among G) by (w)AUCC value.

Additional functions:

  • missing_instances - Calculate how many instances of all configurations are missing.
  • move_folders - Move/remove/copy saved instances for current graphs, crawlers, metrics.

NOTES:

  • x values must be the same for all files and are the ones generated by exponential_batch_generator() from running/runner.py
  • it is supposed that for all instances values lists are of equal lengthes (i.e. budgets). Otherwise normalisation and aggregation may fail. If so, use x_lims parameter for the control.
__init__(graph_full_names, crawler_decls, metric_decls, budget, n_instances=None, x_lims=None, result_dir=PosixPath('/home/misha/workspace/crawling-framework.github.io/results'), numeric_only=True)[source]
Parameters:
  • graph_full_names – list of graphs full names.
  • crawler_decls – list of crawlers declarations.
  • metric_decls – list of metrics declarations. Non-numeric metrics will be ignored.
  • budget – results with this budget will be taken.
  • n_instances – number of instances to average over, None for all.
  • x_lims – use only specified x-limits for all plots unless another value is specified in plotting function.
  • result_dir – specify if want to use non-default directory where results are stored.
static names_to_path(graph_full_name: tuple, crawler_name: str, metric_name: str, budget: int, result_dir=PosixPath('/home/misha/workspace/crawling-framework.github.io/results'))[source]

Returns file pattern e.g. ‘/home/misha/workspace/crawling/results/ego-gplus/POD(batch=1)/TopK(centrality=BtwDistr,measure=Re,part=crawled,top=0.01)/*.json’

move_folders(path_from=None, path_to=None, copy=False)[source]

Move/remove/copy all saved instances for current [graphs X crawlers X metrics]. Specify path_to parameter to move files instead of removing.

Parameters:
  • path_from – this folder is root for all folders to be (re)moved, must be contained in path to folders
  • path_to – this folder is the destination for all folders to be moved. If None (which is default), all folders will be removed.
  • copy – set to True if want to copy folders
missing_instances() → dict[source]

Return dict of instances where computed < n_instances.

Returns:result[graph][crawler][metric] -> missing count
draw_by_crawler(x_lims=None, x_normalize=True, sharey=True, draw_error=True, draw_each_instance=False, scale=3, title='By crawler')[source]

Draw M x G table of plots with C lines each, where M - num of metrics, G - num of graphs, C - num of crawlers. Ox - crawling step, Oy - metric value.

Parameters:
  • x_lims – x-limits for plots. Overrides x_lims passed in constructor
  • x_normalize – if True, x values are normalized to be from 0 to 1
  • draw_error – if True, fill standard deviation area around the averaged crawling curve
  • draw_each_instance – if True, show each instance
  • scale – size of plots (default 3)
  • title – figure title
draw_by_metric(x_lims=None, x_normalize=True, sharey=True, draw_error=True, scale=3, title='By metric')[source]

Draw C x G table of plots with M lines each, where M - num of metrics, G - num of graphs, C - num of crawlers Ox - crawling step, Oy - metric value.

draw_by_metric_crawler(x_lims=None, x_normalize=True, sharey=True, swap_coloring_scheme=False, draw_error=True, scale=3, title='By metric and crawler')[source]

Draw G plots with CxM lines each, where M - num of metrics, G - num of graphs, C - num of crawlers. Ox - crawling step, Oy - metric value.

Parameters:
  • x_lims – x-limits for plots. Overrides x_lims passed in constructor
  • x_normalize – if True, x values are normalized to be from 0 to 1
  • sharey – if True, share properties among or y axes
  • swap_coloring_scheme – by default metrics differ in linestyle, crawlers differ in color. Set True to swap
  • draw_error – if True, fill standard deviation area around the averaged crawling curve
  • scale – size of plots (default 3)
  • title – figure title
get_aggregated(aggregator='AUCC', x_lims=None, median=False, print_results=False)[source]

Get results according to an aggregatro (AUCC, wAUCC, TC) :param x_lims: x-limits passed to aggregator. Overrides x_lims passed in constructor :param median: if True, compute median instead of mean :param print_results: if True, print results :return: list of results as tuple (num_instances, Graph, Crawler, Metric, mean, error)

draw_aggregated(aggregator='AUCC', x_lims=None, scale=3, sharey=True, boxplot=True, xticks_rotation=90, title=None, draw_count=True)[source]

Draw G plots with M lines. Ox - C crawlers, Oy - AUCC value (M curves with error bars). M - num of metrics, G - num of graphs, C - num of crawlers

Parameters:
  • aggregator – function translating crawling curve into 1 number. AUCC (default) or wAUCC
  • x_lims – x-limits passed to aggregator. Overrides x_lims passed in constructor
  • scale – size of plots (default 3)
  • sharey – if True, share properties among or y axes
  • xticks_rotation – rotate x-ticks (default 90 degrees)
  • title – figure title
  • draw_count – if True, prepend number of instances to label
draw_winners(aggregator='AUCC', x_lims=None, scale=8, xticks_rotation=90, title=None)[source]

Draw C stacked bars (each of M elements). Ox - C crawlers, Oy - number of wins (among G) by (w)AUCC value. Miss graphs where not all configurations are present.

Parameters:
  • aggregator – function translating crawling curve into 1 number. AUCC (default) or wAUCC
  • x_lims – x-limits passed to aggregator. Overrides x_lims passed in constructor
  • scale – size of plots (default 8)
  • xticks_rotation – rotate x-ticks (default 90 degrees)
  • title – figure title
show_plots()[source]

Show drawn matplotlib plots

static next_file(folder: pathlib.Path)[source]

Return a path with a smallest number not present in the folder. E.g. if folder has 0.json and 2.json, it returns path for 1.json

static merge_folders(*path, not_earlier_than=None, not_later_than=None, check_identical=False, copy=False)[source]

Merge all results into 1 folder: path[1], path[2], etc into path[0]. Name collisions resolved via assigning new smallest numbers, e.g. when 0.json is added to a folder with 0.json and 2.json, it becomes 1.json.

Args:
*path: list of paths each of those is analog to original results/ in terms of structure. not_earlier_than: look for files with modify datetime not earlier than specified. not_later_than: look for files with modify datetime not later than specified. check_identical: before renaming check whether equally named files are identical. copy: if True, copy all moved elements.
class running.metrics.Metric(is_numeric, name=None, callback=None, **kwargs)[source]

Base class for metrics on crawling process. Metric has a callback function that takes a crawler and returns some value. Metric filename is constructed only from class name and kwargs, so they should fully identify the metric. To create a custom metric, one should implement a subclass with proper callback(). As examples see OracleBasedMetric.

__init__(is_numeric, name=None, callback=None, **kwargs)[source]

NOTE: is_numeric, name, and callback are not used in filename.

Parameters:is_numeric – if True, metric must return number, ResultsMerger can operate with it,

the kwargs are used in folder name. If False, metric can return any structure which is saved in folder name as the class, ResultsMerger ignores it. :param name: name to use in plots :param callback: metric function: callback(crawler, **kwargs) -> number :param kwargs: additional argument to callback function

static from_declaration(declaration, **aux_kwargs)[source]

Build a Metric instance from its declaration