kg_covid_19 package¶

Subpackages¶

Submodules¶

kg_covid_19.download module¶

kg_covid_19.download.download(yaml_file: str, output_dir: str, ignore_cache: bool = False) → None¶

Downloads data files from list of URLs (default: download.yaml) into data directory (default: data/).

Args:: yaml_file: A string pointing to the yaml file utilized to facilitate the downloading of data. output_dir: A string pointing to the location to download data to. ignore_cache: Ignore cache and download files even if they exist [false]
Returns:: None.

kg_covid_19.make_holdouts module¶

kg_covid_19.make_holdouts.df_to_tsv(df: pandas.core.frame.DataFrame, outfile: str, sep='\t', index=False) → None¶

kg_covid_19.make_holdouts.make_holdouts(nodes: str, edges: str, output_dir: str, train_fraction: float, validation: bool, seed=42) → None¶

Prepare positive and negative edges for testing and training (see run.py holdouts command for documentation)

Args:: :param nodes nodes of input graph, in KGX TSV format [data/merged/nodes.tsv] :param edges: edges for input graph, in KGX TSV format [data/merged/edges.tsv] :param output_dir: directory to output edges and new graph [data/edges/] :param train_fraction: fraction of edges to emit as training :param validation: should we make validation edges? [False] :param seed: random seed [42]
Returns:: None.

kg_covid_19.make_holdouts.make_negative_edges(nodes_df: pandas.core.frame.DataFrame, edges_df: pandas.core.frame.DataFrame, edge_label: str = 'negative_edge', relation: str = 'negative_edge') → pandas.core.frame.DataFrame¶

Given a graph (as nodes and edges pandas dataframes), select num_edges holdouts that are NOT present in the graph

Parameters

nodes_df – pandas dataframe containing node info
edges_df – pandas dataframe containing edge info
relation – string to put in relation column
edge_label – string to put in edge_label column

Returns

kg_covid_19.make_holdouts.make_positive_edges(nodes_df: pandas.core.frame.DataFrame, edges_df: pandas.core.frame.DataFrame, train_fraction: float) → List[pandas.core.frame.DataFrame]¶

Positive edges are randomly selected from the edges in the graph, IFF both nodes participating in the edge have a degree greater than min_degree (to avoid creating disconnected components). This edge is then removed in the output graph. Negative edges are selected by randomly selecting pairs of nodes that are not connected by an edge.

Parameters

nodes_df – pandas dataframe with node info, generated from KGX TSV file
edges_df – pandas dataframe with edge info, generated from KGX TSV file
train_fraction – fraction of input edges to emit as test (and optionally validation) edges

Returns

pandas dataframes:

training_edges_df: a dataframe with training edges with positive edges we: selected for test removed from graph

test_edges_df: a dataframe with test positive edges

kg_covid_19.make_holdouts.tsv_to_df(tsv_file: str, *args, **kwargs) → pandas.core.frame.DataFrame¶

Read in a TSV file and return a pandas dataframe

Parameters: tsv_file – file to read in
Returns: pandas dataframe

kg_covid_19.query module¶

kg_covid_19.query.parse_query_rq(rq_file) → dict¶

Args:: rq_file: sparql query in grlc rq format

Returns: dict with parsed info about sparql query

kg_covid_19.query.result_dict_to_tsv(result_dict: dict, outfile: str) → None¶

kg_covid_19.query.run_query(query: str, endpoint: str, return_format='json') → dict¶

kg_covid_19.transform module¶

kg_covid_19.transform.transform(input_dir: str, output_dir: str, sources: List[str] = None) → None¶

Call scripts in kg_covid_19/transform/[source name]/ to transform each source into a graph format that KGX can ingest directly, in either TSV or JSON format: https://github.com/NCATS-Tangerine/kgx/blob/master/data-preparation.md

Args:: input_dir: A string pointing to the directory to import data from. output_dir: A string pointing to the directory to output data to. sources: A list of sources to transform.
Returns:: None.

Module contents¶

kg_covid_19.download(yaml_file: str, output_dir: str, ignore_cache: bool = False) → None¶

Downloads data files from list of URLs (default: download.yaml) into data directory (default: data/).

Args:: yaml_file: A string pointing to the yaml file utilized to facilitate the downloading of data. output_dir: A string pointing to the location to download data to. ignore_cache: Ignore cache and download files even if they exist [false]
Returns:: None.

kg_covid_19.transform(input_dir: str, output_dir: str, sources: List[str] = None) → None¶

Call scripts in kg_covid_19/transform/[source name]/ to transform each source into a graph format that KGX can ingest directly, in either TSV or JSON format: https://github.com/NCATS-Tangerine/kgx/blob/master/data-preparation.md

Args:: input_dir: A string pointing to the directory to import data from. output_dir: A string pointing to the directory to output data to. sources: A list of sources to transform.
Returns:: None.