kg_covid_19 package¶
Submodules¶
kg_covid_19.download module¶
-
kg_covid_19.download.
download
(yaml_file: str, output_dir: str, ignore_cache: bool = False) → None¶ Downloads data files from list of URLs (default: download.yaml) into data directory (default: data/).
- Args:
yaml_file: A string pointing to the yaml file utilized to facilitate the downloading of data. output_dir: A string pointing to the location to download data to. ignore_cache: Ignore cache and download files even if they exist [false]
- Returns:
None.
kg_covid_19.make_holdouts module¶
-
kg_covid_19.make_holdouts.
df_to_tsv
(df: pandas.core.frame.DataFrame, outfile: str, sep='\t', index=False) → None¶
-
kg_covid_19.make_holdouts.
make_holdouts
(nodes: str, edges: str, output_dir: str, train_fraction: float, validation: bool, seed=42) → None¶ Prepare positive and negative edges for testing and training (see run.py holdouts command for documentation)
- Args:
:param nodes nodes of input graph, in KGX TSV format [data/merged/nodes.tsv] :param edges: edges for input graph, in KGX TSV format [data/merged/edges.tsv] :param output_dir: directory to output edges and new graph [data/edges/] :param train_fraction: fraction of edges to emit as training :param validation: should we make validation edges? [False] :param seed: random seed [42]
- Returns:
None.
-
kg_covid_19.make_holdouts.
make_negative_edges
(nodes_df: pandas.core.frame.DataFrame, edges_df: pandas.core.frame.DataFrame, edge_label: str = 'negative_edge', relation: str = 'negative_edge') → pandas.core.frame.DataFrame¶ Given a graph (as nodes and edges pandas dataframes), select num_edges holdouts that are NOT present in the graph
- Parameters
nodes_df – pandas dataframe containing node info
edges_df – pandas dataframe containing edge info
relation – string to put in relation column
edge_label – string to put in edge_label column
- Returns
-
kg_covid_19.make_holdouts.
make_positive_edges
(nodes_df: pandas.core.frame.DataFrame, edges_df: pandas.core.frame.DataFrame, train_fraction: float) → List[pandas.core.frame.DataFrame]¶ Positive edges are randomly selected from the edges in the graph, IFF both nodes participating in the edge have a degree greater than min_degree (to avoid creating disconnected components). This edge is then removed in the output graph. Negative edges are selected by randomly selecting pairs of nodes that are not connected by an edge.
- Parameters
nodes_df – pandas dataframe with node info, generated from KGX TSV file
edges_df – pandas dataframe with edge info, generated from KGX TSV file
train_fraction – fraction of input edges to emit as test (and optionally validation) edges
- Returns
pandas dataframes:
- training_edges_df: a dataframe with training edges with positive edges we
selected for test removed from graph
test_edges_df: a dataframe with test positive edges
-
kg_covid_19.make_holdouts.
tsv_to_df
(tsv_file: str, *args, **kwargs) → pandas.core.frame.DataFrame¶ Read in a TSV file and return a pandas dataframe
- Parameters
tsv_file – file to read in
- Returns
pandas dataframe
kg_covid_19.query module¶
-
kg_covid_19.query.
parse_query_rq
(rq_file) → dict¶ - Args:
rq_file: sparql query in grlc rq format
Returns: dict with parsed info about sparql query
-
kg_covid_19.query.
result_dict_to_tsv
(result_dict: dict, outfile: str) → None¶
-
kg_covid_19.query.
run_query
(query: str, endpoint: str, return_format='json') → dict¶
kg_covid_19.transform module¶
-
kg_covid_19.transform.
transform
(input_dir: str, output_dir: str, sources: List[str] = None) → None¶ Call scripts in kg_covid_19/transform/[source name]/ to transform each source into a graph format that KGX can ingest directly, in either TSV or JSON format: https://github.com/NCATS-Tangerine/kgx/blob/master/data-preparation.md
- Args:
input_dir: A string pointing to the directory to import data from. output_dir: A string pointing to the directory to output data to. sources: A list of sources to transform.
- Returns:
None.
Module contents¶
-
kg_covid_19.
download
(yaml_file: str, output_dir: str, ignore_cache: bool = False) → None¶ Downloads data files from list of URLs (default: download.yaml) into data directory (default: data/).
- Args:
yaml_file: A string pointing to the yaml file utilized to facilitate the downloading of data. output_dir: A string pointing to the location to download data to. ignore_cache: Ignore cache and download files even if they exist [false]
- Returns:
None.
-
kg_covid_19.
transform
(input_dir: str, output_dir: str, sources: List[str] = None) → None¶ Call scripts in kg_covid_19/transform/[source name]/ to transform each source into a graph format that KGX can ingest directly, in either TSV or JSON format: https://github.com/NCATS-Tangerine/kgx/blob/master/data-preparation.md
- Args:
input_dir: A string pointing to the directory to import data from. output_dir: A string pointing to the directory to output data to. sources: A list of sources to transform.
- Returns:
None.