kg_microbe.utils package

Submodules

kg_microbe.utils.biohub_converter module

Converter utils for Bio Term Hub.

kg_microbe.utils.biohub_converter.parse(input_filename, output_filename) None

Parse KGX nodes TSV into Bio Term Hub format for OGER compatibility.

Mapping of columns from KGX format to Bio Term Hub format, 0. ‘CUI-less’ -> UMLS CUI 1. ‘N/A’ -> resource from which it comes 2. CURIE -> native ID 3. name -> term (this is the field that is tokenized) 4. name -> preferred form 5. category -> type :param input_filename: Input file path (str) :param output_filename: Output file path (str) :return: None.

kg_microbe.utils.biohub_converter.parse_header(elements) Dict[str, int]

Parse headers from nodes TSV.

Parameters

elements – The header record (list)

Return dict

A dictionary of node header names to index

kg_microbe.utils.biohub_converter.write_line(elements, outstream) None

Write line to outstream.

Parameters
  • elements – The record to write (list).

  • outstream – File handle to the output file.

kg_microbe.utils.download_utils module

Download utilities.

kg_microbe.utils.download_utils.download_from_api(yaml_item, outfile) None

Download from an Elasticsearch API.

Args:

yaml_item: item to be download, parsed from yaml outfile: where to write out file

Returns:

kg_microbe.utils.download_utils.download_from_yaml(yaml_file: str, output_dir: str, ignore_cache: bool = False) None

Download files specified in an input yaml.

Given an download info from an download.yaml file, download all files. Args:

yaml_file: A string pointing to the download.yaml file, to be parsed for things to download. output_dir: A string pointing to where to write out downloaded files. ignore_cache: Ignore cache and download files even if they exist [false]

Returns:

None.

kg_microbe.utils.download_utils.elastic_search_query(es_connection, index, query, scroll: str = '1m', request_timeout: int = 60, preserve_order: bool = True)

Fetch records from the given URL and query parameters.

Args:

es_connection: elastic search connection index: the elastic search index for query query: query scroll: scroll parameter passed to elastic search request_timeout: timeout parameter passed to elastic search preserve_order: preserve order param passed to elastic search

Returns:

All records for query

kg_microbe.utils.nlp_utils module

Utilities for natural language processing.

kg_microbe.utils.nlp_utils.assign_string_match_rating(dfrow)

Assign a column categorizing match between TokenizedTerm and PreferredTerm.

  • Exact

  • Partial

  • Synonym

Parameters

dfRow – each row of the OGER output

Returns

Same dataframe with an extra ‘matchRating’ column

kg_microbe.utils.nlp_utils.create_settings_file(path: str, ont: str = 'ALL') None

Create the settings.ini file for OGER to get parameters.

Parameters
  • path – Path of the ‘nlp’ folder

  • ont – The ontology to be used as dictionary [‘ALL’, ‘ENVO’, ‘CHEBI’]

Returns

None.

  • The ‘Shared’ section declares global variables that can be used in other sections e.g. Data root. root = location of the working directory accessed in other sections using => ${Shared:root}/

  • Input formats accepted: txt, txt_json, bioc_xml, bioc_json, conll, pubmed, pxml, pxml.gz, pmc, nxml, pubtator, pubtator_fbk, becalmabstracts, becalmpatents

  • Two iter-modes available: [collection or document] document:- ‘n’ input files = ‘n’ output files (provided every file has ontology terms) collection:- n input files = 1 output file

  • Export formats possible: tsv, txt, text_tsv, xml, text_xml, bioc_xml, bioc_json, bionlp, bionlp.ann, brat, brat.ann, conll, pubtator, pubanno_json, pubtator, pubtator_fbk, europepmc, europepmc.zip, odin, becalm_tsv, becalm_json These can be passed as a list for multiple outputs too.

  • Multiple Termlists can be declared in separate sections e.g. [Termlist1], [Termlist2] …[Termlistn] with each having their own paths

kg_microbe.utils.nlp_utils.create_termlist(path: str, ont: str) None

Create termlist.tsv files from ontology JSON files for NLP.

TODO: Replace this code once runNER is installed and remove ‘kg_microbe/utils/biohub_converter.py’

kg_microbe.utils.nlp_utils.prep_nlp_input(path: str, columns: list, dic: str) str

Create a tsv which forms the input for OGER.

Parameters
  • path – Path to the file which has text to be analyzed

  • columns – The first column HAS to be an id column.

  • dic – The Ontology to be used as a dictionary for NLP

Returns

Filename (str)

kg_microbe.utils.nlp_utils.process_oger_output(path: str, input_file_name: str) DataFrame

Process output TSV from OGER.

The OGER output is a TSV which is imported and only the terms that occurred in the text file are considered and a dataframe of relevant information is returned. :param path: Path to the folder containing relevant files :param input_file_name: OGER output (tsv file) :return: Pandas Dataframe containing required data for further analyses.

kg_microbe.utils.nlp_utils.run_oger(path: str, input_file_name: str, n_workers: int = 1) DataFrame

Run OGER using the settings.ini file created previously.

Parameters
  • path – Path of the input file.

  • input_file_name – Filename.

  • n_workers – Number of threads to run (default: 1).

Returns

Pandas DataFrame containing the output of OGER analysis.

kg_microbe.utils.robot_utils module

Utilities for working with ROBOT.

kg_microbe.utils.robot_utils.convert_to_json(path: str, ont: str)

Convert owl to JSON using ROBOT and the subprocess library.

Parameters
  • path – Path to ROBOT and the input OWL files.

  • ont – Ontology

Returns

None

kg_microbe.utils.robot_utils.extract_convert_to_json(path: str, ont_name: str, terms: str, mode: str)

Extract all children of a provided CURIE.

Parameters
  • path – path of file to be converted

  • ont_name – Name of the ontology

  • terms – Either CURIE or a file of CURIEs list

  • mode – Method options as listed below.

Returns

None

ROBOT Method options: - STAR: The STAR-module contains mainly the terms in the seed and

the inter-relations between them (not necessarily sub- and super-classes).

  • TOP: The TOP-module contains mainly the terms in the seed, plus all their sub-classes and the inter-relations between them.

  • BOT: The BOT, or BOTTOM, -module contains mainly the terms in the seed, plus all their super-classes and the inter-relations between them.

  • MIREOT : The MIREOT method preserves the hierarchy of the input ontology (subclass and subproperty relationships), but does not try to preserve the full set of logical entailments.

kg_microbe.utils.robot_utils.initialize_robot(path: str) list

Initialize ROBOT with necessary configuration.

Parameters

path – Path to ROBOT files.

Returns

A list consisting of robot shell script name and environment variables.

kg_microbe.utils.transform_utils module

Utilities for assisting data transformations.

exception kg_microbe.utils.transform_utils.ItemInDictNotFoundError

Bases: TransformError

Raised when the input value is too small.

exception kg_microbe.utils.transform_utils.TransformError

Bases: Exception

Base class for other exceptions.

kg_microbe.utils.transform_utils.collapse_uniprot_curie(uniprot_curie: str) str

Collapse a UniProtKB isoform ID to a parent ID.

Given a UniProtKB curie for an isoform such as UniprotKB:P63151-1 or UniprotKB:P63151-2, collapse to parent protein (UniprotKB:P63151 / UniprotKB:P63151) :param uniprot_curie: :return: collapsed UniProtKB ID

kg_microbe.utils.transform_utils.data_to_dict(these_keys, these_values) dict

Zip up two lists to make a dict.

Parameters
  • these_keys – keys for new dict

  • these_values – values for new dict

Returns

dictionary

kg_microbe.utils.transform_utils.get_header_items(table_data: Any) List

Get header from (first page of) a table.

Args:

table_data: Data, as list of dicts from tabula.io.read_pdf().

Returns:

header_items: An array of header items.

kg_microbe.utils.transform_utils.get_item_by_priority(items_dict: dict, keys_by_priority: list) str

Retrieve item from a dict using a list of keys.

Keys should be in descending order of priority. :param items_dict: :param keys_by_priority: list of keys to use to find values :return: str: first value in dict for first item in keys_by_priority that isn’t blank, or None

kg_microbe.utils.transform_utils.guess_bl_category(identifier: str) str

Guess Biolink category for a given identifier.

Note: This is a temporary solution and should not be used long term. Args:

identifier: A CURIE

Returns:

The category for the given CURIE

kg_microbe.utils.transform_utils.multi_page_table_to_list(multi_page_table: Any) List[Dict]

Convert multi-page tables to lists of dicts.

Method to turn table data returned from tabula.io.read_pdf(), possibly broken over several pages, into a list of dicts, one dict for each row. Args:

multi_page_table:

Returns:

table_data: A list of dicts, where each dict is item from one row.

kg_microbe.utils.transform_utils.parse_header(header_string: str, sep: str = '\t') List

Parse header data from a file.

Args:

header_string: A string containing header items. sep: A string containing a delimiter.

Returns:

A list of header items.

kg_microbe.utils.transform_utils.parse_line(this_line: str, header_items: List, sep=',') Dict

Process a line of text from the csv file.

Parameters
  • this_line – A string containing a line of text.

  • header_items – A list of header items.

  • sep – A string containing a delimiter.

Return item_dict

A dictionary of header items

and a processed item from the dataset.

kg_microbe.utils.transform_utils.ungzip_to_tempdir(gzipped_file: str, tempdir: str) str

Decompress a GZIP file into a temp directory.

kg_microbe.utils.transform_utils.uniprot_make_name_to_id_mapping(dat_gz_file: str) dict

Convert UniProtKB id maps to dict of maps.

Given a Uniprot dat.gz file, like this: ftp://ftp.uniprot.org/pub/databases/uniprot/ current_release/knowledgebase/idmapping/by_organism/ HUMAN_9606_idmapping.dat.gz makes dict with name to id mapping :param dat_gz_file: :return: dict with mapping

kg_microbe.utils.transform_utils.uniprot_name_to_id(name_to_id_map: dict, name: str) Optional[str]

Set up Uniprot name to ID mapping.

Parameters
  • name_to_id_map – mapping dict[name] -> id

  • name – name

Returns

id string, or None

kg_microbe.utils.transform_utils.unzip_to_tempdir(zip_file_name: str, tempdir: str) None

Decompress a zip file into a temp directory.

kg_microbe.utils.transform_utils.write_node_edge_item(fh: Any, header: List, data: List, sep: str = '\t')

Write out a single line for a node or an edge in *.tsv.

Parameters
  • fh – file handle of node or edge file

  • header – list of header items

  • data – data for line to write out

  • sep – separator [t]

Module contents

Initialize utilities.

kg_microbe.utils.download_from_yaml(yaml_file: str, output_dir: str, ignore_cache: bool = False) None

Download files specified in an input yaml.

Given an download info from an download.yaml file, download all files. Args:

yaml_file: A string pointing to the download.yaml file, to be parsed for things to download. output_dir: A string pointing to where to write out downloaded files. ignore_cache: Ignore cache and download files even if they exist [false]

Returns:

None.

kg_microbe.utils.multi_page_table_to_list(multi_page_table: Any) List[Dict]

Convert multi-page tables to lists of dicts.

Method to turn table data returned from tabula.io.read_pdf(), possibly broken over several pages, into a list of dicts, one dict for each row. Args:

multi_page_table:

Returns:

table_data: A list of dicts, where each dict is item from one row.

kg_microbe.utils.write_node_edge_item(fh: Any, header: List, data: List, sep: str = '\t')

Write out a single line for a node or an edge in *.tsv.

Parameters
  • fh – file handle of node or edge file

  • header – list of header items

  • data – data for line to write out

  • sep – separator [t]