kg_microbe.utils package
Submodules
kg_microbe.utils.biohub_converter module
Converter utils for Bio Term Hub.
- kg_microbe.utils.biohub_converter.parse(input_filename, output_filename) None
Parse KGX nodes TSV into Bio Term Hub format for OGER compatibility.
Mapping of columns from KGX format to Bio Term Hub format, 0. ‘CUI-less’ -> UMLS CUI 1. ‘N/A’ -> resource from which it comes 2. CURIE -> native ID 3. name -> term (this is the field that is tokenized) 4. name -> preferred form 5. category -> type :param input_filename: Input file path (str) :param output_filename: Output file path (str) :return: None.
- kg_microbe.utils.biohub_converter.parse_header(elements) Dict[str, int]
Parse headers from nodes TSV.
- Parameters
elements – The header record (list)
- Return dict
A dictionary of node header names to index
- kg_microbe.utils.biohub_converter.write_line(elements, outstream) None
Write line to outstream.
- Parameters
elements – The record to write (list).
outstream – File handle to the output file.
kg_microbe.utils.download_utils module
Download utilities.
- kg_microbe.utils.download_utils.download_from_api(yaml_item, outfile) None
Download from an Elasticsearch API.
- Args:
yaml_item: item to be download, parsed from yaml outfile: where to write out file
Returns:
- kg_microbe.utils.download_utils.download_from_yaml(yaml_file: str, output_dir: str, ignore_cache: bool = False) None
Download files specified in an input yaml.
Given an download info from an download.yaml file, download all files. Args:
yaml_file: A string pointing to the download.yaml file, to be parsed for things to download. output_dir: A string pointing to where to write out downloaded files. ignore_cache: Ignore cache and download files even if they exist [false]
- Returns:
None.
- kg_microbe.utils.download_utils.elastic_search_query(es_connection, index, query, scroll: str = '1m', request_timeout: int = 60, preserve_order: bool = True)
Fetch records from the given URL and query parameters.
- Args:
es_connection: elastic search connection index: the elastic search index for query query: query scroll: scroll parameter passed to elastic search request_timeout: timeout parameter passed to elastic search preserve_order: preserve order param passed to elastic search
- Returns:
All records for query
kg_microbe.utils.nlp_utils module
Utilities for natural language processing.
- kg_microbe.utils.nlp_utils.assign_string_match_rating(dfrow)
Assign a column categorizing match between TokenizedTerm and PreferredTerm.
Exact
Partial
Synonym
- Parameters
dfRow – each row of the OGER output
- Returns
Same dataframe with an extra ‘matchRating’ column
- kg_microbe.utils.nlp_utils.create_settings_file(path: str, ont: str = 'ALL') None
Create the settings.ini file for OGER to get parameters.
- Parameters
path – Path of the ‘nlp’ folder
ont – The ontology to be used as dictionary [‘ALL’, ‘ENVO’, ‘CHEBI’]
- Returns
None.
The ‘Shared’ section declares global variables that can be used in other sections e.g. Data root. root = location of the working directory accessed in other sections using => ${Shared:root}/
Input formats accepted: txt, txt_json, bioc_xml, bioc_json, conll, pubmed, pxml, pxml.gz, pmc, nxml, pubtator, pubtator_fbk, becalmabstracts, becalmpatents
Two iter-modes available: [collection or document] document:- ‘n’ input files = ‘n’ output files (provided every file has ontology terms) collection:- n input files = 1 output file
Export formats possible: tsv, txt, text_tsv, xml, text_xml, bioc_xml, bioc_json, bionlp, bionlp.ann, brat, brat.ann, conll, pubtator, pubanno_json, pubtator, pubtator_fbk, europepmc, europepmc.zip, odin, becalm_tsv, becalm_json These can be passed as a list for multiple outputs too.
Multiple Termlists can be declared in separate sections e.g. [Termlist1], [Termlist2] …[Termlistn] with each having their own paths
- kg_microbe.utils.nlp_utils.create_termlist(path: str, ont: str) None
Create termlist.tsv files from ontology JSON files for NLP.
TODO: Replace this code once runNER is installed and remove ‘kg_microbe/utils/biohub_converter.py’
- kg_microbe.utils.nlp_utils.prep_nlp_input(path: str, columns: list, dic: str) str
Create a tsv which forms the input for OGER.
- Parameters
path – Path to the file which has text to be analyzed
columns – The first column HAS to be an id column.
dic – The Ontology to be used as a dictionary for NLP
- Returns
Filename (str)
- kg_microbe.utils.nlp_utils.process_oger_output(path: str, input_file_name: str) DataFrame
Process output TSV from OGER.
The OGER output is a TSV which is imported and only the terms that occurred in the text file are considered and a dataframe of relevant information is returned. :param path: Path to the folder containing relevant files :param input_file_name: OGER output (tsv file) :return: Pandas Dataframe containing required data for further analyses.
- kg_microbe.utils.nlp_utils.run_oger(path: str, input_file_name: str, n_workers: int = 1) DataFrame
Run OGER using the settings.ini file created previously.
- Parameters
path – Path of the input file.
input_file_name – Filename.
n_workers – Number of threads to run (default: 1).
- Returns
Pandas DataFrame containing the output of OGER analysis.
kg_microbe.utils.robot_utils module
Utilities for working with ROBOT.
- kg_microbe.utils.robot_utils.convert_to_json(path: str, ont: str)
Convert owl to JSON using ROBOT and the subprocess library.
- Parameters
path – Path to ROBOT and the input OWL files.
ont – Ontology
- Returns
None
- kg_microbe.utils.robot_utils.extract_convert_to_json(path: str, ont_name: str, terms: str, mode: str)
Extract all children of a provided CURIE.
- Parameters
path – path of file to be converted
ont_name – Name of the ontology
terms – Either CURIE or a file of CURIEs list
mode – Method options as listed below.
- Returns
None
ROBOT Method options: - STAR: The STAR-module contains mainly the terms in the seed and
the inter-relations between them (not necessarily sub- and super-classes).
TOP: The TOP-module contains mainly the terms in the seed, plus all their sub-classes and the inter-relations between them.
BOT: The BOT, or BOTTOM, -module contains mainly the terms in the seed, plus all their super-classes and the inter-relations between them.
MIREOT : The MIREOT method preserves the hierarchy of the input ontology (subclass and subproperty relationships), but does not try to preserve the full set of logical entailments.
- kg_microbe.utils.robot_utils.initialize_robot(path: str) list
Initialize ROBOT with necessary configuration.
- Parameters
path – Path to ROBOT files.
- Returns
A list consisting of robot shell script name and environment variables.
kg_microbe.utils.transform_utils module
Utilities for assisting data transformations.
- exception kg_microbe.utils.transform_utils.ItemInDictNotFoundError
Bases:
TransformError
Raised when the input value is too small.
- exception kg_microbe.utils.transform_utils.TransformError
Bases:
Exception
Base class for other exceptions.
- kg_microbe.utils.transform_utils.collapse_uniprot_curie(uniprot_curie: str) str
Collapse a UniProtKB isoform ID to a parent ID.
Given a UniProtKB curie for an isoform such as UniprotKB:P63151-1 or UniprotKB:P63151-2, collapse to parent protein (UniprotKB:P63151 / UniprotKB:P63151) :param uniprot_curie: :return: collapsed UniProtKB ID
- kg_microbe.utils.transform_utils.data_to_dict(these_keys, these_values) dict
Zip up two lists to make a dict.
- Parameters
these_keys – keys for new dict
these_values – values for new dict
- Returns
dictionary
- kg_microbe.utils.transform_utils.get_header_items(table_data: Any) List
Get header from (first page of) a table.
- Args:
table_data: Data, as list of dicts from tabula.io.read_pdf().
- Returns:
header_items: An array of header items.
- kg_microbe.utils.transform_utils.get_item_by_priority(items_dict: dict, keys_by_priority: list) str
Retrieve item from a dict using a list of keys.
Keys should be in descending order of priority. :param items_dict: :param keys_by_priority: list of keys to use to find values :return: str: first value in dict for first item in keys_by_priority that isn’t blank, or None
- kg_microbe.utils.transform_utils.guess_bl_category(identifier: str) str
Guess Biolink category for a given identifier.
Note: This is a temporary solution and should not be used long term. Args:
identifier: A CURIE
- Returns:
The category for the given CURIE
- kg_microbe.utils.transform_utils.multi_page_table_to_list(multi_page_table: Any) List[Dict]
Convert multi-page tables to lists of dicts.
Method to turn table data returned from tabula.io.read_pdf(), possibly broken over several pages, into a list of dicts, one dict for each row. Args:
multi_page_table:
- Returns:
table_data: A list of dicts, where each dict is item from one row.
- kg_microbe.utils.transform_utils.parse_header(header_string: str, sep: str = '\t') List
Parse header data from a file.
- Args:
header_string: A string containing header items. sep: A string containing a delimiter.
- Returns:
A list of header items.
- kg_microbe.utils.transform_utils.parse_line(this_line: str, header_items: List, sep=',') Dict
Process a line of text from the csv file.
- Parameters
this_line – A string containing a line of text.
header_items – A list of header items.
sep – A string containing a delimiter.
- Return item_dict
A dictionary of header items
and a processed item from the dataset.
- kg_microbe.utils.transform_utils.ungzip_to_tempdir(gzipped_file: str, tempdir: str) str
Decompress a GZIP file into a temp directory.
- kg_microbe.utils.transform_utils.uniprot_make_name_to_id_mapping(dat_gz_file: str) dict
Convert UniProtKB id maps to dict of maps.
Given a Uniprot dat.gz file, like this: ftp://ftp.uniprot.org/pub/databases/uniprot/ current_release/knowledgebase/idmapping/by_organism/ HUMAN_9606_idmapping.dat.gz makes dict with name to id mapping :param dat_gz_file: :return: dict with mapping
- kg_microbe.utils.transform_utils.uniprot_name_to_id(name_to_id_map: dict, name: str) Optional[str]
Set up Uniprot name to ID mapping.
- Parameters
name_to_id_map – mapping dict[name] -> id
name – name
- Returns
id string, or None
- kg_microbe.utils.transform_utils.unzip_to_tempdir(zip_file_name: str, tempdir: str) None
Decompress a zip file into a temp directory.
Module contents
Initialize utilities.
- kg_microbe.utils.download_from_yaml(yaml_file: str, output_dir: str, ignore_cache: bool = False) None
Download files specified in an input yaml.
Given an download info from an download.yaml file, download all files. Args:
yaml_file: A string pointing to the download.yaml file, to be parsed for things to download. output_dir: A string pointing to where to write out downloaded files. ignore_cache: Ignore cache and download files even if they exist [false]
- Returns:
None.
- kg_microbe.utils.multi_page_table_to_list(multi_page_table: Any) List[Dict]
Convert multi-page tables to lists of dicts.
Method to turn table data returned from tabula.io.read_pdf(), possibly broken over several pages, into a list of dicts, one dict for each row. Args:
multi_page_table:
- Returns:
table_data: A list of dicts, where each dict is item from one row.