kg_covid_19.transform_utils.scibite_cord package¶
Submodules¶
kg_covid_19.transform_utils.scibite_cord.scibite_cord module¶
-
class
kg_covid_19.transform_utils.scibite_cord.scibite_cord.
ScibiteCordTransform
(input_dir: str = None, output_dir: str = None)¶ Bases:
kg_covid_19.transform_utils.transform.Transform
ScibiteCordTransform parses the SciBite annotations on CORD-19 dataset to extract concept to publication annotations and co-occurrences.
-
contract_uri
(iri) → str¶ Contract a given IRI.
Contract a given IRI, with special parsing and transformations depending on the nature of the IRI.
- Args:
iri: IRI as string
- Returns:
str.
-
extract_termite_hits
(data: Dict) → Set¶ Parse termite hits
- Args:
node_handle: File handle for nodes.csv. edge_handle: File handle for edges.csv. data: A dictionary.
- Returns:
None.
-
static
get_identifier_by_prefix
(record, prefix)¶ Get identifier from a list based on prefix.
- Args:
record: record from NCBI gene_info. prefix: prefix of the identifier to extract.
- Returns:
str
-
static
is_curie
(s: str) → bool¶ Check if a given string is a CURIE.
- Args:
s: string
- Returns:
bool.
-
static
is_iri
(s) → bool¶ Check ig a given string is an IRI.
- Args:
s: string
- Returns:
bool.
-
load_country_code
(input_dir: str, output_dir: str) → None¶
-
load_gene_info
(input_dir: str, output_dir: str, species_id: List = None) → None¶ Load mappings from NCBI gene_info (gene_info.gz).
- Args:
input_dir: A string pointing to the directory to import data from. output_dir: A string pointing to the directory to output data to. species_id: A list with the species IDs.
- Returns:
None.
-
parse_annotation_doc
(node_handle, edge_handle, doc: Dict) → None¶ Parse a JSON document corresponding to a publication.
- Args:
node_handle: File handle for nodes.csv. edge_handle: File handle for edges.csv. doc: JSON document as dict. subset: The subset name for this dataset.
- Returns:
None.
-
parse_annotations
(node_handle: IO, edge_handle: IO, data_file1: str, data_file2: str, data_file3: str) → None¶ Parse annotations from CORD-19_1_5.zip.
- Args:
node_handle: File handle for nodes.csv. edge_handle: File handle for edges.csv. data_file1: Path to pdf_json_part_1.zip data_file2: Path to pdf_json_part_2.zip data_file2: Path to pmc_json.zip
- Returns:
None.
-
parse_cooccurrence
(node_handle: Any, edge_handle: Any, data_file: str) → None¶ Parse term co-occurrences from cv19_scc.zip.
- Args:
node_handle: File handle for nodes.csv. edge_handle: File handle for edges.csv. data_file: Path to cv19_scc.zip.
- Returns:
None.
-
parse_cooccurrence_record
(node_handle: Any, edge_handle: Any, record: Dict) → None¶ Parse term-cooccurrences.
- Args:
node_handle: File handle for nodes.csv. edge_handle: File handle for edges.csv. record: A dictionary corresponding to a row from a table.
- Returns:
None.
-
run
(pdf_zipfile_1: Optional[str] = None, pdf_zipfile_2: Optional[str] = None, pmc_zipfile: Optional[str] = None, co_occur_zipfile: Optional[str] = None) → None¶ Method is called and performs needed transformations to process annotations from SciBite CORD-19
- Args:
pdf_zipfile_1: PDF zip file part 1 [pdf_json_part_1.zip] pdf_zipfile_2: PDF zip file part 2 [pdf_json_part_1.zip] pmc_zipfile: pmc zipfile [pmc_json.zip] co_occur_zipfile: coocurrence data zipfile [cv19_scc_1_2.zip]
- Returns:
None.
-
Module contents¶
-
class
kg_covid_19.transform_utils.scibite_cord.
ScibiteCordTransform
(input_dir: str = None, output_dir: str = None)¶ Bases:
kg_covid_19.transform_utils.transform.Transform
ScibiteCordTransform parses the SciBite annotations on CORD-19 dataset to extract concept to publication annotations and co-occurrences.
-
contract_uri
(iri) → str¶ Contract a given IRI.
Contract a given IRI, with special parsing and transformations depending on the nature of the IRI.
- Args:
iri: IRI as string
- Returns:
str.
-
extract_termite_hits
(data: Dict) → Set¶ Parse termite hits
- Args:
node_handle: File handle for nodes.csv. edge_handle: File handle for edges.csv. data: A dictionary.
- Returns:
None.
-
static
get_identifier_by_prefix
(record, prefix)¶ Get identifier from a list based on prefix.
- Args:
record: record from NCBI gene_info. prefix: prefix of the identifier to extract.
- Returns:
str
-
static
is_curie
(s: str) → bool¶ Check if a given string is a CURIE.
- Args:
s: string
- Returns:
bool.
-
static
is_iri
(s) → bool¶ Check ig a given string is an IRI.
- Args:
s: string
- Returns:
bool.
-
load_country_code
(input_dir: str, output_dir: str) → None¶
-
load_gene_info
(input_dir: str, output_dir: str, species_id: List = None) → None¶ Load mappings from NCBI gene_info (gene_info.gz).
- Args:
input_dir: A string pointing to the directory to import data from. output_dir: A string pointing to the directory to output data to. species_id: A list with the species IDs.
- Returns:
None.
-
parse_annotation_doc
(node_handle, edge_handle, doc: Dict) → None¶ Parse a JSON document corresponding to a publication.
- Args:
node_handle: File handle for nodes.csv. edge_handle: File handle for edges.csv. doc: JSON document as dict. subset: The subset name for this dataset.
- Returns:
None.
-
parse_annotations
(node_handle: IO, edge_handle: IO, data_file1: str, data_file2: str, data_file3: str) → None¶ Parse annotations from CORD-19_1_5.zip.
- Args:
node_handle: File handle for nodes.csv. edge_handle: File handle for edges.csv. data_file1: Path to pdf_json_part_1.zip data_file2: Path to pdf_json_part_2.zip data_file2: Path to pmc_json.zip
- Returns:
None.
-
parse_cooccurrence
(node_handle: Any, edge_handle: Any, data_file: str) → None¶ Parse term co-occurrences from cv19_scc.zip.
- Args:
node_handle: File handle for nodes.csv. edge_handle: File handle for edges.csv. data_file: Path to cv19_scc.zip.
- Returns:
None.
-
parse_cooccurrence_record
(node_handle: Any, edge_handle: Any, record: Dict) → None¶ Parse term-cooccurrences.
- Args:
node_handle: File handle for nodes.csv. edge_handle: File handle for edges.csv. record: A dictionary corresponding to a row from a table.
- Returns:
None.
-
run
(pdf_zipfile_1: Optional[str] = None, pdf_zipfile_2: Optional[str] = None, pmc_zipfile: Optional[str] = None, co_occur_zipfile: Optional[str] = None) → None¶ Method is called and performs needed transformations to process annotations from SciBite CORD-19
- Args:
pdf_zipfile_1: PDF zip file part 1 [pdf_json_part_1.zip] pdf_zipfile_2: PDF zip file part 2 [pdf_json_part_1.zip] pmc_zipfile: pmc zipfile [pmc_json.zip] co_occur_zipfile: coocurrence data zipfile [cv19_scc_1_2.zip]
- Returns:
None.
-