Reference¶
moodys_datahub.tools
¶
Sftp
¶
Bases: _Process
A class to manage SFTP connections and file operations for data transfer.
__init__(hostname=None, username=None, port=22, privatekey=None, data_product_template=None, local_repo=None)
¶
Constructor Method
Constructor Parameters
hostname(str, optional): Hostname of the SFTP server (default is CBS SFTP server).username(str, optional): Username for authentication (default is CBS SFTP server).port(int, optional): Port number for the SFTP connection (default is 22).privatekey(str, optional): Path to the private key file for authentication (required for SFTP access).data_product_template(str, optional): Template for managing data products during SFTP operations.local_repo(str, optional): Path to a folder containing previously downloaded data products.
Object Attributes:
- connection (pysftp.Connection or None): Represents the current SFTP connection, initially set to None.
- hostname (str): Hostname for the SFTP server.
- username (str): Username for SFTP authentication.
- privatekey (str or None): Path to the private key file for SFTP authentication.
- port (int): Port number for SFTP connection (default is 22).
File Handling Attributes:
- output_format (list of str): Supported output formats for files (default is ['.csv']).
- file_size_mb (int): Maximum file size in MB before splitting files (default is 500 MB).
- delete_files (bool): Flag indicating whether to delete processed files (default is False).
- concat_files (bool): Flag indicating whether to concatenate processed files (default is True).
- query (str, function, or None): Query string or function for filtering data (default is None).
- query_args (list or None): List of arguments for the query string or function (default is None).
- dfs (DataFrame or None): Stores concatenated DataFrames if concatenation is enabled.
copy_obj()
¶
Create a deep copy of the current instance and initialize its defaults.
This method creates a deep copy of the instance, calls the
_object_defaults() method to set default values, and then
invokes the select_data() method to prepare the copied object
for use.
get_column_names(save_to=False, files=None)
¶
Retrieve column names from a DataFrame or dictionary and save them to a file.
Input Variables:
- self: Implicit reference to the instance.
- save_to (str, optional): Format to save results (default is CSV).
- files (list, optional): List of files to retrieve column names from.
Returns: - List of column names or None if no valid source is provided.
orbis_to_moodys(file)
¶
Match headings from an Orbis output file to headings in Moody's DataHub.
This method reads headings from an Orbis output file and matches them to headings in Moody's DataHub. The function returns a DataFrame with matched headings and a list of headings not found in Moody's DataHub.
Input Variables:
- file (str): Path to the Orbis output file.
Returns: - tuple: A tuple where: - The first element is a DataFrame containing matched headings. - The second element is a list of headings that were not found.
Notes: - Headings from the Orbis file are processed to remove any extra lines and to ensure uniqueness. - The DataFrame is sorted based on the number of unique headings for each 'Data Product'. - If no headings are found, an empty DataFrame is returned.
Example
matched_df, not_found_list = self.orbis_to_moodys('orbis_output.xlsx')
search_bvd_changes(bvd_list, num_workers=-1)
¶
Search for changes in BvD IDs based on the provided list.
This method retrieves changes in BvD IDs by processing the provided list of BvD IDs. It utilizes concurrent processing for efficiency and returns the new IDs, the newest IDs, and a filtered DataFrame containing relevant change information.
bvd_list : list A list of BvD IDs to check for changes.
int, optional
The number of worker processes to use for concurrent operations. If set to -1 (default), it will use the maximum available workers minus two to avoid overloading the system.
tuple A tuple containing: - new_ids: A list of newly identified BvD IDs. - newest_ids: A list of the most recent BvD IDs. - filtered_df: A DataFrame with relevant change information.
Examples: new_ids, newest_ids, changes_df = obj.search_bvd_changes(['BVD123', 'BVD456'])
search_company_names(names, num_workers=-1, cut_off=90.1, company_suffixes=None)
¶
Search for company names and find the best matches based on a fuzzy query.
This method performs a search for company names using fuzzy matching techniques, leveraging concurrent processing to improve performance. It processes the provided names against a dataset of firmographic data and returns the best matches based on the specified cut-off score.
names : list A list of company names to search for.
int, optional
The number of worker processes to use for concurrent operations. If set to -1 (default), it will use the maximum available workers minus two to avoid overloading the system.
float, optional
The cut-off score for considering a match as valid. Default is 90.1.
list, optional
A list of valid company suffixes to consider when searching for names.
pandas.DataFrame A DataFrame containing the best matches for the searched company names, including associated scores and other relevant information.
Examples: results = obj.search_company_names(['Example Inc', 'Sample Ltd'], num_workers=4)