perceval package

Subpackages

Submodules

perceval.archive module

class perceval.archive.Archive(archive_path)[source]

Bases: object

Basic class for archiving raw items fetched by Perceval.

This class allows to archive raw items - usually HTML pages or JSON documents - for a further recovery. These raw items will be fetched, stored and retrieved back by a backend.

Each stored item will have a hash code used as unique identifier. Hash codes are generated using URIs and other parameters needed to fetch raw items.

When an instance of Archive is initialized it will expect to access an existing archive file. To create a new and empty archive used create class method instead. Metadata must be initialized calling to init_metadata method after creating a new archive.

Parameters

archive_path – path where this archive is stored

Raises

ArchiveError – when the archive does not exist or is invalid

ARCHIVE_CREATE_STMT = 'CREATE TABLE archive ( id INTEGER PRIMARY KEY AUTOINCREMENT, hashcode VARCHAR(256) UNIQUE NOT NULL, uri TEXT, payload BLOB, headers BLOB, data BLOB)'
ARCHIVE_TABLE = 'archive'
METADATA_CREATE_STMT = 'CREATE TABLE metadata ( origin TEXT, backend_name TEXT, backend_version TEXT, category TEXT, backend_params BLOB, created_on TEXT)'
METADATA_TABLE = 'metadata'
classmethod create(archive_path)[source]

Create a brand new archive.

Call this method to create a new and empty archive. It will initialize the storage file in the path defined by archive_path.

Parameters

archive_path – absolute path where the archive file will be created

Raises

ArchiveError – when the archive file already exists

init_metadata(origin, backend_name, backend_version, category, backend_params)[source]

Init metadata information.

Metatada is composed by basic information needed to identify where archived data came from and how it can be retrieved and built into Perceval items.

Param

origin: identifier of the repository

Param

backend_name: name of the backend

Param

backend_version: version of the backend

Param

category: category of the items fetched

Param

backend_params: dict representation of the fetch parameters

raises ArchiveError: when an error occurs initializing the metadata

static make_hashcode(uri, payload, headers)[source]

Generate a SHA1 based on the given arguments.

Hashcodes created by this method will used as unique identifiers for the raw items or resources stored by this archive.

Parameters
  • uri – URI to the resource

  • payload – payload of the request needed to fetch the resource

  • headers – headers of the request needed to fetch the resource

Returns

a SHA1 hash code

retrieve(uri, payload, headers)[source]

Retrieve a raw item from the archive.

The method will return the data content corresponding to the hascode derived from the given parameters.

Parameters
  • uri – request URI

  • payload – request payload

  • headers – request headers

Returns

the archived data

Raises

ArchiveError – when an error occurs retrieving data

store(uri, payload, headers, data)[source]

Store a raw item in this archive.

The method will store data content in this archive. The unique identifier for that item will be generated using the rest of the parameters.

Parameters
  • uri – request URI

  • payload – request payload

  • headers – request headers

  • data – data to store in this archive

Raises

ArchiveError – when an error occurs storing the given data

class perceval.archive.ArchiveManager(dirpath)[source]

Bases: object

Manager for handling archives in Perceval.

This class manages the creation, deletion and access of Archive objects. Archives are stored under dirpath directory, using a random SHA1 for each file. The first byte of the hashcode will be the name of the subdirectory; the remaining bytes, the archive name.

Param

dirpath: path where the archives are stored

STORAGE_EXT = '.sqlite3'
create_archive()[source]

Create a new archive.

The method creates in the filesystem a brand new archive with a random SHA1 as its name. The first byte of the hashcode will be the name of the subdirectory; the remaining bytes, the archive name.

Returns

a new Archive object

Raises

ArchiveManagerError – when an error occurs creating the new archive

remove_archive(archive_path)[source]

Remove an archive.

This method deletes from the filesystem the archive stored in archive_path.

Parameters

archive_path – path to the archive

Raises

ArchiveManangerError – when an error occurs removing the archive

search(origin, backend_name, category, archived_after)[source]

Search archives.

Get the archives which store data based on the given parameters. These parameters define which the origin was (origin), how data was fetched (backend_name) and data type (‘category’). Only those archives created on or after archived_after will be returned.

The method returns a list with the file paths to those archives. The list is sorted by the date of creation of each archive.

Parameters
  • origin – data origin

  • backend_name – backed used to fetch data

  • category – type of the items fetched by the backend

  • archived_after – get archives created on or after this date

Returns

a list with archive names which match the search criteria

perceval.backend module

class perceval.backend.Backend(origin, tag=None, archive=None, blacklist_ids=None, ssl_verify=True)[source]

Bases: object

Abstract class for backends.

Base class to fetch data from a repository. This repository will be named as ‘origin’. During the initialization, an Archive object can be provided for archiving raw data from the repositories.

To avoid a NotImplementedError, derived classes have to implement or define:

For more information on the details of implementing these methods, please see the docs on each method.

The fetched items can be tagged using the tag parameter. It will be useful to trace data. When it is set to None or to an empty string, the tag will be the same that the origin attribute.

To track which version of the backend was used during the fetching process, this class provides a version attribute that each backend may override.

Each fetch operation generates a summary, available via the property summary. By default, it includes the last UUID generated, number of items fetched, skipped and their sum, plus the min, max and last updated_on times. Furthermore, for backends using offsets, the corresponding summary contains the min and max offsets retrieved. Finally, the summary also includes some extra fields, which can be used by any backend to include fetch-specific information.

Backends also produce a set of search fields, exposed in the search_fields attribute of each item returned by a call to fetch(). These contain the item_id, as well as any number of backend-specific fields.

Parameters
  • origin – identifier of the repository

  • tag – tag items using this label

  • archive – archive to store/retrieve data

  • ssl_verify – enable/disable SSL verification

Raises

ValueError – raised when archive is not an instance of Archive class

CATEGORIES = []

A list of categories that can be fetched by this backend.

Every backend able to produce items falling into a limited set of categories. The specific categories a backend can fetch is unique to that backend.

The categories defined in this variable (and only the categories defined in this variable) can be passed to fetch() and returned from metadata_category().

Implementing backends can define any category they need, as long as categories are short, descriptive, snake_case strings, such as “commit”, “merge_request”, or “pull_request”.

CLASSIFIED_FIELDS = []

A list of fields that should be considered sensitive or confidential.

Fields listed here will be hidden from fetched items, when this behaviour is requested.

Fields are represented as a list of strings. As items returned are dicts that may contain nested dicts, each entry is a list which stores the “path” or nested dicts keys to the field to remove. For example, [‘my’, ‘classified’, ‘field’] will remove field from item[‘data’][‘my’][‘classified’] dict.

Classified data filtering and archiving are not compatible to prevent data leaks or security issues.

EXTRA_SEARCH_FIELDS = {}

A set of search fields to simplify query operations.

The use of search fields can avoid the manual inspection of the items. The search fields are included with items returned from fetch() in a dict with the following shape:

{

‘key-1’: value-1, ‘key-2’: value-2, ‘key-3’: value-3

}

These fields are added to the item metadata information in the search_fields attribute. By default, search_fields contains the id of the item (‘item_id’: item_id_value), obtained via the method metadata_id. However, each backend can set extra search fields using the dict EXTRA_SEARCH_FIELDS. An example of EXTRA_SEARCH_FIELDS is provided below:

{

‘project_id’: [‘fields’, ‘project’, ‘id’], ‘project_key’: [‘fields’, ‘project’, ‘key’], ‘project_name’: [‘fields’, ‘project’, ‘name’]

}

Each key in the dict is a search field to be included in the item metadata information, while the corresponding value is a list that stores the “path” of the search field value within the item.

ORIGIN_UNIQUE_FIELD = None

A field unique to a given origin for items produced by this backend.

If ORIGIN_UNIQUE_FIELD is defined, users can pass a list of blocked values which should not be included in the results, if the field defined here contains them. For example, if ORIGIN_UNIQUE_FIELD were set to post_id, then users could pass a list of post ids that should be excluded from the results.

If set to None, blacklisting will be disabled completely. Otherwise, this should be set to a OriginUniqueField containing the number and data type of the field.

Note: Origin in this context refers to one site, api, or other remote that contains several repositories, each consisting of many items of several categories. For example, for the backend GitLab, an origin would be one instance GitLab, such as gitlab.com or opensource.ieee.org, which each contain many repositories, which contain items such as issues and merge request.

To access this field, please prefer origin_unique_field().

property archive
property categories

See CATEGORIES.

property classified_fields

A list of fields to be hidden from results.

Fields are represented as a list of strings, where each string is a period delimited field. For example, ‘attributes.author_info.secret_info’ would hide the secret info of the author in the attributes dict.

fetch(category, filter_classified=False, **kwargs)[source]

Fetch items from the repository.

The method retrieves items from a repository.

To removed classified fields from the resulting items, set the parameter filter_classified. Take into account this parameter is incompatible with archiving items. Raw client data are archived before any other process. Therefore, classified data are stored within the archive. To prevent from possible data leaks or security issues when users do not need these fields, archiving and filtering are not compatible.

Parameters
  • category – the category of the items fetched

  • filter_classified – remove classified fields from the resulting items

  • kwargs – a list of other parameters (e.g., from_date, offset, etc.

specific for each backend)

Returns

a generator of items

Raises

BackendError – either when the category is not valid or ‘filter_classified’ and ‘archive’ are active at the same time.

fetch_from_archive()[source]

Fetch the questions from an archive.

It returns the items stored within an archive. If this method is called but no archive was provided, the method will raise a ArchiveError exception.

Returns

a generator of items

Raises

ArchiveError – raised when an error occurs accessing an archive

fetch_items(category, **kwargs)[source]

Retrieve raw data from the repository.

This method is to be implemented by implementors of Backend, and is intended for internal use. Developers hoping to retrieve processed results should use the fetch() method.

This method receives a category of items to fetch from the repository. This will be one of categories defined in the CATEGORIES class variable. The method also receives a list of keyword arguments. These arguments include any commandline variables defined by the corresponding BackendCommand.

The method is then responsible for retrieving all items matching the criteria defined by the keyword args and the given category, then returning them as a generator of dicts. The structure of the dicts is irrelevant, but each dict should represent exactly one item.

Parameters
  • category – the category if items to retrieve from the repository

  • kwargs – additional arguments to assist or specify retrieval

Returns

a generator producing items

filter_classified_data(item)[source]

Remove classified or confidential data from an item.

It removes those fields that contain data considered as classified. Classified fields are defined in CLASSIFIED_FIELDS class attribute.

Parameters

item – fields will be removed from this item

Returns

the same item but with confidential data filtered

classmethod has_archiving()[source]

Whether or not this backend supports archiving requests.

For implementors, this means whether _init_client() can be called with from_archive=True and whether the backend will respect that. If the client used by the backend is an HttpClient, and _init_client() passes from_archive on to the client.HttpClient’s initializer, this should be true.

Classified data filtering and archiving are not compatible to prevent data leaks or security issues.

classmethod has_resuming()[source]

Whether this backend supports resuming interrupted collections.

When interrupted, some backends may support resuming the collection by setting the from_date parameter on fetch_items() or fetch() to the date of the last item retrieved from the repository.

However, for some backends, this cannot be done, for example because results are retrieved from newest to oldest. If resuming was attempted on a backend like this, then some items would be missed.

For example, if the backend was in the middle of retrieving items from January 5th through 1st, but was interrupted when retrieving items from the 3rd, than it would be missing items for the 2nd and 1st. If this backend was resumed by setting from_date to the most recent item (the 5th), these missing items would not be retrieved, since they are earlier than the from_date.

This method is used to indicate that this backend can be resumed in this manner without missing any items. If a backend declares that it supports resuming, than from_date should be set to the date of the most recent item from the last collection, even if it failed. Otherwise, from_date should be set to the most recent item of the last successful collection. Resuming in this manner should not leave any holes in the collected items.

This can be used to speed up collections by skipping network IO for items that have already been downloaded and added to the database.

Additionally, from_date may be set regardless of this setting if the last collection did not fail, or if the user is not interested in items earlier than the provided datetime.

Implementers should return a constant True if their backend supports resuming connections in this manor, or False otherwise.

metadata(item, filter_classified=False)[source]

Add metadata to an item.

It adds metadata to a given item such as how and when it was fetched. The contents from the original item will be stored under the ‘data’ keyword.

Parameters
  • item – an item fetched by a backend

  • filter_classified – sets if classified fields were filtered

static metadata_category(item)[source]

Identify the category of a given item.

Every item returned by fetch_items() should belong to exactly one of the categories listed in the CATEGORIES class member. This method should determine which category the item belongs to, given one of the items returned by fetch_items() method.

Note that for all items returned by a call to fetch_items(), this method is expected to return a category equivalent to the one passed as the category argument.

Returns

One of the strings in CATEGORIES

static metadata_id(item)[source]

Produce a unique identifier for an item.

Given one of the items produced by fetch_items(), produce a unique identifier for that item. Typically, this is an identifier given by the repository itself, such as a commit hash or post id.

The id should be represented by a string.

static metadata_updated_on(item)[source]

Determine the last time an item was updated.

Given one of the items produced by fetch_items(), attempt to identify the last time this item was modified.

Returns

The timestamp of the last modification, represented as epoch seconds (a UNIX timestamp)

property origin
property origin_unique_field

See ORIGIN_UNIQUE_FIELD.

search_fields(item)[source]

Add search fields to an item.

It adds the values of the fields defined in SEARCH_FIELDS class attribute with their corresponding keys.

Parameters

item – the item to extract the search fields values

Returns

a dict of search fields

property ssl_verify
property summary
version = '0.12.0'
class perceval.backend.BackendCommand(*args, debug=False)[source]

Bases: object

Abstract class to run backends from the command line.

When the class is initialized, it parses the given arguments using the defined argument parser on setup_cmd_parser method. Those arguments will be stored in the attribute parsed_args.

The arguments will be used to initialize and run the Backend object assigned to this command. The backend used to run the command is stored under BACKEND class attributed. Any class derived from this and must set its own Backend class.

Moreover, the method setup_cmd_parser must be implemented to execute the backend.

Parameters

debug – boolean flag to check if application is running in debug mode

BACKEND = None
run()[source]

Fetch and write items.

This method runs the backend to fetch the items from the given origin. Items are converted to JSON objects and written to the defined output. A summary with the result is written to the log.

If fetch-archive parameter was given as an argument during the initialization of the instance, the items will be retrieved using the archive manager.

classmethod setup_cmd_parser()[source]
class perceval.backend.BackendCommandArgumentParser(backend, from_date=False, to_date=False, offset=False, basic_auth=False, token_auth=False, archive=False, aliases=None, blacklist=False, ssl_verify=False)[source]

Bases: object

Manage and parse backend command arguments.

This class defines and parses a set of arguments common to backends commands. Some parameters like archive or the different types of authentication can be set during the initialization of the instance.

Parameters
  • backend – backend object

  • from_date – set from_date argument

  • to_date – set to_date argument

  • offset – set offset argument

  • basic_auth – set basic authentication arguments

  • token_auth – set token/key authentication arguments

  • archive – set archiving arguments

  • aliases – define aliases for parsed arguments

  • ssl_verify – set SSL verify argument

Raises

AttributeError – when both from_date and offset are set to True

parse(*args)[source]

Parse a list of arguments.

Parse argument strings needed to run a backend command. The result will be a argparse.Namespace object populated with the values obtained after the validation of the parameters.

Parameters

args – argument strings

Result

an object with the parsed values

class perceval.backend.BackendItemsGenerator(backend_class, backend_args, category, filter_classified=False, manager=None, fetch_archive=False, archived_after=None)[source]

Bases: object

BackendItemsGenerator class.

This class provides a generator through the items attribute that will fetch items from any data source and/or archive in a transparent way. A summary with the result of the process can be accessed via the attribute summary.

To initialize an instance is necessary to pass the backend that will be used to fetch data, its parameters and other useful data as the category of the items to retrieve and the archive options.

This object can also be used as a context manager.

Parameters
  • backend_class – backend class to fetch items

  • backend_args – dict of arguments needed to fetch the items

  • category – category of the items to retrieve If None, it will use the default backend category

  • filter_classified – remove classified fields from the resulting items. Note that filter classified is not supported for archived items.

  • manager – archive manager where the items will be retrieved

  • fetch_archive – If enabled, items are fetched from archives

  • archived_after – return items archived after this date

property summary

Return the summary object of the last fetch execution

class perceval.backend.OriginUniqueField(name, type)

Bases: tuple

property name

Alias for field number 0

property type

Alias for field number 1

class perceval.backend.Summary[source]

Bases: object

Summary class for fetch executions.

This class models the summary of a fetch execution. It includes the last UUID, number of items fetched, skipped and their sum, plus the minimum, maximum and last updated_on times.

Furthermore, for backends using offsets, the corresponding summary contains the minimum, maximum and last offsets retrieved.

Finally, the summary also includes some extra fields, which can be used by any backend to include fetch-specific information.

property total

Number of items retrieved. This includes fetched and skipped items.

update(item)[source]

Update the summary attributes by accessing the item data.

Parameters

item – a Perceval item

perceval.backend.fetch(backend_class, backend_args, category, filter_classified=False, manager=None)[source]

Fetch items using the given backend.

Generator to get items using the given backend class. When an archive manager is given, this function will store the fetched items in an Archive. If an exception is raised, this archive will be removed to avoid corrupted archives.

The parameters needed to initialize the backend class and get the items are given using backend_args dict parameter.

Parameters
  • backend_class – backend class to fetch items

  • backend_args – dict of arguments needed to fetch the items

  • category – category of the items to retrieve. If None, it will use the default backend category

  • filter_classified – remove classified fields from the resulting items

  • manager – archive manager needed to store the items

Returns

a generator of items

perceval.backend.fetch_from_archive(backend_class, backend_args, manager, category, archived_after)[source]

Fetch items from an archive manager.

Generator to get the items of a category (previously fetched by the given backend class) from an archive manager. Only those items archived after the given date will be returned.

The parameters needed to initialize backend and get the items are given using backend_args dict parameter.

Parameters
  • backend_class – backend class to retrive items

  • backend_args – dict of arguments needed to retrieve the items

  • manager – archive manager where the items will be retrieved

  • category – category of the items to retrieve

  • archived_after – return items archived after this date

Returns

a generator of archived items

perceval.backend.find_backends(top_package)[source]

Find available backends.

Look for the Perceval backends and commands under top_package and its sub-packages. When top_package defines a namespace, backends under that same namespace will be found too.

Parameters

top_package – package storing backends

Returns

a tuple with two dicts: one with Backend classes and one with BackendCommand classes

perceval.backend.uuid(*args)[source]

Generate a UUID based on the given parameters.

The UUID will be the SHA1 of the concatenation of the values from the list. The separator between these values is ‘:’. Each value must be a non-empty string, otherwise, the function will raise an exception.

Parameters

*args

list of arguments used to generate the UUID

Returns

a universal unique identifier

Raises

ValueError – when anyone of the values is not a string, is empty or None.

perceval.client module

class perceval.client.HttpClient(base_url, max_retries=5, sleep_time=1, extra_headers=None, extra_status_forcelist=None, extra_retry_after_status=None, archive=None, from_archive=False, ssl_verify=True)[source]

Bases: object

Abstract class for HTTP clients.

Base class to query data sources taking care of retrying requests in case connection issues. If the data source does not send back a response after retrying a request, a RetryError exception is thrown.

Sub-classes can use the methods fetch to obtain data from the data source.

To track which version of the client was used during the fetching process, this class provides a version attribute that each client may override.

Parameters
  • base_url – base URL of the data source

  • max_retries – number of max retries to a data source before raising a RetryError exception

  • sleep_time – time (in seconds) to sleep in case of connection problems

  • extra_headers – extra headers to be included in the requests

  • extra_status_forcelist – a set of HTTP status codes that will force a retry on

  • extra_retry_after_status – a set of HTTP status codes that will perform a retry respecting the Retry-After header

  • archive – archive to store/retrieve items

  • from_archive – if True the data is fetched from an archive

  • ssl_verify – enable/disable SSL verification

DEFAULT_HEADERS = {'User-Agent': 'Perceval/0.17.12'}
DEFAULT_METHOD_WHITELIST = False
DEFAULT_RAISE_ON_REDIRECT = True
DEFAULT_RAISE_ON_STATUS = True
DEFAULT_RESPECT_RETRY_AFTER_HEADER = True
DEFAULT_RETRY_AFTER_STATUS_CODES = [413, 429, 503]
DEFAULT_SLEEP_TIME = 1
DEFAULT_STATUS_FORCE_LIST = [408, 423, 504]
GET = 'GET'
MAX_RETRIES = 5
MAX_RETRIES_ON_CONNECT = 5
MAX_RETRIES_ON_READ = 5
MAX_RETRIES_ON_REDIRECT = 5
MAX_RETRIES_ON_STATUS = 5
POST = 'POST'
fetch(url, payload=None, headers=None, method='GET', stream=False, auth=None)[source]

Fetch the data from a given URL.

Parameters
  • url – link to the resource

  • payload – payload of the request

  • headers – headers of the request

  • method – type of request call (GET or POST)

  • stream – defer downloading the response body until the response content is available

  • auth – auth of the request

:returns a response object

static sanitize_for_archive(url, headers, payload)[source]

Sanitize the URL, headers and payload of a HTTP request before storing/retrieving items. By default, this method does not modify url, headers and payload. The modifications take place within the specific backends that redefine the sanitize_for_archive.

Param

url: HTTP url request

Param

headers: HTTP headers request

Param

payload: HTTP payload request

:returns url, headers and payload sanitized

version = '0.3.0'
class perceval.client.RateLimitHandler[source]

Bases: object

Class to handle rate limit for HTTP clients.

Parameters
  • sleep_for_rate – sleep until rate limit is reset

  • min_rate_to_sleep – minimun rate needed to sleep until it will be rese

  • rate_limit_header – header to know the current rate limit

  • rate_limit_reset_header – header to know the next rate limit reset

MAX_RATE_LIMIT = 500
MIN_RATE_LIMIT = 10
RATE_LIMIT_HEADER = 'X-RateLimit-Remaining'
RATE_LIMIT_RESET_HEADER = 'X-RateLimit-Reset'
calculate_time_to_reset()[source]

Calculate the seconds to reset the token requests.

setup_rate_limit_handler(sleep_for_rate=False, min_rate_to_sleep=10, rate_limit_header='X-RateLimit-Remaining', rate_limit_reset_header='X-RateLimit-Reset')[source]

Setup the rate limit handler.

Parameters
  • sleep_for_rate – sleep until rate limit is reset

  • min_rate_to_sleep – minimun rate needed to make the fecthing process sleep

  • rate_limit_header – header from where extract the rate limit data

  • rate_limit_reset_header – header from where extract the rate limit reset data

sleep_for_rate_limit()[source]

The fetching process sleeps until the rate limit is restored or raises a RateLimitError exception if sleep_for_rate flag is disabled.

update_rate_limit(response)[source]

Update the rate limit and the time to reset from the response headers.

Param

response: the response object

version = '0.2'

perceval.errors module

exception perceval.errors.ArchiveError(**kwargs)[source]

Bases: perceval.errors.BaseError

Generic error for archive objects

message = '%(cause)s'
exception perceval.errors.ArchiveManagerError(**kwargs)[source]

Bases: perceval.errors.BaseError

Generic error for archive manager

message = '%(cause)s'
exception perceval.errors.BackendCommandArgumentParserError(**kwargs)[source]

Bases: perceval.errors.BaseError

Generic error for BackendCommandArgumentParser

message = '%(cause)s'
exception perceval.errors.BackendError(**kwargs)[source]

Bases: perceval.errors.BaseError

Generic error for backends

message = '%(cause)s'
exception perceval.errors.BaseError(**kwargs)[source]

Bases: Exception

Base class for Perceval exceptions.

Derived classes can overwrite the error message declaring message property.

message = 'Perceval base error'
exception perceval.errors.HttpClientError(**kwargs)[source]

Bases: perceval.errors.BaseError

Generic error for HTTP Cient

message = '%(cause)s'
exception perceval.errors.ParseError(**kwargs)[source]

Bases: perceval.errors.BaseError

Exception raised a parsing errors occurs

message = '%(cause)s'
exception perceval.errors.RateLimitError(**kwargs)[source]

Bases: perceval.errors.BaseError

Exception raised when the rate limit is exceeded

message = '%(cause)s; %(seconds_to_reset)s seconds to rate reset'
property seconds_to_reset
exception perceval.errors.RepositoryError(**kwargs)[source]

Bases: perceval.errors.BaseError

Generic error for repositories

message = '%(cause)s'

perceval.perceval module

class perceval.perceval.ListBackends(option_strings, dest, **kwargs)[source]

Bases: argparse.Action

perceval.perceval.configure_logging(debug=False)[source]

Configure Perceval logging

The function configures the log messages produced by Perceval. By default, log messages are sent to stderr. Set the parameter debug to activate the debug mode.

Parameters

debug – set the debug mode

perceval.perceval.main()[source]
perceval.perceval.parse_args(perceval_cmds)[source]

Parse command line arguments

perceval.utils module

perceval.utils.check_compressed_file_type(filepath)[source]

Check if filename is a compressed file supported by the tool.

This function uses magic numbers (first four bytes) to determine the type of the file. Supported types are ‘gz’ and ‘bz2’. When the filetype is not supported, the function returns None.

Parameters

filepath – path to the file

Returns

‘gz’ or ‘bz2’; None if the type is not supported

perceval.utils.message_to_dict(msg)[source]

Convert an email message into a dictionary.

This function transforms an email.message.Message object into a dictionary. Headers are stored as key:value pairs while the body of the message is stored inside body key. Body may have two other keys inside, ‘plain’, for plain body messages and ‘html’, for HTML encoded messages.

The returned dictionary has the type requests.structures.CaseInsensitiveDict due to same headers with different case formats can appear in the same message.

Parameters

msg – email message of type email.message.Message

:returns : dictionary of type requests.structures.CaseInsensitiveDict

Raises

ParseError – when an error occurs transforming the message to a dictionary

perceval.utils.months_range(from_date, to_date)[source]

Generate a months range.

Generator of months starting on from_date util to_date. Each returned item is a tuple of two datatime objects like in (month, month+1). Thus, the result will follow the sequence:

((fd, fd+1), (fd+1, fd+2), …, (td-2, td-1), (td-1, td))

Parameters
  • from_date – generate dates starting on this month

  • to_date – generate dates until this month

Result

a generator of months range

perceval.utils.remove_invalid_xml_chars(raw_xml)[source]

Remove control and invalid characters from an xml stream.

Looks for invalid characters and subtitutes them with whitespaces. This solution is based on these two posts: Olemis Lang’s reponse on StackOverflow (http://stackoverflow.com/questions/1707890) and lawlesst’s on GitHub Gist (https://gist.github.com/lawlesst/4110923), that is based on the previous answer.

Parameters

xml – XML stream

Returns

a purged XML stream

perceval.utils.xml_to_dict(raw_xml)[source]

Convert a XML stream into a dictionary.

This function transforms a xml stream into a dictionary. The attributes are stored as single elements while child nodes are stored into lists. The text node is stored using the special key ‘__text__’.

This code is based on Winston Ewert’s solution to this problem. See http://codereview.stackexchange.com/questions/10400/convert-elementtree-to-dict for more info. The code was licensed as cc by-sa 3.0.

Parameters

raw_xml – XML stream

Returns

a dict with the XML data

Raises

ParseError – raised when an error occurs parsing the given XML stream

Module contents