ncdiff.reader package

Submodules

ncdiff.reader.base module

Factory functions and base classes for the reader framework.

since

2012-02-10

ncdiff.reader.base.create_reader_from_module(reader_name, module_name, *args, **kwargs)

Try to load the reader class from the specified module.

Parameter
  • reader_name (str) – The type of reader to create.

  • *args

    arguments passed to the reader constructor of the reader class.

  • **kwargs

    keyword arguments passed to the constructor of the reader class.

ncdiff.reader.base.create_old_reader(target_configuration)

Create the reader for the old input file according the passed configuration.

Note: Row filters must be passed to the reader if filtering of rows is supported! :param target_configuration: The configuration of the diff target. :type target_configuration: TargetConfiguration :return: A reader implementation to read the old input file. :rtype: BaseReader

ncdiff.reader.base.create_new_reader(target_configuration)

Create the reader for the new input file according the passed configuration.

Note: Row filters must be passed to the reader if filtering of rows is supported! :param target_configuration: The configuration of the diff target. :type target_configuration: TargetConfiguration :return: A reader implementation to read the new input file. :rtype: BaseReader

ncdiff.reader.base.create_old_sorted_file_reader(target_configuration)

Create the reader for the old sorted file.

Parameter

target_configuration (TargetConfiguration) – The configuration of the diff target.

ncdiff.reader.base.create_new_sorted_file_reader(target_configuration)

Create the reader for the old sorted file.

Parameter

target_configuration (TargetConfiguration) – The configuration of the diff target.

class ncdiff.reader.base.BaseReader(has_header, row_filter=None, append_row_nr=False, encoding=None, column_names=None)

Bases: object

Abstract base class that defines the reader interface.

All readers support the context manager protocol (with statement).

close()

Close the location.

This method never fails.

get_information()

Get information from data.

Note: The dictionary values may be None if the file is empty or has no header. :return: A dictionary containing the column names and the number of columns:

  • ‚column_names‘: None

  • ‚columnCount‘: 11

Rückgabetyp

dict

has_header()

Check if data has a header line.

Rückgabe

True if the first line of a reader is a header.

Rückgabetyp

bool

open()

Open the location.

reader()

Return a reader.

A reader is expected to return all cell values in string format with any leading and trailing white space characters removed. :return: An reader supporting the iterator protocol. :rtype: iterable

class ncdiff.reader.base.FileReader(has_header, input_path, row_filter=None, append_row_nr=False, encoding=None, column_names=None)

Bases: ncdiff.reader.base.BaseReader

Abstract base class implementing support to read files.

close()

Close the input file.

get_information()

Get information from a file.

Note: The dictionary values may be None if the file is empty or has no header. :return: A dictionary containing the column names, the number of columns

and the size of the input file in bytes.
  • ‚column_names‘: None

  • ‚columnCount‘ : 11

  • ‚size‘ : 123456

Rückgabetyp

dict

has_header()

Check if data has a header line.

Rückgabe

True if the first line of a reader is a header.

Rückgabetyp

bool

open()

Open the file for reading in universal newlines mode.

reader()

Return a reader.

A reader is expected to return all cell values in string format with any leading and trailing white space characters removed. :return: An reader supporting the iterator protocol. :rtype: iterable

ncdiff.reader.csv module

Reader classes for CSV format.

since

2012-02-10

class ncdiff.reader.csv.CSVReader(has_header, input_path, delimiter=';', replace_delimiters=None, quoting='QUOTE_MINIMAL', quotechar='"', doublequote=True, escapechar=None, row_filter=None, append_row_nr=False, encoding=None, column_names=None)

Bases: ncdiff.reader.base.FileReader

Read csv files.

close()

Close the input file.

get_information()

Get information from a file.

Note: The dictionary values may be None if the file is empty or has no header. :return: A dictionary containing the column names, the number of columns

and the size of the input file in bytes.
  • ‚column_names‘: None

  • ‚columnCount‘ : 11

  • ‚size‘ : 123456

Rückgabetyp

dict

has_header()

Check if data has a header line.

Rückgabe

True if the first line of a reader is a header.

Rückgabetyp

bool

open()

Open the file for reading in universal newlines mode.

reader()

Return the CSV reader.

Rückgabe

A CSV reader for the csv file.

Rückgabetyp

csv.reader

class ncdiff.reader.csv.CSVVARReader(*args, **kwargs)

Bases: ncdiff.reader.csv.CSVReader

A csv reader that can cope with columns of different length.

close()

Close the input file.

get_information()

Get information from a file.

Note: The dictionary values may be None if the file is empty or has no header. :return: A dictionary containing the column names, the number of columns

and the size of the input file in bytes.
  • ‚column_names‘: None

  • ‚columnCount‘ : 11

  • ‚size‘ : 123456

Rückgabetyp

dict

has_header()

Check if data has a header line.

Rückgabe

True if the first line of a reader is a header.

Rückgabetyp

bool

open()

Open the file for reading in universal newlines mode.

reader()

Return the CSV reader.

Rückgabe

A CSV reader for the csv file.

Rückgabetyp

csv.reader

ncdiff.reader.dir module

Reader classes for DIR format to compare file system structures.

since

2012-02-10

class ncdiff.reader.dir.DirReader(has_header, input_path, row_filter=None, append_row_nr=False, encoding=None, column_names=None)

Bases: ncdiff.reader.base.BaseReader

Reader for recursively comparing filesystem structures.

This reader allows to generically extract information about filesystem objects (files, directories, links) and to forward this information in the form of structured data records that can be written into CSV files. Since all the functionality for comparing CSV files is already in place all the existing functionality like filters, tolerances, result filters, data type mapping, … can be used to compare the filesystem data. Since the primary key of the filesystem comparison does always need to be the ABSPATH + NAME different kind of comparison can be achieved.

close()

Empty close method.

Since the directory reader does not handle single files this function has no functionality within the DIR reader.

get_information()

General information/statistics that might have been gathered by this reader.

B{Note:} The dictionary values may be None if the file is empty or has no header. :return: A dictionary containing the column names, the number of columns

and the size of the input file in bytes.

Rückgabetyp

dict

has_header()

Check if data has a header line.

Rückgabe

True if the first line of a reader is a header.

Rückgabetyp

bool

open()

Empty open method.

Since the directory reader does not handle single files this function has no functionality within the DIR reader.

reader()

Gather all the information about the filesystem objects.

This method contains the main functionality of the DIR reader. It recursively loops through all the directories underneath the defined BaseDirectory. For each file, link, directory that is found on the way down to the last directory a according FileInfo object is created and the information of finally forwarded to a CSV writer. :return: A list of strings representing the complete filesystem objects information :rtype: list

ncdiff.reader.fixed_width module

Reader classes for fixed width text format.

since

2012-02-10

class ncdiff.reader.fixed_width.FixedWidthReader(has_header, input_path, columns, row_filter=None, append_row_nr=False, encoding=None, column_names=None)

Bases: ncdiff.reader.base.FileReader

Implementation of reading text files with columns defined by the number of characters.

close()

Close the input file.

get_information()

Get information from a file.

Note: The dictionary values may be None if the file is empty or has no header. :return: A dictionary containing the column names, the number of columns

and the size of the input file in bytes.
  • ‚column_names‘: None

  • ‚columnCount‘ : 11

  • ‚size‘ : 123456

Rückgabetyp

dict

has_header()

Check if data has a header line.

Rückgabe

True if the first line of a reader is a header.

Rückgabetyp

bool

open()

Open the file for reading in universal newlines mode.

reader()

Iterate through the passed file stream and returns a single line for every calling.

Rückgabe

iterable with column entries

Rückgabetyp

array

ncdiff.reader.json module

Reader classes for json format.

since

2012-02-10

class ncdiff.reader.json.JSON2CSVReader(has_header, input_path, row_filter=None, append_row_nr=False, encoding=None, column_names=None)

Bases: ncdiff.reader.base.FileReader

JSON2CSVReader transforms JSON messages into a CSV file.

Goal was to find a generic implementation that would transform a JSON file into CSV in order to be able to use the already existing (CSV) process for comparing files instead of having to write an new complex algorithm to diff structured hierarchical data.

This basic implementation is able to handle JSON where data is stored on different hierarchical levels. Arrays of JSON objects are only supported on the top most level, but nested objects (aka dictionaries of dictionaries) are fully supported. It is a generic implementation, where the name of the element will be the name of the resulting CSV column. For nested objects the dot syntax is used to built the column name. For instance:

{"name": "Alice", "location": {"country": "Wonderland", "street": "Rabbit Hole"}, "age": "8"}
will result in columns::

name | location.country | location.street | age Alice | Wonderland | Rabbit Hole | 8

Since it is completely integrated into the existing NCDiff framework it fully supports the filter, tolerance, data type features of the framework.

close()

Close the input file.

get_information()

Get generic information about the given file like, files ize, amount of columns, column names.

Rückgabe

A dictionary containing the column names, the number of columns

Rückgabetyp

dict

has_header()

Check if data has a header line.

Rückgabe

True if the first line of a reader is a header.

Rückgabetyp

bool

open()

Open the file and determine the column names.

reader()

Get the flat/table structured data of the JSON file.

ncdiff.reader.sql module

Reader classes for SQL databases.

since

2012-02-10

class ncdiff.reader.sql.SQLReader(has_header, connection_string, database_driver, query_string, fetch_size=500, row_filter=None, append_row_nr=False, encoding=None, column_names=None)

Bases: ncdiff.reader.base.BaseReader

Read SQL tables.

close()

Close the database connection.

get_information()

Get information from data.

Note: The dictionary values may be None if the file is empty or has no header. :return: A dictionary containing the column names and the number of columns:

  • ‚column_names‘: None

  • ‚columnCount‘: 11

Rückgabetyp

dict

has_header()

Check if data has a header line.

Rückgabe

True if the first line of a reader is a header.

Rückgabetyp

bool

open()

Open the database connection.

reader()

Return a SQL reader.

Rückgabe

A CSV reader for the csv file.

Rückgabetyp

csv.reader

ncdiff.reader.swift module

Reader classes for swift messages.

since

2012-02-10

class ncdiff.reader.swift.SWIFTReader(has_header, input_path, quotechar='"', doublequote=True, delimiter=';', replace_delimiters=None, escapechar=None, row_filter=None, append_row_nr=False, encoding=None, column_names=None)

Bases: ncdiff.reader.base.FileReader

Read swift files.

close()

Close the input file.

get_information()

Get information from a file.

Note: The dictionary values may be None if the file is empty or has no header. :return: A dictionary containing the column names, the number of columns

and the size of the input file in bytes.
  • ‚column_names‘: None

  • ‚columnCount‘ : 11

  • ‚size‘ : 123456

Rückgabetyp

dict

has_header()

Check if data has a header line.

Rückgabe

True if the first line of a reader is a header.

Rückgabetyp

bool

open()

Open the file for reading in universal newlines mode.

reader()

Return a SWIFT reader.

Rückgabe

A SWIFT reader for the swift file.

Rückgabetyp

swift.reader

ncdiff.reader.tar module

Reader classes for TAR format to compare file system structures in *.tar files.

since

2012-02-10

class ncdiff.reader.tar.TARReader(has_header, input_path, compression_type, row_filter=None, append_row_nr=False, encoding=None, column_names=None)

Bases: ncdiff.reader.base.FileReader

The TAR file reader.

Main purpose of this class is to provide the functionality to read a TAR file and write an overview of its content into a CSV file that can be compared to an similar TAR-Content-File. Background of this reader is that lots of „packages“ are either TAR or ZIP containers that contain lots of other files. To get an overview about the content of such an container an according reader had be written There is currently support for 3 difference kinds of TAR files:

  • plain/uncompressed: tar -> use TAR as <new|old>FileFormat

  • GZIP compressed tar: tar.gz -> use TAR.GZ as <new|old>FileFormat

  • BZIP2 compressed tar: tar.bz2 -> use TAR.BZ2 as <new|old>FileFormat

class TarMemberInfo(tar_info_object)

Bases: ncdiff.utils.FileUtils.FileInfo

TarMemberInfo is a specialization of the FileUtils.FileInfo object.

The main difference here is that the information about the files/directories are not directly retrieved from the file system, they are read out of the tar container itself.

class Type

Bases: object

Little helper class for distinguishing difference filesystem object types.

DIRECTORY = 'Directory'
FILE = 'File'
get_abs_path()

Get the absolute (full) path of the filesystem object within the filesystem.

Rückgabe

Absolute path of the FileInfo object

Rückgabetyp

str

get_access_time(dt_format='%Y-%m-%d %H:%M:%S')

Get the timestamp when the filesystem object was accessed the last time.

Parameter

dt_format (str) – The optional timestamp format of the resulting string

Rückgabe

A string representing the filesystem objects last access time.

Rückgabetyp

str

get_creation_time(dt_format='%Y-%m-%d %H:%M:%S')

Get the timestamp when the filesystem object was created.

Parameter

dt_format (str) – The optional timestamp format of the resulting string

Rückgabe

A string representing the filesystem objects creation time

Rückgabetyp

str

get_depth()

Get the depth of the filesystem object relative to the BaseDirectory.

Rückgabe

A number representing the depth level relative to the root/BaseDirectory

Rückgabetyp

int

get_extension()

Get the files extension.

All the characters right to the very last dot ‚.‘ within the files base name. :return: Files extension :rtype: str

get_group()

Get the id of the users group owning the file system object.

Rückgabe

Id of the group that owns the file

Rückgabetyp

int

get_hash_value()

Get the hashdigest representing the hash value of the filesystem object.

Rückgabe

A string representing the hashdigest

Rückgabetyp

str

get_mode()

Get the file system permissions bits.

Rückgabe

File system permission info.

Rückgabetyp

int

get_modification_time(dt_format='%Y-%m-%d %H:%M:%S')

Get the timestamp when the filesystem object was modified the last time.

Parameter

dt_format (str) – The optional timestamp format of the resulting string.

Rückgabe

A string representing the filesystem objects last modification time.

Rückgabetyp

str

get_name()

Return the basename of the filesystem object.

Rückgabe

Basename of the FileInfo object

Rückgabetyp

str

get_permissions()

Get a string representing the file permissions in unix format „rwx-r-x–x“.

Rückgabe

a string that represents the file permission in POSIX format.

Rückgabetyp

str

get_rel_path()

Get the path of the filesystem object relative to the given BaseDirecotry.

Rückgabe

Relative path of the FileInfo object

Rückgabetyp

str

get_size()

Get the size of the file.

Rückgabe

Size of the file

Rückgabetyp

int

get_type()

Get the file system object type this object represents.

Rückgabe

The type of filesystem object

Rückgabetyp

FileInfo.Type

get_user()

Get the id of the user owning the file system object.

Rückgabe

Id of the user that owns the file.

Rückgabetyp

int

is_directory()

Check if this FileInfo is a directory.

Rückgabe

An indicator in case this is a directory True .. in case the FileInfo represents a directory False .. in all other cases

Rückgabetyp

bool

is_file()

Check if this FileInfo is a file.

Rückgabe

An indicator in case this is a file True .. in case the FileInfo represents a file False .. in all other cases

Rückgabetyp

bool

Check if this FileInfo is a link.

Rückgabe

An indicator in case this is a link True .. in case the FileInfo represents a link False .. in all other cases

Rückgabetyp

bool

set_access_time(access_time)

Set the last accessed time of the file system object.

Parameter

access_time (datetime) – The creation timestamp of the filesystem object

set_creation_time(creation_time)

Set the creation time of the file system object.

Parameter

creation_time (datetime) – The creation timestamp of the filesystem object

set_depth(depth)

Set the depth of the filesystem object relative to the BaseDirectory specified in the constructor.

Parameter

depth (int) – The depth/level of the filesystem objects relative to the root/BaseDirectory

set_file_system_object(tar_info_object, base_directory=None)

Setter for the TarInfo file descriptor.

Parameter

tar_info_object (tarfile.TarInfo) – Kind of file descriptor object from a TarFile.

set_group(group_id)

Set the id of the group the owner belongs to.

Parameter

group_id (int) – The group id within the filesystem.

set_hash_value(hash_value)

Set the hashdigest that represents this FileInfo object.

Parameter

hash_value (str) – A string preferably extracted with FileUtils.calc_hash

set_mode(mode)

Set the file system permissions bits of the current filesystem object.

Parameter

mode (int) – File system permission

set_modification_time(modification_time)

Set the last modification time of the file system object.

Parameter

modification_time (datetime) – The creation timestamp of the filesystem object.

set_name(file_path, base_directory=None)

Set the name of the TarInfo from a file path.

Parameter
  • file_path (str) – location/path within the tarfile

  • base_directory – optional base path; this has to be stripped from the absolute path to get the

relative one. :type base_directory: str

set_size(size)

Set the size of the file.

Parameter

size (int) – Size of the filesystem object

set_type(type_)

Set the type of the file system object this FileInfo object represents. File, Directory or Link.

Parameter

type (FileInfo.Type) – Type of the FileInfo object

set_user(user_id)

Set the id of the user that owns the filesystem object.

Parameter

user_id (int) – The users id within the filesystem.

close()

Close the input file.

get_information()

Get information from a file.

Note: The dictionary values may be None if the file is empty or has no header. :return: A dictionary containing the column names, the number of columns

and the size of the input file in bytes.
  • ‚column_names‘: None

  • ‚columnCount‘ : 11

  • ‚size‘ : 123456

Rückgabetyp

dict

has_header()

Check if data has a header line.

Rückgabe

True if the first line of a reader is a header.

Rückgabetyp

bool

open()

TAR file open/read functionality.

This method has some fall backs in case the actual given compression format does not match the specified on. Fallback sequence is: [1] .. generic transparent compression [2] .. GZIP compression [3] .. ‚explicit‘ NO compression [4] .. BZIP2 compression In case the file cannot be opened an error will be raised :raise IOError: either in case the file is already read by another process or in case none of the compression formats match the file.

reader()

Gather all the information about the tar files objects.

This method contains the main functionality of the TAR reader. It recursively loops through all the directories underneath the root. For each file, link, directory that is found on the way down to the last directory a according TarFileInfo object is created and the information of finally forwarded to a CSV writer. :return: A list of strings representing the complete filesystem objects information :rtype: list

ncdiff.reader.xls module

Reader classes for XLS and XLSX format.

since

2012-02-10

class ncdiff.reader.xls.XLSReader(has_header, input_path, worksheet_name=None, row_filter=None, append_row_nr=False, encoding=None, column_names=None)

Bases: ncdiff.reader.base.BaseReader

Implementation of reading the XLS files.

close()

Free the worksheet.

get_information()

Get XLS information.

Note: The dictionary values may be None if the file is empty or has no header. :return: A dictionary containing the column names, the number of columns

and the size of the input file in bytes.
  • ‚column_names‘: None

  • ‚columnCount‘ : 11

  • ‚size‘ : 123456

Rückgabetyp

dict

has_header()

Check if data has a header line.

Rückgabe

True if the first line of a reader is a header.

Rückgabetyp

bool

open()

Open the xls file for reading.

reader()

Iterate through the passed worksheet file and returns a single line for every calling.

Rückgabe

iterable with column entries

Rückgabetyp

array

ncdiff.reader.xml module

Reader classes for XML format.

since

2012-02-10

class ncdiff.reader.xml.XML2CSVReader(has_header, input_path, row_filter=None, append_row_nr=False, encoding=None, column_names=None)

Bases: ncdiff.reader.base.FileReader

XML2CSVReader is a kind of mini XSLT transformation of a XML into a CSV file.

Goal was to find a generic implementation that would transform a XML into CSV in order to be able to use the already existing (CSV) process for comparing files instead of having to write an new complex algorithm to diff structured hierarchical data.

This basic implementation is able to handle XMLs where data is stored on different hierarchical levels. It is a generic implementation, where the name of the element/tag will be the name of the resulting a CSV column.

Since it is completely integrated into the existing NCDiff framework it fully supports the filter, tolerance, data type features of the framework.

class AttributeColumn(matrix, key, level, index=0)

Bases: ncdiff.reader.xml.XML2CSVReader.ElementColumn

Represents an XML attribute column.

A further specialization on top of the ElementColumn is the AttributeColumn that would represent a single XML attribute of an XML element. In addition to all the information that is available on the element/tag level, there is also the index of the attribute within the element.

This feature is currently not fully supported/implement.

add_vector(vector)

Add a vector to the axis (row/column).

Parameter

vector (Vector) – A vector that should be added to the list

Rückgabe

True in case the vector was added to the axis, False (default) in case the given vector

was not a valid object. :rtype: bool

cleanup()

Functionality that should release all the resources currently connected to the row.

This is basically a functionality to avoid unnecessary memory consumption

get_index()

Get the index of the underlying XMLs attribute within the XML element.

Rückgabe

The attribute index within the XML element

Rückgabetyp

int

get_key()

Get the unique id/name of the column.

Rückgabe

The columns id

Rückgabetyp

str

get_level()

Get the hierachical level of the underlying XML element within the XML file.

Rückgabe

The hierachical level of the underlying XML element

Rückgabetyp

int

get_matrix()

Return the matrix of the axis.

Rückgabe

Matrix of the Axis

Rückgabetyp

XML2CSVReader.Matrix

has_values()

Indicate if any of the given vectors of the axis contains valid (size>0) strings.

Rückgabe

True in case at least a single vector contains valid data, False (Default).

Rückgabetyp

bool

remove_vector(vector)

Remove a given vector from the list of vectors (in case it is found).

Parameter

vector (Vector) – Instance of the vector that should be removed from this axis

Rückgabe

True in case the vector was removed False (default) in case the vector was not found/removed.

Rückgabetyp

bool

class Axis(matrix, key)

Bases: object

Base class of all the rows and columns that belong to the mapping matrix.

Its main purpose is to provide the ID/Key of a given axis (Column/Row) and a list of all the items/vectors that belong to this axis.

add_vector(vector)

Add a vector to the axis (row/column).

Parameter

vector (Vector) – A vector that should be added to the list

Rückgabe

True in case the vector was added to the axis, False (default) in case the given vector

was not a valid object. :rtype: bool

get_key()

Return the unique key/id of the axis (column/row).

Rückgabe

a string representing the unique key/Id of the axis

Rückgabetyp

str

get_matrix()

Return the matrix of the axis.

Rückgabe

Matrix of the Axis

Rückgabetyp

XML2CSVReader.Matrix

has_values()

Indicate if any of the given vectors of the axis contains valid (size>0) strings.

Rückgabe

True in case at least a single vector contains valid data, False (Default).

Rückgabetyp

bool

remove_vector(vector)

Remove a given vector from the list of vectors (in case it is found).

Parameter

vector (Vector) – Instance of the vector that should be removed from this axis

Rückgabe

True in case the vector was removed False (default) in case the vector was not found/removed.

Rückgabetyp

bool

class CellVector(row, column, value, line_number=- 1)

Bases: ncdiff.reader.xml.XML2CSVReader.Vector

Represent the data that belongs to a single coordinate (combination of row and column) within the matrix.

add_line_number(line_number)

Add all the linenumbers the current vector originates from.

Parameter

line_number (int) – Line number from within the XML file

add_line_numbers(line_numbers)

Add all the linenumbers the current vector originates from.

Parameter

line_numbers (int) – Line number from within the XML file

cleanup()

Cleanup, release all existing references, so that the garbage collector can pick up this object.

get_column()

Get the column this vector belongs to.

Rückgabe

The vectors column

Rückgabetyp

Column

get_context()

Get the context of the vector.

Rückgabe

The column this vector belongs to

Rückgabetyp

Column

get_data()

Get the data that represents the matrix cells value.

Rückgabe

The value of the vector

Rückgabetyp

str

get_line_numbers()

Get all the linenumbers the current vector originates from.

Rückgabe

The linenumbers from the originating XML file

Rückgabetyp

list of int

get_max_line_number()

Get the highest linenumber the current vector originates from.

Rückgabe

The linenumber from the originating XML file

Rückgabetyp

int

get_min_line_number()

Get the lowest linenumber the current vector originates from.

Rückgabe

The linenumber from the originating XML file

Rückgabetyp

int

get_row()

Get the row this vector belongs to.

Rückgabe

The vectors row

Rückgabetyp

Row

class Column(matrix, key)

Bases: ncdiff.reader.xml.XML2CSVReader.Axis

Columns from an XML.

Class that represents on the one hand the collection of all the tags from the XML file and on the other hand the list of columns that will defined that structure of the CSV file. Since it forms the horizontal axis of the matrix it is derived from the Axis class.

add_vector(vector)

Add a vector to the axis (row/column).

Parameter

vector (Vector) – A vector that should be added to the list

Rückgabe

True in case the vector was added to the axis, False (default) in case the given vector

was not a valid object. :rtype: bool

cleanup()

Functionality that should release all the resources currently connected to the row.

This is basically a functionality to avoid unnecessary memory consumption

get_key()

Return the unique key/id of the axis (column/row).

Rückgabe

a string representing the unique key/Id of the axis

Rückgabetyp

str

get_matrix()

Return the matrix of the axis.

Rückgabe

Matrix of the Axis

Rückgabetyp

XML2CSVReader.Matrix

has_values()

Indicate if any of the given vectors of the axis contains valid (size>0) strings.

Rückgabe

True in case at least a single vector contains valid data, False (Default).

Rückgabetyp

bool

remove_vector(vector)

Remove a given vector from the list of vectors (in case it is found).

Parameter

vector (Vector) – Instance of the vector that should be removed from this axis

Rückgabe

True in case the vector was removed False (default) in case the vector was not found/removed.

Rückgabetyp

bool

class ElementColumn(matrix, key, level)

Bases: ncdiff.reader.xml.XML2CSVReader.Column

Represents an XLM element column.

A specialization of the basic Column is the ElementColumn. It also contains an information (level) about its hierarchical depth level. This is especially important that XML of course allows the resage of tags with the same name all over the whole structure.

add_vector(vector)

Add a vector to the axis (row/column).

Parameter

vector (Vector) – A vector that should be added to the list

Rückgabe

True in case the vector was added to the axis, False (default) in case the given vector

was not a valid object. :rtype: bool

cleanup()

Functionality that should release all the resources currently connected to the row.

This is basically a functionality to avoid unnecessary memory consumption

get_key()

Get the unique id/name of the column.

Rückgabe

The columns id

Rückgabetyp

str

get_level()

Get the hierachical level of the underlying XML element within the XML file.

Rückgabe

The hierachical level of the underlying XML element

Rückgabetyp

int

get_matrix()

Return the matrix of the axis.

Rückgabe

Matrix of the Axis

Rückgabetyp

XML2CSVReader.Matrix

has_values()

Indicate if any of the given vectors of the axis contains valid (size>0) strings.

Rückgabe

True in case at least a single vector contains valid data, False (Default).

Rückgabetyp

bool

remove_vector(vector)

Remove a given vector from the list of vectors (in case it is found).

Parameter

vector (Vector) – Instance of the vector that should be removed from this axis

Rückgabe

True in case the vector was removed False (default) in case the vector was not found/removed.

Rückgabetyp

bool

class Matrix(columns=None)

Bases: object

Basic container that holds 2 Axis.

The horizontal axis form all the columns of the CSV file, whereas the vertical axis froms all the rows of the CSV file.

The object itself acts as a pure data storage. When parsing the XML all the information about a given element (or attribute in the future) is fed into the matrix. The matrix itself knows how to map the given information into its internal structure to finally transform the hierarchical data of the XML into the table based data structure of the CSV.

A further feature of the matrix is that it is designed for lazy-loading/writing. This means that it (for memory purposes) will not contain all the rows of the XML/CSV. Whenever a row is fully populated it can be removed from the matrix. Therefore the matrix contains a list of finished rows that can be extracted from outside. This is per design since on the input side the matrix gets line-by-line data from a XML parser and on the output there is also a file writer that creates a CSV.

add_attributes(tag, attributes, level, xml_file_line_number)

Map a list of XML attributes into the CSV structure.

Parameter
  • tag (str) – The XML element that attributes belong to

  • attributes (str) – All the XML element attributes, key value pairs

  • level (int) – The hierachical level of the XML element

  • xml_file_line_number (int) – The originating XML file linenumber of the XML element

add_data(tag, data, level, xml_file_line_number, index=0)

Map the value of an XML element into the CSV structure.

Parameter
  • tag (str) – The XML element that attributes belong to

  • data (str) – The XML elments value

  • level (int) – The hierachical level of the XML element

  • xml_file_line_number (int) – The originating XML file linenumber of the XML element

  • index (int) –

cleanup()

Functionality that should release all the resources currently connected to the row.

This is basically a functionality to avoid unnecessary memory consumption.

get_column_max_level()

Return the maximum column level.

get_columns()

Get all columns belonging to the matrix.

Rückgabe

All columns of the matrix

Rückgabetyp

list of Column

get_new_row()

Create a new Row object.

Rückgabe

A new row object

Rückgabetyp

Row

get_rows()

Get all rows belonging to the matrix.

Rückgabe

All columns of the matrix

Rückgabetyp

list of Row

class Row(matrix, key)

Bases: ncdiff.reader.xml.XML2CSVReader.Axis

Class that contains all the items/vectors of a single CSV line.

It is derived from the Axis class because it represents the vertical axis of the matrix. Since we trying to map hierarchical data there can be siblings of the current row that logically belong together.

Those siblings can share RowVectors. Those share a single value across multiple siblings. Furthermore there is also the information (line number) from where within the XML file the data of a given row originate. This information is primary used in case a difference is found to give the user a hint where to look within the XML file

add_sibling(sibling)

Add a row that logically is related to the current row.

Parameter

sibling (Row) – The row that is the sibling of the current row

add_vector(vector)

Add a vector to the axis (row/column).

Parameter

vector (Vector) – A vector that should be added to the list

Rückgabe

True in case the vector was added to the axis, False (default) in case the given vector

was not a valid object. :rtype: bool

cleanup()

Functionality that should release all the resources currently connected to the row.

This is basically a functionality to avoid unnecessary memory consumption

clone(level, xml_file_line_number)

Clone the current row.

Parameter
  • level

  • xml_file_line_number

Rückgabe

A cloned row object

Rückgabetyp

Row

get_key()

Return the unique key/id of the axis (column/row).

Rückgabe

a string representing the unique key/Id of the axis

Rückgabetyp

str

get_matrix()

Return the matrix of the axis.

Rückgabe

Matrix of the Axis

Rückgabetyp

XML2CSVReader.Matrix

get_reference_lines()

Return a string representing the starting and closing XMl line from where the current XML row originates.

Rückgabe

String [<start>:<stop>] that represents the starting and stopping xml element

Rückgabetyp

str

get_vector(column)

Return the vector that is assigned to the given column (within the current row).

Parameter

column (Column) – The column whos status has to be updated

Rückgabe

The vector that represents the given column None .. in case the column could not be found

Rückgabetyp

bool

has_values()

Indicate if any of the given vectors of the axis contains valid (size>0) strings.

Rückgabe

True in case at least a single vector contains valid data, False (Default).

Rückgabetyp

bool

is_completly_handled()

Indicate if every column within the current row has been handled (set).

Rückgabe

True .. in case all columns have been set False .. in case columns are still missing

Rückgabetyp

bool

remove_vector(vector)

Remove a given vector from the list of vectors (in case it is found).

Parameter

vector (Vector) – Instance of the vector that should be removed from this axis

Rückgabe

True in case the vector was removed False (default) in case the vector was not found/removed.

Rückgabetyp

bool

set_column_handled(column, state=True)

Set the ‚handled‘ status of the given column.

Parameter
  • column (Column) – The column whos status has to be updated

  • state (bool) – Flag to indicate if the column has already handled. True .. Yes, False .. No

set_data(column, data, level, xml_file_line_number, index)

Map a hierachical XML value/string into the into the CSV structure.

Parameter
  • column (Column) – The column the data belongs to

  • data (str) – The data that belongs to the column

  • level (int) – The hierachical level of the column within the XML

  • xml_file_line_number (int) – The linenumber within the XML the data originated from

  • index (int) – The attribute index in case the value is from an attribute

was_column_already_handled(column, data=None)

Indicate if the given column (within the current row) was already handled.

Was the value of that column already set? If that is the case this is an indicator that this row has to be cloned. :param column: The name/id of the column that has to be checked :type column: str :param data: The optional value of the cell that belongs to that row/column :type data: str

Rückgabe

True .. in case the column was already handled and the the Data matches False .. (default) in case the column was not handled

Rückgabetyp

C(bool)

class RowVector(row, column, value, line_number=- 1)

Bases: ncdiff.reader.xml.XML2CSVReader.Vector

A row vector represents data that is shared by multiple rows for a single column/tag/element.

add_line_number(line_number)

Add all the linenumbers the current vector originates from.

Parameter

line_number (int) – Line number from within the XML file

add_line_numbers(line_numbers)

Add all the linenumbers the current vector originates from.

Parameter

line_numbers (int) – Line number from within the XML file

add_row(row)

Add a Row to the vectors list of rows.

Parameter

row (Row) – A row that should be added

cleanup()

Cleanup, release all existing references, so that the garbage collector can pick up this object.

get_column()

Get the column this vector belongs to.

Rückgabe

The vectors column

Rückgabetyp

Column

get_context()

Get the context of the vector.

Rückgabe

The column this vector belongs to

Rückgabetyp

Column

get_data()

Get the data that represents the matrix cells value.

Rückgabe

The value of the vector

Rückgabetyp

str

get_line_numbers()

Get all the linenumbers the current vector originates from.

Rückgabe

The linenumbers from the originating XML file

Rückgabetyp

list of int

get_max_line_number()

Get the highest linenumber the current vector originates from.

Rückgabe

The linenumber from the originating XML file

Rückgabetyp

int

get_min_line_number()

Get the lowest linenumber the current vector originates from.

Rückgabe

The linenumber from the originating XML file

Rückgabetyp

int

get_rows()

Get the rows this vector belongs to.

Rückgabe

The vectors rows

Rückgabetyp

list of Row

class Vector(value, line_number)

Bases: object

Base class representing a vector within a matrix.

In its simplest form it represents the data of a single coordinate within the matrix. A further feature of the vector is that it contains information about the origin of the data it represents from within the XML file (line number).

add_line_number(line_number)

Add all the linenumbers the current vector originates from.

Parameter

line_number (int) – Line number from within the XML file

add_line_numbers(line_numbers)

Add all the linenumbers the current vector originates from.

Parameter

line_numbers (int) – Line number from within the XML file

get_data()

Get the data that represents the matrix cells value.

Rückgabe

The value of the vector

Rückgabetyp

str

get_line_numbers()

Get all the linenumbers the current vector originates from.

Rückgabe

The linenumbers from the originating XML file

Rückgabetyp

list of int

get_max_line_number()

Get the highest linenumber the current vector originates from.

Rückgabe

The linenumber from the originating XML file

Rückgabetyp

int

get_min_line_number()

Get the lowest linenumber the current vector originates from.

Rückgabe

The linenumber from the originating XML file

Rückgabetyp

int

class Xml2CsvAnalyser

Bases: xml.sax.handler.ContentHandler

Analyse a XML structure before the CSV translation.

The XML2CSVAnalyser represents a mini implementation of the XML2CSVTranslation that is up-front used to analyse the data/element/tag-structure of a given XML. This is needed because of the lazy-writing feature of the matrix. Since it is very possible that certain elements/columns may occur at the very end of an XML we have to know their existence before starting to write the CSV file.

Therefore the analyser will quickly rush through the XML to collect the list of all existing elements. It does not consider any data, but since it has the information about all the elements the columns of the XML2CSV matrix can be initialised before starting to feed the matrix with the actual data when parsing the XML again with the translator.

characters(content)

Receive notification of character data.

The Parser will call this method to report each chunk of character data. SAX parsers may return all contiguous character data in a single chunk, or they may split it into several chunks; however, all of the characters in any single event must come from the same external entity so that the Locator provides useful information.

endDocument()

Receive notification of the end of a document.

The SAX parser will invoke this method only once, and it will be the last method invoked during the parse. The parser shall not invoke this method until it has either abandoned parsing (because of an unrecoverable error) or reached the end of input.

endElement(tag)

SAX 2 parser interface method that is called for every closing XML element that is found during XML parsing.

Parameter

tag – The name of the XML element

Rtype tag

str

endElementNS(name, qname)

Signals the end of an element in namespace mode.

The name parameter contains the name of the element type, just as with the startElementNS event.

endPrefixMapping(prefix)

End the scope of a prefix-URI mapping.

See startPrefixMapping for details. This event will always occur after the corresponding endElement event, but the order of endPrefixMapping events is not otherwise guaranteed.

get_column_names()

Return a list of all the columns found during analysis.

Rückgabe

The list of found columns

Rückgabetyp

list of Column

ignorableWhitespace(whitespace)

Receive notification of ignorable whitespace in element content.

Validating Parsers must use this method to report each chunk of ignorable whitespace (see the W3C XML 1.0 recommendation, section 2.10): non-validating parsers may also use this method if they are capable of parsing and using content models.

SAX parsers may return all contiguous whitespace in a single chunk, or they may split it into several chunks; however, all of the characters in any single event must come from the same external entity, so that the Locator provides useful information.

processingInstruction(target, data)

Receive notification of a processing instruction.

The Parser will invoke this method once for each processing instruction found: note that processing instructions may occur before or after the main document element.

A SAX parser should never report an XML declaration (XML 1.0, section 2.8) or a text declaration (XML 1.0, section 4.3.1) using this method.

setDocumentLocator(locator)

Called by the parser to give the application a locator for locating the origin of document events.

SAX parsers are strongly encouraged (though not absolutely required) to supply a locator: if it does so, it must supply the locator to the application by invoking this method before invoking any of the other methods in the DocumentHandler interface.

The locator allows the application to determine the end position of any document-related event, even if the parser is not reporting an error. Typically, the application will use this information for reporting its own errors (such as character content that does not match an application’s business rules). The information returned by the locator is probably not sufficient for use with a search engine.

Note that the locator will return correct information only during the invocation of the events in this interface. The application should not attempt to use it at any other time.

skippedEntity(name)

Receive notification of a skipped entity.

The Parser will invoke this method once for each entity skipped. Non-validating processors may skip entities if they have not seen the declarations (because, for example, the entity was declared in an external DTD subset). All processors may skip external entities, depending on the values of the http://xml.org/sax/features/external-general-entities and the http://xml.org/sax/features/external-parameter-entities properties.

startDocument()

Receive notification of the beginning of a document.

The SAX parser will invoke this method only once, before any other methods in this interface or in DTDHandler (except for setDocumentLocator).

startElement(tag, attributes)

SAX 2 parser interface method that is called for every starting XML element that is found during parsing.

Parameter
  • tag – The name of the XML element

  • attributes – List of key value pairs representing all the XML elements XML attributes

Rtype tag

str

Rtype attributes

dict

startElementNS(name, qname, attrs)

Signals the start of an element in namespace mode.

The name parameter contains the name of the element type as a (uri, localname) tuple, the qname parameter the raw XML 1.0 name used in the source document, and the attrs parameter holds an instance of the Attributes class containing the attributes of the element.

The uri part of the name tuple is None for elements which have no namespace.

startPrefixMapping(prefix, uri)

Begin the scope of a prefix-URI Namespace mapping.

The information from this event is not necessary for normal Namespace processing: the SAX XML reader will automatically replace prefixes for element and attribute names when the http://xml.org/sax/features/namespaces feature is true (the default).

There are cases, however, when applications need to use prefixes in character data or in attribute values, where they cannot safely be expanded automatically; the start/endPrefixMapping event supplies the information to the application to expand prefixes in those contexts itself, if necessary.

Note that start/endPrefixMapping events are not guaranteed to be properly nested relative to each-other: all startPrefixMapping events will occur before the corresponding startElement event, and all endPrefixMapping events will occur after the corresponding endElement event, but their order is not guaranteed.

class Xml2CsvTranslator(reader, xml_file_name, columns, csv_file_name=None)

Bases: xml.sax.handler.ContentHandler

The XML2CSV Translator is responsible for feeding all the XML/element/tag data into the mapping matrix.

The data itself is enriched with information about its location within the XML structure (line number, hierarchical level, element it belongs to, … ). Its basically build around a XML SAX-Parser-Event-Framework that allows to parse huge XML files without consuming much memory. This low-memory-consuming implementation is also followed in the mapping Matrix itself since it is per design already possible to retrieve finished CSV rows when still reading/feeding data from the XML file.

CLMN_XML_LINENUMBERS = 'TOIGNORE:XmlFileLines'
TRANSLATION = {9: None, 10: None}
characters(data)

SAX 2 parser interface method that is called for every value belonging to a XML element.

B{NOTE}: This method might be called multiple times for a single XML element value! Therefore all the data has to be concatenated :param data: String representing either parts or the whole value of the currently handled XML element. :type data: str

close()

SAX 2 parser interface method that is called when the end of the XML document is reached.

endDocument()

Receive notification of the end of a document.

The SAX parser will invoke this method only once, and it will be the last method invoked during the parse. The parser shall not invoke this method until it has either abandoned parsing (because of an unrecoverable error) or reached the end of input.

endElement(tag)

SAX 2 parser interface method that is called for every closing XML element that is found during parsing.

This method triggers the XML2CSV transformation for the value and attributes of this XML element :param tag: The name of the XML element :rtype tag: str

endElementNS(name, qname)

Signals the end of an element in namespace mode.

The name parameter contains the name of the element type, just as with the startElementNS event.

endPrefixMapping(prefix)

End the scope of a prefix-URI mapping.

See startPrefixMapping for details. This event will always occur after the corresponding endElement event, but the order of endPrefixMapping events is not otherwise guaranteed.

get_buffered_row()

Retrieve all the buffers rows that have already been transformed into CSV ready structure.

Rückgabe

A list of strings representing the transformed XML data

Rückgabetyp

list of str

has_buffered_rows()

Show if the algorithm already produced new rows that can already be forwarded to the CSV file writer.

Rückgabe

Amount of rows that are ready for processing

Rückgabetyp

bool

ignorableWhitespace(whitespace)

Receive notification of ignorable whitespace in element content.

Validating Parsers must use this method to report each chunk of ignorable whitespace (see the W3C XML 1.0 recommendation, section 2.10): non-validating parsers may also use this method if they are capable of parsing and using content models.

SAX parsers may return all contiguous whitespace in a single chunk, or they may split it into several chunks; however, all of the characters in any single event must come from the same external entity, so that the Locator provides useful information.

processingInstruction(target, data)

Receive notification of a processing instruction.

The Parser will invoke this method once for each processing instruction found: note that processing instructions may occur before or after the main document element.

A SAX parser should never report an XML declaration (XML 1.0, section 2.8) or a text declaration (XML 1.0, section 4.3.1) using this method.

setDocumentLocator(locator)

Called by the parser to give the application a locator for locating the origin of document events.

SAX parsers are strongly encouraged (though not absolutely required) to supply a locator: if it does so, it must supply the locator to the application by invoking this method before invoking any of the other methods in the DocumentHandler interface.

The locator allows the application to determine the end position of any document-related event, even if the parser is not reporting an error. Typically, the application will use this information for reporting its own errors (such as character content that does not match an application’s business rules). The information returned by the locator is probably not sufficient for use with a search engine.

Note that the locator will return correct information only during the invocation of the events in this interface. The application should not attempt to use it at any other time.

set_input_line_number(line_number)

Represent the line number, within the XML that is read, of the currently handles XML element.

Parameter

line_number (int) – The XML file line number

skippedEntity(name)

Receive notification of a skipped entity.

The Parser will invoke this method once for each entity skipped. Non-validating processors may skip entities if they have not seen the declarations (because, for example, the entity was declared in an external DTD subset). All processors may skip external entities, depending on the values of the http://xml.org/sax/features/external-general-entities and the http://xml.org/sax/features/external-parameter-entities properties.

startDocument()

Receive notification of the beginning of a document.

The SAX parser will invoke this method only once, before any other methods in this interface or in DTDHandler (except for setDocumentLocator).

startElement(tag, attributes)

SAX 2 parser interface method that is called for every starting XML element that is found during parsing.

Basically all the data that is provided is buffered to be handled when reaching the according XML end call. :param tag: The name of the XML element :rtype tag: str :param attributes: List of key value pairs representing all the XML elements XML attributes :rtype attributes: dict

startElementNS(name, qname, attrs)

Signals the start of an element in namespace mode.

The name parameter contains the name of the element type as a (uri, localname) tuple, the qname parameter the raw XML 1.0 name used in the source document, and the attrs parameter holds an instance of the Attributes class containing the attributes of the element.

The uri part of the name tuple is None for elements which have no namespace.

startPrefixMapping(prefix, uri)

Begin the scope of a prefix-URI Namespace mapping.

The information from this event is not necessary for normal Namespace processing: the SAX XML reader will automatically replace prefixes for element and attribute names when the http://xml.org/sax/features/namespaces feature is true (the default).

There are cases, however, when applications need to use prefixes in character data or in attribute values, where they cannot safely be expanded automatically; the start/endPrefixMapping event supplies the information to the application to expand prefixes in those contexts itself, if necessary.

Note that start/endPrefixMapping events are not guaranteed to be properly nested relative to each-other: all startPrefixMapping events will occur before the corresponding startElement event, and all endPrefixMapping events will occur after the corresponding endElement event, but their order is not guaranteed.

close()

Close the input file.

get_information()

Get generic information about the given file like, filesize, amount of columns, column names.

Rückgabe

A dictionary containing the column names, the number of columns

Rückgabetyp

dict

has_header()

Check if data has a header line.

Rückgabe

True if the first line of a reader is a header.

Rückgabetyp

bool

open()

Open the file for reading in universal newlines mode.

reader()

Get the flat/table structured data of the XML file.

This method already returns complete CSV compatible data rows that have been created by the XML data transformation. With the help of the yield functionality those data rows can be accessed on a „row by row“ basis instead of a „give me all at the end“ basis. This has the advantage that the memory consumption of the whole XML transformation to CSV is very low. :return: A list of strings that represent a CSV data row :rtype: list

Module contents

Reader package.

Defines public viewable API by specifying the __all__ list. :since: 2020-04-16

ncdiff.reader.create_reader_from_module(reader_name, module_name, *args, **kwargs)

Try to load the reader class from the specified module.

Parameter
  • reader_name (str) – The type of reader to create.

  • *args

    arguments passed to the reader constructor of the reader class.

  • **kwargs

    keyword arguments passed to the constructor of the reader class.

ncdiff.reader.create_old_reader(target_configuration)

Create the reader for the old input file according the passed configuration.

Note: Row filters must be passed to the reader if filtering of rows is supported! :param target_configuration: The configuration of the diff target. :type target_configuration: TargetConfiguration :return: A reader implementation to read the old input file. :rtype: BaseReader

ncdiff.reader.create_new_reader(target_configuration)

Create the reader for the new input file according the passed configuration.

Note: Row filters must be passed to the reader if filtering of rows is supported! :param target_configuration: The configuration of the diff target. :type target_configuration: TargetConfiguration :return: A reader implementation to read the new input file. :rtype: BaseReader

ncdiff.reader.create_old_sorted_file_reader(target_configuration)

Create the reader for the old sorted file.

Parameter

target_configuration (TargetConfiguration) – The configuration of the diff target.

ncdiff.reader.create_new_sorted_file_reader(target_configuration)

Create the reader for the old sorted file.

Parameter

target_configuration (TargetConfiguration) – The configuration of the diff target.

class ncdiff.reader.BaseReader(has_header, row_filter=None, append_row_nr=False, encoding=None, column_names=None)

Bases: object

Abstract base class that defines the reader interface.

All readers support the context manager protocol (with statement).

close()

Close the location.

This method never fails.

get_information()

Get information from data.

Note: The dictionary values may be None if the file is empty or has no header. :return: A dictionary containing the column names and the number of columns:

  • ‚column_names‘: None

  • ‚columnCount‘: 11

Rückgabetyp

dict

has_header()

Check if data has a header line.

Rückgabe

True if the first line of a reader is a header.

Rückgabetyp

bool

open()

Open the location.

reader()

Return a reader.

A reader is expected to return all cell values in string format with any leading and trailing white space characters removed. :return: An reader supporting the iterator protocol. :rtype: iterable

class ncdiff.reader.FileReader(has_header, input_path, row_filter=None, append_row_nr=False, encoding=None, column_names=None)

Bases: ncdiff.reader.base.BaseReader

Abstract base class implementing support to read files.

close()

Close the input file.

get_information()

Get information from a file.

Note: The dictionary values may be None if the file is empty or has no header. :return: A dictionary containing the column names, the number of columns

and the size of the input file in bytes.
  • ‚column_names‘: None

  • ‚columnCount‘ : 11

  • ‚size‘ : 123456

Rückgabetyp

dict

has_header()

Check if data has a header line.

Rückgabe

True if the first line of a reader is a header.

Rückgabetyp

bool

open()

Open the file for reading in universal newlines mode.

reader()

Return a reader.

A reader is expected to return all cell values in string format with any leading and trailing white space characters removed. :return: An reader supporting the iterator protocol. :rtype: iterable

class ncdiff.reader.CSVReader(has_header, input_path, delimiter=';', replace_delimiters=None, quoting='QUOTE_MINIMAL', quotechar='"', doublequote=True, escapechar=None, row_filter=None, append_row_nr=False, encoding=None, column_names=None)

Bases: ncdiff.reader.base.FileReader

Read csv files.

close()

Close the input file.

get_information()

Get information from a file.

Note: The dictionary values may be None if the file is empty or has no header. :return: A dictionary containing the column names, the number of columns

and the size of the input file in bytes.
  • ‚column_names‘: None

  • ‚columnCount‘ : 11

  • ‚size‘ : 123456

Rückgabetyp

dict

has_header()

Check if data has a header line.

Rückgabe

True if the first line of a reader is a header.

Rückgabetyp

bool

open()

Open the file for reading in universal newlines mode.

reader()

Return the CSV reader.

Rückgabe

A CSV reader for the csv file.

Rückgabetyp

csv.reader

class ncdiff.reader.CSVVARReader(*args, **kwargs)

Bases: ncdiff.reader.csv.CSVReader

A csv reader that can cope with columns of different length.

close()

Close the input file.

get_information()

Get information from a file.

Note: The dictionary values may be None if the file is empty or has no header. :return: A dictionary containing the column names, the number of columns

and the size of the input file in bytes.
  • ‚column_names‘: None

  • ‚columnCount‘ : 11

  • ‚size‘ : 123456

Rückgabetyp

dict

has_header()

Check if data has a header line.

Rückgabe

True if the first line of a reader is a header.

Rückgabetyp

bool

open()

Open the file for reading in universal newlines mode.

reader()

Return the CSV reader.

Rückgabe

A CSV reader for the csv file.

Rückgabetyp

csv.reader

class ncdiff.reader.DirReader(has_header, input_path, row_filter=None, append_row_nr=False, encoding=None, column_names=None)

Bases: ncdiff.reader.base.BaseReader

Reader for recursively comparing filesystem structures.

This reader allows to generically extract information about filesystem objects (files, directories, links) and to forward this information in the form of structured data records that can be written into CSV files. Since all the functionality for comparing CSV files is already in place all the existing functionality like filters, tolerances, result filters, data type mapping, … can be used to compare the filesystem data. Since the primary key of the filesystem comparison does always need to be the ABSPATH + NAME different kind of comparison can be achieved.

close()

Empty close method.

Since the directory reader does not handle single files this function has no functionality within the DIR reader.

get_information()

General information/statistics that might have been gathered by this reader.

B{Note:} The dictionary values may be None if the file is empty or has no header. :return: A dictionary containing the column names, the number of columns

and the size of the input file in bytes.

Rückgabetyp

dict

has_header()

Check if data has a header line.

Rückgabe

True if the first line of a reader is a header.

Rückgabetyp

bool

open()

Empty open method.

Since the directory reader does not handle single files this function has no functionality within the DIR reader.

reader()

Gather all the information about the filesystem objects.

This method contains the main functionality of the DIR reader. It recursively loops through all the directories underneath the defined BaseDirectory. For each file, link, directory that is found on the way down to the last directory a according FileInfo object is created and the information of finally forwarded to a CSV writer. :return: A list of strings representing the complete filesystem objects information :rtype: list

class ncdiff.reader.FixedWidthReader(has_header, input_path, columns, row_filter=None, append_row_nr=False, encoding=None, column_names=None)

Bases: ncdiff.reader.base.FileReader

Implementation of reading text files with columns defined by the number of characters.

close()

Close the input file.

get_information()

Get information from a file.

Note: The dictionary values may be None if the file is empty or has no header. :return: A dictionary containing the column names, the number of columns

and the size of the input file in bytes.
  • ‚column_names‘: None

  • ‚columnCount‘ : 11

  • ‚size‘ : 123456

Rückgabetyp

dict

has_header()

Check if data has a header line.

Rückgabe

True if the first line of a reader is a header.

Rückgabetyp

bool

open()

Open the file for reading in universal newlines mode.

reader()

Iterate through the passed file stream and returns a single line for every calling.

Rückgabe

iterable with column entries

Rückgabetyp

array

class ncdiff.reader.JSON2CSVReader(has_header, input_path, row_filter=None, append_row_nr=False, encoding=None, column_names=None)

Bases: ncdiff.reader.base.FileReader

JSON2CSVReader transforms JSON messages into a CSV file.

Goal was to find a generic implementation that would transform a JSON file into CSV in order to be able to use the already existing (CSV) process for comparing files instead of having to write an new complex algorithm to diff structured hierarchical data.

This basic implementation is able to handle JSON where data is stored on different hierarchical levels. Arrays of JSON objects are only supported on the top most level, but nested objects (aka dictionaries of dictionaries) are fully supported. It is a generic implementation, where the name of the element will be the name of the resulting CSV column. For nested objects the dot syntax is used to built the column name. For instance:

{"name": "Alice", "location": {"country": "Wonderland", "street": "Rabbit Hole"}, "age": "8"}
will result in columns::

name | location.country | location.street | age Alice | Wonderland | Rabbit Hole | 8

Since it is completely integrated into the existing NCDiff framework it fully supports the filter, tolerance, data type features of the framework.

close()

Close the input file.

get_information()

Get generic information about the given file like, files ize, amount of columns, column names.

Rückgabe

A dictionary containing the column names, the number of columns

Rückgabetyp

dict

has_header()

Check if data has a header line.

Rückgabe

True if the first line of a reader is a header.

Rückgabetyp

bool

open()

Open the file and determine the column names.

reader()

Get the flat/table structured data of the JSON file.

class ncdiff.reader.SQLReader(has_header, connection_string, database_driver, query_string, fetch_size=500, row_filter=None, append_row_nr=False, encoding=None, column_names=None)

Bases: ncdiff.reader.base.BaseReader

Read SQL tables.

close()

Close the database connection.

get_information()

Get information from data.

Note: The dictionary values may be None if the file is empty or has no header. :return: A dictionary containing the column names and the number of columns:

  • ‚column_names‘: None

  • ‚columnCount‘: 11

Rückgabetyp

dict

has_header()

Check if data has a header line.

Rückgabe

True if the first line of a reader is a header.

Rückgabetyp

bool

open()

Open the database connection.

reader()

Return a SQL reader.

Rückgabe

A CSV reader for the csv file.

Rückgabetyp

csv.reader

class ncdiff.reader.SWIFTReader(has_header, input_path, quotechar='"', doublequote=True, delimiter=';', replace_delimiters=None, escapechar=None, row_filter=None, append_row_nr=False, encoding=None, column_names=None)

Bases: ncdiff.reader.base.FileReader

Read swift files.

close()

Close the input file.

get_information()

Get information from a file.

Note: The dictionary values may be None if the file is empty or has no header. :return: A dictionary containing the column names, the number of columns

and the size of the input file in bytes.
  • ‚column_names‘: None

  • ‚columnCount‘ : 11

  • ‚size‘ : 123456

Rückgabetyp

dict

has_header()

Check if data has a header line.

Rückgabe

True if the first line of a reader is a header.

Rückgabetyp

bool

open()

Open the file for reading in universal newlines mode.

reader()

Return a SWIFT reader.

Rückgabe

A SWIFT reader for the swift file.

Rückgabetyp

swift.reader

class ncdiff.reader.TARReader(has_header, input_path, compression_type, row_filter=None, append_row_nr=False, encoding=None, column_names=None)

Bases: ncdiff.reader.base.FileReader

The TAR file reader.

Main purpose of this class is to provide the functionality to read a TAR file and write an overview of its content into a CSV file that can be compared to an similar TAR-Content-File. Background of this reader is that lots of „packages“ are either TAR or ZIP containers that contain lots of other files. To get an overview about the content of such an container an according reader had be written There is currently support for 3 difference kinds of TAR files:

  • plain/uncompressed: tar -> use TAR as <new|old>FileFormat

  • GZIP compressed tar: tar.gz -> use TAR.GZ as <new|old>FileFormat

  • BZIP2 compressed tar: tar.bz2 -> use TAR.BZ2 as <new|old>FileFormat

class TarMemberInfo(tar_info_object)

Bases: ncdiff.utils.FileUtils.FileInfo

TarMemberInfo is a specialization of the FileUtils.FileInfo object.

The main difference here is that the information about the files/directories are not directly retrieved from the file system, they are read out of the tar container itself.

class Type

Bases: object

Little helper class for distinguishing difference filesystem object types.

DIRECTORY = 'Directory'
FILE = 'File'
get_abs_path()

Get the absolute (full) path of the filesystem object within the filesystem.

Rückgabe

Absolute path of the FileInfo object

Rückgabetyp

str

get_access_time(dt_format='%Y-%m-%d %H:%M:%S')

Get the timestamp when the filesystem object was accessed the last time.

Parameter

dt_format (str) – The optional timestamp format of the resulting string

Rückgabe

A string representing the filesystem objects last access time.

Rückgabetyp

str

get_creation_time(dt_format='%Y-%m-%d %H:%M:%S')

Get the timestamp when the filesystem object was created.

Parameter

dt_format (str) – The optional timestamp format of the resulting string

Rückgabe

A string representing the filesystem objects creation time

Rückgabetyp

str

get_depth()

Get the depth of the filesystem object relative to the BaseDirectory.

Rückgabe

A number representing the depth level relative to the root/BaseDirectory

Rückgabetyp

int

get_extension()

Get the files extension.

All the characters right to the very last dot ‚.‘ within the files base name. :return: Files extension :rtype: str

get_group()

Get the id of the users group owning the file system object.

Rückgabe

Id of the group that owns the file

Rückgabetyp

int

get_hash_value()

Get the hashdigest representing the hash value of the filesystem object.

Rückgabe

A string representing the hashdigest

Rückgabetyp

str

get_mode()

Get the file system permissions bits.

Rückgabe

File system permission info.

Rückgabetyp

int

get_modification_time(dt_format='%Y-%m-%d %H:%M:%S')

Get the timestamp when the filesystem object was modified the last time.

Parameter

dt_format (str) – The optional timestamp format of the resulting string.

Rückgabe

A string representing the filesystem objects last modification time.

Rückgabetyp

str

get_name()

Return the basename of the filesystem object.

Rückgabe

Basename of the FileInfo object

Rückgabetyp

str

get_permissions()

Get a string representing the file permissions in unix format „rwx-r-x–x“.

Rückgabe

a string that represents the file permission in POSIX format.

Rückgabetyp

str

get_rel_path()

Get the path of the filesystem object relative to the given BaseDirecotry.

Rückgabe

Relative path of the FileInfo object

Rückgabetyp

str

get_size()

Get the size of the file.

Rückgabe

Size of the file

Rückgabetyp

int

get_type()

Get the file system object type this object represents.

Rückgabe

The type of filesystem object

Rückgabetyp

FileInfo.Type

get_user()

Get the id of the user owning the file system object.

Rückgabe

Id of the user that owns the file.

Rückgabetyp

int

is_directory()

Check if this FileInfo is a directory.

Rückgabe

An indicator in case this is a directory True .. in case the FileInfo represents a directory False .. in all other cases

Rückgabetyp

bool

is_file()

Check if this FileInfo is a file.

Rückgabe

An indicator in case this is a file True .. in case the FileInfo represents a file False .. in all other cases

Rückgabetyp

bool

Check if this FileInfo is a link.

Rückgabe

An indicator in case this is a link True .. in case the FileInfo represents a link False .. in all other cases

Rückgabetyp

bool

set_access_time(access_time)

Set the last accessed time of the file system object.

Parameter

access_time (datetime) – The creation timestamp of the filesystem object

set_creation_time(creation_time)

Set the creation time of the file system object.

Parameter

creation_time (datetime) – The creation timestamp of the filesystem object

set_depth(depth)

Set the depth of the filesystem object relative to the BaseDirectory specified in the constructor.

Parameter

depth (int) – The depth/level of the filesystem objects relative to the root/BaseDirectory

set_file_system_object(tar_info_object, base_directory=None)

Setter for the TarInfo file descriptor.

Parameter

tar_info_object (tarfile.TarInfo) – Kind of file descriptor object from a TarFile.

set_group(group_id)

Set the id of the group the owner belongs to.

Parameter

group_id (int) – The group id within the filesystem.

set_hash_value(hash_value)

Set the hashdigest that represents this FileInfo object.

Parameter

hash_value (str) – A string preferably extracted with FileUtils.calc_hash

set_mode(mode)

Set the file system permissions bits of the current filesystem object.

Parameter

mode (int) – File system permission

set_modification_time(modification_time)

Set the last modification time of the file system object.

Parameter

modification_time (datetime) – The creation timestamp of the filesystem object.

set_name(file_path, base_directory=None)

Set the name of the TarInfo from a file path.

Parameter
  • file_path (str) – location/path within the tarfile

  • base_directory – optional base path; this has to be stripped from the absolute path to get the

relative one. :type base_directory: str

set_size(size)

Set the size of the file.

Parameter

size (int) – Size of the filesystem object

set_type(type_)

Set the type of the file system object this FileInfo object represents. File, Directory or Link.

Parameter

type (FileInfo.Type) – Type of the FileInfo object

set_user(user_id)

Set the id of the user that owns the filesystem object.

Parameter

user_id (int) – The users id within the filesystem.

close()

Close the input file.

get_information()

Get information from a file.

Note: The dictionary values may be None if the file is empty or has no header. :return: A dictionary containing the column names, the number of columns

and the size of the input file in bytes.
  • ‚column_names‘: None

  • ‚columnCount‘ : 11

  • ‚size‘ : 123456

Rückgabetyp

dict

has_header()

Check if data has a header line.

Rückgabe

True if the first line of a reader is a header.

Rückgabetyp

bool

open()

TAR file open/read functionality.

This method has some fall backs in case the actual given compression format does not match the specified on. Fallback sequence is: [1] .. generic transparent compression [2] .. GZIP compression [3] .. ‚explicit‘ NO compression [4] .. BZIP2 compression In case the file cannot be opened an error will be raised :raise IOError: either in case the file is already read by another process or in case none of the compression formats match the file.

reader()

Gather all the information about the tar files objects.

This method contains the main functionality of the TAR reader. It recursively loops through all the directories underneath the root. For each file, link, directory that is found on the way down to the last directory a according TarFileInfo object is created and the information of finally forwarded to a CSV writer. :return: A list of strings representing the complete filesystem objects information :rtype: list

class ncdiff.reader.XLSReader(has_header, input_path, worksheet_name=None, row_filter=None, append_row_nr=False, encoding=None, column_names=None)

Bases: ncdiff.reader.base.BaseReader

Implementation of reading the XLS files.

close()

Free the worksheet.

get_information()

Get XLS information.

Note: The dictionary values may be None if the file is empty or has no header. :return: A dictionary containing the column names, the number of columns

and the size of the input file in bytes.
  • ‚column_names‘: None

  • ‚columnCount‘ : 11

  • ‚size‘ : 123456

Rückgabetyp

dict

has_header()

Check if data has a header line.

Rückgabe

True if the first line of a reader is a header.

Rückgabetyp

bool

open()

Open the xls file for reading.

reader()

Iterate through the passed worksheet file and returns a single line for every calling.

Rückgabe

iterable with column entries

Rückgabetyp

array

class ncdiff.reader.XML2CSVReader(has_header, input_path, row_filter=None, append_row_nr=False, encoding=None, column_names=None)

Bases: ncdiff.reader.base.FileReader

XML2CSVReader is a kind of mini XSLT transformation of a XML into a CSV file.

Goal was to find a generic implementation that would transform a XML into CSV in order to be able to use the already existing (CSV) process for comparing files instead of having to write an new complex algorithm to diff structured hierarchical data.

This basic implementation is able to handle XMLs where data is stored on different hierarchical levels. It is a generic implementation, where the name of the element/tag will be the name of the resulting a CSV column.

Since it is completely integrated into the existing NCDiff framework it fully supports the filter, tolerance, data type features of the framework.

class AttributeColumn(matrix, key, level, index=0)

Bases: ncdiff.reader.xml.XML2CSVReader.ElementColumn

Represents an XML attribute column.

A further specialization on top of the ElementColumn is the AttributeColumn that would represent a single XML attribute of an XML element. In addition to all the information that is available on the element/tag level, there is also the index of the attribute within the element.

This feature is currently not fully supported/implement.

add_vector(vector)

Add a vector to the axis (row/column).

Parameter

vector (Vector) – A vector that should be added to the list

Rückgabe

True in case the vector was added to the axis, False (default) in case the given vector

was not a valid object. :rtype: bool

cleanup()

Functionality that should release all the resources currently connected to the row.

This is basically a functionality to avoid unnecessary memory consumption

get_index()

Get the index of the underlying XMLs attribute within the XML element.

Rückgabe

The attribute index within the XML element

Rückgabetyp

int

get_key()

Get the unique id/name of the column.

Rückgabe

The columns id

Rückgabetyp

str

get_level()

Get the hierachical level of the underlying XML element within the XML file.

Rückgabe

The hierachical level of the underlying XML element

Rückgabetyp

int

get_matrix()

Return the matrix of the axis.

Rückgabe

Matrix of the Axis

Rückgabetyp

XML2CSVReader.Matrix

has_values()

Indicate if any of the given vectors of the axis contains valid (size>0) strings.

Rückgabe

True in case at least a single vector contains valid data, False (Default).

Rückgabetyp

bool

remove_vector(vector)

Remove a given vector from the list of vectors (in case it is found).

Parameter

vector (Vector) – Instance of the vector that should be removed from this axis

Rückgabe

True in case the vector was removed False (default) in case the vector was not found/removed.

Rückgabetyp

bool

class Axis(matrix, key)

Bases: object

Base class of all the rows and columns that belong to the mapping matrix.

Its main purpose is to provide the ID/Key of a given axis (Column/Row) and a list of all the items/vectors that belong to this axis.

add_vector(vector)

Add a vector to the axis (row/column).

Parameter

vector (Vector) – A vector that should be added to the list

Rückgabe

True in case the vector was added to the axis, False (default) in case the given vector

was not a valid object. :rtype: bool

get_key()

Return the unique key/id of the axis (column/row).

Rückgabe

a string representing the unique key/Id of the axis

Rückgabetyp

str

get_matrix()

Return the matrix of the axis.

Rückgabe

Matrix of the Axis

Rückgabetyp

XML2CSVReader.Matrix

has_values()

Indicate if any of the given vectors of the axis contains valid (size>0) strings.

Rückgabe

True in case at least a single vector contains valid data, False (Default).

Rückgabetyp

bool

remove_vector(vector)

Remove a given vector from the list of vectors (in case it is found).

Parameter

vector (Vector) – Instance of the vector that should be removed from this axis

Rückgabe

True in case the vector was removed False (default) in case the vector was not found/removed.

Rückgabetyp

bool

class CellVector(row, column, value, line_number=- 1)

Bases: ncdiff.reader.xml.XML2CSVReader.Vector

Represent the data that belongs to a single coordinate (combination of row and column) within the matrix.

add_line_number(line_number)

Add all the linenumbers the current vector originates from.

Parameter

line_number (int) – Line number from within the XML file

add_line_numbers(line_numbers)

Add all the linenumbers the current vector originates from.

Parameter

line_numbers (int) – Line number from within the XML file

cleanup()

Cleanup, release all existing references, so that the garbage collector can pick up this object.

get_column()

Get the column this vector belongs to.

Rückgabe

The vectors column

Rückgabetyp

Column

get_context()

Get the context of the vector.

Rückgabe

The column this vector belongs to

Rückgabetyp

Column

get_data()

Get the data that represents the matrix cells value.

Rückgabe

The value of the vector

Rückgabetyp

str

get_line_numbers()

Get all the linenumbers the current vector originates from.

Rückgabe

The linenumbers from the originating XML file

Rückgabetyp

list of int

get_max_line_number()

Get the highest linenumber the current vector originates from.

Rückgabe

The linenumber from the originating XML file

Rückgabetyp

int

get_min_line_number()

Get the lowest linenumber the current vector originates from.

Rückgabe

The linenumber from the originating XML file

Rückgabetyp

int

get_row()

Get the row this vector belongs to.

Rückgabe

The vectors row

Rückgabetyp

Row

class Column(matrix, key)

Bases: ncdiff.reader.xml.XML2CSVReader.Axis

Columns from an XML.

Class that represents on the one hand the collection of all the tags from the XML file and on the other hand the list of columns that will defined that structure of the CSV file. Since it forms the horizontal axis of the matrix it is derived from the Axis class.

add_vector(vector)

Add a vector to the axis (row/column).

Parameter

vector (Vector) – A vector that should be added to the list

Rückgabe

True in case the vector was added to the axis, False (default) in case the given vector

was not a valid object. :rtype: bool

cleanup()

Functionality that should release all the resources currently connected to the row.

This is basically a functionality to avoid unnecessary memory consumption

get_key()

Return the unique key/id of the axis (column/row).

Rückgabe

a string representing the unique key/Id of the axis

Rückgabetyp

str

get_matrix()

Return the matrix of the axis.

Rückgabe

Matrix of the Axis

Rückgabetyp

XML2CSVReader.Matrix

has_values()

Indicate if any of the given vectors of the axis contains valid (size>0) strings.

Rückgabe

True in case at least a single vector contains valid data, False (Default).

Rückgabetyp

bool

remove_vector(vector)

Remove a given vector from the list of vectors (in case it is found).

Parameter

vector (Vector) – Instance of the vector that should be removed from this axis

Rückgabe

True in case the vector was removed False (default) in case the vector was not found/removed.

Rückgabetyp

bool

class ElementColumn(matrix, key, level)

Bases: ncdiff.reader.xml.XML2CSVReader.Column

Represents an XLM element column.

A specialization of the basic Column is the ElementColumn. It also contains an information (level) about its hierarchical depth level. This is especially important that XML of course allows the resage of tags with the same name all over the whole structure.

add_vector(vector)

Add a vector to the axis (row/column).

Parameter

vector (Vector) – A vector that should be added to the list

Rückgabe

True in case the vector was added to the axis, False (default) in case the given vector

was not a valid object. :rtype: bool

cleanup()

Functionality that should release all the resources currently connected to the row.

This is basically a functionality to avoid unnecessary memory consumption

get_key()

Get the unique id/name of the column.

Rückgabe

The columns id

Rückgabetyp

str

get_level()

Get the hierachical level of the underlying XML element within the XML file.

Rückgabe

The hierachical level of the underlying XML element

Rückgabetyp

int

get_matrix()

Return the matrix of the axis.

Rückgabe

Matrix of the Axis

Rückgabetyp

XML2CSVReader.Matrix

has_values()

Indicate if any of the given vectors of the axis contains valid (size>0) strings.

Rückgabe

True in case at least a single vector contains valid data, False (Default).

Rückgabetyp

bool

remove_vector(vector)

Remove a given vector from the list of vectors (in case it is found).

Parameter

vector (Vector) – Instance of the vector that should be removed from this axis

Rückgabe

True in case the vector was removed False (default) in case the vector was not found/removed.

Rückgabetyp

bool

class Matrix(columns=None)

Bases: object

Basic container that holds 2 Axis.

The horizontal axis form all the columns of the CSV file, whereas the vertical axis froms all the rows of the CSV file.

The object itself acts as a pure data storage. When parsing the XML all the information about a given element (or attribute in the future) is fed into the matrix. The matrix itself knows how to map the given information into its internal structure to finally transform the hierarchical data of the XML into the table based data structure of the CSV.

A further feature of the matrix is that it is designed for lazy-loading/writing. This means that it (for memory purposes) will not contain all the rows of the XML/CSV. Whenever a row is fully populated it can be removed from the matrix. Therefore the matrix contains a list of finished rows that can be extracted from outside. This is per design since on the input side the matrix gets line-by-line data from a XML parser and on the output there is also a file writer that creates a CSV.

add_attributes(tag, attributes, level, xml_file_line_number)

Map a list of XML attributes into the CSV structure.

Parameter
  • tag (str) – The XML element that attributes belong to

  • attributes (str) – All the XML element attributes, key value pairs

  • level (int) – The hierachical level of the XML element

  • xml_file_line_number (int) – The originating XML file linenumber of the XML element

add_data(tag, data, level, xml_file_line_number, index=0)

Map the value of an XML element into the CSV structure.

Parameter
  • tag (str) – The XML element that attributes belong to

  • data (str) – The XML elments value

  • level (int) – The hierachical level of the XML element

  • xml_file_line_number (int) – The originating XML file linenumber of the XML element

  • index (int) –

cleanup()

Functionality that should release all the resources currently connected to the row.

This is basically a functionality to avoid unnecessary memory consumption.

get_column_max_level()

Return the maximum column level.

get_columns()

Get all columns belonging to the matrix.

Rückgabe

All columns of the matrix

Rückgabetyp

list of Column

get_new_row()

Create a new Row object.

Rückgabe

A new row object

Rückgabetyp

Row

get_rows()

Get all rows belonging to the matrix.

Rückgabe

All columns of the matrix

Rückgabetyp

list of Row

class Row(matrix, key)

Bases: ncdiff.reader.xml.XML2CSVReader.Axis

Class that contains all the items/vectors of a single CSV line.

It is derived from the Axis class because it represents the vertical axis of the matrix. Since we trying to map hierarchical data there can be siblings of the current row that logically belong together.

Those siblings can share RowVectors. Those share a single value across multiple siblings. Furthermore there is also the information (line number) from where within the XML file the data of a given row originate. This information is primary used in case a difference is found to give the user a hint where to look within the XML file

add_sibling(sibling)

Add a row that logically is related to the current row.

Parameter

sibling (Row) – The row that is the sibling of the current row

add_vector(vector)

Add a vector to the axis (row/column).

Parameter

vector (Vector) – A vector that should be added to the list

Rückgabe

True in case the vector was added to the axis, False (default) in case the given vector

was not a valid object. :rtype: bool

cleanup()

Functionality that should release all the resources currently connected to the row.

This is basically a functionality to avoid unnecessary memory consumption

clone(level, xml_file_line_number)

Clone the current row.

Parameter
  • level

  • xml_file_line_number

Rückgabe

A cloned row object

Rückgabetyp

Row

get_key()

Return the unique key/id of the axis (column/row).

Rückgabe

a string representing the unique key/Id of the axis

Rückgabetyp

str

get_matrix()

Return the matrix of the axis.

Rückgabe

Matrix of the Axis

Rückgabetyp

XML2CSVReader.Matrix

get_reference_lines()

Return a string representing the starting and closing XMl line from where the current XML row originates.

Rückgabe

String [<start>:<stop>] that represents the starting and stopping xml element

Rückgabetyp

str

get_vector(column)

Return the vector that is assigned to the given column (within the current row).

Parameter

column (Column) – The column whos status has to be updated

Rückgabe

The vector that represents the given column None .. in case the column could not be found

Rückgabetyp

bool

has_values()

Indicate if any of the given vectors of the axis contains valid (size>0) strings.

Rückgabe

True in case at least a single vector contains valid data, False (Default).

Rückgabetyp

bool

is_completly_handled()

Indicate if every column within the current row has been handled (set).

Rückgabe

True .. in case all columns have been set False .. in case columns are still missing

Rückgabetyp

bool

remove_vector(vector)

Remove a given vector from the list of vectors (in case it is found).

Parameter

vector (Vector) – Instance of the vector that should be removed from this axis

Rückgabe

True in case the vector was removed False (default) in case the vector was not found/removed.

Rückgabetyp

bool

set_column_handled(column, state=True)

Set the ‚handled‘ status of the given column.

Parameter
  • column (Column) – The column whos status has to be updated

  • state (bool) – Flag to indicate if the column has already handled. True .. Yes, False .. No

set_data(column, data, level, xml_file_line_number, index)

Map a hierachical XML value/string into the into the CSV structure.

Parameter
  • column (Column) – The column the data belongs to

  • data (str) – The data that belongs to the column

  • level (int) – The hierachical level of the column within the XML

  • xml_file_line_number (int) – The linenumber within the XML the data originated from

  • index (int) – The attribute index in case the value is from an attribute

was_column_already_handled(column, data=None)

Indicate if the given column (within the current row) was already handled.

Was the value of that column already set? If that is the case this is an indicator that this row has to be cloned. :param column: The name/id of the column that has to be checked :type column: str :param data: The optional value of the cell that belongs to that row/column :type data: str

Rückgabe

True .. in case the column was already handled and the the Data matches False .. (default) in case the column was not handled

Rückgabetyp

C(bool)

class RowVector(row, column, value, line_number=- 1)

Bases: ncdiff.reader.xml.XML2CSVReader.Vector

A row vector represents data that is shared by multiple rows for a single column/tag/element.

add_line_number(line_number)

Add all the linenumbers the current vector originates from.

Parameter

line_number (int) – Line number from within the XML file

add_line_numbers(line_numbers)

Add all the linenumbers the current vector originates from.

Parameter

line_numbers (int) – Line number from within the XML file

add_row(row)

Add a Row to the vectors list of rows.

Parameter

row (Row) – A row that should be added

cleanup()

Cleanup, release all existing references, so that the garbage collector can pick up this object.

get_column()

Get the column this vector belongs to.

Rückgabe

The vectors column

Rückgabetyp

Column

get_context()

Get the context of the vector.

Rückgabe

The column this vector belongs to

Rückgabetyp

Column

get_data()

Get the data that represents the matrix cells value.

Rückgabe

The value of the vector

Rückgabetyp

str

get_line_numbers()

Get all the linenumbers the current vector originates from.

Rückgabe

The linenumbers from the originating XML file

Rückgabetyp

list of int

get_max_line_number()

Get the highest linenumber the current vector originates from.

Rückgabe

The linenumber from the originating XML file

Rückgabetyp

int

get_min_line_number()

Get the lowest linenumber the current vector originates from.

Rückgabe

The linenumber from the originating XML file

Rückgabetyp

int

get_rows()

Get the rows this vector belongs to.

Rückgabe

The vectors rows

Rückgabetyp

list of Row

class Vector(value, line_number)

Bases: object

Base class representing a vector within a matrix.

In its simplest form it represents the data of a single coordinate within the matrix. A further feature of the vector is that it contains information about the origin of the data it represents from within the XML file (line number).

add_line_number(line_number)

Add all the linenumbers the current vector originates from.

Parameter

line_number (int) – Line number from within the XML file

add_line_numbers(line_numbers)

Add all the linenumbers the current vector originates from.

Parameter

line_numbers (int) – Line number from within the XML file

get_data()

Get the data that represents the matrix cells value.

Rückgabe

The value of the vector

Rückgabetyp

str

get_line_numbers()

Get all the linenumbers the current vector originates from.

Rückgabe

The linenumbers from the originating XML file

Rückgabetyp

list of int

get_max_line_number()

Get the highest linenumber the current vector originates from.

Rückgabe

The linenumber from the originating XML file

Rückgabetyp

int

get_min_line_number()

Get the lowest linenumber the current vector originates from.

Rückgabe

The linenumber from the originating XML file

Rückgabetyp

int

class Xml2CsvAnalyser

Bases: xml.sax.handler.ContentHandler

Analyse a XML structure before the CSV translation.

The XML2CSVAnalyser represents a mini implementation of the XML2CSVTranslation that is up-front used to analyse the data/element/tag-structure of a given XML. This is needed because of the lazy-writing feature of the matrix. Since it is very possible that certain elements/columns may occur at the very end of an XML we have to know their existence before starting to write the CSV file.

Therefore the analyser will quickly rush through the XML to collect the list of all existing elements. It does not consider any data, but since it has the information about all the elements the columns of the XML2CSV matrix can be initialised before starting to feed the matrix with the actual data when parsing the XML again with the translator.

characters(content)

Receive notification of character data.

The Parser will call this method to report each chunk of character data. SAX parsers may return all contiguous character data in a single chunk, or they may split it into several chunks; however, all of the characters in any single event must come from the same external entity so that the Locator provides useful information.

endDocument()

Receive notification of the end of a document.

The SAX parser will invoke this method only once, and it will be the last method invoked during the parse. The parser shall not invoke this method until it has either abandoned parsing (because of an unrecoverable error) or reached the end of input.

endElement(tag)

SAX 2 parser interface method that is called for every closing XML element that is found during XML parsing.

Parameter

tag – The name of the XML element

Rtype tag

str

endElementNS(name, qname)

Signals the end of an element in namespace mode.

The name parameter contains the name of the element type, just as with the startElementNS event.

endPrefixMapping(prefix)

End the scope of a prefix-URI mapping.

See startPrefixMapping for details. This event will always occur after the corresponding endElement event, but the order of endPrefixMapping events is not otherwise guaranteed.

get_column_names()

Return a list of all the columns found during analysis.

Rückgabe

The list of found columns

Rückgabetyp

list of Column

ignorableWhitespace(whitespace)

Receive notification of ignorable whitespace in element content.

Validating Parsers must use this method to report each chunk of ignorable whitespace (see the W3C XML 1.0 recommendation, section 2.10): non-validating parsers may also use this method if they are capable of parsing and using content models.

SAX parsers may return all contiguous whitespace in a single chunk, or they may split it into several chunks; however, all of the characters in any single event must come from the same external entity, so that the Locator provides useful information.

processingInstruction(target, data)

Receive notification of a processing instruction.

The Parser will invoke this method once for each processing instruction found: note that processing instructions may occur before or after the main document element.

A SAX parser should never report an XML declaration (XML 1.0, section 2.8) or a text declaration (XML 1.0, section 4.3.1) using this method.

setDocumentLocator(locator)

Called by the parser to give the application a locator for locating the origin of document events.

SAX parsers are strongly encouraged (though not absolutely required) to supply a locator: if it does so, it must supply the locator to the application by invoking this method before invoking any of the other methods in the DocumentHandler interface.

The locator allows the application to determine the end position of any document-related event, even if the parser is not reporting an error. Typically, the application will use this information for reporting its own errors (such as character content that does not match an application’s business rules). The information returned by the locator is probably not sufficient for use with a search engine.

Note that the locator will return correct information only during the invocation of the events in this interface. The application should not attempt to use it at any other time.

skippedEntity(name)

Receive notification of a skipped entity.

The Parser will invoke this method once for each entity skipped. Non-validating processors may skip entities if they have not seen the declarations (because, for example, the entity was declared in an external DTD subset). All processors may skip external entities, depending on the values of the http://xml.org/sax/features/external-general-entities and the http://xml.org/sax/features/external-parameter-entities properties.

startDocument()

Receive notification of the beginning of a document.

The SAX parser will invoke this method only once, before any other methods in this interface or in DTDHandler (except for setDocumentLocator).

startElement(tag, attributes)

SAX 2 parser interface method that is called for every starting XML element that is found during parsing.

Parameter
  • tag – The name of the XML element

  • attributes – List of key value pairs representing all the XML elements XML attributes

Rtype tag

str

Rtype attributes

dict

startElementNS(name, qname, attrs)

Signals the start of an element in namespace mode.

The name parameter contains the name of the element type as a (uri, localname) tuple, the qname parameter the raw XML 1.0 name used in the source document, and the attrs parameter holds an instance of the Attributes class containing the attributes of the element.

The uri part of the name tuple is None for elements which have no namespace.

startPrefixMapping(prefix, uri)

Begin the scope of a prefix-URI Namespace mapping.

The information from this event is not necessary for normal Namespace processing: the SAX XML reader will automatically replace prefixes for element and attribute names when the http://xml.org/sax/features/namespaces feature is true (the default).

There are cases, however, when applications need to use prefixes in character data or in attribute values, where they cannot safely be expanded automatically; the start/endPrefixMapping event supplies the information to the application to expand prefixes in those contexts itself, if necessary.

Note that start/endPrefixMapping events are not guaranteed to be properly nested relative to each-other: all startPrefixMapping events will occur before the corresponding startElement event, and all endPrefixMapping events will occur after the corresponding endElement event, but their order is not guaranteed.

class Xml2CsvTranslator(reader, xml_file_name, columns, csv_file_name=None)

Bases: xml.sax.handler.ContentHandler

The XML2CSV Translator is responsible for feeding all the XML/element/tag data into the mapping matrix.

The data itself is enriched with information about its location within the XML structure (line number, hierarchical level, element it belongs to, … ). Its basically build around a XML SAX-Parser-Event-Framework that allows to parse huge XML files without consuming much memory. This low-memory-consuming implementation is also followed in the mapping Matrix itself since it is per design already possible to retrieve finished CSV rows when still reading/feeding data from the XML file.

CLMN_XML_LINENUMBERS = 'TOIGNORE:XmlFileLines'
TRANSLATION = {9: None, 10: None}
characters(data)

SAX 2 parser interface method that is called for every value belonging to a XML element.

B{NOTE}: This method might be called multiple times for a single XML element value! Therefore all the data has to be concatenated :param data: String representing either parts or the whole value of the currently handled XML element. :type data: str

close()

SAX 2 parser interface method that is called when the end of the XML document is reached.

endDocument()

Receive notification of the end of a document.

The SAX parser will invoke this method only once, and it will be the last method invoked during the parse. The parser shall not invoke this method until it has either abandoned parsing (because of an unrecoverable error) or reached the end of input.

endElement(tag)

SAX 2 parser interface method that is called for every closing XML element that is found during parsing.

This method triggers the XML2CSV transformation for the value and attributes of this XML element :param tag: The name of the XML element :rtype tag: str

endElementNS(name, qname)

Signals the end of an element in namespace mode.

The name parameter contains the name of the element type, just as with the startElementNS event.

endPrefixMapping(prefix)

End the scope of a prefix-URI mapping.

See startPrefixMapping for details. This event will always occur after the corresponding endElement event, but the order of endPrefixMapping events is not otherwise guaranteed.

get_buffered_row()

Retrieve all the buffers rows that have already been transformed into CSV ready structure.

Rückgabe

A list of strings representing the transformed XML data

Rückgabetyp

list of str

has_buffered_rows()

Show if the algorithm already produced new rows that can already be forwarded to the CSV file writer.

Rückgabe

Amount of rows that are ready for processing

Rückgabetyp

bool

ignorableWhitespace(whitespace)

Receive notification of ignorable whitespace in element content.

Validating Parsers must use this method to report each chunk of ignorable whitespace (see the W3C XML 1.0 recommendation, section 2.10): non-validating parsers may also use this method if they are capable of parsing and using content models.

SAX parsers may return all contiguous whitespace in a single chunk, or they may split it into several chunks; however, all of the characters in any single event must come from the same external entity, so that the Locator provides useful information.

processingInstruction(target, data)

Receive notification of a processing instruction.

The Parser will invoke this method once for each processing instruction found: note that processing instructions may occur before or after the main document element.

A SAX parser should never report an XML declaration (XML 1.0, section 2.8) or a text declaration (XML 1.0, section 4.3.1) using this method.

setDocumentLocator(locator)

Called by the parser to give the application a locator for locating the origin of document events.

SAX parsers are strongly encouraged (though not absolutely required) to supply a locator: if it does so, it must supply the locator to the application by invoking this method before invoking any of the other methods in the DocumentHandler interface.

The locator allows the application to determine the end position of any document-related event, even if the parser is not reporting an error. Typically, the application will use this information for reporting its own errors (such as character content that does not match an application’s business rules). The information returned by the locator is probably not sufficient for use with a search engine.

Note that the locator will return correct information only during the invocation of the events in this interface. The application should not attempt to use it at any other time.

set_input_line_number(line_number)

Represent the line number, within the XML that is read, of the currently handles XML element.

Parameter

line_number (int) – The XML file line number

skippedEntity(name)

Receive notification of a skipped entity.

The Parser will invoke this method once for each entity skipped. Non-validating processors may skip entities if they have not seen the declarations (because, for example, the entity was declared in an external DTD subset). All processors may skip external entities, depending on the values of the http://xml.org/sax/features/external-general-entities and the http://xml.org/sax/features/external-parameter-entities properties.

startDocument()

Receive notification of the beginning of a document.

The SAX parser will invoke this method only once, before any other methods in this interface or in DTDHandler (except for setDocumentLocator).

startElement(tag, attributes)

SAX 2 parser interface method that is called for every starting XML element that is found during parsing.

Basically all the data that is provided is buffered to be handled when reaching the according XML end call. :param tag: The name of the XML element :rtype tag: str :param attributes: List of key value pairs representing all the XML elements XML attributes :rtype attributes: dict

startElementNS(name, qname, attrs)

Signals the start of an element in namespace mode.

The name parameter contains the name of the element type as a (uri, localname) tuple, the qname parameter the raw XML 1.0 name used in the source document, and the attrs parameter holds an instance of the Attributes class containing the attributes of the element.

The uri part of the name tuple is None for elements which have no namespace.

startPrefixMapping(prefix, uri)

Begin the scope of a prefix-URI Namespace mapping.

The information from this event is not necessary for normal Namespace processing: the SAX XML reader will automatically replace prefixes for element and attribute names when the http://xml.org/sax/features/namespaces feature is true (the default).

There are cases, however, when applications need to use prefixes in character data or in attribute values, where they cannot safely be expanded automatically; the start/endPrefixMapping event supplies the information to the application to expand prefixes in those contexts itself, if necessary.

Note that start/endPrefixMapping events are not guaranteed to be properly nested relative to each-other: all startPrefixMapping events will occur before the corresponding startElement event, and all endPrefixMapping events will occur after the corresponding endElement event, but their order is not guaranteed.

close()

Close the input file.

get_information()

Get generic information about the given file like, filesize, amount of columns, column names.

Rückgabe

A dictionary containing the column names, the number of columns

Rückgabetyp

dict

has_header()

Check if data has a header line.

Rückgabe

True if the first line of a reader is a header.

Rückgabetyp

bool

open()

Open the file for reading in universal newlines mode.

reader()

Get the flat/table structured data of the XML file.

This method already returns complete CSV compatible data rows that have been created by the XML data transformation. With the help of the yield functionality those data rows can be accessed on a „row by row“ basis instead of a „give me all at the end“ basis. This has the advantage that the memory consumption of the whole XML transformation to CSV is very low. :return: A list of strings that represent a CSV data row :rtype: list