ncdiff.reader package¶
Submodules¶
ncdiff.reader.base module¶
Factory functions and base classes for the reader framework.
- since
2012-02-10
-
ncdiff.reader.base.
create_reader_from_module
(reader_name, module_name, *args, **kwargs)¶ Try to load the reader class from the specified module.
- Parameter
reader_name (
str
) – The type of reader to create.*args –
arguments passed to the reader constructor of the reader class.
**kwargs –
keyword arguments passed to the constructor of the reader class.
-
ncdiff.reader.base.
create_old_reader
(target_configuration)¶ Create the reader for the old input file according the passed configuration.
Note: Row filters must be passed to the reader if filtering of rows is supported! :param target_configuration: The configuration of the diff target. :type target_configuration:
TargetConfiguration
:return: A reader implementation to read the old input file. :rtype: BaseReader
-
ncdiff.reader.base.
create_new_reader
(target_configuration)¶ Create the reader for the new input file according the passed configuration.
Note: Row filters must be passed to the reader if filtering of rows is supported! :param target_configuration: The configuration of the diff target. :type target_configuration:
TargetConfiguration
:return: A reader implementation to read the new input file. :rtype: BaseReader
-
ncdiff.reader.base.
create_old_sorted_file_reader
(target_configuration)¶ Create the reader for the old sorted file.
- Parameter
target_configuration (
TargetConfiguration
) – The configuration of the diff target.
-
ncdiff.reader.base.
create_new_sorted_file_reader
(target_configuration)¶ Create the reader for the old sorted file.
- Parameter
target_configuration (
TargetConfiguration
) – The configuration of the diff target.
-
class
ncdiff.reader.base.
BaseReader
(has_header, row_filter=None, append_row_nr=False, encoding=None, column_names=None)¶ Bases:
object
Abstract base class that defines the reader interface.
All readers support the context manager protocol (with statement).
-
close
()¶ Close the location.
This method never fails.
-
get_information
()¶ Get information from data.
Note: The dictionary values may be None if the file is empty or has no header. :return: A dictionary containing the column names and the number of columns:
‚column_names‘: None
‚columnCount‘: 11
- Rückgabetyp
dict
-
has_header
()¶ Check if data has a header line.
- Rückgabe
True if the first line of a reader is a header.
- Rückgabetyp
bool
-
open
()¶ Open the location.
-
reader
()¶ Return a reader.
A reader is expected to return all cell values in string format with any leading and trailing white space characters removed. :return: An reader supporting the iterator protocol. :rtype:
iterable
-
-
class
ncdiff.reader.base.
FileReader
(has_header, input_path, row_filter=None, append_row_nr=False, encoding=None, column_names=None)¶ Bases:
ncdiff.reader.base.BaseReader
Abstract base class implementing support to read files.
-
close
()¶ Close the input file.
-
get_information
()¶ Get information from a file.
Note: The dictionary values may be None if the file is empty or has no header. :return: A dictionary containing the column names, the number of columns
- and the size of the input file in bytes.
‚column_names‘: None
‚columnCount‘ : 11
‚size‘ : 123456
- Rückgabetyp
dict
-
has_header
()¶ Check if data has a header line.
- Rückgabe
True if the first line of a reader is a header.
- Rückgabetyp
bool
-
open
()¶ Open the file for reading in universal newlines mode.
-
reader
()¶ Return a reader.
A reader is expected to return all cell values in string format with any leading and trailing white space characters removed. :return: An reader supporting the iterator protocol. :rtype:
iterable
-
ncdiff.reader.csv module¶
Reader classes for CSV format.
- since
2012-02-10
-
class
ncdiff.reader.csv.
CSVReader
(has_header, input_path, delimiter=';', replace_delimiters=None, quoting='QUOTE_MINIMAL', quotechar='"', doublequote=True, escapechar=None, row_filter=None, append_row_nr=False, encoding=None, column_names=None)¶ Bases:
ncdiff.reader.base.FileReader
Read csv files.
-
close
()¶ Close the input file.
-
get_information
()¶ Get information from a file.
Note: The dictionary values may be None if the file is empty or has no header. :return: A dictionary containing the column names, the number of columns
- and the size of the input file in bytes.
‚column_names‘: None
‚columnCount‘ : 11
‚size‘ : 123456
- Rückgabetyp
dict
-
has_header
()¶ Check if data has a header line.
- Rückgabe
True if the first line of a reader is a header.
- Rückgabetyp
bool
-
open
()¶ Open the file for reading in universal newlines mode.
-
reader
()¶ Return the CSV reader.
- Rückgabe
A CSV reader for the csv file.
- Rückgabetyp
csv.reader
-
-
class
ncdiff.reader.csv.
CSVVARReader
(*args, **kwargs)¶ Bases:
ncdiff.reader.csv.CSVReader
A csv reader that can cope with columns of different length.
-
close
()¶ Close the input file.
-
get_information
()¶ Get information from a file.
Note: The dictionary values may be None if the file is empty or has no header. :return: A dictionary containing the column names, the number of columns
- and the size of the input file in bytes.
‚column_names‘: None
‚columnCount‘ : 11
‚size‘ : 123456
- Rückgabetyp
dict
-
has_header
()¶ Check if data has a header line.
- Rückgabe
True if the first line of a reader is a header.
- Rückgabetyp
bool
-
open
()¶ Open the file for reading in universal newlines mode.
-
reader
()¶ Return the CSV reader.
- Rückgabe
A CSV reader for the csv file.
- Rückgabetyp
csv.reader
-
ncdiff.reader.dir module¶
Reader classes for DIR format to compare file system structures.
- since
2012-02-10
-
class
ncdiff.reader.dir.
DirReader
(has_header, input_path, row_filter=None, append_row_nr=False, encoding=None, column_names=None)¶ Bases:
ncdiff.reader.base.BaseReader
Reader for recursively comparing filesystem structures.
This reader allows to generically extract information about filesystem objects (files, directories, links) and to forward this information in the form of structured data records that can be written into CSV files. Since all the functionality for comparing CSV files is already in place all the existing functionality like filters, tolerances, result filters, data type mapping, … can be used to compare the filesystem data. Since the primary key of the filesystem comparison does always need to be the ABSPATH + NAME different kind of comparison can be achieved.
-
close
()¶ Empty close method.
Since the directory reader does not handle single files this function has no functionality within the DIR reader.
-
get_information
()¶ General information/statistics that might have been gathered by this reader.
B{Note:} The dictionary values may be None if the file is empty or has no header. :return: A dictionary containing the column names, the number of columns
and the size of the input file in bytes.
- Rückgabetyp
dict
-
has_header
()¶ Check if data has a header line.
- Rückgabe
True if the first line of a reader is a header.
- Rückgabetyp
bool
-
open
()¶ Empty open method.
Since the directory reader does not handle single files this function has no functionality within the DIR reader.
-
reader
()¶ Gather all the information about the filesystem objects.
This method contains the main functionality of the DIR reader. It recursively loops through all the directories underneath the defined BaseDirectory. For each file, link, directory that is found on the way down to the last directory a according
FileInfo
object is created and the information of finally forwarded to a CSV writer. :return: A list of strings representing the complete filesystem objects information :rtype:list
-
ncdiff.reader.fixed_width module¶
Reader classes for fixed width text format.
- since
2012-02-10
-
class
ncdiff.reader.fixed_width.
FixedWidthReader
(has_header, input_path, columns, row_filter=None, append_row_nr=False, encoding=None, column_names=None)¶ Bases:
ncdiff.reader.base.FileReader
Implementation of reading text files with columns defined by the number of characters.
-
close
()¶ Close the input file.
-
get_information
()¶ Get information from a file.
Note: The dictionary values may be None if the file is empty or has no header. :return: A dictionary containing the column names, the number of columns
- and the size of the input file in bytes.
‚column_names‘: None
‚columnCount‘ : 11
‚size‘ : 123456
- Rückgabetyp
dict
-
has_header
()¶ Check if data has a header line.
- Rückgabe
True if the first line of a reader is a header.
- Rückgabetyp
bool
-
open
()¶ Open the file for reading in universal newlines mode.
-
reader
()¶ Iterate through the passed file stream and returns a single line for every calling.
- Rückgabe
iterable with column entries
- Rückgabetyp
array
-
ncdiff.reader.json module¶
Reader classes for json format.
- since
2012-02-10
-
class
ncdiff.reader.json.
JSON2CSVReader
(has_header, input_path, row_filter=None, append_row_nr=False, encoding=None, column_names=None)¶ Bases:
ncdiff.reader.base.FileReader
JSON2CSVReader transforms JSON messages into a CSV file.
Goal was to find a generic implementation that would transform a JSON file into CSV in order to be able to use the already existing (CSV) process for comparing files instead of having to write an new complex algorithm to diff structured hierarchical data.
This basic implementation is able to handle JSON where data is stored on different hierarchical levels. Arrays of JSON objects are only supported on the top most level, but nested objects (aka dictionaries of dictionaries) are fully supported. It is a generic implementation, where the name of the element will be the name of the resulting CSV column. For nested objects the dot syntax is used to built the column name. For instance:
{"name": "Alice", "location": {"country": "Wonderland", "street": "Rabbit Hole"}, "age": "8"}
- will result in columns::
name | location.country | location.street | age Alice | Wonderland | Rabbit Hole | 8
Since it is completely integrated into the existing NCDiff framework it fully supports the filter, tolerance, data type features of the framework.
-
close
()¶ Close the input file.
-
get_information
()¶ Get generic information about the given file like, files ize, amount of columns, column names.
- Rückgabe
A dictionary containing the column names, the number of columns
- Rückgabetyp
dict
-
has_header
()¶ Check if data has a header line.
- Rückgabe
True if the first line of a reader is a header.
- Rückgabetyp
bool
-
open
()¶ Open the file and determine the column names.
-
reader
()¶ Get the flat/table structured data of the JSON file.
ncdiff.reader.sql module¶
Reader classes for SQL databases.
- since
2012-02-10
-
class
ncdiff.reader.sql.
SQLReader
(has_header, connection_string, database_driver, query_string, fetch_size=500, row_filter=None, append_row_nr=False, encoding=None, column_names=None)¶ Bases:
ncdiff.reader.base.BaseReader
Read SQL tables.
-
close
()¶ Close the database connection.
-
get_information
()¶ Get information from data.
Note: The dictionary values may be None if the file is empty or has no header. :return: A dictionary containing the column names and the number of columns:
‚column_names‘: None
‚columnCount‘: 11
- Rückgabetyp
dict
-
has_header
()¶ Check if data has a header line.
- Rückgabe
True if the first line of a reader is a header.
- Rückgabetyp
bool
-
open
()¶ Open the database connection.
-
reader
()¶ Return a SQL reader.
- Rückgabe
A CSV reader for the csv file.
- Rückgabetyp
csv.reader
-
ncdiff.reader.swift module¶
Reader classes for swift messages.
- since
2012-02-10
-
class
ncdiff.reader.swift.
SWIFTReader
(has_header, input_path, quotechar='"', doublequote=True, delimiter=';', replace_delimiters=None, escapechar=None, row_filter=None, append_row_nr=False, encoding=None, column_names=None)¶ Bases:
ncdiff.reader.base.FileReader
Read swift files.
-
close
()¶ Close the input file.
-
get_information
()¶ Get information from a file.
Note: The dictionary values may be None if the file is empty or has no header. :return: A dictionary containing the column names, the number of columns
- and the size of the input file in bytes.
‚column_names‘: None
‚columnCount‘ : 11
‚size‘ : 123456
- Rückgabetyp
dict
-
has_header
()¶ Check if data has a header line.
- Rückgabe
True if the first line of a reader is a header.
- Rückgabetyp
bool
-
open
()¶ Open the file for reading in universal newlines mode.
-
reader
()¶ Return a SWIFT reader.
- Rückgabe
A SWIFT reader for the swift file.
- Rückgabetyp
swift.reader
-
ncdiff.reader.tar module¶
Reader classes for TAR format to compare file system structures in *.tar files.
- since
2012-02-10
-
class
ncdiff.reader.tar.
TARReader
(has_header, input_path, compression_type, row_filter=None, append_row_nr=False, encoding=None, column_names=None)¶ Bases:
ncdiff.reader.base.FileReader
The TAR file reader.
Main purpose of this class is to provide the functionality to read a TAR file and write an overview of its content into a CSV file that can be compared to an similar TAR-Content-File. Background of this reader is that lots of „packages“ are either TAR or ZIP containers that contain lots of other files. To get an overview about the content of such an container an according reader had be written There is currently support for 3 difference kinds of TAR files:
plain/uncompressed: tar -> use TAR as <new|old>FileFormat
GZIP compressed tar: tar.gz -> use TAR.GZ as <new|old>FileFormat
BZIP2 compressed tar: tar.bz2 -> use TAR.BZ2 as <new|old>FileFormat
-
class
TarMemberInfo
(tar_info_object)¶ Bases:
ncdiff.utils.FileUtils.FileInfo
TarMemberInfo is a specialization of the FileUtils.FileInfo object.
The main difference here is that the information about the files/directories are not directly retrieved from the file system, they are read out of the tar container itself.
-
class
Type
¶ Bases:
object
Little helper class for distinguishing difference filesystem object types.
-
DIRECTORY
= 'Directory'¶
-
FILE
= 'File'¶
-
LINK
= 'Link'¶
-
-
get_abs_path
()¶ Get the absolute (full) path of the filesystem object within the filesystem.
- Rückgabe
Absolute path of the
FileInfo
object- Rückgabetyp
str
-
get_access_time
(dt_format='%Y-%m-%d %H:%M:%S')¶ Get the timestamp when the filesystem object was accessed the last time.
- Parameter
dt_format (
str
) – The optional timestamp format of the resulting string- Rückgabe
A string representing the filesystem objects last access time.
- Rückgabetyp
str
-
get_creation_time
(dt_format='%Y-%m-%d %H:%M:%S')¶ Get the timestamp when the filesystem object was created.
- Parameter
dt_format (
str
) – The optional timestamp format of the resulting string- Rückgabe
A string representing the filesystem objects creation time
- Rückgabetyp
str
-
get_depth
()¶ Get the depth of the filesystem object relative to the BaseDirectory.
- Rückgabe
A number representing the depth level relative to the root/BaseDirectory
- Rückgabetyp
int
-
get_extension
()¶ Get the files extension.
All the characters right to the very last dot ‚.‘ within the files base name. :return: Files extension :rtype:
str
-
get_group
()¶ Get the id of the users group owning the file system object.
- Rückgabe
Id of the group that owns the file
- Rückgabetyp
int
-
get_hash_value
()¶ Get the hashdigest representing the hash value of the filesystem object.
- Rückgabe
A string representing the hashdigest
- Rückgabetyp
str
-
get_mode
()¶ Get the file system permissions bits.
- Rückgabe
File system permission info.
- Rückgabetyp
int
-
get_modification_time
(dt_format='%Y-%m-%d %H:%M:%S')¶ Get the timestamp when the filesystem object was modified the last time.
- Parameter
dt_format (
str
) – The optional timestamp format of the resulting string.- Rückgabe
A string representing the filesystem objects last modification time.
- Rückgabetyp
str
-
get_name
()¶ Return the basename of the filesystem object.
- Rückgabe
Basename of the
FileInfo
object- Rückgabetyp
str
-
get_permissions
()¶ Get a string representing the file permissions in unix format „rwx-r-x–x“.
- Rückgabe
a string that represents the file permission in POSIX format.
- Rückgabetyp
str
-
get_rel_path
()¶ Get the path of the filesystem object relative to the given BaseDirecotry.
- Rückgabe
Relative path of the
FileInfo
object- Rückgabetyp
str
-
get_size
()¶ Get the size of the file.
- Rückgabe
Size of the file
- Rückgabetyp
int
-
get_type
()¶ Get the file system object type this object represents.
- Rückgabe
The type of filesystem object
- Rückgabetyp
FileInfo.Type
-
get_user
()¶ Get the id of the user owning the file system object.
- Rückgabe
Id of the user that owns the file.
- Rückgabetyp
int
-
is_directory
()¶ Check if this
FileInfo
is a directory.- Rückgabe
An indicator in case this is a directory True .. in case the
FileInfo
represents a directory False .. in all other cases- Rückgabetyp
bool
-
is_file
()¶ Check if this
FileInfo
is a file.- Rückgabe
An indicator in case this is a file True .. in case the
FileInfo
represents a file False .. in all other cases- Rückgabetyp
bool
-
is_link
()¶ Check if this
FileInfo
is a link.- Rückgabe
An indicator in case this is a link True .. in case the
FileInfo
represents a link False .. in all other cases- Rückgabetyp
bool
-
set_access_time
(access_time)¶ Set the last accessed time of the file system object.
- Parameter
access_time (
datetime
) – The creation timestamp of the filesystem object
-
set_creation_time
(creation_time)¶ Set the creation time of the file system object.
- Parameter
creation_time (
datetime
) – The creation timestamp of the filesystem object
-
set_depth
(depth)¶ Set the depth of the filesystem object relative to the BaseDirectory specified in the constructor.
- Parameter
depth (
int
) – The depth/level of the filesystem objects relative to the root/BaseDirectory
-
set_file_system_object
(tar_info_object, base_directory=None)¶ Setter for the TarInfo file descriptor.
- Parameter
tar_info_object (
tarfile.TarInfo
) – Kind of file descriptor object from a TarFile.
-
set_group
(group_id)¶ Set the id of the group the owner belongs to.
- Parameter
group_id (
int
) – The group id within the filesystem.
-
set_hash_value
(hash_value)¶ Set the hashdigest that represents this
FileInfo
object.- Parameter
hash_value (
str
) – A string preferably extracted withFileUtils.calc_hash
-
set_mode
(mode)¶ Set the file system permissions bits of the current filesystem object.
- Parameter
mode (
int
) – File system permission
-
set_modification_time
(modification_time)¶ Set the last modification time of the file system object.
- Parameter
modification_time (
datetime
) – The creation timestamp of the filesystem object.
-
set_name
(file_path, base_directory=None)¶ Set the name of the TarInfo from a file path.
- Parameter
file_path (
str
) – location/path within the tarfilebase_directory – optional base path; this has to be stripped from the absolute path to get the
relative one. :type base_directory:
str
-
set_size
(size)¶ Set the size of the file.
- Parameter
size (
int
) – Size of the filesystem object
-
set_type
(type_)¶ Set the type of the file system object this
FileInfo
object represents. File, Directory or Link.- Parameter
type (
FileInfo.Type
) – Type of the FileInfo object
-
set_user
(user_id)¶ Set the id of the user that owns the filesystem object.
- Parameter
user_id (
int
) – The users id within the filesystem.
-
class
-
close
()¶ Close the input file.
-
get_information
()¶ Get information from a file.
Note: The dictionary values may be None if the file is empty or has no header. :return: A dictionary containing the column names, the number of columns
- and the size of the input file in bytes.
‚column_names‘: None
‚columnCount‘ : 11
‚size‘ : 123456
- Rückgabetyp
dict
-
has_header
()¶ Check if data has a header line.
- Rückgabe
True if the first line of a reader is a header.
- Rückgabetyp
bool
-
open
()¶ TAR file open/read functionality.
This method has some fall backs in case the actual given compression format does not match the specified on. Fallback sequence is: [1] .. generic transparent compression [2] .. GZIP compression [3] .. ‚explicit‘ NO compression [4] .. BZIP2 compression In case the file cannot be opened an error will be raised :raise IOError: either in case the file is already read by another process or in case none of the compression formats match the file.
-
reader
()¶ Gather all the information about the tar files objects.
This method contains the main functionality of the TAR reader. It recursively loops through all the directories underneath the root. For each file, link, directory that is found on the way down to the last directory a according
TarFileInfo
object is created and the information of finally forwarded to a CSV writer. :return: A list of strings representing the complete filesystem objects information :rtype:list
ncdiff.reader.xls module¶
Reader classes for XLS and XLSX format.
- since
2012-02-10
-
class
ncdiff.reader.xls.
XLSReader
(has_header, input_path, worksheet_name=None, row_filter=None, append_row_nr=False, encoding=None, column_names=None)¶ Bases:
ncdiff.reader.base.BaseReader
Implementation of reading the XLS files.
-
close
()¶ Free the worksheet.
-
get_information
()¶ Get XLS information.
Note: The dictionary values may be None if the file is empty or has no header. :return: A dictionary containing the column names, the number of columns
- and the size of the input file in bytes.
‚column_names‘: None
‚columnCount‘ : 11
‚size‘ : 123456
- Rückgabetyp
dict
-
has_header
()¶ Check if data has a header line.
- Rückgabe
True if the first line of a reader is a header.
- Rückgabetyp
bool
-
open
()¶ Open the xls file for reading.
-
reader
()¶ Iterate through the passed worksheet file and returns a single line for every calling.
- Rückgabe
iterable with column entries
- Rückgabetyp
array
-
ncdiff.reader.xml module¶
Reader classes for XML format.
- since
2012-02-10
-
class
ncdiff.reader.xml.
XML2CSVReader
(has_header, input_path, row_filter=None, append_row_nr=False, encoding=None, column_names=None)¶ Bases:
ncdiff.reader.base.FileReader
XML2CSVReader is a kind of mini XSLT transformation of a XML into a CSV file.
Goal was to find a generic implementation that would transform a XML into CSV in order to be able to use the already existing (CSV) process for comparing files instead of having to write an new complex algorithm to diff structured hierarchical data.
This basic implementation is able to handle XMLs where data is stored on different hierarchical levels. It is a generic implementation, where the name of the element/tag will be the name of the resulting a CSV column.
Since it is completely integrated into the existing NCDiff framework it fully supports the filter, tolerance, data type features of the framework.
-
class
AttributeColumn
(matrix, key, level, index=0)¶ Bases:
ncdiff.reader.xml.XML2CSVReader.ElementColumn
Represents an XML attribute column.
A further specialization on top of the ElementColumn is the AttributeColumn that would represent a single XML attribute of an XML element. In addition to all the information that is available on the element/tag level, there is also the index of the attribute within the element.
This feature is currently not fully supported/implement.
-
add_vector
(vector)¶ Add a vector to the axis (row/column).
- Parameter
vector (
Vector
) – A vector that should be added to the list- Rückgabe
True
in case the vector was added to the axis,False
(default) in case the given vector
was not a valid object. :rtype:
bool
-
cleanup
()¶ Functionality that should release all the resources currently connected to the row.
This is basically a functionality to avoid unnecessary memory consumption
-
get_index
()¶ Get the index of the underlying XMLs attribute within the XML element.
- Rückgabe
The attribute index within the XML element
- Rückgabetyp
int
-
get_key
()¶ Get the unique id/name of the column.
- Rückgabe
The columns id
- Rückgabetyp
str
-
get_level
()¶ Get the hierachical level of the underlying XML element within the XML file.
- Rückgabe
The hierachical level of the underlying XML element
- Rückgabetyp
int
-
get_matrix
()¶ Return the matrix of the axis.
- Rückgabe
Matrix of the Axis
- Rückgabetyp
-
has_values
()¶ Indicate if any of the given vectors of the axis contains valid (size>0) strings.
- Rückgabe
True
in case at least a single vector contains valid data,False
(Default).- Rückgabetyp
bool
-
remove_vector
(vector)¶ Remove a given vector from the list of vectors (in case it is found).
- Parameter
vector (
Vector
) – Instance of the vector that should be removed from this axis- Rückgabe
True
in case the vector was removedFalse
(default) in case the vector was not found/removed.- Rückgabetyp
bool
-
-
class
Axis
(matrix, key)¶ Bases:
object
Base class of all the rows and columns that belong to the mapping matrix.
Its main purpose is to provide the ID/Key of a given axis (Column/Row) and a list of all the items/vectors that belong to this axis.
-
add_vector
(vector)¶ Add a vector to the axis (row/column).
- Parameter
vector (
Vector
) – A vector that should be added to the list- Rückgabe
True
in case the vector was added to the axis,False
(default) in case the given vector
was not a valid object. :rtype:
bool
-
get_key
()¶ Return the unique key/id of the axis (column/row).
- Rückgabe
a string representing the unique key/Id of the axis
- Rückgabetyp
str
-
get_matrix
()¶ Return the matrix of the axis.
- Rückgabe
Matrix of the Axis
- Rückgabetyp
-
has_values
()¶ Indicate if any of the given vectors of the axis contains valid (size>0) strings.
- Rückgabe
True
in case at least a single vector contains valid data,False
(Default).- Rückgabetyp
bool
-
remove_vector
(vector)¶ Remove a given vector from the list of vectors (in case it is found).
- Parameter
vector (
Vector
) – Instance of the vector that should be removed from this axis- Rückgabe
True
in case the vector was removedFalse
(default) in case the vector was not found/removed.- Rückgabetyp
bool
-
-
class
CellVector
(row, column, value, line_number=- 1)¶ Bases:
ncdiff.reader.xml.XML2CSVReader.Vector
Represent the data that belongs to a single coordinate (combination of row and column) within the matrix.
-
add_line_number
(line_number)¶ Add all the linenumbers the current vector originates from.
- Parameter
line_number (
int
) – Line number from within the XML file
-
add_line_numbers
(line_numbers)¶ Add all the linenumbers the current vector originates from.
- Parameter
line_numbers (
int
) – Line number from within the XML file
-
cleanup
()¶ Cleanup, release all existing references, so that the garbage collector can pick up this object.
-
get_column
()¶ Get the column this vector belongs to.
- Rückgabe
The vectors column
- Rückgabetyp
Column
-
get_context
()¶ Get the context of the vector.
- Rückgabe
The column this vector belongs to
- Rückgabetyp
Column
-
get_data
()¶ Get the data that represents the matrix cells value.
- Rückgabe
The value of the vector
- Rückgabetyp
str
-
get_line_numbers
()¶ Get all the linenumbers the current vector originates from.
- Rückgabe
The linenumbers from the originating XML file
- Rückgabetyp
list
ofint
-
get_max_line_number
()¶ Get the highest linenumber the current vector originates from.
- Rückgabe
The linenumber from the originating XML file
- Rückgabetyp
int
-
get_min_line_number
()¶ Get the lowest linenumber the current vector originates from.
- Rückgabe
The linenumber from the originating XML file
- Rückgabetyp
int
-
get_row
()¶ Get the row this vector belongs to.
- Rückgabe
The vectors row
- Rückgabetyp
Row
-
-
class
Column
(matrix, key)¶ Bases:
ncdiff.reader.xml.XML2CSVReader.Axis
Columns from an XML.
Class that represents on the one hand the collection of all the tags from the XML file and on the other hand the list of columns that will defined that structure of the CSV file. Since it forms the horizontal axis of the matrix it is derived from the Axis class.
-
add_vector
(vector)¶ Add a vector to the axis (row/column).
- Parameter
vector (
Vector
) – A vector that should be added to the list- Rückgabe
True
in case the vector was added to the axis,False
(default) in case the given vector
was not a valid object. :rtype:
bool
-
cleanup
()¶ Functionality that should release all the resources currently connected to the row.
This is basically a functionality to avoid unnecessary memory consumption
-
get_key
()¶ Return the unique key/id of the axis (column/row).
- Rückgabe
a string representing the unique key/Id of the axis
- Rückgabetyp
str
-
get_matrix
()¶ Return the matrix of the axis.
- Rückgabe
Matrix of the Axis
- Rückgabetyp
-
has_values
()¶ Indicate if any of the given vectors of the axis contains valid (size>0) strings.
- Rückgabe
True
in case at least a single vector contains valid data,False
(Default).- Rückgabetyp
bool
-
remove_vector
(vector)¶ Remove a given vector from the list of vectors (in case it is found).
- Parameter
vector (
Vector
) – Instance of the vector that should be removed from this axis- Rückgabe
True
in case the vector was removedFalse
(default) in case the vector was not found/removed.- Rückgabetyp
bool
-
-
class
ElementColumn
(matrix, key, level)¶ Bases:
ncdiff.reader.xml.XML2CSVReader.Column
Represents an XLM element column.
A specialization of the basic Column is the ElementColumn. It also contains an information (level) about its hierarchical depth level. This is especially important that XML of course allows the resage of tags with the same name all over the whole structure.
-
add_vector
(vector)¶ Add a vector to the axis (row/column).
- Parameter
vector (
Vector
) – A vector that should be added to the list- Rückgabe
True
in case the vector was added to the axis,False
(default) in case the given vector
was not a valid object. :rtype:
bool
-
cleanup
()¶ Functionality that should release all the resources currently connected to the row.
This is basically a functionality to avoid unnecessary memory consumption
-
get_key
()¶ Get the unique id/name of the column.
- Rückgabe
The columns id
- Rückgabetyp
str
-
get_level
()¶ Get the hierachical level of the underlying XML element within the XML file.
- Rückgabe
The hierachical level of the underlying XML element
- Rückgabetyp
int
-
get_matrix
()¶ Return the matrix of the axis.
- Rückgabe
Matrix of the Axis
- Rückgabetyp
-
has_values
()¶ Indicate if any of the given vectors of the axis contains valid (size>0) strings.
- Rückgabe
True
in case at least a single vector contains valid data,False
(Default).- Rückgabetyp
bool
-
remove_vector
(vector)¶ Remove a given vector from the list of vectors (in case it is found).
- Parameter
vector (
Vector
) – Instance of the vector that should be removed from this axis- Rückgabe
True
in case the vector was removedFalse
(default) in case the vector was not found/removed.- Rückgabetyp
bool
-
-
class
Matrix
(columns=None)¶ Bases:
object
Basic container that holds 2 Axis.
The horizontal axis form all the columns of the CSV file, whereas the vertical axis froms all the rows of the CSV file.
The object itself acts as a pure data storage. When parsing the XML all the information about a given element (or attribute in the future) is fed into the matrix. The matrix itself knows how to map the given information into its internal structure to finally transform the hierarchical data of the XML into the table based data structure of the CSV.
A further feature of the matrix is that it is designed for lazy-loading/writing. This means that it (for memory purposes) will not contain all the rows of the XML/CSV. Whenever a row is fully populated it can be removed from the matrix. Therefore the matrix contains a list of finished rows that can be extracted from outside. This is per design since on the input side the matrix gets line-by-line data from a XML parser and on the output there is also a file writer that creates a CSV.
-
add_attributes
(tag, attributes, level, xml_file_line_number)¶ Map a list of XML attributes into the CSV structure.
- Parameter
tag (
str
) – The XML element that attributes belong toattributes (
str
) – All the XML element attributes, key value pairslevel (
int
) – The hierachical level of the XML elementxml_file_line_number (
int
) – The originating XML file linenumber of the XML element
-
add_data
(tag, data, level, xml_file_line_number, index=0)¶ Map the value of an XML element into the CSV structure.
- Parameter
tag (
str
) – The XML element that attributes belong todata (
str
) – The XML elments valuelevel (
int
) – The hierachical level of the XML elementxml_file_line_number (
int
) – The originating XML file linenumber of the XML elementindex (
int
) –
-
cleanup
()¶ Functionality that should release all the resources currently connected to the row.
This is basically a functionality to avoid unnecessary memory consumption.
-
get_column_max_level
()¶ Return the maximum column level.
-
get_columns
()¶ Get all columns belonging to the matrix.
- Rückgabe
All columns of the matrix
- Rückgabetyp
list
ofColumn
-
get_new_row
()¶ Create a new
Row
object.- Rückgabe
A new row object
- Rückgabetyp
Row
-
get_rows
()¶ Get all rows belonging to the matrix.
- Rückgabe
All columns of the matrix
- Rückgabetyp
list
ofRow
-
-
class
Row
(matrix, key)¶ Bases:
ncdiff.reader.xml.XML2CSVReader.Axis
Class that contains all the items/vectors of a single CSV line.
It is derived from the Axis class because it represents the vertical axis of the matrix. Since we trying to map hierarchical data there can be siblings of the current row that logically belong together.
Those siblings can share RowVectors. Those share a single value across multiple siblings. Furthermore there is also the information (line number) from where within the XML file the data of a given row originate. This information is primary used in case a difference is found to give the user a hint where to look within the XML file
-
add_sibling
(sibling)¶ Add a row that logically is related to the current row.
- Parameter
sibling (
Row
) – The row that is the sibling of the current row
-
add_vector
(vector)¶ Add a vector to the axis (row/column).
- Parameter
vector (
Vector
) – A vector that should be added to the list- Rückgabe
True
in case the vector was added to the axis,False
(default) in case the given vector
was not a valid object. :rtype:
bool
-
cleanup
()¶ Functionality that should release all the resources currently connected to the row.
This is basically a functionality to avoid unnecessary memory consumption
-
clone
(level, xml_file_line_number)¶ Clone the current row.
- Parameter
level –
xml_file_line_number –
- Rückgabe
A cloned row object
- Rückgabetyp
Row
-
get_key
()¶ Return the unique key/id of the axis (column/row).
- Rückgabe
a string representing the unique key/Id of the axis
- Rückgabetyp
str
-
get_matrix
()¶ Return the matrix of the axis.
- Rückgabe
Matrix of the Axis
- Rückgabetyp
-
get_reference_lines
()¶ Return a string representing the starting and closing XMl line from where the current XML row originates.
- Rückgabe
String [<start>:<stop>] that represents the starting and stopping xml element
- Rückgabetyp
str
-
get_vector
(column)¶ Return the vector that is assigned to the given column (within the current row).
- Parameter
column (
Column
) – The column whos status has to be updated- Rückgabe
The vector that represents the given column None .. in case the column could not be found
- Rückgabetyp
bool
-
has_values
()¶ Indicate if any of the given vectors of the axis contains valid (size>0) strings.
- Rückgabe
True
in case at least a single vector contains valid data,False
(Default).- Rückgabetyp
bool
-
is_completly_handled
()¶ Indicate if every column within the current row has been handled (set).
- Rückgabe
True .. in case all columns have been set False .. in case columns are still missing
- Rückgabetyp
bool
-
remove_vector
(vector)¶ Remove a given vector from the list of vectors (in case it is found).
- Parameter
vector (
Vector
) – Instance of the vector that should be removed from this axis- Rückgabe
True
in case the vector was removedFalse
(default) in case the vector was not found/removed.- Rückgabetyp
bool
-
set_column_handled
(column, state=True)¶ Set the ‚handled‘ status of the given column.
- Parameter
column (
Column
) – The column whos status has to be updatedstate (
bool
) – Flag to indicate if the column has already handled. True .. Yes, False .. No
-
set_data
(column, data, level, xml_file_line_number, index)¶ Map a hierachical XML value/string into the into the CSV structure.
- Parameter
column (
Column
) – The column the data belongs todata (
str
) – The data that belongs to the columnlevel (
int
) – The hierachical level of the column within the XMLxml_file_line_number (
int
) – The linenumber within the XML the data originated fromindex (
int
) – The attribute index in case the value is from an attribute
-
was_column_already_handled
(column, data=None)¶ Indicate if the given column (within the current row) was already handled.
Was the value of that column already set? If that is the case this is an indicator that this row has to be cloned. :param column: The name/id of the column that has to be checked :type column:
str
:param data: The optional value of the cell that belongs to that row/column :type data:str
- Rückgabe
True .. in case the column was already handled and the the Data matches False .. (default) in case the column was not handled
- Rückgabetyp
C(bool)
-
-
class
RowVector
(row, column, value, line_number=- 1)¶ Bases:
ncdiff.reader.xml.XML2CSVReader.Vector
A row vector represents data that is shared by multiple rows for a single column/tag/element.
-
add_line_number
(line_number)¶ Add all the linenumbers the current vector originates from.
- Parameter
line_number (
int
) – Line number from within the XML file
-
add_line_numbers
(line_numbers)¶ Add all the linenumbers the current vector originates from.
- Parameter
line_numbers (
int
) – Line number from within the XML file
-
add_row
(row)¶ Add a
Row
to the vectors list of rows.- Parameter
row (
Row
) – A row that should be added
-
cleanup
()¶ Cleanup, release all existing references, so that the garbage collector can pick up this object.
-
get_column
()¶ Get the column this vector belongs to.
- Rückgabe
The vectors column
- Rückgabetyp
Column
-
get_context
()¶ Get the context of the vector.
- Rückgabe
The column this vector belongs to
- Rückgabetyp
Column
-
get_data
()¶ Get the data that represents the matrix cells value.
- Rückgabe
The value of the vector
- Rückgabetyp
str
-
get_line_numbers
()¶ Get all the linenumbers the current vector originates from.
- Rückgabe
The linenumbers from the originating XML file
- Rückgabetyp
list
ofint
-
get_max_line_number
()¶ Get the highest linenumber the current vector originates from.
- Rückgabe
The linenumber from the originating XML file
- Rückgabetyp
int
-
get_min_line_number
()¶ Get the lowest linenumber the current vector originates from.
- Rückgabe
The linenumber from the originating XML file
- Rückgabetyp
int
-
get_rows
()¶ Get the rows this vector belongs to.
- Rückgabe
The vectors rows
- Rückgabetyp
list
ofRow
-
-
class
Vector
(value, line_number)¶ Bases:
object
Base class representing a vector within a matrix.
In its simplest form it represents the data of a single coordinate within the matrix. A further feature of the vector is that it contains information about the origin of the data it represents from within the XML file (line number).
-
add_line_number
(line_number)¶ Add all the linenumbers the current vector originates from.
- Parameter
line_number (
int
) – Line number from within the XML file
-
add_line_numbers
(line_numbers)¶ Add all the linenumbers the current vector originates from.
- Parameter
line_numbers (
int
) – Line number from within the XML file
-
get_data
()¶ Get the data that represents the matrix cells value.
- Rückgabe
The value of the vector
- Rückgabetyp
str
-
get_line_numbers
()¶ Get all the linenumbers the current vector originates from.
- Rückgabe
The linenumbers from the originating XML file
- Rückgabetyp
list
ofint
-
get_max_line_number
()¶ Get the highest linenumber the current vector originates from.
- Rückgabe
The linenumber from the originating XML file
- Rückgabetyp
int
-
get_min_line_number
()¶ Get the lowest linenumber the current vector originates from.
- Rückgabe
The linenumber from the originating XML file
- Rückgabetyp
int
-
-
class
Xml2CsvAnalyser
¶ Bases:
xml.sax.handler.ContentHandler
Analyse a XML structure before the CSV translation.
The XML2CSVAnalyser represents a mini implementation of the XML2CSVTranslation that is up-front used to analyse the data/element/tag-structure of a given XML. This is needed because of the lazy-writing feature of the matrix. Since it is very possible that certain elements/columns may occur at the very end of an XML we have to know their existence before starting to write the CSV file.
Therefore the analyser will quickly rush through the XML to collect the list of all existing elements. It does not consider any data, but since it has the information about all the elements the columns of the XML2CSV matrix can be initialised before starting to feed the matrix with the actual data when parsing the XML again with the translator.
-
characters
(content)¶ Receive notification of character data.
The Parser will call this method to report each chunk of character data. SAX parsers may return all contiguous character data in a single chunk, or they may split it into several chunks; however, all of the characters in any single event must come from the same external entity so that the Locator provides useful information.
-
endDocument
()¶ Receive notification of the end of a document.
The SAX parser will invoke this method only once, and it will be the last method invoked during the parse. The parser shall not invoke this method until it has either abandoned parsing (because of an unrecoverable error) or reached the end of input.
-
endElement
(tag)¶ SAX 2 parser interface method that is called for every closing XML element that is found during XML parsing.
- Parameter
tag – The name of the XML element
- Rtype tag
str
-
endElementNS
(name, qname)¶ Signals the end of an element in namespace mode.
The name parameter contains the name of the element type, just as with the startElementNS event.
-
endPrefixMapping
(prefix)¶ End the scope of a prefix-URI mapping.
See startPrefixMapping for details. This event will always occur after the corresponding endElement event, but the order of endPrefixMapping events is not otherwise guaranteed.
-
get_column_names
()¶ Return a list of all the columns found during analysis.
- Rückgabe
The list of found columns
- Rückgabetyp
list
ofColumn
-
ignorableWhitespace
(whitespace)¶ Receive notification of ignorable whitespace in element content.
Validating Parsers must use this method to report each chunk of ignorable whitespace (see the W3C XML 1.0 recommendation, section 2.10): non-validating parsers may also use this method if they are capable of parsing and using content models.
SAX parsers may return all contiguous whitespace in a single chunk, or they may split it into several chunks; however, all of the characters in any single event must come from the same external entity, so that the Locator provides useful information.
-
processingInstruction
(target, data)¶ Receive notification of a processing instruction.
The Parser will invoke this method once for each processing instruction found: note that processing instructions may occur before or after the main document element.
A SAX parser should never report an XML declaration (XML 1.0, section 2.8) or a text declaration (XML 1.0, section 4.3.1) using this method.
-
setDocumentLocator
(locator)¶ Called by the parser to give the application a locator for locating the origin of document events.
SAX parsers are strongly encouraged (though not absolutely required) to supply a locator: if it does so, it must supply the locator to the application by invoking this method before invoking any of the other methods in the DocumentHandler interface.
The locator allows the application to determine the end position of any document-related event, even if the parser is not reporting an error. Typically, the application will use this information for reporting its own errors (such as character content that does not match an application’s business rules). The information returned by the locator is probably not sufficient for use with a search engine.
Note that the locator will return correct information only during the invocation of the events in this interface. The application should not attempt to use it at any other time.
-
skippedEntity
(name)¶ Receive notification of a skipped entity.
The Parser will invoke this method once for each entity skipped. Non-validating processors may skip entities if they have not seen the declarations (because, for example, the entity was declared in an external DTD subset). All processors may skip external entities, depending on the values of the http://xml.org/sax/features/external-general-entities and the http://xml.org/sax/features/external-parameter-entities properties.
-
startDocument
()¶ Receive notification of the beginning of a document.
The SAX parser will invoke this method only once, before any other methods in this interface or in DTDHandler (except for setDocumentLocator).
-
startElement
(tag, attributes)¶ SAX 2 parser interface method that is called for every starting XML element that is found during parsing.
- Parameter
tag – The name of the XML element
attributes – List of key value pairs representing all the XML elements XML attributes
- Rtype tag
str
- Rtype attributes
dict
-
startElementNS
(name, qname, attrs)¶ Signals the start of an element in namespace mode.
The name parameter contains the name of the element type as a (uri, localname) tuple, the qname parameter the raw XML 1.0 name used in the source document, and the attrs parameter holds an instance of the Attributes class containing the attributes of the element.
The uri part of the name tuple is None for elements which have no namespace.
-
startPrefixMapping
(prefix, uri)¶ Begin the scope of a prefix-URI Namespace mapping.
The information from this event is not necessary for normal Namespace processing: the SAX XML reader will automatically replace prefixes for element and attribute names when the http://xml.org/sax/features/namespaces feature is true (the default).
There are cases, however, when applications need to use prefixes in character data or in attribute values, where they cannot safely be expanded automatically; the start/endPrefixMapping event supplies the information to the application to expand prefixes in those contexts itself, if necessary.
Note that start/endPrefixMapping events are not guaranteed to be properly nested relative to each-other: all startPrefixMapping events will occur before the corresponding startElement event, and all endPrefixMapping events will occur after the corresponding endElement event, but their order is not guaranteed.
-
-
class
Xml2CsvTranslator
(reader, xml_file_name, columns, csv_file_name=None)¶ Bases:
xml.sax.handler.ContentHandler
The XML2CSV Translator is responsible for feeding all the XML/element/tag data into the mapping matrix.
The data itself is enriched with information about its location within the XML structure (line number, hierarchical level, element it belongs to, … ). Its basically build around a XML SAX-Parser-Event-Framework that allows to parse huge XML files without consuming much memory. This low-memory-consuming implementation is also followed in the mapping Matrix itself since it is per design already possible to retrieve finished CSV rows when still reading/feeding data from the XML file.
-
CLMN_XML_LINENUMBERS
= 'TOIGNORE:XmlFileLines'¶
-
TRANSLATION
= {9: None, 10: None}¶
-
characters
(data)¶ SAX 2 parser interface method that is called for every value belonging to a XML element.
B{NOTE}: This method might be called multiple times for a single XML element value! Therefore all the data has to be concatenated :param data: String representing either parts or the whole value of the currently handled XML element. :type data:
str
-
close
()¶ SAX 2 parser interface method that is called when the end of the XML document is reached.
-
endDocument
()¶ Receive notification of the end of a document.
The SAX parser will invoke this method only once, and it will be the last method invoked during the parse. The parser shall not invoke this method until it has either abandoned parsing (because of an unrecoverable error) or reached the end of input.
-
endElement
(tag)¶ SAX 2 parser interface method that is called for every closing XML element that is found during parsing.
This method triggers the XML2CSV transformation for the value and attributes of this XML element :param tag: The name of the XML element :rtype tag:
str
-
endElementNS
(name, qname)¶ Signals the end of an element in namespace mode.
The name parameter contains the name of the element type, just as with the startElementNS event.
-
endPrefixMapping
(prefix)¶ End the scope of a prefix-URI mapping.
See startPrefixMapping for details. This event will always occur after the corresponding endElement event, but the order of endPrefixMapping events is not otherwise guaranteed.
-
get_buffered_row
()¶ Retrieve all the buffers rows that have already been transformed into CSV ready structure.
- Rückgabe
A list of strings representing the transformed XML data
- Rückgabetyp
list
ofstr
-
has_buffered_rows
()¶ Show if the algorithm already produced new rows that can already be forwarded to the CSV file writer.
- Rückgabe
Amount of rows that are ready for processing
- Rückgabetyp
bool
-
ignorableWhitespace
(whitespace)¶ Receive notification of ignorable whitespace in element content.
Validating Parsers must use this method to report each chunk of ignorable whitespace (see the W3C XML 1.0 recommendation, section 2.10): non-validating parsers may also use this method if they are capable of parsing and using content models.
SAX parsers may return all contiguous whitespace in a single chunk, or they may split it into several chunks; however, all of the characters in any single event must come from the same external entity, so that the Locator provides useful information.
-
processingInstruction
(target, data)¶ Receive notification of a processing instruction.
The Parser will invoke this method once for each processing instruction found: note that processing instructions may occur before or after the main document element.
A SAX parser should never report an XML declaration (XML 1.0, section 2.8) or a text declaration (XML 1.0, section 4.3.1) using this method.
-
setDocumentLocator
(locator)¶ Called by the parser to give the application a locator for locating the origin of document events.
SAX parsers are strongly encouraged (though not absolutely required) to supply a locator: if it does so, it must supply the locator to the application by invoking this method before invoking any of the other methods in the DocumentHandler interface.
The locator allows the application to determine the end position of any document-related event, even if the parser is not reporting an error. Typically, the application will use this information for reporting its own errors (such as character content that does not match an application’s business rules). The information returned by the locator is probably not sufficient for use with a search engine.
Note that the locator will return correct information only during the invocation of the events in this interface. The application should not attempt to use it at any other time.
-
set_input_line_number
(line_number)¶ Represent the line number, within the XML that is read, of the currently handles XML element.
- Parameter
line_number (
int
) – The XML file line number
-
skippedEntity
(name)¶ Receive notification of a skipped entity.
The Parser will invoke this method once for each entity skipped. Non-validating processors may skip entities if they have not seen the declarations (because, for example, the entity was declared in an external DTD subset). All processors may skip external entities, depending on the values of the http://xml.org/sax/features/external-general-entities and the http://xml.org/sax/features/external-parameter-entities properties.
-
startDocument
()¶ Receive notification of the beginning of a document.
The SAX parser will invoke this method only once, before any other methods in this interface or in DTDHandler (except for setDocumentLocator).
-
startElement
(tag, attributes)¶ SAX 2 parser interface method that is called for every starting XML element that is found during parsing.
Basically all the data that is provided is buffered to be handled when reaching the according XML end call. :param tag: The name of the XML element :rtype tag:
str
:param attributes: List of key value pairs representing all the XML elements XML attributes :rtype attributes:dict
-
startElementNS
(name, qname, attrs)¶ Signals the start of an element in namespace mode.
The name parameter contains the name of the element type as a (uri, localname) tuple, the qname parameter the raw XML 1.0 name used in the source document, and the attrs parameter holds an instance of the Attributes class containing the attributes of the element.
The uri part of the name tuple is None for elements which have no namespace.
-
startPrefixMapping
(prefix, uri)¶ Begin the scope of a prefix-URI Namespace mapping.
The information from this event is not necessary for normal Namespace processing: the SAX XML reader will automatically replace prefixes for element and attribute names when the http://xml.org/sax/features/namespaces feature is true (the default).
There are cases, however, when applications need to use prefixes in character data or in attribute values, where they cannot safely be expanded automatically; the start/endPrefixMapping event supplies the information to the application to expand prefixes in those contexts itself, if necessary.
Note that start/endPrefixMapping events are not guaranteed to be properly nested relative to each-other: all startPrefixMapping events will occur before the corresponding startElement event, and all endPrefixMapping events will occur after the corresponding endElement event, but their order is not guaranteed.
-
-
close
()¶ Close the input file.
-
get_information
()¶ Get generic information about the given file like, filesize, amount of columns, column names.
- Rückgabe
A dictionary containing the column names, the number of columns
- Rückgabetyp
dict
-
has_header
()¶ Check if data has a header line.
- Rückgabe
True if the first line of a reader is a header.
- Rückgabetyp
bool
-
open
()¶ Open the file for reading in universal newlines mode.
-
reader
()¶ Get the flat/table structured data of the XML file.
This method already returns complete CSV compatible data rows that have been created by the XML data transformation. With the help of the yield functionality those data rows can be accessed on a „row by row“ basis instead of a „give me all at the end“ basis. This has the advantage that the memory consumption of the whole XML transformation to CSV is very low. :return: A list of strings that represent a CSV data row :rtype:
list
-
class
Module contents¶
Reader package.
Defines public viewable API by specifying the __all__ list. :since: 2020-04-16
-
ncdiff.reader.
create_reader_from_module
(reader_name, module_name, *args, **kwargs)¶ Try to load the reader class from the specified module.
- Parameter
reader_name (
str
) – The type of reader to create.*args –
arguments passed to the reader constructor of the reader class.
**kwargs –
keyword arguments passed to the constructor of the reader class.
-
ncdiff.reader.
create_old_reader
(target_configuration)¶ Create the reader for the old input file according the passed configuration.
Note: Row filters must be passed to the reader if filtering of rows is supported! :param target_configuration: The configuration of the diff target. :type target_configuration:
TargetConfiguration
:return: A reader implementation to read the old input file. :rtype: BaseReader
-
ncdiff.reader.
create_new_reader
(target_configuration)¶ Create the reader for the new input file according the passed configuration.
Note: Row filters must be passed to the reader if filtering of rows is supported! :param target_configuration: The configuration of the diff target. :type target_configuration:
TargetConfiguration
:return: A reader implementation to read the new input file. :rtype: BaseReader
-
ncdiff.reader.
create_old_sorted_file_reader
(target_configuration)¶ Create the reader for the old sorted file.
- Parameter
target_configuration (
TargetConfiguration
) – The configuration of the diff target.
-
ncdiff.reader.
create_new_sorted_file_reader
(target_configuration)¶ Create the reader for the old sorted file.
- Parameter
target_configuration (
TargetConfiguration
) – The configuration of the diff target.
-
class
ncdiff.reader.
BaseReader
(has_header, row_filter=None, append_row_nr=False, encoding=None, column_names=None)¶ Bases:
object
Abstract base class that defines the reader interface.
All readers support the context manager protocol (with statement).
-
close
()¶ Close the location.
This method never fails.
-
get_information
()¶ Get information from data.
Note: The dictionary values may be None if the file is empty or has no header. :return: A dictionary containing the column names and the number of columns:
‚column_names‘: None
‚columnCount‘: 11
- Rückgabetyp
dict
-
has_header
()¶ Check if data has a header line.
- Rückgabe
True if the first line of a reader is a header.
- Rückgabetyp
bool
-
open
()¶ Open the location.
-
reader
()¶ Return a reader.
A reader is expected to return all cell values in string format with any leading and trailing white space characters removed. :return: An reader supporting the iterator protocol. :rtype:
iterable
-
-
class
ncdiff.reader.
FileReader
(has_header, input_path, row_filter=None, append_row_nr=False, encoding=None, column_names=None)¶ Bases:
ncdiff.reader.base.BaseReader
Abstract base class implementing support to read files.
-
close
()¶ Close the input file.
-
get_information
()¶ Get information from a file.
Note: The dictionary values may be None if the file is empty or has no header. :return: A dictionary containing the column names, the number of columns
- and the size of the input file in bytes.
‚column_names‘: None
‚columnCount‘ : 11
‚size‘ : 123456
- Rückgabetyp
dict
-
has_header
()¶ Check if data has a header line.
- Rückgabe
True if the first line of a reader is a header.
- Rückgabetyp
bool
-
open
()¶ Open the file for reading in universal newlines mode.
-
reader
()¶ Return a reader.
A reader is expected to return all cell values in string format with any leading and trailing white space characters removed. :return: An reader supporting the iterator protocol. :rtype:
iterable
-
-
class
ncdiff.reader.
CSVReader
(has_header, input_path, delimiter=';', replace_delimiters=None, quoting='QUOTE_MINIMAL', quotechar='"', doublequote=True, escapechar=None, row_filter=None, append_row_nr=False, encoding=None, column_names=None)¶ Bases:
ncdiff.reader.base.FileReader
Read csv files.
-
close
()¶ Close the input file.
-
get_information
()¶ Get information from a file.
Note: The dictionary values may be None if the file is empty or has no header. :return: A dictionary containing the column names, the number of columns
- and the size of the input file in bytes.
‚column_names‘: None
‚columnCount‘ : 11
‚size‘ : 123456
- Rückgabetyp
dict
-
has_header
()¶ Check if data has a header line.
- Rückgabe
True if the first line of a reader is a header.
- Rückgabetyp
bool
-
open
()¶ Open the file for reading in universal newlines mode.
-
reader
()¶ Return the CSV reader.
- Rückgabe
A CSV reader for the csv file.
- Rückgabetyp
csv.reader
-
-
class
ncdiff.reader.
CSVVARReader
(*args, **kwargs)¶ Bases:
ncdiff.reader.csv.CSVReader
A csv reader that can cope with columns of different length.
-
close
()¶ Close the input file.
-
get_information
()¶ Get information from a file.
Note: The dictionary values may be None if the file is empty or has no header. :return: A dictionary containing the column names, the number of columns
- and the size of the input file in bytes.
‚column_names‘: None
‚columnCount‘ : 11
‚size‘ : 123456
- Rückgabetyp
dict
-
has_header
()¶ Check if data has a header line.
- Rückgabe
True if the first line of a reader is a header.
- Rückgabetyp
bool
-
open
()¶ Open the file for reading in universal newlines mode.
-
reader
()¶ Return the CSV reader.
- Rückgabe
A CSV reader for the csv file.
- Rückgabetyp
csv.reader
-
-
class
ncdiff.reader.
DirReader
(has_header, input_path, row_filter=None, append_row_nr=False, encoding=None, column_names=None)¶ Bases:
ncdiff.reader.base.BaseReader
Reader for recursively comparing filesystem structures.
This reader allows to generically extract information about filesystem objects (files, directories, links) and to forward this information in the form of structured data records that can be written into CSV files. Since all the functionality for comparing CSV files is already in place all the existing functionality like filters, tolerances, result filters, data type mapping, … can be used to compare the filesystem data. Since the primary key of the filesystem comparison does always need to be the ABSPATH + NAME different kind of comparison can be achieved.
-
close
()¶ Empty close method.
Since the directory reader does not handle single files this function has no functionality within the DIR reader.
-
get_information
()¶ General information/statistics that might have been gathered by this reader.
B{Note:} The dictionary values may be None if the file is empty or has no header. :return: A dictionary containing the column names, the number of columns
and the size of the input file in bytes.
- Rückgabetyp
dict
-
has_header
()¶ Check if data has a header line.
- Rückgabe
True if the first line of a reader is a header.
- Rückgabetyp
bool
-
open
()¶ Empty open method.
Since the directory reader does not handle single files this function has no functionality within the DIR reader.
-
reader
()¶ Gather all the information about the filesystem objects.
This method contains the main functionality of the DIR reader. It recursively loops through all the directories underneath the defined BaseDirectory. For each file, link, directory that is found on the way down to the last directory a according
FileInfo
object is created and the information of finally forwarded to a CSV writer. :return: A list of strings representing the complete filesystem objects information :rtype:list
-
-
class
ncdiff.reader.
FixedWidthReader
(has_header, input_path, columns, row_filter=None, append_row_nr=False, encoding=None, column_names=None)¶ Bases:
ncdiff.reader.base.FileReader
Implementation of reading text files with columns defined by the number of characters.
-
close
()¶ Close the input file.
-
get_information
()¶ Get information from a file.
Note: The dictionary values may be None if the file is empty or has no header. :return: A dictionary containing the column names, the number of columns
- and the size of the input file in bytes.
‚column_names‘: None
‚columnCount‘ : 11
‚size‘ : 123456
- Rückgabetyp
dict
-
has_header
()¶ Check if data has a header line.
- Rückgabe
True if the first line of a reader is a header.
- Rückgabetyp
bool
-
open
()¶ Open the file for reading in universal newlines mode.
-
reader
()¶ Iterate through the passed file stream and returns a single line for every calling.
- Rückgabe
iterable with column entries
- Rückgabetyp
array
-
-
class
ncdiff.reader.
JSON2CSVReader
(has_header, input_path, row_filter=None, append_row_nr=False, encoding=None, column_names=None)¶ Bases:
ncdiff.reader.base.FileReader
JSON2CSVReader transforms JSON messages into a CSV file.
Goal was to find a generic implementation that would transform a JSON file into CSV in order to be able to use the already existing (CSV) process for comparing files instead of having to write an new complex algorithm to diff structured hierarchical data.
This basic implementation is able to handle JSON where data is stored on different hierarchical levels. Arrays of JSON objects are only supported on the top most level, but nested objects (aka dictionaries of dictionaries) are fully supported. It is a generic implementation, where the name of the element will be the name of the resulting CSV column. For nested objects the dot syntax is used to built the column name. For instance:
{"name": "Alice", "location": {"country": "Wonderland", "street": "Rabbit Hole"}, "age": "8"}
- will result in columns::
name | location.country | location.street | age Alice | Wonderland | Rabbit Hole | 8
Since it is completely integrated into the existing NCDiff framework it fully supports the filter, tolerance, data type features of the framework.
-
close
()¶ Close the input file.
-
get_information
()¶ Get generic information about the given file like, files ize, amount of columns, column names.
- Rückgabe
A dictionary containing the column names, the number of columns
- Rückgabetyp
dict
-
has_header
()¶ Check if data has a header line.
- Rückgabe
True if the first line of a reader is a header.
- Rückgabetyp
bool
-
open
()¶ Open the file and determine the column names.
-
reader
()¶ Get the flat/table structured data of the JSON file.
-
class
ncdiff.reader.
SQLReader
(has_header, connection_string, database_driver, query_string, fetch_size=500, row_filter=None, append_row_nr=False, encoding=None, column_names=None)¶ Bases:
ncdiff.reader.base.BaseReader
Read SQL tables.
-
close
()¶ Close the database connection.
-
get_information
()¶ Get information from data.
Note: The dictionary values may be None if the file is empty or has no header. :return: A dictionary containing the column names and the number of columns:
‚column_names‘: None
‚columnCount‘: 11
- Rückgabetyp
dict
-
has_header
()¶ Check if data has a header line.
- Rückgabe
True if the first line of a reader is a header.
- Rückgabetyp
bool
-
open
()¶ Open the database connection.
-
reader
()¶ Return a SQL reader.
- Rückgabe
A CSV reader for the csv file.
- Rückgabetyp
csv.reader
-
-
class
ncdiff.reader.
SWIFTReader
(has_header, input_path, quotechar='"', doublequote=True, delimiter=';', replace_delimiters=None, escapechar=None, row_filter=None, append_row_nr=False, encoding=None, column_names=None)¶ Bases:
ncdiff.reader.base.FileReader
Read swift files.
-
close
()¶ Close the input file.
-
get_information
()¶ Get information from a file.
Note: The dictionary values may be None if the file is empty or has no header. :return: A dictionary containing the column names, the number of columns
- and the size of the input file in bytes.
‚column_names‘: None
‚columnCount‘ : 11
‚size‘ : 123456
- Rückgabetyp
dict
-
has_header
()¶ Check if data has a header line.
- Rückgabe
True if the first line of a reader is a header.
- Rückgabetyp
bool
-
open
()¶ Open the file for reading in universal newlines mode.
-
reader
()¶ Return a SWIFT reader.
- Rückgabe
A SWIFT reader for the swift file.
- Rückgabetyp
swift.reader
-
-
class
ncdiff.reader.
TARReader
(has_header, input_path, compression_type, row_filter=None, append_row_nr=False, encoding=None, column_names=None)¶ Bases:
ncdiff.reader.base.FileReader
The TAR file reader.
Main purpose of this class is to provide the functionality to read a TAR file and write an overview of its content into a CSV file that can be compared to an similar TAR-Content-File. Background of this reader is that lots of „packages“ are either TAR or ZIP containers that contain lots of other files. To get an overview about the content of such an container an according reader had be written There is currently support for 3 difference kinds of TAR files:
plain/uncompressed: tar -> use TAR as <new|old>FileFormat
GZIP compressed tar: tar.gz -> use TAR.GZ as <new|old>FileFormat
BZIP2 compressed tar: tar.bz2 -> use TAR.BZ2 as <new|old>FileFormat
-
class
TarMemberInfo
(tar_info_object)¶ Bases:
ncdiff.utils.FileUtils.FileInfo
TarMemberInfo is a specialization of the FileUtils.FileInfo object.
The main difference here is that the information about the files/directories are not directly retrieved from the file system, they are read out of the tar container itself.
-
class
Type
¶ Bases:
object
Little helper class for distinguishing difference filesystem object types.
-
DIRECTORY
= 'Directory'¶
-
FILE
= 'File'¶
-
LINK
= 'Link'¶
-
-
get_abs_path
()¶ Get the absolute (full) path of the filesystem object within the filesystem.
- Rückgabe
Absolute path of the
FileInfo
object- Rückgabetyp
str
-
get_access_time
(dt_format='%Y-%m-%d %H:%M:%S')¶ Get the timestamp when the filesystem object was accessed the last time.
- Parameter
dt_format (
str
) – The optional timestamp format of the resulting string- Rückgabe
A string representing the filesystem objects last access time.
- Rückgabetyp
str
-
get_creation_time
(dt_format='%Y-%m-%d %H:%M:%S')¶ Get the timestamp when the filesystem object was created.
- Parameter
dt_format (
str
) – The optional timestamp format of the resulting string- Rückgabe
A string representing the filesystem objects creation time
- Rückgabetyp
str
-
get_depth
()¶ Get the depth of the filesystem object relative to the BaseDirectory.
- Rückgabe
A number representing the depth level relative to the root/BaseDirectory
- Rückgabetyp
int
-
get_extension
()¶ Get the files extension.
All the characters right to the very last dot ‚.‘ within the files base name. :return: Files extension :rtype:
str
-
get_group
()¶ Get the id of the users group owning the file system object.
- Rückgabe
Id of the group that owns the file
- Rückgabetyp
int
-
get_hash_value
()¶ Get the hashdigest representing the hash value of the filesystem object.
- Rückgabe
A string representing the hashdigest
- Rückgabetyp
str
-
get_mode
()¶ Get the file system permissions bits.
- Rückgabe
File system permission info.
- Rückgabetyp
int
-
get_modification_time
(dt_format='%Y-%m-%d %H:%M:%S')¶ Get the timestamp when the filesystem object was modified the last time.
- Parameter
dt_format (
str
) – The optional timestamp format of the resulting string.- Rückgabe
A string representing the filesystem objects last modification time.
- Rückgabetyp
str
-
get_name
()¶ Return the basename of the filesystem object.
- Rückgabe
Basename of the
FileInfo
object- Rückgabetyp
str
-
get_permissions
()¶ Get a string representing the file permissions in unix format „rwx-r-x–x“.
- Rückgabe
a string that represents the file permission in POSIX format.
- Rückgabetyp
str
-
get_rel_path
()¶ Get the path of the filesystem object relative to the given BaseDirecotry.
- Rückgabe
Relative path of the
FileInfo
object- Rückgabetyp
str
-
get_size
()¶ Get the size of the file.
- Rückgabe
Size of the file
- Rückgabetyp
int
-
get_type
()¶ Get the file system object type this object represents.
- Rückgabe
The type of filesystem object
- Rückgabetyp
FileInfo.Type
-
get_user
()¶ Get the id of the user owning the file system object.
- Rückgabe
Id of the user that owns the file.
- Rückgabetyp
int
-
is_directory
()¶ Check if this
FileInfo
is a directory.- Rückgabe
An indicator in case this is a directory True .. in case the
FileInfo
represents a directory False .. in all other cases- Rückgabetyp
bool
-
is_file
()¶ Check if this
FileInfo
is a file.- Rückgabe
An indicator in case this is a file True .. in case the
FileInfo
represents a file False .. in all other cases- Rückgabetyp
bool
-
is_link
()¶ Check if this
FileInfo
is a link.- Rückgabe
An indicator in case this is a link True .. in case the
FileInfo
represents a link False .. in all other cases- Rückgabetyp
bool
-
set_access_time
(access_time)¶ Set the last accessed time of the file system object.
- Parameter
access_time (
datetime
) – The creation timestamp of the filesystem object
-
set_creation_time
(creation_time)¶ Set the creation time of the file system object.
- Parameter
creation_time (
datetime
) – The creation timestamp of the filesystem object
-
set_depth
(depth)¶ Set the depth of the filesystem object relative to the BaseDirectory specified in the constructor.
- Parameter
depth (
int
) – The depth/level of the filesystem objects relative to the root/BaseDirectory
-
set_file_system_object
(tar_info_object, base_directory=None)¶ Setter for the TarInfo file descriptor.
- Parameter
tar_info_object (
tarfile.TarInfo
) – Kind of file descriptor object from a TarFile.
-
set_group
(group_id)¶ Set the id of the group the owner belongs to.
- Parameter
group_id (
int
) – The group id within the filesystem.
-
set_hash_value
(hash_value)¶ Set the hashdigest that represents this
FileInfo
object.- Parameter
hash_value (
str
) – A string preferably extracted withFileUtils.calc_hash
-
set_mode
(mode)¶ Set the file system permissions bits of the current filesystem object.
- Parameter
mode (
int
) – File system permission
-
set_modification_time
(modification_time)¶ Set the last modification time of the file system object.
- Parameter
modification_time (
datetime
) – The creation timestamp of the filesystem object.
-
set_name
(file_path, base_directory=None)¶ Set the name of the TarInfo from a file path.
- Parameter
file_path (
str
) – location/path within the tarfilebase_directory – optional base path; this has to be stripped from the absolute path to get the
relative one. :type base_directory:
str
-
set_size
(size)¶ Set the size of the file.
- Parameter
size (
int
) – Size of the filesystem object
-
set_type
(type_)¶ Set the type of the file system object this
FileInfo
object represents. File, Directory or Link.- Parameter
type (
FileInfo.Type
) – Type of the FileInfo object
-
set_user
(user_id)¶ Set the id of the user that owns the filesystem object.
- Parameter
user_id (
int
) – The users id within the filesystem.
-
class
-
close
()¶ Close the input file.
-
get_information
()¶ Get information from a file.
Note: The dictionary values may be None if the file is empty or has no header. :return: A dictionary containing the column names, the number of columns
- and the size of the input file in bytes.
‚column_names‘: None
‚columnCount‘ : 11
‚size‘ : 123456
- Rückgabetyp
dict
-
has_header
()¶ Check if data has a header line.
- Rückgabe
True if the first line of a reader is a header.
- Rückgabetyp
bool
-
open
()¶ TAR file open/read functionality.
This method has some fall backs in case the actual given compression format does not match the specified on. Fallback sequence is: [1] .. generic transparent compression [2] .. GZIP compression [3] .. ‚explicit‘ NO compression [4] .. BZIP2 compression In case the file cannot be opened an error will be raised :raise IOError: either in case the file is already read by another process or in case none of the compression formats match the file.
-
reader
()¶ Gather all the information about the tar files objects.
This method contains the main functionality of the TAR reader. It recursively loops through all the directories underneath the root. For each file, link, directory that is found on the way down to the last directory a according
TarFileInfo
object is created and the information of finally forwarded to a CSV writer. :return: A list of strings representing the complete filesystem objects information :rtype:list
-
class
ncdiff.reader.
XLSReader
(has_header, input_path, worksheet_name=None, row_filter=None, append_row_nr=False, encoding=None, column_names=None)¶ Bases:
ncdiff.reader.base.BaseReader
Implementation of reading the XLS files.
-
close
()¶ Free the worksheet.
-
get_information
()¶ Get XLS information.
Note: The dictionary values may be None if the file is empty or has no header. :return: A dictionary containing the column names, the number of columns
- and the size of the input file in bytes.
‚column_names‘: None
‚columnCount‘ : 11
‚size‘ : 123456
- Rückgabetyp
dict
-
has_header
()¶ Check if data has a header line.
- Rückgabe
True if the first line of a reader is a header.
- Rückgabetyp
bool
-
open
()¶ Open the xls file for reading.
-
reader
()¶ Iterate through the passed worksheet file and returns a single line for every calling.
- Rückgabe
iterable with column entries
- Rückgabetyp
array
-
-
class
ncdiff.reader.
XML2CSVReader
(has_header, input_path, row_filter=None, append_row_nr=False, encoding=None, column_names=None)¶ Bases:
ncdiff.reader.base.FileReader
XML2CSVReader is a kind of mini XSLT transformation of a XML into a CSV file.
Goal was to find a generic implementation that would transform a XML into CSV in order to be able to use the already existing (CSV) process for comparing files instead of having to write an new complex algorithm to diff structured hierarchical data.
This basic implementation is able to handle XMLs where data is stored on different hierarchical levels. It is a generic implementation, where the name of the element/tag will be the name of the resulting a CSV column.
Since it is completely integrated into the existing NCDiff framework it fully supports the filter, tolerance, data type features of the framework.
-
class
AttributeColumn
(matrix, key, level, index=0)¶ Bases:
ncdiff.reader.xml.XML2CSVReader.ElementColumn
Represents an XML attribute column.
A further specialization on top of the ElementColumn is the AttributeColumn that would represent a single XML attribute of an XML element. In addition to all the information that is available on the element/tag level, there is also the index of the attribute within the element.
This feature is currently not fully supported/implement.
-
add_vector
(vector)¶ Add a vector to the axis (row/column).
- Parameter
vector (
Vector
) – A vector that should be added to the list- Rückgabe
True
in case the vector was added to the axis,False
(default) in case the given vector
was not a valid object. :rtype:
bool
-
cleanup
()¶ Functionality that should release all the resources currently connected to the row.
This is basically a functionality to avoid unnecessary memory consumption
-
get_index
()¶ Get the index of the underlying XMLs attribute within the XML element.
- Rückgabe
The attribute index within the XML element
- Rückgabetyp
int
-
get_key
()¶ Get the unique id/name of the column.
- Rückgabe
The columns id
- Rückgabetyp
str
-
get_level
()¶ Get the hierachical level of the underlying XML element within the XML file.
- Rückgabe
The hierachical level of the underlying XML element
- Rückgabetyp
int
-
get_matrix
()¶ Return the matrix of the axis.
- Rückgabe
Matrix of the Axis
- Rückgabetyp
-
has_values
()¶ Indicate if any of the given vectors of the axis contains valid (size>0) strings.
- Rückgabe
True
in case at least a single vector contains valid data,False
(Default).- Rückgabetyp
bool
-
remove_vector
(vector)¶ Remove a given vector from the list of vectors (in case it is found).
- Parameter
vector (
Vector
) – Instance of the vector that should be removed from this axis- Rückgabe
True
in case the vector was removedFalse
(default) in case the vector was not found/removed.- Rückgabetyp
bool
-
-
class
Axis
(matrix, key)¶ Bases:
object
Base class of all the rows and columns that belong to the mapping matrix.
Its main purpose is to provide the ID/Key of a given axis (Column/Row) and a list of all the items/vectors that belong to this axis.
-
add_vector
(vector)¶ Add a vector to the axis (row/column).
- Parameter
vector (
Vector
) – A vector that should be added to the list- Rückgabe
True
in case the vector was added to the axis,False
(default) in case the given vector
was not a valid object. :rtype:
bool
-
get_key
()¶ Return the unique key/id of the axis (column/row).
- Rückgabe
a string representing the unique key/Id of the axis
- Rückgabetyp
str
-
get_matrix
()¶ Return the matrix of the axis.
- Rückgabe
Matrix of the Axis
- Rückgabetyp
-
has_values
()¶ Indicate if any of the given vectors of the axis contains valid (size>0) strings.
- Rückgabe
True
in case at least a single vector contains valid data,False
(Default).- Rückgabetyp
bool
-
remove_vector
(vector)¶ Remove a given vector from the list of vectors (in case it is found).
- Parameter
vector (
Vector
) – Instance of the vector that should be removed from this axis- Rückgabe
True
in case the vector was removedFalse
(default) in case the vector was not found/removed.- Rückgabetyp
bool
-
-
class
CellVector
(row, column, value, line_number=- 1)¶ Bases:
ncdiff.reader.xml.XML2CSVReader.Vector
Represent the data that belongs to a single coordinate (combination of row and column) within the matrix.
-
add_line_number
(line_number)¶ Add all the linenumbers the current vector originates from.
- Parameter
line_number (
int
) – Line number from within the XML file
-
add_line_numbers
(line_numbers)¶ Add all the linenumbers the current vector originates from.
- Parameter
line_numbers (
int
) – Line number from within the XML file
-
cleanup
()¶ Cleanup, release all existing references, so that the garbage collector can pick up this object.
-
get_column
()¶ Get the column this vector belongs to.
- Rückgabe
The vectors column
- Rückgabetyp
Column
-
get_context
()¶ Get the context of the vector.
- Rückgabe
The column this vector belongs to
- Rückgabetyp
Column
-
get_data
()¶ Get the data that represents the matrix cells value.
- Rückgabe
The value of the vector
- Rückgabetyp
str
-
get_line_numbers
()¶ Get all the linenumbers the current vector originates from.
- Rückgabe
The linenumbers from the originating XML file
- Rückgabetyp
list
ofint
-
get_max_line_number
()¶ Get the highest linenumber the current vector originates from.
- Rückgabe
The linenumber from the originating XML file
- Rückgabetyp
int
-
get_min_line_number
()¶ Get the lowest linenumber the current vector originates from.
- Rückgabe
The linenumber from the originating XML file
- Rückgabetyp
int
-
get_row
()¶ Get the row this vector belongs to.
- Rückgabe
The vectors row
- Rückgabetyp
Row
-
-
class
Column
(matrix, key)¶ Bases:
ncdiff.reader.xml.XML2CSVReader.Axis
Columns from an XML.
Class that represents on the one hand the collection of all the tags from the XML file and on the other hand the list of columns that will defined that structure of the CSV file. Since it forms the horizontal axis of the matrix it is derived from the Axis class.
-
add_vector
(vector)¶ Add a vector to the axis (row/column).
- Parameter
vector (
Vector
) – A vector that should be added to the list- Rückgabe
True
in case the vector was added to the axis,False
(default) in case the given vector
was not a valid object. :rtype:
bool
-
cleanup
()¶ Functionality that should release all the resources currently connected to the row.
This is basically a functionality to avoid unnecessary memory consumption
-
get_key
()¶ Return the unique key/id of the axis (column/row).
- Rückgabe
a string representing the unique key/Id of the axis
- Rückgabetyp
str
-
get_matrix
()¶ Return the matrix of the axis.
- Rückgabe
Matrix of the Axis
- Rückgabetyp
-
has_values
()¶ Indicate if any of the given vectors of the axis contains valid (size>0) strings.
- Rückgabe
True
in case at least a single vector contains valid data,False
(Default).- Rückgabetyp
bool
-
remove_vector
(vector)¶ Remove a given vector from the list of vectors (in case it is found).
- Parameter
vector (
Vector
) – Instance of the vector that should be removed from this axis- Rückgabe
True
in case the vector was removedFalse
(default) in case the vector was not found/removed.- Rückgabetyp
bool
-
-
class
ElementColumn
(matrix, key, level)¶ Bases:
ncdiff.reader.xml.XML2CSVReader.Column
Represents an XLM element column.
A specialization of the basic Column is the ElementColumn. It also contains an information (level) about its hierarchical depth level. This is especially important that XML of course allows the resage of tags with the same name all over the whole structure.
-
add_vector
(vector)¶ Add a vector to the axis (row/column).
- Parameter
vector (
Vector
) – A vector that should be added to the list- Rückgabe
True
in case the vector was added to the axis,False
(default) in case the given vector
was not a valid object. :rtype:
bool
-
cleanup
()¶ Functionality that should release all the resources currently connected to the row.
This is basically a functionality to avoid unnecessary memory consumption
-
get_key
()¶ Get the unique id/name of the column.
- Rückgabe
The columns id
- Rückgabetyp
str
-
get_level
()¶ Get the hierachical level of the underlying XML element within the XML file.
- Rückgabe
The hierachical level of the underlying XML element
- Rückgabetyp
int
-
get_matrix
()¶ Return the matrix of the axis.
- Rückgabe
Matrix of the Axis
- Rückgabetyp
-
has_values
()¶ Indicate if any of the given vectors of the axis contains valid (size>0) strings.
- Rückgabe
True
in case at least a single vector contains valid data,False
(Default).- Rückgabetyp
bool
-
remove_vector
(vector)¶ Remove a given vector from the list of vectors (in case it is found).
- Parameter
vector (
Vector
) – Instance of the vector that should be removed from this axis- Rückgabe
True
in case the vector was removedFalse
(default) in case the vector was not found/removed.- Rückgabetyp
bool
-
-
class
Matrix
(columns=None)¶ Bases:
object
Basic container that holds 2 Axis.
The horizontal axis form all the columns of the CSV file, whereas the vertical axis froms all the rows of the CSV file.
The object itself acts as a pure data storage. When parsing the XML all the information about a given element (or attribute in the future) is fed into the matrix. The matrix itself knows how to map the given information into its internal structure to finally transform the hierarchical data of the XML into the table based data structure of the CSV.
A further feature of the matrix is that it is designed for lazy-loading/writing. This means that it (for memory purposes) will not contain all the rows of the XML/CSV. Whenever a row is fully populated it can be removed from the matrix. Therefore the matrix contains a list of finished rows that can be extracted from outside. This is per design since on the input side the matrix gets line-by-line data from a XML parser and on the output there is also a file writer that creates a CSV.
-
add_attributes
(tag, attributes, level, xml_file_line_number)¶ Map a list of XML attributes into the CSV structure.
- Parameter
tag (
str
) – The XML element that attributes belong toattributes (
str
) – All the XML element attributes, key value pairslevel (
int
) – The hierachical level of the XML elementxml_file_line_number (
int
) – The originating XML file linenumber of the XML element
-
add_data
(tag, data, level, xml_file_line_number, index=0)¶ Map the value of an XML element into the CSV structure.
- Parameter
tag (
str
) – The XML element that attributes belong todata (
str
) – The XML elments valuelevel (
int
) – The hierachical level of the XML elementxml_file_line_number (
int
) – The originating XML file linenumber of the XML elementindex (
int
) –
-
cleanup
()¶ Functionality that should release all the resources currently connected to the row.
This is basically a functionality to avoid unnecessary memory consumption.
-
get_column_max_level
()¶ Return the maximum column level.
-
get_columns
()¶ Get all columns belonging to the matrix.
- Rückgabe
All columns of the matrix
- Rückgabetyp
list
ofColumn
-
get_new_row
()¶ Create a new
Row
object.- Rückgabe
A new row object
- Rückgabetyp
Row
-
get_rows
()¶ Get all rows belonging to the matrix.
- Rückgabe
All columns of the matrix
- Rückgabetyp
list
ofRow
-
-
class
Row
(matrix, key)¶ Bases:
ncdiff.reader.xml.XML2CSVReader.Axis
Class that contains all the items/vectors of a single CSV line.
It is derived from the Axis class because it represents the vertical axis of the matrix. Since we trying to map hierarchical data there can be siblings of the current row that logically belong together.
Those siblings can share RowVectors. Those share a single value across multiple siblings. Furthermore there is also the information (line number) from where within the XML file the data of a given row originate. This information is primary used in case a difference is found to give the user a hint where to look within the XML file
-
add_sibling
(sibling)¶ Add a row that logically is related to the current row.
- Parameter
sibling (
Row
) – The row that is the sibling of the current row
-
add_vector
(vector)¶ Add a vector to the axis (row/column).
- Parameter
vector (
Vector
) – A vector that should be added to the list- Rückgabe
True
in case the vector was added to the axis,False
(default) in case the given vector
was not a valid object. :rtype:
bool
-
cleanup
()¶ Functionality that should release all the resources currently connected to the row.
This is basically a functionality to avoid unnecessary memory consumption
-
clone
(level, xml_file_line_number)¶ Clone the current row.
- Parameter
level –
xml_file_line_number –
- Rückgabe
A cloned row object
- Rückgabetyp
Row
-
get_key
()¶ Return the unique key/id of the axis (column/row).
- Rückgabe
a string representing the unique key/Id of the axis
- Rückgabetyp
str
-
get_matrix
()¶ Return the matrix of the axis.
- Rückgabe
Matrix of the Axis
- Rückgabetyp
-
get_reference_lines
()¶ Return a string representing the starting and closing XMl line from where the current XML row originates.
- Rückgabe
String [<start>:<stop>] that represents the starting and stopping xml element
- Rückgabetyp
str
-
get_vector
(column)¶ Return the vector that is assigned to the given column (within the current row).
- Parameter
column (
Column
) – The column whos status has to be updated- Rückgabe
The vector that represents the given column None .. in case the column could not be found
- Rückgabetyp
bool
-
has_values
()¶ Indicate if any of the given vectors of the axis contains valid (size>0) strings.
- Rückgabe
True
in case at least a single vector contains valid data,False
(Default).- Rückgabetyp
bool
-
is_completly_handled
()¶ Indicate if every column within the current row has been handled (set).
- Rückgabe
True .. in case all columns have been set False .. in case columns are still missing
- Rückgabetyp
bool
-
remove_vector
(vector)¶ Remove a given vector from the list of vectors (in case it is found).
- Parameter
vector (
Vector
) – Instance of the vector that should be removed from this axis- Rückgabe
True
in case the vector was removedFalse
(default) in case the vector was not found/removed.- Rückgabetyp
bool
-
set_column_handled
(column, state=True)¶ Set the ‚handled‘ status of the given column.
- Parameter
column (
Column
) – The column whos status has to be updatedstate (
bool
) – Flag to indicate if the column has already handled. True .. Yes, False .. No
-
set_data
(column, data, level, xml_file_line_number, index)¶ Map a hierachical XML value/string into the into the CSV structure.
- Parameter
column (
Column
) – The column the data belongs todata (
str
) – The data that belongs to the columnlevel (
int
) – The hierachical level of the column within the XMLxml_file_line_number (
int
) – The linenumber within the XML the data originated fromindex (
int
) – The attribute index in case the value is from an attribute
-
was_column_already_handled
(column, data=None)¶ Indicate if the given column (within the current row) was already handled.
Was the value of that column already set? If that is the case this is an indicator that this row has to be cloned. :param column: The name/id of the column that has to be checked :type column:
str
:param data: The optional value of the cell that belongs to that row/column :type data:str
- Rückgabe
True .. in case the column was already handled and the the Data matches False .. (default) in case the column was not handled
- Rückgabetyp
C(bool)
-
-
class
RowVector
(row, column, value, line_number=- 1)¶ Bases:
ncdiff.reader.xml.XML2CSVReader.Vector
A row vector represents data that is shared by multiple rows for a single column/tag/element.
-
add_line_number
(line_number)¶ Add all the linenumbers the current vector originates from.
- Parameter
line_number (
int
) – Line number from within the XML file
-
add_line_numbers
(line_numbers)¶ Add all the linenumbers the current vector originates from.
- Parameter
line_numbers (
int
) – Line number from within the XML file
-
add_row
(row)¶ Add a
Row
to the vectors list of rows.- Parameter
row (
Row
) – A row that should be added
-
cleanup
()¶ Cleanup, release all existing references, so that the garbage collector can pick up this object.
-
get_column
()¶ Get the column this vector belongs to.
- Rückgabe
The vectors column
- Rückgabetyp
Column
-
get_context
()¶ Get the context of the vector.
- Rückgabe
The column this vector belongs to
- Rückgabetyp
Column
-
get_data
()¶ Get the data that represents the matrix cells value.
- Rückgabe
The value of the vector
- Rückgabetyp
str
-
get_line_numbers
()¶ Get all the linenumbers the current vector originates from.
- Rückgabe
The linenumbers from the originating XML file
- Rückgabetyp
list
ofint
-
get_max_line_number
()¶ Get the highest linenumber the current vector originates from.
- Rückgabe
The linenumber from the originating XML file
- Rückgabetyp
int
-
get_min_line_number
()¶ Get the lowest linenumber the current vector originates from.
- Rückgabe
The linenumber from the originating XML file
- Rückgabetyp
int
-
get_rows
()¶ Get the rows this vector belongs to.
- Rückgabe
The vectors rows
- Rückgabetyp
list
ofRow
-
-
class
Vector
(value, line_number)¶ Bases:
object
Base class representing a vector within a matrix.
In its simplest form it represents the data of a single coordinate within the matrix. A further feature of the vector is that it contains information about the origin of the data it represents from within the XML file (line number).
-
add_line_number
(line_number)¶ Add all the linenumbers the current vector originates from.
- Parameter
line_number (
int
) – Line number from within the XML file
-
add_line_numbers
(line_numbers)¶ Add all the linenumbers the current vector originates from.
- Parameter
line_numbers (
int
) – Line number from within the XML file
-
get_data
()¶ Get the data that represents the matrix cells value.
- Rückgabe
The value of the vector
- Rückgabetyp
str
-
get_line_numbers
()¶ Get all the linenumbers the current vector originates from.
- Rückgabe
The linenumbers from the originating XML file
- Rückgabetyp
list
ofint
-
get_max_line_number
()¶ Get the highest linenumber the current vector originates from.
- Rückgabe
The linenumber from the originating XML file
- Rückgabetyp
int
-
get_min_line_number
()¶ Get the lowest linenumber the current vector originates from.
- Rückgabe
The linenumber from the originating XML file
- Rückgabetyp
int
-
-
class
Xml2CsvAnalyser
¶ Bases:
xml.sax.handler.ContentHandler
Analyse a XML structure before the CSV translation.
The XML2CSVAnalyser represents a mini implementation of the XML2CSVTranslation that is up-front used to analyse the data/element/tag-structure of a given XML. This is needed because of the lazy-writing feature of the matrix. Since it is very possible that certain elements/columns may occur at the very end of an XML we have to know their existence before starting to write the CSV file.
Therefore the analyser will quickly rush through the XML to collect the list of all existing elements. It does not consider any data, but since it has the information about all the elements the columns of the XML2CSV matrix can be initialised before starting to feed the matrix with the actual data when parsing the XML again with the translator.
-
characters
(content)¶ Receive notification of character data.
The Parser will call this method to report each chunk of character data. SAX parsers may return all contiguous character data in a single chunk, or they may split it into several chunks; however, all of the characters in any single event must come from the same external entity so that the Locator provides useful information.
-
endDocument
()¶ Receive notification of the end of a document.
The SAX parser will invoke this method only once, and it will be the last method invoked during the parse. The parser shall not invoke this method until it has either abandoned parsing (because of an unrecoverable error) or reached the end of input.
-
endElement
(tag)¶ SAX 2 parser interface method that is called for every closing XML element that is found during XML parsing.
- Parameter
tag – The name of the XML element
- Rtype tag
str
-
endElementNS
(name, qname)¶ Signals the end of an element in namespace mode.
The name parameter contains the name of the element type, just as with the startElementNS event.
-
endPrefixMapping
(prefix)¶ End the scope of a prefix-URI mapping.
See startPrefixMapping for details. This event will always occur after the corresponding endElement event, but the order of endPrefixMapping events is not otherwise guaranteed.
-
get_column_names
()¶ Return a list of all the columns found during analysis.
- Rückgabe
The list of found columns
- Rückgabetyp
list
ofColumn
-
ignorableWhitespace
(whitespace)¶ Receive notification of ignorable whitespace in element content.
Validating Parsers must use this method to report each chunk of ignorable whitespace (see the W3C XML 1.0 recommendation, section 2.10): non-validating parsers may also use this method if they are capable of parsing and using content models.
SAX parsers may return all contiguous whitespace in a single chunk, or they may split it into several chunks; however, all of the characters in any single event must come from the same external entity, so that the Locator provides useful information.
-
processingInstruction
(target, data)¶ Receive notification of a processing instruction.
The Parser will invoke this method once for each processing instruction found: note that processing instructions may occur before or after the main document element.
A SAX parser should never report an XML declaration (XML 1.0, section 2.8) or a text declaration (XML 1.0, section 4.3.1) using this method.
-
setDocumentLocator
(locator)¶ Called by the parser to give the application a locator for locating the origin of document events.
SAX parsers are strongly encouraged (though not absolutely required) to supply a locator: if it does so, it must supply the locator to the application by invoking this method before invoking any of the other methods in the DocumentHandler interface.
The locator allows the application to determine the end position of any document-related event, even if the parser is not reporting an error. Typically, the application will use this information for reporting its own errors (such as character content that does not match an application’s business rules). The information returned by the locator is probably not sufficient for use with a search engine.
Note that the locator will return correct information only during the invocation of the events in this interface. The application should not attempt to use it at any other time.
-
skippedEntity
(name)¶ Receive notification of a skipped entity.
The Parser will invoke this method once for each entity skipped. Non-validating processors may skip entities if they have not seen the declarations (because, for example, the entity was declared in an external DTD subset). All processors may skip external entities, depending on the values of the http://xml.org/sax/features/external-general-entities and the http://xml.org/sax/features/external-parameter-entities properties.
-
startDocument
()¶ Receive notification of the beginning of a document.
The SAX parser will invoke this method only once, before any other methods in this interface or in DTDHandler (except for setDocumentLocator).
-
startElement
(tag, attributes)¶ SAX 2 parser interface method that is called for every starting XML element that is found during parsing.
- Parameter
tag – The name of the XML element
attributes – List of key value pairs representing all the XML elements XML attributes
- Rtype tag
str
- Rtype attributes
dict
-
startElementNS
(name, qname, attrs)¶ Signals the start of an element in namespace mode.
The name parameter contains the name of the element type as a (uri, localname) tuple, the qname parameter the raw XML 1.0 name used in the source document, and the attrs parameter holds an instance of the Attributes class containing the attributes of the element.
The uri part of the name tuple is None for elements which have no namespace.
-
startPrefixMapping
(prefix, uri)¶ Begin the scope of a prefix-URI Namespace mapping.
The information from this event is not necessary for normal Namespace processing: the SAX XML reader will automatically replace prefixes for element and attribute names when the http://xml.org/sax/features/namespaces feature is true (the default).
There are cases, however, when applications need to use prefixes in character data or in attribute values, where they cannot safely be expanded automatically; the start/endPrefixMapping event supplies the information to the application to expand prefixes in those contexts itself, if necessary.
Note that start/endPrefixMapping events are not guaranteed to be properly nested relative to each-other: all startPrefixMapping events will occur before the corresponding startElement event, and all endPrefixMapping events will occur after the corresponding endElement event, but their order is not guaranteed.
-
-
class
Xml2CsvTranslator
(reader, xml_file_name, columns, csv_file_name=None)¶ Bases:
xml.sax.handler.ContentHandler
The XML2CSV Translator is responsible for feeding all the XML/element/tag data into the mapping matrix.
The data itself is enriched with information about its location within the XML structure (line number, hierarchical level, element it belongs to, … ). Its basically build around a XML SAX-Parser-Event-Framework that allows to parse huge XML files without consuming much memory. This low-memory-consuming implementation is also followed in the mapping Matrix itself since it is per design already possible to retrieve finished CSV rows when still reading/feeding data from the XML file.
-
CLMN_XML_LINENUMBERS
= 'TOIGNORE:XmlFileLines'¶
-
TRANSLATION
= {9: None, 10: None}¶
-
characters
(data)¶ SAX 2 parser interface method that is called for every value belonging to a XML element.
B{NOTE}: This method might be called multiple times for a single XML element value! Therefore all the data has to be concatenated :param data: String representing either parts or the whole value of the currently handled XML element. :type data:
str
-
close
()¶ SAX 2 parser interface method that is called when the end of the XML document is reached.
-
endDocument
()¶ Receive notification of the end of a document.
The SAX parser will invoke this method only once, and it will be the last method invoked during the parse. The parser shall not invoke this method until it has either abandoned parsing (because of an unrecoverable error) or reached the end of input.
-
endElement
(tag)¶ SAX 2 parser interface method that is called for every closing XML element that is found during parsing.
This method triggers the XML2CSV transformation for the value and attributes of this XML element :param tag: The name of the XML element :rtype tag:
str
-
endElementNS
(name, qname)¶ Signals the end of an element in namespace mode.
The name parameter contains the name of the element type, just as with the startElementNS event.
-
endPrefixMapping
(prefix)¶ End the scope of a prefix-URI mapping.
See startPrefixMapping for details. This event will always occur after the corresponding endElement event, but the order of endPrefixMapping events is not otherwise guaranteed.
-
get_buffered_row
()¶ Retrieve all the buffers rows that have already been transformed into CSV ready structure.
- Rückgabe
A list of strings representing the transformed XML data
- Rückgabetyp
list
ofstr
-
has_buffered_rows
()¶ Show if the algorithm already produced new rows that can already be forwarded to the CSV file writer.
- Rückgabe
Amount of rows that are ready for processing
- Rückgabetyp
bool
-
ignorableWhitespace
(whitespace)¶ Receive notification of ignorable whitespace in element content.
Validating Parsers must use this method to report each chunk of ignorable whitespace (see the W3C XML 1.0 recommendation, section 2.10): non-validating parsers may also use this method if they are capable of parsing and using content models.
SAX parsers may return all contiguous whitespace in a single chunk, or they may split it into several chunks; however, all of the characters in any single event must come from the same external entity, so that the Locator provides useful information.
-
processingInstruction
(target, data)¶ Receive notification of a processing instruction.
The Parser will invoke this method once for each processing instruction found: note that processing instructions may occur before or after the main document element.
A SAX parser should never report an XML declaration (XML 1.0, section 2.8) or a text declaration (XML 1.0, section 4.3.1) using this method.
-
setDocumentLocator
(locator)¶ Called by the parser to give the application a locator for locating the origin of document events.
SAX parsers are strongly encouraged (though not absolutely required) to supply a locator: if it does so, it must supply the locator to the application by invoking this method before invoking any of the other methods in the DocumentHandler interface.
The locator allows the application to determine the end position of any document-related event, even if the parser is not reporting an error. Typically, the application will use this information for reporting its own errors (such as character content that does not match an application’s business rules). The information returned by the locator is probably not sufficient for use with a search engine.
Note that the locator will return correct information only during the invocation of the events in this interface. The application should not attempt to use it at any other time.
-
set_input_line_number
(line_number)¶ Represent the line number, within the XML that is read, of the currently handles XML element.
- Parameter
line_number (
int
) – The XML file line number
-
skippedEntity
(name)¶ Receive notification of a skipped entity.
The Parser will invoke this method once for each entity skipped. Non-validating processors may skip entities if they have not seen the declarations (because, for example, the entity was declared in an external DTD subset). All processors may skip external entities, depending on the values of the http://xml.org/sax/features/external-general-entities and the http://xml.org/sax/features/external-parameter-entities properties.
-
startDocument
()¶ Receive notification of the beginning of a document.
The SAX parser will invoke this method only once, before any other methods in this interface or in DTDHandler (except for setDocumentLocator).
-
startElement
(tag, attributes)¶ SAX 2 parser interface method that is called for every starting XML element that is found during parsing.
Basically all the data that is provided is buffered to be handled when reaching the according XML end call. :param tag: The name of the XML element :rtype tag:
str
:param attributes: List of key value pairs representing all the XML elements XML attributes :rtype attributes:dict
-
startElementNS
(name, qname, attrs)¶ Signals the start of an element in namespace mode.
The name parameter contains the name of the element type as a (uri, localname) tuple, the qname parameter the raw XML 1.0 name used in the source document, and the attrs parameter holds an instance of the Attributes class containing the attributes of the element.
The uri part of the name tuple is None for elements which have no namespace.
-
startPrefixMapping
(prefix, uri)¶ Begin the scope of a prefix-URI Namespace mapping.
The information from this event is not necessary for normal Namespace processing: the SAX XML reader will automatically replace prefixes for element and attribute names when the http://xml.org/sax/features/namespaces feature is true (the default).
There are cases, however, when applications need to use prefixes in character data or in attribute values, where they cannot safely be expanded automatically; the start/endPrefixMapping event supplies the information to the application to expand prefixes in those contexts itself, if necessary.
Note that start/endPrefixMapping events are not guaranteed to be properly nested relative to each-other: all startPrefixMapping events will occur before the corresponding startElement event, and all endPrefixMapping events will occur after the corresponding endElement event, but their order is not guaranteed.
-
-
close
()¶ Close the input file.
-
get_information
()¶ Get generic information about the given file like, filesize, amount of columns, column names.
- Rückgabe
A dictionary containing the column names, the number of columns
- Rückgabetyp
dict
-
has_header
()¶ Check if data has a header line.
- Rückgabe
True if the first line of a reader is a header.
- Rückgabetyp
bool
-
open
()¶ Open the file for reading in universal newlines mode.
-
reader
()¶ Get the flat/table structured data of the XML file.
This method already returns complete CSV compatible data rows that have been created by the XML data transformation. With the help of the yield functionality those data rows can be accessed on a „row by row“ basis instead of a „give me all at the end“ basis. This has the advantage that the memory consumption of the whole XML transformation to CSV is very low. :return: A list of strings that represent a CSV data row :rtype:
list
-
class