ArrayProcessing#

Tools for parallel processing of large arrays.

Note

This module provides an interface to deal with large numpy arrays and speed up numpy routines that get very slow for data arrays above 100-500GB of size.

The implementation builds on the buffer interface used by cython.

apply_lut(source, lut, sink=None, blocks=None, processes=None, verbose=False)[source]#

Transforms the source via a lookup table.

Arguments

sourcearray: The source array.
lutarray: The lookup table.
sinkarray or None: The result array, if none an array is created.
blocksint or None:: Number of blocks to use, if None use blocks=default_blocks_per_process * processes
processesNone or int: Number of processes to use, if None use number of cpus.
verbosebool: If True, print progress information.

Returns

sinkarray: The source transformed via the lookup table.

apply_lut_to_index(source, kernel, lut, sink=None, processes=None, verbose=False)[source]#

Correlates the source with an index kernel and returns the value of the look-up table.

Arguments

sourcearray: The source array.
kernelarray: The correlation kernel.
lutarray: The lookup table.
sinkarray or None: The result array, if none an array is created.
processesNone or int: Number of processes to use, if None use number of cpus

Returns

sinkarray: The source transformed via the lookup table.

block_ranges(source, blocks=None, processes=None)[source]#

Ranges of evenly spaced blocks in array.

Arguments

sourcearray: Source to divide in blocks.
blocksint or None: Number of blocks to split array into.
processesNone or int: Number of processes, if None use number of cpus.

Returns

block_rangesarray: List of the range boundaries

block_sums(source, blocks=None, processes=None)[source]#

Sums of evenly spaced blocks in array.

Arguments

sourcearray: Array to perform the block sums on.
blocksint or None: Number of blocks to split array into.
processesNone or int: Number of processes, if None use number of cpus.

Returns

block_sumsarray: Sums of the values in the different blocks.

correlate1d(source, kernel, sink=None, axis=0, processes=None, verbose=False)[source]#

Correlates the source along the given axis wih ta 1d kernel.

Arguments

sourcearray: The source array.
lutarray: The lookup table.
sinkarray or None: The result array, if none an array is created.
processesNone or int: Number of processes to use, if None use number of cpus
verbosebool: If True, print progress information.

Returns

sinkarray: The source transformed via the lookup table.

finalize_processing(verbose=False, function=None, timer=None)[source]#

Finalize parallel array processing.

Arguments

verbosebool: If True, print progress information.
functionstr or None: The nae of the function.
timerTimer or None: A processing timer.

index_neighbours(indices, offset, processes=None)[source]#

Returns all pairs of indices that are a part of a specified offset.

Arguments

indicesarray: List of indices.
offsetint: The offset to check for.
processesNone or int: Number of processes, if None use number of cpus.

initialize_processing(processes=None, verbose=False, function=None, blocks=None, return_blocks=False)[source]#

Initialize parallel array processing.

Arguments

processesint, ‘serial’ or None: The number of processes to use. If None use number of cpus.
verbosebool: If True, print progress information.
functionstr or None: The name of the function.

Returns

processesint: The number of processes.
timerTimer: A timer for the processing.

initialize_sink(sink=None, shape=None, dtype=None, order=None, memory=None, location=None, mode=None, source=None, return_buffer=True, as_1d=False, return_shape=False, return_strides=False)[source]#

Initialze or create a sink.

Arguments

sinksink specification: The source to initialize.
shapetuple of int: Optional shape of the sink. If None, inferred from the source.
dtypestr, type or None: Optional dtype of the sink. If None, inferred from the source.
order‘C’, ‘F’ or None: Optional order of the sink. If None, inferred from the source.
memory‘shared’ or None: If ‘shared’ create a shared memory sink.
locationstr: Optional location specification of the sink.
sourceSource or None: Optional source to infer sink specifications from.
return_bufferbool: If True, return also a buffer compatible with cython memory views.
return_shapebool: If True, also return shape of the sink.
return_stridesbool: If True, also return the element strides of the sink.

Returns

sinkSource: The initialized sink.
buffer (optional)array: Buffer of the sink.
shape (optional)tuple of int: Shape of the source.
strides (optional)tuple of int: Element strides of the source.

initialize_source(source, return_buffer=True, as_1d=False, return_shape=False, return_strides=False, return_order=False)[source]#

Initialize a source buffer for parallel array processing.

Arguments

sourcesource specification: The source to initialize.
return_bufferbool: If True, return a buffer compatible with cython memory views.
return_shapebool: If True, also return shape of the source.
return_stridesbool: If True, also return the element strides of the source.
return_orderbool: If True, also return order of the source.

Returns

sourceSource: The initialized source.
source_buffer: The initialized source as buffer.
shapetuple of int: Shape of the source.
return_Stridestuple of int: Element strides of the source.

neighbours(indices, offset, processes=None, verbose=False)[source]#

Returns all pairs in a list of indices that are apart a specified offset.

Arguments

indicesarray: List of indices.
offsetint: The offset to search for.
processesNone or int: Number of processes, if None use number of cpus.
verbosebool: If True, print progress.

Returns

neighboursarray: List of pairs of neighbours.

Note

This function can be used to create graphs from binary images.

read(source, sink=None, slicing=None, memory=None, blocks=None, processes=None, verbose=False, **kwargs)[source]#

Read a large array into memory in parallel.

Arguments

sourcestr or Source: The source on diks to load.
slicingslice, tuple, or None: Optional sublice to read.
memory‘shared; or None: If ‘shared’, read into shared memory.
blocksint or None: number of blocks to split array into for parallel processing
processesNone or int: number of processes, if None use number of cpus
verbosebool: print info about the file to be loaded

Returns

sinkSource class: The read source in memory.

where(source, sink=None, blocks=None, cutoff=None, processes=None, verbose=False)[source]#

Returns the indices of the non-zero entries of the array.

Arguments

sourcearray: Array to search for nonzero indices.
sinkarray or None: If not None, results is written into this array
blocksint or None: Number of blocks to split array into for parallel processing
cutoffint: Number of elements below whih to switch to numpy.where
processesNone or int: Number of processes, if None use number of cpus.

Returns

wherearray: Positions of the nonzero entries of the input array

Note

Uses numpy.where if there is no match of dimension implemented!

write(sink, source, slicing=None, overwrite=True, blocks=None, processes=None, verbose=False)[source]#

Write a large array to disk in parallel.

Arguments

sinkstr or Source: The sink on disk to write to.
sourcearray or Source: The data to write to disk.
slicingslicing or None: Optional slicing for the sink to write to.
overwritebool: If True, create new file if the source specifications do not match.
blocksint or None: Number of blocks to split array into for parallel processing.
processesNone or int: Number of processes, if None use number of cpus.
verbosebool: Print info about the file to be loaded.

Returns

sinkSource class: The sink to which the source was written.

default_blocks_per_process = 10#

Default number of blocks per process to split the data.

Note

10 blocks per process is a good choice.

default_cutoff = 20000000#

Default size of array below which ordinary numpy is used.

Note

Ideally test this on your machine for different array sizes.

default_processes = 12#: Default number of processes to use