ArrayProcessing#
Tools for parallel processing of large arrays.
Note
This module provides an interface to deal with large numpy arrays and speed up numpy routines that get very slow for data arrays above 100-500GB of size.
The implementation builds on the buffer interface used by cython.
- apply_lut(source, lut, sink=None, blocks=None, processes=None, verbose=False)[source]#
Transforms the source via a lookup table.
Arguments
- sourcearray
The source array.
- lutarray
The lookup table.
- sinkarray or None
The result array, if none an array is created.
- blocksint or None:
Number of blocks to use, if None use blocks=default_blocks_per_process * processes
- processesNone or int
Number of processes to use, if None use number of cpus.
- verbosebool
If True, print progress information.
Returns
- sinkarray
The source transformed via the lookup table.
- apply_lut_to_index(source, kernel, lut, sink=None, processes=None, verbose=False)[source]#
Correlates the source with an index kernel and returns the value of the look-up table.
Arguments
- sourcearray
The source array.
- kernelarray
The correlation kernel.
- lutarray
The lookup table.
- sinkarray or None
The result array, if none an array is created.
- processesNone or int
Number of processes to use, if None use number of cpus
Returns
- sinkarray
The source transformed via the lookup table.
- block_ranges(source, blocks=None, processes=None)[source]#
Ranges of evenly spaced blocks in array.
Arguments
- sourcearray
Source to divide in blocks.
- blocksint or None
Number of blocks to split array into.
- processesNone or int
Number of processes, if None use number of cpus.
Returns
- block_rangesarray
List of the range boundaries
- block_sums(source, blocks=None, processes=None)[source]#
Sums of evenly spaced blocks in array.
Arguments
- sourcearray
Array to perform the block sums on.
- blocksint or None
Number of blocks to split array into.
- processesNone or int
Number of processes, if None use number of cpus.
Returns
- block_sumsarray
Sums of the values in the different blocks.
- correlate1d(source, kernel, sink=None, axis=0, processes=None, verbose=False)[source]#
Correlates the source along the given axis wih ta 1d kernel.
Arguments
- sourcearray
The source array.
- lutarray
The lookup table.
- sinkarray or None
The result array, if none an array is created.
- processesNone or int
Number of processes to use, if None use number of cpus
- verbosebool
If True, print progress information.
Returns
- sinkarray
The source transformed via the lookup table.
- finalize_processing(verbose=False, function=None, timer=None)[source]#
Finalize parallel array processing.
Arguments
- verbosebool
If True, print progress information.
- functionstr or None
The nae of the function.
- timerTimer or None
A processing timer.
- index_neighbours(indices, offset, processes=None)[source]#
Returns all pairs of indices that are a part of a specified offset.
Arguments
- indicesarray
List of indices.
- offsetint
The offset to check for.
- processesNone or int
Number of processes, if None use number of cpus.
- initialize_processing(processes=None, verbose=False, function=None, blocks=None, return_blocks=False)[source]#
Initialize parallel array processing.
Arguments
- processesint, ‘serial’ or None
The number of processes to use. If None use number of cpus.
- verbosebool
If True, print progress information.
- functionstr or None
The name of the function.
Returns
- processesint
The number of processes.
- timerTimer
A timer for the processing.
- initialize_sink(sink=None, shape=None, dtype=None, order=None, memory=None, location=None, mode=None, source=None, return_buffer=True, as_1d=False, return_shape=False, return_strides=False)[source]#
Initialze or create a sink.
Arguments
- sinksink specification
The source to initialize.
- shapetuple of int
Optional shape of the sink. If None, inferred from the source.
- dtypestr, type or None
Optional dtype of the sink. If None, inferred from the source.
- order‘C’, ‘F’ or None
Optional order of the sink. If None, inferred from the source.
- memory‘shared’ or None
If ‘shared’ create a shared memory sink.
- locationstr
Optional location specification of the sink.
- sourceSource or None
Optional source to infer sink specifications from.
- return_bufferbool
If True, return also a buffer compatible with cython memory views.
- return_shapebool
If True, also return shape of the sink.
- return_stridesbool
If True, also return the element strides of the sink.
Returns
- sinkSource
The initialized sink.
- buffer (optional)array
Buffer of the sink.
- shape (optional)tuple of int
Shape of the source.
- strides (optional)tuple of int
Element strides of the source.
- initialize_source(source, return_buffer=True, as_1d=False, return_shape=False, return_strides=False, return_order=False)[source]#
Initialize a source buffer for parallel array processing.
Arguments
- sourcesource specification
The source to initialize.
- return_bufferbool
If True, return a buffer compatible with cython memory views.
- return_shapebool
If True, also return shape of the source.
- return_stridesbool
If True, also return the element strides of the source.
- return_orderbool
If True, also return order of the source.
Returns
- sourceSource
The initialized source.
- source_buffer
The initialized source as buffer.
- shapetuple of int
Shape of the source.
- return_Stridestuple of int
Element strides of the source.
- neighbours(indices, offset, processes=None, verbose=False)[source]#
Returns all pairs in a list of indices that are apart a specified offset.
Arguments
- indicesarray
List of indices.
- offsetint
The offset to search for.
- processesNone or int
Number of processes, if None use number of cpus.
- verbosebool
If True, print progress.
Returns
- neighboursarray
List of pairs of neighbours.
Note
This function can be used to create graphs from binary images.
- read(source, sink=None, slicing=None, memory=None, blocks=None, processes=None, verbose=False, **kwargs)[source]#
Read a large array into memory in parallel.
Arguments
- sourcestr or Source
The source on diks to load.
- slicingslice, tuple, or None
Optional sublice to read.
- memory‘shared; or None
If ‘shared’, read into shared memory.
- blocksint or None
number of blocks to split array into for parallel processing
- processesNone or int
number of processes, if None use number of cpus
- verbosebool
print info about the file to be loaded
Returns
- sinkSource class
The read source in memory.
- where(source, sink=None, blocks=None, cutoff=None, processes=None, verbose=False)[source]#
Returns the indices of the non-zero entries of the array.
Arguments
- sourcearray
Array to search for nonzero indices.
- sinkarray or None
If not None, results is written into this array
- blocksint or None
Number of blocks to split array into for parallel processing
- cutoffint
Number of elements below whih to switch to numpy.where
- processesNone or int
Number of processes, if None use number of cpus.
Returns
- wherearray
Positions of the nonzero entries of the input array
Note
Uses numpy.where if there is no match of dimension implemented!
- write(sink, source, slicing=None, overwrite=True, blocks=None, processes=None, verbose=False)[source]#
Write a large array to disk in parallel.
Arguments
- sinkstr or Source
The sink on disk to write to.
- sourcearray or Source
The data to write to disk.
- slicingslicing or None
Optional slicing for the sink to write to.
- overwritebool
If True, create new file if the source specifications do not match.
- blocksint or None
Number of blocks to split array into for parallel processing.
- processesNone or int
Number of processes, if None use number of cpus.
- verbosebool
Print info about the file to be loaded.
Returns
- sinkSource class
The sink to which the source was written.
- default_blocks_per_process = 10#
Default number of blocks per process to split the data.
Note
10 blocks per process is a good choice.
- default_cutoff = 20000000#
Default size of array below which ordinary numpy is used.
Note
Ideally test this on your machine for different array sizes.
- default_processes = 12#
Default number of processes to use