collection

SiteCollection for batch operations on multiple sites.

Chunking Utilities

Internal utilities for splitting large site collections into manageable chunks.


source

chunk_items


def chunk_items(
    items:Any, chunk_size:int
)->Iterator[List]:

Yield chunks of items from an iterable.

Args: items: Iterable of items to chunk (list, generator, etc.) chunk_size: Maximum items per chunk

Yields: Lists of up to chunk_size items


source

calculate_chunk_size


def calculate_chunk_size(
    n_years:int, n_sites:int, target_rows:int=40000
)->int:

Calculate optimal chunk size based on expected data volume.

Heuristics: - Target ~40,000 result rows per getInfo() call (conservative for GEE) - Each site-year produces ~10-50 class rows for categorical data - For continuous, each site-year produces 1 row per band

Args: n_years: Number of years being extracted n_sites: Total number of sites target_rows: Target maximum rows per chunk (default 40,000)

Returns: Recommended number of sites per chunk

ChunkedResult

A result container that tracks both successful extractions and any errors that occurred.


source

ChunkedResult


def ChunkedResult(
    data:pd.DataFrame, errors:List[dict]=<factory>
)->None:

Result from chunked extraction with error tracking.

Attributes: data: DataFrame containing successful extractions errors: List of dicts with site_id, chunk_idx, and error message

SiteCollection

The SiteCollection class enables batch operations on multiple sites. It supports:

  • Eager mode: All Site objects created upfront (default for <1000 sites)
  • Lazy mode: Site objects created on-demand to save memory for large collections

Use SiteCollection when you need to extract data from hundreds to thousands of sites efficiently.


source

SiteCollection


def SiteCollection(
    sites:Optional[List[Site]]=None, feature_dicts:Optional[List[dict]]=None, source_crs:str='EPSG:4326',
    metadata:Optional[dict]=None
):

A collection of Sites for batch operations.

Supports two modes: - Eager: All Site objects created upfront (default for <1000 sites) - Lazy: Site objects created on-demand from stored feature dicts

Example: # Load and extract from many sites sites = SiteCollection.from_geojson(‘restoration_sites.geojson’)

# Interactive batch extraction (Path A)
result = sites.extract_categorical(MAPBIOMAS_LULC, years=range(2010, 2024))
df = result.data

# Export for large collections (Path B)
task = sites.export_categorical(
    MAPBIOMAS_LULC,
    years=range(2010, 2024),
    destination=ExportDestination(type='drive', folder='exports')
)

Batch Extraction Methods (Path A)

These methods extract data interactively, returning pandas DataFrames. Best for collections under ~5000 sites.


source

SiteCollection.extract_categorical


def extract_categorical(
    layer:'CategoricalLayer', years:List[int], chunk_size:Optional[int]=None, max_pixels:int=1000000000,
    progress:bool=True
)->ChunkedResult:

Extract categorical data from all sites.

Batches sites into chunks to avoid GEE timeout/memory limits. Each chunk is processed with a single getInfo() call.

Args: layer: CategoricalLayer to extract from years: List of years to extract chunk_size: Sites per chunk (auto-calculated if None) max_pixels: Max pixels per reduceRegion call progress: Show progress bar (requires tqdm)

Returns: ChunkedResult with DataFrame and any errors

Example: result = sites.extract_categorical(MAPBIOMAS_LULC, years=[2020, 2021, 2022]) print(result) # ChunkedResult(sites=500, errors=2, success_rate=99.6%) df = result.data


source

SiteCollection.extract_continuous


def extract_continuous(
    layer:'ContinuousLayer', start_date:str, end_date:str, reducer:str='mean', frequency:str='yearly',
    chunk_size:Optional[int]=None, max_pixels:int=1000000000, progress:bool=True
)->ChunkedResult:

Extract continuous data from all sites.

Batches sites into chunks to avoid GEE timeout/memory limits.

Args: layer: ContinuousLayer to extract from start_date: Start date (YYYY-MM-DD) end_date: End date (YYYY-MM-DD) reducer: Spatial reducer (‘mean’, ‘median’, ‘min’, ‘max’) frequency: Temporal grouping (‘all’, ‘monthly’, ‘yearly’) chunk_size: Sites per chunk (auto-calculated if None) max_pixels: Max pixels per reduceRegion call progress: Show progress bar

Returns: ChunkedResult with DataFrame and any errors

Export Methods (Path B)

These methods export data to Google Drive or Cloud Storage using GEE batch tasks. Best for large collections (>5000 sites).


source

SiteCollection.export_categorical


def export_categorical(
    layer:'CategoricalLayer', years:List[int], destination:'ExportDestination', config:Optional['ExportConfig']=None,
    max_pixels:int=1000000000
)->'ExportTask':

Export categorical extraction to Google Drive or Cloud Storage.

For collections larger than ~5000 sites, this is more reliable than interactive extraction. Results are exported as CSV or GeoJSON files, one per chunk.

Args: layer: CategoricalLayer to extract from years: List of years to extract destination: Where to export (Drive or GCS) config: Export configuration (chunk size, concurrency) max_pixels: Max pixels per reduceRegion call

Returns: ExportTask for monitoring progress

Example: from gee_polygons.export import ExportDestination, ExportConfig

task = sites.export_categorical(
    layer=MAPBIOMAS_LULC,
    years=range(2010, 2024),
    destination=ExportDestination(type='drive', folder='exports'),
    config=ExportConfig(chunk_size=50, max_concurrent=15)
)

# Monitor progress
print(task.status())

# Wait for completion
task.wait(timeout_minutes=180)

# Get result file locations
print(task.results_info())

source

SiteCollection.export_continuous


def export_continuous(
    layer:'ContinuousLayer', start_date:str, end_date:str, destination:'ExportDestination', reducer:str='mean',
    frequency:str='yearly', config:Optional['ExportConfig']=None, max_pixels:int=1000000000
)->'ExportTask':

Export continuous extraction to Google Drive or Cloud Storage.

Args: layer: ContinuousLayer to extract from start_date: Start date (YYYY-MM-DD) end_date: End date (YYYY-MM-DD) destination: Where to export (Drive or GCS) reducer: Spatial reducer (‘mean’, ‘median’, ‘min’, ‘max’) frequency: Temporal grouping (‘monthly’, ‘yearly’) config: Export configuration max_pixels: Max pixels per reduceRegion call

Returns: ExportTask for monitoring progress

Internal Extraction Functions

These functions handle the actual GEE operations for each chunk.

Export Helper Functions

Example Usage

# Initialize Earth Engine
ee.Authenticate()
ee.Initialize(project="your-project-id")
# Load a collection of sites
sites = SiteCollection.from_geojson('../data/restoration_sites_subset.geojson')
print(sites)
# Extract categorical data (Path A - interactive)
from gee_polygons.datasets.mapbiomas import MAPBIOMAS_LULC

result = sites.extract_categorical(MAPBIOMAS_LULC, years=range(2018, 2023))
print(result)
result.data.head()