# Initialize Earth Engine
ee.Authenticate()
ee.Initialize(project="your-project-id")collection
SiteCollection for batch operations on multiple sites.
Chunking Utilities
Internal utilities for splitting large site collections into manageable chunks.
chunk_items
def chunk_items(
items:Any, chunk_size:int
)->Iterator[List]:
Yield chunks of items from an iterable.
Args: items: Iterable of items to chunk (list, generator, etc.) chunk_size: Maximum items per chunk
Yields: Lists of up to chunk_size items
calculate_chunk_size
def calculate_chunk_size(
n_years:int, n_sites:int, target_rows:int=40000
)->int:
Calculate optimal chunk size based on expected data volume.
Heuristics: - Target ~40,000 result rows per getInfo() call (conservative for GEE) - Each site-year produces ~10-50 class rows for categorical data - For continuous, each site-year produces 1 row per band
Args: n_years: Number of years being extracted n_sites: Total number of sites target_rows: Target maximum rows per chunk (default 40,000)
Returns: Recommended number of sites per chunk
ChunkedResult
A result container that tracks both successful extractions and any errors that occurred.
ChunkedResult
def ChunkedResult(
data:pd.DataFrame, errors:List[dict]=<factory>
)->None:
Result from chunked extraction with error tracking.
Attributes: data: DataFrame containing successful extractions errors: List of dicts with site_id, chunk_idx, and error message
SiteCollection
The SiteCollection class enables batch operations on multiple sites. It supports:
- Eager mode: All Site objects created upfront (default for <1000 sites)
- Lazy mode: Site objects created on-demand to save memory for large collections
Use SiteCollection when you need to extract data from hundreds to thousands of sites efficiently.
SiteCollection
def SiteCollection(
sites:Optional[List[Site]]=None, feature_dicts:Optional[List[dict]]=None, source_crs:str='EPSG:4326',
metadata:Optional[dict]=None
):
A collection of Sites for batch operations.
Supports two modes: - Eager: All Site objects created upfront (default for <1000 sites) - Lazy: Site objects created on-demand from stored feature dicts
Example: # Load and extract from many sites sites = SiteCollection.from_geojson(‘restoration_sites.geojson’)
# Interactive batch extraction (Path A)
result = sites.extract_categorical(MAPBIOMAS_LULC, years=range(2010, 2024))
df = result.data
# Export for large collections (Path B)
task = sites.export_categorical(
MAPBIOMAS_LULC,
years=range(2010, 2024),
destination=ExportDestination(type='drive', folder='exports')
)
Batch Extraction Methods (Path A)
These methods extract data interactively, returning pandas DataFrames. Best for collections under ~5000 sites.
SiteCollection.extract_categorical
def extract_categorical(
layer:'CategoricalLayer', years:List[int], chunk_size:Optional[int]=None, max_pixels:int=1000000000,
progress:bool=True
)->ChunkedResult:
Extract categorical data from all sites.
Batches sites into chunks to avoid GEE timeout/memory limits. Each chunk is processed with a single getInfo() call.
Args: layer: CategoricalLayer to extract from years: List of years to extract chunk_size: Sites per chunk (auto-calculated if None) max_pixels: Max pixels per reduceRegion call progress: Show progress bar (requires tqdm)
Returns: ChunkedResult with DataFrame and any errors
Example: result = sites.extract_categorical(MAPBIOMAS_LULC, years=[2020, 2021, 2022]) print(result) # ChunkedResult(sites=500, errors=2, success_rate=99.6%) df = result.data
SiteCollection.extract_continuous
def extract_continuous(
layer:'ContinuousLayer', start_date:str, end_date:str, reducer:str='mean', frequency:str='yearly',
chunk_size:Optional[int]=None, max_pixels:int=1000000000, progress:bool=True
)->ChunkedResult:
Extract continuous data from all sites.
Batches sites into chunks to avoid GEE timeout/memory limits.
Args: layer: ContinuousLayer to extract from start_date: Start date (YYYY-MM-DD) end_date: End date (YYYY-MM-DD) reducer: Spatial reducer (‘mean’, ‘median’, ‘min’, ‘max’) frequency: Temporal grouping (‘all’, ‘monthly’, ‘yearly’) chunk_size: Sites per chunk (auto-calculated if None) max_pixels: Max pixels per reduceRegion call progress: Show progress bar
Returns: ChunkedResult with DataFrame and any errors
Export Methods (Path B)
These methods export data to Google Drive or Cloud Storage using GEE batch tasks. Best for large collections (>5000 sites).
SiteCollection.export_categorical
def export_categorical(
layer:'CategoricalLayer', years:List[int], destination:'ExportDestination', config:Optional['ExportConfig']=None,
max_pixels:int=1000000000
)->'ExportTask':
Export categorical extraction to Google Drive or Cloud Storage.
For collections larger than ~5000 sites, this is more reliable than interactive extraction. Results are exported as CSV or GeoJSON files, one per chunk.
Args: layer: CategoricalLayer to extract from years: List of years to extract destination: Where to export (Drive or GCS) config: Export configuration (chunk size, concurrency) max_pixels: Max pixels per reduceRegion call
Returns: ExportTask for monitoring progress
Example: from gee_polygons.export import ExportDestination, ExportConfig
task = sites.export_categorical(
layer=MAPBIOMAS_LULC,
years=range(2010, 2024),
destination=ExportDestination(type='drive', folder='exports'),
config=ExportConfig(chunk_size=50, max_concurrent=15)
)
# Monitor progress
print(task.status())
# Wait for completion
task.wait(timeout_minutes=180)
# Get result file locations
print(task.results_info())
SiteCollection.export_continuous
def export_continuous(
layer:'ContinuousLayer', start_date:str, end_date:str, destination:'ExportDestination', reducer:str='mean',
frequency:str='yearly', config:Optional['ExportConfig']=None, max_pixels:int=1000000000
)->'ExportTask':
Export continuous extraction to Google Drive or Cloud Storage.
Args: layer: ContinuousLayer to extract from start_date: Start date (YYYY-MM-DD) end_date: End date (YYYY-MM-DD) destination: Where to export (Drive or GCS) reducer: Spatial reducer (‘mean’, ‘median’, ‘min’, ‘max’) frequency: Temporal grouping (‘monthly’, ‘yearly’) config: Export configuration max_pixels: Max pixels per reduceRegion call
Returns: ExportTask for monitoring progress
Internal Extraction Functions
These functions handle the actual GEE operations for each chunk.
Export Helper Functions
Example Usage
# Load a collection of sites
sites = SiteCollection.from_geojson('../data/restoration_sites_subset.geojson')
print(sites)# Extract categorical data (Path A - interactive)
from gee_polygons.datasets.mapbiomas import MAPBIOMAS_LULC
result = sites.extract_categorical(MAPBIOMAS_LULC, years=range(2018, 2023))
print(result)
result.data.head()