View on GitHub

Hypermap-Registry

Hypermap Registry, remote map services made easy for your SDI

Hypermap Registry Architecture

Integration within a Django project

Make sure you have registry module installed inside your working environment and have an elasticsearch version of 2.4.0

pip install django-registry

Add this variables into your settings.py

from hypermap.settings import REGISTRY_PYCSW

REGISTRY = True # ensure the value is True
REGISTRY_SKIP_CELERY = False
REGISTRY_SEARCH_URL = os.getenv('REGISTRY_SEARCH_URL',
                                'elasticsearch+%s' % ES_URL)

# Check layers every 24 hours
REGISTRY_CHECK_PERIOD = int(os.environ.get('REGISTRY_CHECK_PERIOD', '1440'))
# Index cached layers every minute
REGISTRY_INDEX_CACHED_LAYERS_PERIOD = int(os.environ.get('REGISTRY_CHECK_PERIOD', '1'))

CELERYBEAT_SCHEDULE = {
    'Check All Services': {
        'task': 'check_all_services',
        'schedule': timedelta(minutes=REGISTRY_CHECK_PERIOD)
    },
    'Index Cached Layers': {
        'task': 'hypermap.aggregator.tasks.index_cached_layers',
        'schedule': timedelta(minutes=REGISTRY_INDEX_CACHED_LAYERS_PERIOD)
    }
}

# -1 Disables limit.
REGISTRY_LIMIT_LAYERS = int(os.getenv('REGISTRY_LIMIT_LAYERS', '-1'))

FILE_CACHE_DIRECTORY = '/tmp/mapproxy/'
REGISTRY_MAPPING_PRECISION = os.getenv('REGISTRY_MAPPING_PRECISION', '500m')

# if DEBUG_SERVICES is set to True, only first DEBUG_LAYERS_NUMBER layers
# for each service are updated and checked
REGISTRY_PYCSW['server']['url'] = SITE_URL.rstrip('/') + '/registry/search/csw'

REGISTRY_PYCSW['metadata:main'] = {
    'identification_title': 'Registry Catalogue',
    'identification_abstract': 'Registry, a Harvard Hypermap project, is an application that manages ' \
    'OWS, Esri REST, and other types of map service harvesting, and maintains uptime statistics for ' \
    'services and layers.',
    'identification_keywords': 'sdi,catalogue,discovery,metadata,registry,HHypermap',
    'identification_keywords_type': 'theme',
    'identification_fees': 'None',
    'identification_accessconstraints': 'None',
    'provider_name': 'Organization Name',
    'provider_url': SITE_URL,
    'contact_name': 'Lastname, Firstname',
    'contact_position': 'Position Title',
    'contact_address': 'Mailing Address',
    'contact_city': 'City',
    'contact_stateorprovince': 'Administrative Area',
    'contact_postalcode': 'Zip or Postal Code',
    'contact_country': 'Country',
    'contact_phone': '+xx-xxx-xxx-xxxx',
    'contact_fax': '+xx-xxx-xxx-xxxx',
    'contact_email': 'Email Address',
    'contact_url': 'Contact URL',
    'contact_hours': 'Hours of Service',
    'contact_instructions': 'During hours of service. Off on weekends.',
    'contact_role': 'Point of Contact'
}

Add the following django applications into the main project INSTALLED_APPS settings.

'djmp',
'hypermap.aggregator',
'hypermap.dynasty',
'hypermap.search',
'hypermap.search_api',
'rest_framework',

Test CSW transactions

curl -v "http://(admin_username):(admin_password)@(host)/registry/(catalog)/csw?service=CSW&request=Transaction&version=2.0.2" -d "@(document_body_path)"

As an example, using the template found in the data folder,

curl -v "http://admin:admin@localhost/registry/hypermap/csw?service=CSW&request=Transaction&version=2.0.2" -d "@data/cswt_insert.xml"

Verify that layers have been added into the database.

Important notes

Please check django.env file as example.

Hhypermap registry troubleshootings

1. When I add an url into the database, services are not created

2. Services and layers are created, but layers are not indexed into search backend

As an administrator, verify in the periodic tasks section that index cached layers task is set.

Performance considerations

The Hypermap architecture depends on 6 main components:

+-------------------+      +----------------------+
|                   |      |                      |
|                   |      |    postgres          |
|    django app     <--+--->                      |
|                   |  |   |                      |
|                   |  |   |                      |
+--------^----------+  |   +----------------------+
         |             |
+--------v----------+  |   +----------------------+
|                   |  |   |                      |
|                   |  |   |                      |
|     rabbitmq      |  +--->    elastic search    |
|                   |  |   |                      |
|                   |  |   |                      |
+---------^---------+  |   +----------------------+
          |            |
+---------v--------+   |   +----------------------+
|                  |   |   |                      |
|                  |   |   |                      |
|      celery      <---+--->      memcached       |
|   & celery beats |       |                      |
|                  |       |                      |
+------------------+       +----------------------+

If you want to see how to install those services, refer to “Manual Installations” in the developers documentation.

Django app

The application layer [#TODO: provide more info here]

The app can be hosted via wsgi application located here: hypermap/wsgi.py for production enviroment is recommended to host it with uWSGI application server. Refer to https://uwsgi-docs.readthedocs.io/en/latest/ to more documentation.

How to start?
Development:
python manage.py runserver
Production:
uwsgi --module=hypermap.wsgi:application --env DJANGO_SETTINGS_MODULE=hypermap.settings

Read more about Configuring and starting the uWSGI server for Django

Docker:
make start

Rabbit MQ, Celery and Memcached

The queue/task layer. It performs operations (as follows above) that works with dedicated async workers that could run in the local or remote machines connected to a Rabbit MQ instance.

Harvesting and Indexing

Download metadata from Internet, each time an Endpoint, Service and Layer is created a worker starts async jobs to fetch the information for the remote services.

Perform Periodic/Scheduled Tasks (AKA beats)

Kicks off tasks at regular intervals, two important periodic tasks are placed in the settings file:

Once a Layers are created, and checked with hypermap.aggregator.tasks.check_all_services are inserted to Memcached to store a buffer for the task hypermap.aggregator.tasks.index_cached_layers where a batch call is made to Search engine in order to index.

Important settings

REGISTRY_CHECK_PERIOD (in minutes) defines the interval which the task check_all_services will be executed by the available workers to start checking the Service and Layers status.

REGISTRY_INDEX_CACHED_LAYERS_PERIOD (in minutes) defines the interval which the task index_cached_layers will be executed by the available workers to start to send memcached buffered layers to the search backend.

The setting CELERYBEAT_SCHEDULE registers the creation of those periodic tasks:

CELERYBEAT_SCHEDULE = {
    'Check All Services': {
        'task': 'hypermap.aggregator.tasks.check_all_services',
        'schedule': timedelta(minutes=REGISTRY_CHECK_PERIOD)
    },
    'Index Cached Layers': {
        'task': 'hypermap.aggregator.tasks.index_cached_layers',
        'schedule': timedelta(minutes=REGISTRY_INDEX_CACHED_LAYERS_PERIOD)
    }
}

Those 2 periodic tasks should be automatically created in admin site when starting the celery workers. One way to check this is go to the admin site and verify in the “Periodic Tasks” page the presence of 3 tasks:

How to start?
celery worker --app=hypermap.celeryapp:app --concurrency 4 -B -l INFO
Docker:
make start
How to check periodic tasks created by Celery?

You have to ensure only a single scheduler is running for a schedule at a time, otherwise you would end up with duplicate tasks. Using a centralized approach means the schedule does not have to be synchronized, and the service can operate without using locks.

Why REGISTRY_CHECK_PERIOD should be an extended period of time

check_all_services performs connections to the registered services in order to make checks and download information, if checks periods are too low it could be causing massive connections to the services and cause high incoming traffic and workload that could looks like a denial of service attack. The recommended setting with REGISTRY_CHECK_PERIOD is 60*24 to perform a daily check.

One way to avoid those remote connections to the service servers not required/needed to harvest, is to set Service.is_monitored=True.

As in admin site:

Workers quantity (–concurrency N)

The recommended number of concurrent workers running in a machine should be near to the number of CPU cores.

Scalling up/down: Register celery nodes

Deploy the hypermap code in a different machine in the same cluster and use the same BROKER_URL. Task will be automaticaly starting on this node. Dont start with beats to ensure only a single scheduler is running for a schedule at a time, otherwise you would end up with duplicate tasks.

Starting celery without beats

Just remove the -B from the start command:

celery worker --app=hypermap.celeryapp:app --concurrency 4 -l INFO

How to purge active and pending task from celery

this is unrecoverable, and the tasks will be deleted from the messaging server.

First stop all workers and run:

celery worker --app=hypermap.celeryapp:app purge -f

Elasticsearch

REGISTRY_MAPPING_PRECISION

This parameter may be used instead of tree_levels to set an appropriate value for the tree_levels parameter. The value specifies the desired precision and Elasticsearch will calculate the best tree_levels value to honor this precision. The value should be a number followed by m distance unit.

Performance considerations

Elasticsearch uses the paths in the prefix tree as terms in the index and in queries. The higher the levels is (and thus the precision), the more terms are generated. Of course, calculating the terms, keeping them in memory, and storing them on disk all have a price. Especially with higher tree levels, indices can become extremely large even with a modest amount of data. Additionally, the size of the features also matters. Big, complex polygons can take up a lot of space at higher tree levels. Which setting is right depends on the use case. Generally one trades off accuracy against index size and query performance.

The defaults in Elasticsearch for both implementations are a compromise between index size and a reasonable level of precision of 50m at the equator. This allows for indexing tens of millions of shapes without overly bloating the resulting index too much relative to the input size.

So take care settings low REGISTRY_MAPPING_PRECISION because at the moment of sending Layers to Elasticsearch it will become slow.