Syncing Roster Data at Scale
As your product integrates with more schools, you'll need a data synchronization strategy that is performant, reliable, and cost-effective. This guide outlines our recommended best practices for syncing roster data from dozens or hundreds of Edlink integrations.
Technical Architecture for Syncing
The single most important principle for syncing at scale is to always process data on a per-integration basis.
Data in the Edlink platform is sharded by integration_id
. Different integrations may reside in different physical data regions. Architecting your sync process around the integration_id
is essential for performance and reliability.
We recommend using a queuing system like AWS SQS or Google Pub/Sub to manage and distribute sync jobs to a number of worker nodes.
- Create Jobs by Integration: Each "job" placed into the queue should represent a single
integration_id
that needs to be synced. - Process in Parallel: Use workers (e.g., Kubernetes pods, serverless functions) to pull jobs from the queue and process them in parallel. This allows you to sync multiple districts simultaneously.
Using a persistent VM may be cheaper due to the costs associated with high-memory, long-duration serverless functions. You may not be able to process some large schools or universities in a standard serverless environment due to the sheer volume of data.
Use Top-Level Endpoints
A common mistake is to iterate through schools and then fetch data for each school. This is inefficient and will not scale.
- Always use top-level Graph API endpoints like
/v2/graph/people
or/v2/graph/enrollments
for a given integration. These are more performant and will require fewer API calls than sub-object endpoints like/v2/graph/schools/:school_id/people
. - You can easily reconstruct school-specific data locally. For example, each
Person
object returned from/v2/graph/people
contains aschool_ids
array, allowing you to associate people with their schools after fetching the complete dataset.
Scheduling Your Syncs
Use a cron-based system, like the open-source cron
utility or a cloud-native tool like Google Cloud Scheduler, to trigger your sync jobs.
Roster data does not change constantly. For most providers, Edlink syncs with the source of truth once every 24 hours. Scheduling your syncs to run every 6-12 hours is a reasonable starting point.
Be mindful of the load on your system. Increasing sync frequency from every 6 hours to every 4 hours represents a 50% increase in the computational load.
Choosing a Sync Strategy: Full Sync vs. Events
While the Edlink Events API can seem like an efficient way to get data deltas, it introduces significant complexity. We almost always recommend a full sync strategy for its simplicity and robustness.
The Challenges of an Events-Based System
- State Management: Events are carefully ordered by dependency. Missing a single event or processing it out of order can lead to a corrupted state in your local database.
- Volume: There are often far more events than developers anticipate. Sequentially processing a high volume of events can become a performance bottleneck.
A Hybrid Approach for Higher Frequency Syncs
If you need to know if data has changed more frequently without the overhead of a full sync every time, consider this hybrid model:
- Perform a full sync for an integration. As part of this, fetch the ID of the most recent event and store it.
- On your next scheduled run, query the
/v2/graph/events
endpoint using the$after
parameter with the last event ID you stored. - If the query returns no new events, you can skip the full sync for this cycle.
- If the query returns one or more new events, trigger a new full sync and store the ID of the newest event.
This approach lets you check for changes frequently while only committing to the resource cost of a full sync when necessary.
Optimizing Compute: Database vs. Virtual Machine
Database computation is significantly more expensive than VM compute. Perform data comparison and transformation logic within your application workers, not in the database.
Our recommended workflow for a single integration job:
- Download all relevant data from the Edlink API (e.g., all people, classes, enrollments).
- Download all corresponding existing data from your database.
- In memory on the worker, compare the two datasets. Create three lists: entities to be created, entities to be updated, and entities to be deleted.
- Apply these changes to your database using batched writes.
We recommend that you do not use a queue to process individual database changes as this tends to be highly inefficient. Databases perform best when you batch operations. Sending 5,000 `UPDATE` statements in a single transaction is far more performant than processing 5,000 individual queue messages.
Squeezing More Performance Out of Syncs
For applications with very large integrations, you can further optimize the process:
- Parallelize Data Types: For a single integration, you can sync different data types (people, classes, etc.) in parallel, so long as your worker has enough memory to hold each dataset for comparison.
- Use Large Page Sizes: When making paginated requests to the Edlink API, use the maximum page size (
?$first=10000
) to minimize the number of HTTP round-trips. - Implement Object Hashing: To reduce the amount of data you pull from your own database, you can:
- When you sync data from Edlink, compute a stable hash (e.g., a SHA256 hash of the JSON-stringified object) for each entity.
- Store this hash in a column next to the record in your database.
- On the next sync, you only need to query the
id
andhash
for each record from your database, not the entire object. - Compute the hash for the incoming Edlink objects and compare it to the stored hash. If they match, you can skip the
UPDATE
operation for that record.