Language Models and Data Integration within SchoolJoy

Language models know a lot about the world, but it is not all-knowing. For it to be all-knowing (i.e. individual students, school policies, district strategic plans), applications like SchoolJoy must ensure it operates in a safe and secure environment and that there are non-negotiable measures put in place to ensure private interactions between the model and SchoolJoy’s data. SchoolJoy ensures that no PII is ever released nor exposed to the open web through its integration with OpenAI, including a data protection agreement that explicitly prevents OpenAI from using the data for anything other than the fulfillment of the services. We've also taken steps to ensure student names are scrubbed before prompts are ever sent to OpenAI, just as an added precautionary measure.

Within the context of the existing security measures, we have engineered a series of tools and processes to extract and compile the necessary information about students and the schools to respond to the relevant questions. The most exciting component of our product is the ability to understand the context of the question, and the context of the data, and bring the two together to form a thoughtful, relevant, and comprehensive response.

The significance of a prompt-driven user interface is in its simplicity, and the simplest things are often the hardest. To make an answer work, we must solve several challenges:

  1. Data Completeness
  2. Data Recency
  3. Data Accuracy
  4. Data Relevancy

The solution and approach to optimize each of these four dimensions requires a combination of integrations and native applications to ensure that we have the ability to bring together all the necessary context to answer the broadest range of questions possible. The scope of data we needed to capture extends beyond the data that already exists in legacy system. We needed to capture the following:

  1. Large swaths of data exist in legacy systems that are rather challenging to extract and synchronize.
  2. A ton of operational and student data exist in the forms of spreadsheets and documents.
  3. Even more data about students are not documented and exist in the minds of parents and teachers.

To centralize this information requires not only deep integration with systems, but behavioral changes and drivers to incentivize the stakeholders to document what they know so that the data can be made more useful to the remaining stakeholders.

That is why a full-stack application must exist in parallel with any LLM integration. The way we scoped and prioritized our full-stack is by identifying critical and mandatory processes within a given district (even a state), and enabling such processes to be managed within SchoolJoy (i.e. specialized graduation requirements). These mandates give us a strong answer to all four of the data challenges (completeness, recency, accuracy, and relevancy). Once the data hurdle is addressed, the remaining integration with the LLM can be considered the "easy" part of the integration.

That said, we cannot underestimate the ongoing challenges of monitoring and adjusting our models to ensure the highest quality of responses for every question. It's also encouraging to know that we are maximizing the value of the model we already use, meaning once we gain access to more powerful models, the marginal cost of immediately bringing the value of the new model to our users is almost negligible.