- WorqHat
- Posts
- Extending Open Source Models to 150K Context: The Technicalities
Extending Open Source Models to 150K Context: The Technicalities
Releasing our Updated 70B and 180B Language Models with Faster Inference: All built on Open-Source

Falcon and LLama getting bigger!!
Announcing an exciting milestone, we're here to highlight the release of our next-generation language models, AiCon V2 and AiCon V3. Over the past few months, we've observed significant advancements in the open-source ecosystem for Large Language Models (LLMs), such as the Falcon models by TII (Technology Innovation Institute Abu Dhabi) and LLama-3 by Meta. At WorqHat AI, we share our latest breakthroughs in developing long-context models that deliver exceptional quality and efficiency. Our models are crafted on open-source frameworks and meticulously fine-tuned to expand their context windows, making them ideal for Enterprise Use Cases. This refining process involved additional training using legally obtained, carefully filtered, and processed Enterprise Data. We're confident that these advancements in our language models will unlock new possibilities and elevate performance across a wide array of applications, especially within the enterprise sector. We’re eager to witness how businesses and developers leverage our AiCon V2 and AiCon V3 models to pioneer innovative solutions and propel progress across diverse industries.
Our smallest model, AiCon V2, has been built upon the LLama-3 70B model provided by Meta, while our larger model has been constructed on top of the Falcon 180B model from TII. Both of these models have undergone further fine-tuning processes to achieve a remarkable 150K context window, leveraging Position Interpolation and a series of proprietary system data recipes and optimisations, including the FlashAttention-2 technology. The incorporation of these two updated methods in the fine-tuning processes, both for inference and training stack, has enabled efficient inference and fine-tuning with a 150K context window. This has been made possible through the recently released FlashAttention-2 and other optimizations. This also allows users to create their own Fine-Tuned 150K context models and conduct inference efficiently. However, access to these models is limited to certain enterprises only.
This allows us to develop models with significantly extended context windows, achieving an impressive 96% recall rate when processing long-form content and conversations. This enhancement is achieved without compromising the "Time to First Token" speed, a crucial metric in language model performance. Our optimized models demonstrate a remarkable 3x speed improvement compared to most open-source LLM providers, setting a new standard in the industry. By pushing the boundaries of context window size while maintaining high recall rates and accelerated processing speeds, we are empowering developers and businesses to tackle more complex and nuanced language tasks with unprecedented efficiency. This advancement opens up new possibilities for applications such as document analysis, conversational AI, and content generation, where understanding and retaining long-range context is important.
Our commitment to delivering state-of-the-art language models that surpass industry standards reflects our dedication to innovation and our mission to provide advanced tools for natural language processing. With these improved capabilities, users can anticipate quicker, more accurate, and contextually rich results, ultimately leading to more intelligent and seamless language interactions. Long-context models are already proving essential for document understanding, summarization, and enhanced generation. We're thrilled to share this pioneering work with the community and contribute to ongoing progress towards superior, longer-context models. You can sign up for API access at https://worqhat.com or try out our Chat Interface at https://worqhat.com/playground.

SuperLLama at WorqHat
Extending our Open Source Models to 150K Context: The Technicalities
Most open-source models have a context length limited to 4K tokens. To extend this to a remarkable 150K context, three crucial aspects must be addressed: modelling, data, and system optimizations.
Modelling Approach:
Here, we follow the approach outlined in Meta's recent paper, utilizing linear interpolation to extend the context length. This technique provides a powerful means to expand the context length for models that employ rotary positional embeddings. We take the currently available checkpoints and continue pre-training/fine-tuning them using linear interpolation.
Data Selection:
Modelling alone isn't enough. The selection of data used to enhance the base model is crucial. Instead of simply fine-tuning with generic language datasets like Pile and RedPajama, as suggested in Meta's recent recipe, we recognize two vital factors that require careful consideration. Firstly, we require generic long-context language data for the model to effectively handle the interpolated positional embeddings. Secondly, we need instructional data to encourage the models to capitalize on the information within the long context. Additionally, we've gathered hundreds of thousands of multi-turn conversations from various enterprises. These conversations have undergone a rigorous PII (Personally Identifiable Information) redaction process and private data processing to ensure data privacy and compliance.
System Optimizations:
To support the extended context length, system optimizations play a vital role. We've implemented various techniques to enhance the efficiency and performance of the models when dealing with such large contexts. These optimizations ensure that the models can process and utilize the long context effectively, without compromising on speed or resource utilization.
The combination of these three elements—modelling, carefully curated data, and system optimizations—appears to be the key to building a long-context window model suitable for enterprise use cases. By leveraging open-source models, extending their context length, and fine-tuning them with a mix of generic long-context language data, instructional data, and enterprise-specific conversations, we can create powerful models that can handle complex, multi-turn interactions while maintaining data privacy and compliance.
Our Current Data Recipe:
Phase 1: Continued Pre-training: In the first phase of continued pre-training, our data mixture comprises 25% RedPajama Book, 25% RedPajama ArXiv (including abstracts), 25% other data from RedPajama, and 25% from the UL2 Oscar Data, which is part of OIG (Open-Instruction-Generalist). During this phase, we task the model with filling in missing chunks or completing the text. To enhance the model's long-context capabilities, we exclude sequences shorter than 2K tokens. The inclusion of UL2 Oscar Data encourages the model to effectively capture and model long-range dependencies.
Phase 2: Fine-tuning: In the second phase, we fine-tune the model to focus on its few-shot capacity with long contexts. The data mixture for this phase includes 20% Natural Instructions (NI), 20% Public Pool of Prompts (P3), 20% of the Collected Dataset Pile, 20% RedPajama Book, and 20% RedPajama ArXiv with abstracts. The incorporation of RedPajama Book and RedPajama ArXiv data helps mitigate forgetting during the fine-tuning process. We have decontaminated all data against HELM core scenarios, following a precise protocol to ensure data integrity and relevance.
To teach the model to leverage in-context examples effectively, we pack as many examples as possible into one 150K-token sequence. This approach allows the model to learn from a diverse set of examples and adapt to various contexts.
System Optimizations: In addition to the data recipe, we have integrated FlashAttention-2 into the inference stack. This integration provides up to a 3x improvement in inference throughput compared to state-of-the-art models. By optimizing the inference process, we can achieve faster and more efficient model performance.
The combination of the data recipe, which includes a mix of pre-training and fine-tuning data, along with the integration of FlashAttention-2, enables our models to excel in long-context scenarios while maintaining high performance and efficiency. This approach allows us to build models that can effectively handle complex, multi-turn interactions and provide accurate and contextually relevant responses.

The Comp scientist at work
Get started today!
To get started with WorqHat's LLM, follow these steps:
First, visit app.worqhat.com and sign up for an account. This will grant you access to the platform and its features. Once you have an account, explore the documentation available at docs.worqhat.com. Here, you'll find comprehensive guides and instructions for using both the inference and fine-tuning APIs. The documentation will provide you with the necessary information to integrate WorqHat's LLM into your applications and workflows.
If you're an enterprise customer with specific requirements, WorqHat offers additional options. They provide the ability to create custom fine-tuned models based on both AiCon V2 and AiCon V3 deployments. This allows you to tailor the LLM to your specific domain or use case, enhancing its performance and relevance. Additionally, for enterprises requiring dedicated resources and enhanced security, WorqHat offers dedicated instances on the WorqHat Cloud. This ensures that your data and models are isolated and provides you with greater control over your deployment.
If you're interested in custom fine-tuned models or dedicated instances, reach out to WorqHat's sales team. They will be happy to discuss your requirements, provide more information, and guide you through the process.
We can’t wait to see all the amazing things you can build with WorqHat AI.
Till then, Happy Learning!!! Pika pika 。^‿^。
Sagnik
Co-Founder @ WorqHat
Reply