Chat API Endpoint Quick Guide

Welcome to our guide on the Paradigm public API. This powerful interface provides developers with flexible access to our services, enabling seamless integration of our advanced features into your applications.

Last Updated: April 15, 2025

The Paradigm API is used to complement its visual interface. It enables the creation of intelligent applications, drawing on the power of LLM to interact with users in an intuitive way. It is an additional tool for developers wishing to customize their LLM calls by playing with the parameters used, to call external tools, or to integrate automatic LLM calls within existing business applications.

This article is an introduction to LLM calls, using the Paradigm API. It is based on the OpenAI open source framework.

Getting started

Set up

First, set up your client with your API key and our endpoint URL. This connects your requests to our service.

Authentication: Please note that while exploring the documentation doesn't require authentication, you'll need valid API credentials to make actual requests to our services. If you do not have the rights to create an API key, ask your company admin to give you those; if you have, create one in your Paradigm profile!

from openai import OpenAI as OpenAICompatibleClient
import os

# Get API key from environment
api_key = api_key
# Our API base URL
base_url = "https://paradigm.lighton.ai/api/v2"

# Configure the OpenAI client
client = OpenAICompatibleClient(api_key=api_key, base_url=base_url)

Crafting Messages:

The core of your request is the messages array, where each item has a role ("system","user", "assistant" or "tool") and the and the message content.

System: Sets the conversation's context or provides instructions to the AI, guiding its responses. Ideal for initializing the chat with necessary background or rules.
User: Inputs from the human user engaging with the AI. This role is used for the questions, statements, or commands that users input into the chat.
Assistant: Responses from the AI designed to assist, answer, or interact with the user. This role is for the AI-generated content in reply to the user or system prompts.
Tool: Insights from the external tools to provide the LLM with external information (more on this later).

You can have follow-up conversations by providing a list of messages, such as in the code below:

messages = [
    {"role": "system", "content": "You are a helpful AI assistant answering to user's question"},
    {"role": "user", "content": "Hello, my name is Tom, how are you?"},
    {"role": "assistant", "content": "I'm Tom, I'm here to help. What can I do for you?"},
    {"role": "user", "content": "What is my name?"},
]

response = client.chat.completions.create(
    model="alfred-4",
    messages=messages,
    temperature=0.7,
    max_tokens=150,
    stream=False
)

assistant_reply = response.choices[0].message.content
print(assistant_reply)

When using the LightOn API, developers can customize several key parameters of LLMs to adapt the model's behavior to specific use cases.

These include:

temperature: controls the randomness of the output. Lower values (e.g. 0.2) make the response more focused and predictable, while higher values (e.g. 0.8 or 1.0) encourage more creative and diverse responses.
max_tokens: sets the maximum number of tokens the model can generate in the response. Useful to control the output length and avoid overly long answers.
top_p (nucleus sampling): limits the next-token choices to a cumulative probability p. For example, with top_p=0.9, only the most likely tokens that add up to 90% probability are considered. Often used as an alternative or complement to temperature.
frequency_penalty: applies a penalty to tokens that have already appeared, reducing repetition. Values range from 0.0 (no penalty) to 2.0 (strong penalty). Useful for preventing redundant answers.
presence_penalty: encourages the model to introduce new topics by penalizing tokens that have already been mentioned, increasing content diversity. Also ranges from 0.0 to 2.0.
stop: a list of token sequences where generation should stop. For example, setting stop=["\n\n", "User:"] can prevent the model from continuing into a new prompt or section.
and so on. Consult the OpenAI parameters for more information.

Streaming example

Streaming allows for immediate, incremental delivery of responses, perfect for live interactions. With `stream=True`, the API sends parts of the response as they're generated. The example below prints each part upon arrival.

response_with_streaming = client.chat.completions.create(
    model="alfred-4",
    messages=messages,
    temperature=0.7,
    max_tokens=150,
    stream=True
)

for chunk in response_with_streaming:
    try:
        print(chunk.choices[0].delta.content)
    except:
        print("end of generation")

Best Practices

Keep Context Relevant: Only include messages that help the conversation.
Use Streaming for Live Chats: Set `stream=True` for ongoing interactions.
Match Model to your infra: The models that you choose should reflect the ones deployed on your infrastructure.
Balance Temperature and Tokens: Adjust for creativity vs. precision and response length.

Conclusion

This guide is your starting point for integrating chat functionalities. With the right settings and understanding, you can craft engaging AI conversations. Happy coding!