API server tutorial#


AI Navigator is in beta testing. This documentation is currently under construction.

In this tutorial, you’ll learn to build a simple chatbot in Python that utilizes AI Navigator’s built-in API server to process natural language queries. You will use conda to establish a working environment to develop the chatbot, build the API calls for the chatbot application from snippets provided, interact with the chatbot at the command line, and view the API server logs to verify the application is functioning properly.


Setting up your environment#

When working on a new conda project, it is recommended that you create a new environment for development. Follow these steps to set up an environment for your chatbot:

  1. Open a terminal (Anaconda Prompt for Windows).


    This terminal can be opened from within an IDE (JupyterLab, PyCharm, VSCode, Spyder), if preferred.

  2. Create the conda environment for your chatbot development and install the packages you’ll need by running the following command:

    conda create -n chataconda python requests
  3. Activate your newly created conda environment by running the following command:

    conda activate chataconda

For more information and best practices for managing environments, see Environments.

Building the chatbot#

Below, you’ll find the necessary code snippets to build your chatbot, along with an explanation of each snippet to help you understand the functionality of the code.

Using your preferred IDE, create a new file on your machine, and name it chatter-max.py.

Importing libraries#

The application we are building is simple, so we are only importing the requests package, which enables Python to make HTTP requests to the API server and receive responses.

Make this the first line of code in your chatter-max.py file:

import requests

Setting the base_url#

In order for your application to programmatically process natural language inputs to generate responses, run server health checks, and perform other actions, it is crucial that you properly structure your application to interact with the API server and its endpoints.

The URLs for these API endpoints are constructed by combining a base_url with a specific /endpoint for each function. The base_URL can be constructed by combining the Server Address and Server Port specified in AI Navigator, like this: http://<SERVER_ADDRESS>:<SERVER_PORT>.

Set the base_url to point to the default server address by adding the following line to your file.

base_url = 'http://localhost:8080'


localhost and are semantically identical.

Adding the API calls#

The most common API endpoints are described in this tutorial. For a full list of API endpoints and detailed information on how to use them effectively, see the official llama.cpp HTTP server documentation.

To enable your application to communicate with the API server, you must implement functions that make API calls in a way that the server can understand.

GET /health#

Before sending any requests to the server, it’s wise to verify that the server is operational. This function sends a GET request to the /health endpoint and returns a JSON response that tells you the server’s status.

Add the following lines to your chatter-max.py file:

def get_server_health():
    response = requests.get(f'{base_url}/health')
    return response.json()

POST /completion#

To interact with the model, you must have a function that prompts the server’s /completion endpoint. This function sends the user input to the model loaded into the API server and receives a generated response.

The prompt construction here provides context to set the tone for how you would like the model to respond to your users. In essence, this is the initial prompt to the model, and it “sets the tone” for your model. We’ll revisit this later.

The separation of User: and Assistant: inputs onto new lines-delineated by their respective labels-helps the model distinguish between parts of the dialogue. Without this distinction, the model will assume that the user wants it to complete their input, rather than respond to it.

The data dictionary is a structured collection of parameters that control how the AI model generates responses based on the user’s input. These parameters dictate the model’s behavior during the completion process. This is converted to JSON and sent as the body of the request.

Add the following lines to your chatter-max.py file:

def post_completion(context, user_input):
    prompt = f"{context}\nUser: {user_input}\nAssistant:"
    data = {
        'prompt': prompt,
        'temperature': 0.8,
        'top_k': 35,
        'top_p': 0.95,
        'n_predict': 400,
        'stop': ["</s>", "Assistant:", "User:"]
    headers = {'Content-Type': 'application/json'}
    response = requests.post(f'{base_url}/completion', json=data, headers=headers)
    if response.status_code == 200:
        return response.json()['content'].strip()
        return "Error processing your request. Please try again."

After each interaction, you’ll want to update the context of the conversation to help the model produce coherent dialog. This function updates the values for context by appending the latest user input and the assistant’s response, thus keeping the model engaged in the conversation.

Add the following lines to your chatter-max.py file:

def update_context(context, user_input, assistant_response):
    return f"{context}\nUser: {user_input}\nAssistant: {assistant_response}"

Constructing the chat function#

The main function initiates the chatbot, handles user inputs, and manages the flow of the conversation. This is where you set the initial value for context.


Play around with the context to see how it impacts the responses you receive from your model!

Add the following lines to your chatter-max.py file:

def main():
    context = "You are a friendly AI assistant designed to provide helpful, succinct, and accurate information."

    health = get_server_health()
    print('Server Health:', health)

    if health.get('status') == 'ok':
        while True:
            user_input = input("Enter a prompt or type 'exit' to quit: ")
            if user_input.lower() == 'exit':
            assistant_response = post_completion(context, user_input)
            print('Assistant:', assistant_response)

            context = update_context(context, user_input, assistant_response)
        print("Server is not ready for requests.")

if __name__ == "__main__":

Interacting with the API server#

With your chatbot constructed, it’s time to take your model for a test run!

  1. Open AI Navigator and load a model into the API server.

  2. Leave the Server Address and Server Port at the default values and click Start.

  3. Open a terminal and navigate to the directory where you stored your chatter-max.py file.

  4. Initiate the chatbot by running the following command:

    python chatter-max.py
  5. View the AI Navigator API server logs. If everything is set up correctly, the server logs will populate with traffic from your chatbot application, starting with a health check and the initial context prompt for the model.

Having some fun with the model#

Try adjusting the following parameters for the /completion endpoint’s data dictionary to see how they affect the output from the model.


Adjusting the temperature of your model can increase or decrease the randomness of the responses you receive from your prompts. Higher values (example 1.0) make the output more free-flowing and creative. Lower values (example 0.2) make the output more deterministic and focused. Defaults to 0.8.


Limiting the top_k parameter confines the model’s resposnse to the k most probable tokens. Lowering the available tokens is like limiting the words the model has to choose from when attempting to guess which word comes next. top_k defaults to 40. Try setting the top_k to higher and lower values to see how the model responds to the same prompt.


Limits token selection to a subset of tokens with a cumulative probability above a threshold to balance creativity with coherence. Higher values allows the model to provide more creative responses, and lower values enhance focus. Adjust top_p to see how it affects the models descriptiveness to the same prompt. Defaults to 0.95.


Set stream to true to see the model’s responses as they come in token by token. Streaming is set to false by default.

Next steps#

You can continue to develop and extend this chatbot by including other endpoints for more advanced usage–like tokenization or slot management–or you can delete this file and clean up your conda environment by running the following command:

conda deactivate
conda remove -n chataconda --all