API References

LLMlight.

Name : LLMlight.py Author : E.Taskesen Contact : erdogant@gmail.com github : https://github.com/erdogant/LLMlight Licence : See licences

class LLMlight.LLMlight.LLMlight(model: str = None, retrieval_method: (None, <class 'str'>) = 'naive_rag', embedding: (<class 'str'>, <class 'dict'>) = {'context': 'bert', 'memory': 'memvid'}, context_strategy: str = None, alpha: float = 0.05, top_chunks: int = 5, temperature: (<class 'int'>, <class 'float'>) = 0.7, top_p: (<class 'int'>, <class 'float'>) = 1.0, chunks: dict = {'method': 'chars', 'overlap': 200, 'size': 1000}, n_ctx: int = 4096, file_path: str = None, endpoint: str = 'http://localhost:1234/v1/chat/completions', verbose: (<class 'str'>, <class 'int'>) = 'info')

Large Language Model Light.

Run your LLM models local and with minimum dependencies. 1. Go to LM-studio. 2. Go to left panel and select developers mode. 3. On top select your model of interest. 4. Then go to settings in the top bar. 5. Enable “server on local network” if you need. 6. Enable Running.

How LLMlight Works

LLMlight processes text through several key stages to generate intelligent responses:

1. Context strategy

The input context can be processed in different ways: - No Context strategy: Uses the raw context directly - Chunk-wise processing: Breaks down the context into manageable chunks, processes each chunk independently, and combines results - Global reasoning: Creates a global summary of the context before processing

2. Retrieval Method Stage

Three main approaches for retrieving relevant information: - Naive RAG: Splits text into chunks and uses similarity scoring to find the most relevant sections - RSE (Relevant Segment Extraction): Identifies and extracts complete relevant text segments - No retrieval: Uses the entire context directly

3. Embedding Stage

Multiple embedding options for text representation: - TF-IDF: Best for structured documents with matching query terms - Bag of Words: Simple word frequency approach - BERT: Advanced contextual embeddings for free-form text - BGE-small: Efficient embedding model for general use

4. Prompting Stage

The system constructs prompts by combining: - System message: Defines the AI’s role and behavior - Context: Processed and retrieved relevant information - User query: The specific question or request - Instructions: Additional guidance for response generation

5. Response Generation

The system can be configured through various parameters to optimize for different use cases, from simple Q&A to complex document analysis. The model generates responses using: - Temperature control: Adjusts response randomness (0.7 default) - Top-p sampling: Controls response diversity - Context window management: Handles token limits efficiently

Processing Flow

The system follows a sequential processing flow where each stage builds upon the previous one. First, the input context undergoes the context strategy, where it can be either used as-is or transformed into chunks for more manageable processing. These chunks are then passed through the retrieval method stage, which determines how relevant information is extracted and organized. During the embedding stage, the text is converted into numerical representations that capture its semantic meaning. This is crucial for the system to understand and process the content effectively. The embedding method chosen can significantly impact the system’s ability to match queries with relevant content. The prompting stage brings together all the processed information, combining it with the user’s query and any specific instructions. This creates a comprehensive prompt that guides the model in generating an appropriate response. The final response generation stage uses this prompt to create a coherent and relevant output, with parameters like temperature and top-p sampling helping to control the response’s characteristics. Throughout this process, the system maintains flexibility through various configuration options, allowing it to adapt to different types of queries and contexts. This modular approach enables the system to handle everything from simple questions to complex document analysis tasks efficiently.

param model:

‘mistralai/mistral-small-3.2’ ‘qwen/qwen3-coder-30b’ ‘openai/gpt-oss-20b’

type model:

str

param system:

String of the system message. “I am a helpfull assistant”

type system:

str

param retrieval_method:

None: No processing. The entire context is used for the query. ‘naive_rag’: Context is processed using Navie RAG approach. Ideal for chats and when you need to answer specfic questions: Chunk of text are created. Use cosine similarity to for ranking. The top scoring chunks will be combined (n chunks) and used as input with the prompt. ‘RSE’: Context is processed using Navie RSE approach. Identify and extract entire segments of relevant text.

type retrieval_method:

str (default: ‘naive_rag’)

param embedding:

Specify the embedding. When using both video-memory and context, it can be specified with a dictionary: {‘memory’: ‘memvid’, ‘naive_rag’: ‘bert’} None: No embedding is performed. ‘automatic’: {‘memory’: ‘memvid’, ‘naive_rag’: ‘bert’} ‘memvid’: This embedding can only be applied when using video-memory in the retrieval method. ‘tfidf’: Best use when it is a structured documents and the words in the queries are matching. ‘bow’: Bag of words approach. Best use when you expect words in the document and queries to be matching. ‘bert’: Best use when document is more free text and the queries may not match exactly the words or sentences in the document. ‘bge-small’:

type embedding:

str, dict (default: ‘automatic’)

param context_strategy:

None: No pre-processing. The original context is used in the pipeline of retrieval_method, embedding and the response. ‘chunk-wise’: The input context will be analyze chunkwise based on the query, instructions and system. The total set of answered-chunks is then returned. The normal pipeline proceeds for the query, instructions, system etc. ‘global-reasoning’: The input context will be summarized per chunk globally. The total set of summarized context is then returned. The normal pipeline proceeds for the query, instructions, system etc.

type context_strategy:

str (default: None)

param temperature:

Sampling temperature. 0.7: (default) 0: Deterministic 1: Stochastic (max)

type temperature:

float, optional

param top_p:

Top-p (nucleus) sampling parameter (default is 1.0, no filtering).

type top_p:

float, optional

param chunks:

typestr

‘chars’ or ‘words’: Chunks are created using chars or words. ‘size’: Chunk length in chars or words.

The accuracy increases with smaller chunk sizes. But it also reduces the input context for the LLM. Estimates: 1000 words or ~10.000 chars costs ~3000 tokens. With a context window (n_ctx) of 4096 your can set size=1000 words with n chunks=5 and leave some space for instructions, system and the query.

‘overlap’: overlap between chunks ‘top_chunks’: Retrieval of the top N chunks when performing RAG analysis.

type chunks:

dict : {‘method’: ‘chars’, ‘size’: 1000, ‘overlap’: 250, ‘top_chunks’: 5}

param n_ctx:

The context window length is determined by the max tokens. A larger number of tokens will ask more cpu/gpu resources. Estimates: 1000 words or ~10.000 chars costs ~3000 tokens.

type n_ctx:

int, default: 4096

param file_path:

‘knowledge_base.mp4’ Local or absolute path to your (video) memory file.

type file_path:

str

param endpoint:

“http://localhost:1234/v1/chat/completions”

type endpoint:

str: Endpoint of the LLM API

Examples

>>> # Examples
>>> from LLMlight import LLMlight
>>> client =  LLMlight()
>>> client.prompt('hello, who are you?')
>>> system_message = "You are a helpful assistant."
>>> response = client.prompt('What is the capital of France?', system=system_message, top_p=0.9)
>>> print(response)

check_logger(): Check the verbosity.

chunk_wise(query, context, instructions, system, top_chunks=0, return_per_chunk=False, stream=False): Chunk-wise. 1. Break the document into chunks with overlapping parts to make sure we do not miss out. 2. Include the last two results in the prompt as context. 3. Analyze each chunk seperately following the instructions and system messages and jointly with the last 2 results.

compute_context_strategy(query, context, instructions, system)

compute_probability(query, scores, embedding, n=5000)

fit_transform(query, chunks, embedding=None): Converts context chunks and query into vector space representations based on the selected embedding method.

get_available_models(validate=False)

Retrieve available models from the configured API endpoint.

Optionally validates each model by sending a test prompt and filtering out models that return a 404 error or similar failure response.

Parameters:: validate (bool, optional) – If True, each model is tested with a prompt to ensure it can respond correctly. Models that fail validation (e.g., return a 404 error) are excluded from the result.
Returns:: A list of model identifiers (e.g., “llama3”, “gpt-4”) that are available and valid.
Return type:: list of str

Examples

>>> # Import library
>>> from LLMlight import LLMlight
<<< # Initialize
>>> client = LLMlight(endpoint='http://localhost:1234/v1/chat/completions')
>>> # Get models
>>> models = client.get_available_models(validate=False)
>>> # Print
>>> print(models)
>>> ['llama3', 'mistral-7b']

Notes

Requires an accessible endpoint and valid API response.
Relies on the LLMlight class for validation (must be importable).

get_full_path(filepath: str) → str | None

global_reasoning(query, context, instructions, system, return_per_chunk=False, rewrite_query=False, stream=False)

Global Reasoning. 1. Rewrite the input user question into something like: “Based on the extracted summaries, does the document explain the societal relevance of the research? Justify your answer.” 2. Break the document into manageable chunks with overlapping parts to make sure we do not miss out. 3. Create a global reasoning question based on the input user question. 4. Take the summarized outputs and aggregate them.

prompt = “Is the proposal well thought out?” instructions = “Your task is to rewrite questions for global reasoning. As an example, if there is a question like: ‘Does this document section explain the societal relevance of the research?’, the desired output would be: ‘Does this document section explain the societal relevance of the research? If so, summarize it. If not, return ‘No societal relevance found.’’” response = model.llm.prompt(query=prompt, instructions=instructions, task=’Task’)

memory_add(text: str | List[str] = None, files: str | List[str] = None, dirpath: str = None, filetypes: List[str] = ['.pdf', '.txt', '.epub', '.md', '.doc', '.docx', '.rtf', '.html', '.htm'], chunk_size: int = 512, chunk_overlap: int = 100, overwrite=True)

Add chunks to memory.

Parameters:: files ((str, list)) – Path to file(s).

memory_chunks(n=10, return_type='disk')

Return the top n memory stack.

Parameters:

n (int, optional) – Top n chunks to be returned. The default is 5.
return_type (str, optional) – Retrieve chunks from memory or disk. The default is ‘disk’. ‘memory’ ‘disk’

Returns:

chunks – Top n returned chunks.

Return type:

list

memory_init(file_path: str = None, config: dict = None, embedding=None)

Build QR code video and index from chunks with unified codec handling.

Parameters:

file_path (str) – Path to output video memory file.
config (dict) – Dictionary containing configuration parameters.

memory_load(file_path: str = None, config: dict = None)

memory_save(file_path: str = None, codec: str = 'mp4v', auto_build_docker: bool = False, allow_fallback: bool = True, overwrite: bool = True, show_progress: bool = True)

Build QR code video and index from chunks with unified codec handling.

Parameters:

file_path (str (default is the initialization memory-path)) – Path to output video memory file.
codec (str, optional) – Video codec (‘mp4v’, ‘h265’, ‘h264’, etc.) ‘mp4v’: Default
auto_build_docker (bool (default: True)) – Whether to auto-build Docker if needed.
allow_fallback (bool) – Whether to fall back to MP4V if advanced codec fails.
show_progress (bool (default: True))

prompt(query: str, instructions: str = None, system: str = None, context: str = None, response_format=None, temperature: (<class 'int'>, <class 'float'>) = None, top_p: (<class 'int'>, <class 'float'>) = None, stream: bool = False, return_type: str = 'string', verbose=None)

Run the model with the provided parameters.

The final prompt is created based on the query, instructions, and the context

Parameters:

query (str) – The question or query. “What is the capital for France?”
context (str) – Large text string that will be chunked, and embedded. The answer for the query is based on the chunks.
instructions (str) – Set your instructions. “Answer the question strictly based on the provided context.”
system (str, optional) – Optional system message to set context for the AI (default is None). “You are helpfull assistant.”
temperature (float, optional) – Sampling temperature (default is 0.7).
top_p (float, optional) – Top-p (nucleus) sampling parameter (default is 1.0, no filtering).
stream (bool, optional) – Whether to enable streaming (default is False).
return_type (bool, optional) – Return dictionary in case the output is a json ‘max’: Output the full json ‘dict’: Convert json into dictionary. ‘string’: Return only the string answer (remove thinking strings using tags: <think> </think>). ‘string_with_thinking’ Return the full response which includes the thinking proces (if available).

Examples

>>> # Examples
>>> from LLMlight import LLMlight
>>> client =  LLMlight()
>>> client.prompt('hello, who are you?')
>>> system_message = "You are a helpful assistant."
>>> response = client.prompt('What is the capital of France?', system=system_message, top_p=0.9)
>>> print(response)

Returns:: The model’s response or an error message if the request fails.
Return type:: str

read_pdf(file_path, title_pages=[1, 2], body_pages=[], reference_pages=[-1], return_type='str')

Reads a PDF file and extracts its text content as a string.

Args:: pdf_path (str): Path to the PDF file.
Returns:: str: Extracted text from the PDF. dict: dictionary

relevant_context_retrieval(query, context: str, return_type='list')

relevant_memory_retrieval(query: str, return_type='list')

requests_post(headers, data, stream=False, return_type='string'): Create the request to the LLM.

requests_post_gguf(prompt, system, temperature=0.8, top_p=1, headers=None, task='max', stream=False, return_type='string')

requests_post_http(prompt, system, temperature=0.8, top_p=1, headers=None, task='max', stream=False, return_type='string', max_tokens=None)

search(query: str, chunks: list, return_type: str = 'string', top_chunks: int = None, embedding: str = None): Splits large text into chunks and finds the most relevant ones.

set_prompt(query: str, instructions: str, context: (<class 'str'>, <class 'list'>), response_format: str = None)

summarize(query='Extract key insights while maintaining coherence of the previous summaries.', instructions='Extract key insights from the **new text chunk** while maintaining coherence with **Previous summaries', system='You are a professional summarizer with over two decades of experience. Your strength is that you know how to deal with partial and incomplete texts but you do not make up new stuff. Keep the focus on the original input.', response_format='**Make a comprehensive, structured document covering all key insights**', context=None, return_type='string')

Summarize large documents iteratively while maintaining coherence across text chunks.

This function splits the input text into smaller chunks and processes each part in sequence. For every chunk, it generates a partial summary while incorporating the context of the previous summaries. After all chunks have been processed, the function combines the partial results into a final, coherent, and structured summary.

Parameters:

query (str, optional) – The guiding task or question for summarization (default extracts key insights).
instructions (str, optional) – Additional instructions for the summarizer, tailored to each chunk.
system (str) – System message that sets the role and behavior of the summarizer.
response_format (str, optional) – Defines the format of the final output (default is a structured document).
context (str or dict, optional) – Input text or structured content to be summarized. If None, uses self.context.
return_type (str, optional) – Format of the returned result (default “string”).

Returns:

str
A comprehensive, coherent summary that integrates insights across all chunks.

LLMlight.LLMlight.compute_max_tokens(used_tokens, n_ctx=4096, task='max')

Compute the maximum number of tokens that can be generated for a given task, taking into account the number of tokens already used and the model’s context window.

Parameters:

used_tokens (int) – Number of tokens already consumed in the current context.
n_ctx (int, optional) – Total context window size of the model (default is 4096 tokens).
task (str, optional) – Type of generation task. Determines the proportion of the remaining tokens to use. Options are: - “summarization”: Use up to 50% of the context window, minimum 128 tokens. - “chat”: Use up to 60% of the context window, minimum 128 tokens. - “code”: Use up to 75% of the context window, minimum 128 tokens. - “longform”: Use up to 90% of the context window, minimum 256 tokens. - “max”: Use all remaining tokens. Any unrecognized task defaults to a safe fallback using 50% of the context window.

Returns:

max_tokens – Maximum number of tokens that can be generated for the specified task, ensuring at least a minimum number of tokens as defined per task type.

Return type:

int

LLMlight.LLMlight.compute_tokens(string, n_ctx=4096, task='max')

LLMlight.LLMlight.convert_messages_to_model(messages, model='llama', add_assistant_start=True)

Builds a prompt in the appropriate format for different models (LLaMA, Grok, Mistral).

Args:

messages (list of dict): Each dict must have ‘role’ (‘system’, ‘user’, ‘assistant’) and ‘content’. model (str): The type of model to generate the prompt for (‘llama’, ‘grok’, or ‘mistral’). add_assistant_start (bool): Whether to add the assistant start (default True). add_bos_token (bool): Helps models know it’s a fresh conversation. Useful for llama/mistral/hermes-style models

Returns:

str: The final prompt string in the correct format for the given model.

Example:

>>> messages = [
...     {"role": "system", "content": "You are a helpful assistant."},
...     {"role": "user", "content": "What is the capital of France?"}
... ]
>>> prompt = convert_messages_to_model(messages, model='llama')
 >>> print(prompt)

LLMlight.LLMlight.convert_verbose_to_new(verbose): Convert old verbosity to the new.

LLMlight.LLMlight.disable_tqdm(): Set the logger for verbosity messages.

LLMlight.LLMlight.get_embeddings()

LLMlight.LLMlight.get_logger()

LLMlight.LLMlight.load_local_gguf_model(model_path: str, n_ctx: int = 4096, n_threads: int = 8, n_gpu_layers: int = 0, verbose: bool = True) → Llama

Loads a local GGUF model using llama-cpp-python.

Args:

model_path (str): Path to the .gguf model file. n_ctx (int): Maximum context length. Default is 4096. n_threads (int): Number of CPU threads to use. Default is 8. n_gpu_layers (int): Number of layers to offload to GPU (if available). Default is 20. verbose (bool): Whether to print status info.

Returns:

Llama: The loaded Llama model object.

Example:

>>> model_path = r'C://Users//beeld//.lmstudio//models//NousResearch//Hermes-3-Llama-3.2-3B-GGUF//Hermes-3-Llama-3.2-3B.Q4_K_M.gguf'
>>> llm = load_local_gguf_model(model_path, verbose=True)
>>> prompt = "<start_of_turn>user\nWhat is 2 + 2?\n<end_of_turn>\n<start_of_turn>model\n"
>>> response = llm(prompt=prompt, max_tokens=20, stop=["<end_of_turn>"])
>>> print(response["choices"][0]["text"].strip())
'4'

LLMlight.LLMlight.set_logger(verbose: [<class 'str'>, <class 'int'>] = 'info')

Set the logger for verbosity messages.

Parameters:

verbose ([str, int], default is 'info' or 20) – Set the verbose messages using string or integer values. * [0, 60, None, ‘silent’, ‘off’, ‘no’]: No message. * [10, ‘debug’]: Messages from debug level and higher. * [20, ‘info’]: Messages from info level and higher. * [30, ‘warning’]: Messages from warning level and higher. * [50, ‘critical’, ‘error’]: Messages from critical level and higher.

Returns:

None.
> # Set the logger to warning
> set_logger(verbose=’warning’)
> # Test with different messages
> logger.debug(“Hello debug”)
> logger.info(“Hello info”)
> logger.warning(“Hello warning”)
> logger.critical(“Hello critical”)

LLMlight.LLMlight.set_system_message(system)

class LLMlight.LLMlight.wget

Retrieve file from url.

download(writepath)

Download.

Parameters:

url (str.) – Internet source.
writepath (str.) – Directory to write the file.

Return type:

None.

filename_from_url(ext=True): Return filename.