GLM-5.2 OpenAI-Compatible API: A Hands-On Guide to Reasoning Effort, Function Calling, and Long-Context Retrieval

Disclosure: Some links in this article are affiliate links. AI Maestro may earn a commission if you make a purchase, at no…

By AI Maestro June 23, 2026 5 min read
GLM-5.2 OpenAI-Compatible API: A Hands-On Guide to Reasoning Effort, Function Calling, and Long-Context Retrieval

The GLM-5.2 model now offers an OpenAI-compatible API that supports reasoning effort controls, function calling, and long-context retrieval. This guide demonstrates how to configure the client, manage costs, and test these features in a practical workflow.

Setting Up the Client and Chat Wrapper

The tutorial begins by defining multiple provider options, including ZAI, OpenRouter, Together, Requesty, and Hugging Face. It securely loads the API key and creates a reusable chat wrapper. This wrapper handles standard chat, thinking mode, streaming, tool calling, and token tracking.

The code sets up the OpenAI client with specific base URLs for each provider. It defines a `load_api_key` function that checks Google Colab userdata, system environment variables, or prompts the user for input if necessary.

Cost tracking is implemented via a `_track` function that accumulates input tokens, output tokens, and call counts. A helper function, `get_reasoning`, extracts hidden reasoning traces from the model’s response, checking for `reasoning_content` attributes in the message, extra fields, or the dictionary representation.

The main `chat` function accepts parameters for reasoning effort, thinking mode, streaming, and tool selection. It constructs an `extra_body` dictionary to pass GLM-specific settings like `thinking` and `reasoning_effort` alongside standard OpenAI arguments.

import sys, subprocess
subprocess.run([sys.executable, "-m", "pip", "install", "-q", "-U", "openai"], check=False)
import os, re, json, time, getpass
from openai import OpenAI
PROVIDERS = {
   "zai":         {"base_url": "https://api.z.ai/api/paas/v4/",   "model": "glm-5.2",        "env": "ZAI_API_KEY"},
   "openrouter":  {"base_url": "https://openrouter.ai/api/v1",    "model": "z-ai/glm-5.2",   "env": "OPENROUTER_API_KEY"},
   "together":    {"base_url": "https://api.together.xyz/v1",     "model": "zai-org/GLM-5.2","env": "TOGETHER_API_KEY"},
   "requesty":    {"base_url": "https://router.requesty.ai/v1",   "model": "zai/glm-5.2",    "env": "REQUESTY_API_KEY"},
   "huggingface": {"base_url": "https://router.huggingface.co/v1","model": "zai-org/GLM-5.2","env": "HF_TOKEN"},
}
PROVIDER = "zai"
CFG   = PROVIDERS[PROVIDER]
MODEL = CFG["model"]
def load_api_key(env_name):
   try:
       from google.colab import userdata
       v = userdata.get(env_name)
       if v: return v
   except Exception:
       pass
   if os.environ.get(env_name):
       return os.environ[env_name]
   return getpass.getpass(f"Enter your {env_name}: ")
client = OpenAI(api_key=load_api_key(CFG["env"]), base_url=CFG["base_url"])
PRICE_IN_PER_M, PRICE_OUT_PER_M = 1.40, 4.40
_USAGE = {"in": 0, "out": 0, "calls": 0}
def _track(usage):
   if usage:
       _USAGE["in"]    += getattr(usage, "prompt_tokens", 0) or 0
       _USAGE["out"]   += getattr(usage, "completion_tokens", 0) or 0
       _USAGE["calls"] += 1
def get_reasoning(obj):
   """Pull GLM's hidden reasoning trace from a message/delta (a provider-extra field)."""
   val = getattr(obj, "reasoning_content", None)
   if val: return val
   extra = getattr(obj, "model_extra", None) or {}
   if extra.get("reasoning_content"): return extra["reasoning_content"]
   try:    return obj.to_dict().get("reasoning_content")
   except Exception: return None
def chat(messages, effort=None, thinking=True, tools=None, tool_choice="auto",
        stream=False, max_tokens=2048, temperature=1.0, tool_stream=False):
   """
   effort:   None | "high" | "max"   (GLM-5.2 thinking-effort level; max is the model default)
   thinking: True -> deep thinking on; False -> off (fast, cheap, low-latency)
   GLM-specific params go through extra_body so any OpenAI client works.
   """
   extra = {"thinking": {"type": "enabled" if thinking else "disabled"}}
   if effort and thinking: extra["reasoning_effort"] = effort
   if tool_stream:         extra["tool_stream"] = True
   kwargs = dict(model=MODEL, messages=messages, max_tokens=max_tokens,
                 temperature=temperature, stream=stream, extra_body=extra)
   if tools:
       kwargs.update(tools=tools, tool_choice=tool_choice)
   if stream:
       kwargs["stream_options"] = {"include_usage": True}
   return client.chat.completions.create(**kwargs)

Testing Reasoning Effort and Streaming

The guide moves to practical testing, starting with a basic sanity check. It then compares the model’s performance across three modes: thinking off, effort high, and effort max. This comparison measures latency and output token counts for the same train problem.

Streaming is demonstrated by separating the reasoning channel from the final answer. The code iterates through chunks to print the thinking process first, followed by the main content.

def demo_basic():
   print("\n=== 1. BASIC CHAT / SANITY CHECK =========================")
   resp = chat([{"role": "system", "content": "You are a concise technical assistant."},
                {"role": "user",   "content": "In one sentence, what is GLM-5.2 best at?"}],
               thinking=False, max_tokens=200)
   _track(resp.usage)
   print(resp.choices[0].message.content.strip())
def demo_effort():
   print("\n=== 2. THINKING-EFFORT CONTROL (off / high / max) ========")
   problem = ("Train A leaves city A at 9:00 going 60 km/h toward city B. "
              "Train B leaves B (420 km away) at 9:30 going 90 km/h toward A. "
              "At what clock time do they meet? Show the key steps briefly.")
   for label, kw in [("thinking OFF", dict(thinking=False)),
                     ("effort=high",  dict(thinking=True, effort="high")),
                     ("effort=max",   dict(thinking=True, effort="max"))]:
       t0 = time.time()
       resp = chat([{"role": "user", "content": problem}], max_tokens=2000, **kw)
       dt = time.time() - t0
       _track(resp.usage)
       msg, u = resp.choices[0].message, resp.usage
       print(f"\n--- {label} | {dt:0.1f}s | out_tokens={getattr(u,'completion_tokens',0)} ---")
       r = get_reasoning(msg)
       if r:
           print("  [reasoning, first 220 chars]: " + " ".join(r.split())[:220] + " ...")
       print("  : " + " ".join((msg.content or '').split())[:350])
def demo_streaming():
   print("\n=== 3. STREAMING: reasoning channel vs answer channel ====")
   stream = chat([{"role": "user", "content":
                   "Explain why the sky is blue, then give a one-line TL;DR."}],
                 thinking=True, effort="high", stream=True, max_tokens=1200)
   saw_r = saw_a = False
   usage = None
   for chunk in stream:
       if getattr(chunk, "usage", None): usage = chunk.usage
       if not chunk.choices: continue
       delta = chunk.choices[0].delta
       r = get_reasoning(delta)
       if r:
           if not saw_r: print("\n[thinking] ", end="", flush=True); saw_r = True
           print(r, end="", flush=True)
       if getattr(delta, "content", None):
           if not saw_a: print("\n\n ", end="", flush=True); saw_a = True
           print(delta.content, end="", flush=True)
   print()
   _track(usage)

Costs are estimated at 1.40 per million input tokens and 4.40 per million output tokens. The code tracks usage throughout these demonstrations.

Function Calling and Tool Use

The final section builds a multi-step agent using function calling. It defines two tools: a calculator for arithmetic and a city population lookup. The agent runs in a loop, sending tool calls back to the model and feeding the results into the conversation.

Scroll to Top