Tool parser

GitHub Copilot

Copilot CLI and VS Code transcript ingestion, model inference, and tool normalization.

GitHub Copilot

Copilot has two supported on-disk layouts: the legacy CLI agent under ~/.copilot/ and VS Code Copilot Chat transcripts under workspace storage. tokenuse reads both through src/tools/copilot/.

Status: implemented.

Where the Data Lives

Legacy CLI Agent

~/.copilot/session-state/<session-id>/
    events.jsonl
    workspace.yaml

workspace.yaml is parsed for a scalar cwd: line and used as the project path. events.jsonl is the timeline.

VS Code Extension

PlatformWorkspace storage
macOS~/Library/Application Support/Code/User/workspaceStorage/<hash>/
Linux~/.config/Code/User/workspaceStorage/<hash>/
Windows%APPDATA%/Code/User/workspaceStorage/<hash>/

Inside each workspace hash directory:

GitHub.copilot-chat/transcripts/<session>.jsonl

A transcript file only parses as Copilot when its first line has type == "session.start" and data.producer == "copilot-agent". When that session.start event includes data.context.cwd, the cwd is the authoritative project path. If absent, tokenuse falls back to workspace.yaml and then the discovered source project.

flowchart TD A["legacy session-state dir"] --> B["events.jsonl"] A --> C["workspace.yaml cwd"] D["VS Code workspaceStorage"] --> E["transcripts/*.jsonl"] E --> F["first line data.producer == copilot-agent"] B --> G["legacy parser"] F --> H["transcript parser"] C --> G C --> H G --> I["ParsedCall output"] H --> I

Record Format

Legacy events.jsonl

Legacy events store their payload under data. A legacy assistant message only emits a ParsedCall when the current model has been set by session.model_change and data.outputTokens is positive.

{ "type": "session.model_change",
  "timestamp": "2026-04-26T10:00:00Z",
  "data": { "newModel": "claude-sonnet-4-5" } }

{ "type": "user.message",
  "timestamp": "2026-04-26T10:00:01Z",
  "data": { "content": "fix the typo in README" } }

{ "type": "assistant.message",
  "timestamp": "2026-04-26T10:00:02Z",
  "data": {
    "messageId": "m1",
    "outputTokens": 220,
    "toolRequests": [
      { "toolCallId": "tooluse_xyz", "name": "bash",
        "arguments": "{\"command\":\"ls -la | wc -l\"}" },
      { "toolCallId": "tooluse_yyy", "name": "edit_file" }
    ]
  } }

VS Code Transcripts

VS Code transcript payloads also live under data. The parser validates the first session.start line, uses data.context.cwd for the project path, and estimates tokens from message text.

{ "type": "session.start",
  "data": {
    "sessionId": "x",
    "producer": "copilot-agent",
    "model": "gpt-5",
    "context": { "cwd": "/Users/me/Code/tokens" }
  } }

{ "type": "user.message",
  "data": { "content": "hello world" } }

{ "type": "assistant.message",
  "data": {
    "messageId": "abc",
    "content": "sure thing",
    "reasoningText": "let me think",
    "toolRequests": [
      { "toolCallId": "toolu_bdrk_01ZZ", "name": "read_file" },
      { "toolCallId": "toolu_bdrk_02YY", "name": "edit_file" }
    ]
  } }

The current transcript parser does not use data.model for pricing. It infers one model alias per transcript from tool-call id prefixes.

Token & Cost Mapping

ParsedCall fieldLegacy sourceVS Code transcript source
input_tokens0latest data.content.len() / 4, rounded up
output_tokensdata.outputTokensdata.content.len() / 4, rounded up, unless explicit data.outputTokens exists
reasoning_tokens0data.reasoningText.len() / 4, rounded up
cache_creation_input_tokens00
cache_read_input_tokens00
modellatest session.model_change.data.newModelinferred alias from tool-call ids
timestamptop-level timestamp, parsed as RFC3339top-level timestamp when present; otherwise None
projectworkspace.yaml cwd:, then discovered sourcesession.start.data.context.cwd, then workspace.yaml, then discovered source

Transcript reasoning tokens are preserved in reasoning_tokens for future breakouts. They are not folded into output_tokens by the current Copilot parser.

Model Inference

When parsing VS Code transcripts, count recognized data.toolRequests[].toolCallId prefixes across the whole transcript and use the most common alias:

PrefixAliasPricing target
toolu_bdrk_anthropic-autoSonnet alias
toolu_vrtx_anthropic-autoSonnet alias
tooluse_anthropic-autoSonnet alias
call_openai-autoGPT-5 alias

If no recognized prefix appears, the parser uses copilot-auto, which currently falls through pricing lookup to the snapshot fallback.

Deduplication

  • Legacy: copilot:<session_id>:<message_id>, where session_id is the parent directory name and message_id is data.messageId.
  • VS Code: copilot:<session_id>:<message_id>, where session_id is the transcript file stem and message_id is data.messageId.

Tools / Bash Extraction

Walk data.toolRequests[] and normalize each name:

Copilot nameNormalized
bash, run_in_terminal, kill_terminalBash
read_fileRead
edit_file, write_file, replace_string_in_file, apply_patchEdit
create_fileWrite
delete_fileDelete
search_files, file_searchGrep
find_filesGlob
list_directory, list_dirLS
web_searchWebSearch
fetch_webpageWebFetch
github_repoGitHub
memoryMemory

For Bash-class calls, parse arguments as a JSON string and split command or cmd with tools::jsonl::split_bash_commands.

flowchart LR A["data.toolRequests array"] --> B["normalize tool name"] A -->|bash class| C["parse arguments JSON"] C --> D["command or cmd"] D --> E["split_bash_commands"] B --> F["tools"] E --> G["bash_commands"]

Known Limitations

  • Legacy events without a positive data.outputTokens value are skipped.
  • Legacy input tokens are currently recorded as 0 because the legacy format only exposes output tokens in the supported path.
  • VS Code transcript token counts are estimates based on chars / 4.0; treat Copilot totals as approximate.
  • VS Code data.model is currently ignored for pricing; tool-call id inference picks one model alias for the whole transcript.
  • workspace.yaml parsing reads only the scalar cwd: line used by Copilot session-state files. If Copilot starts writing richer YAML, replace the small parser with a YAML crate.