mohsen1 40 minutes ago

Added some benchmarking to show how fast it is:

Here is a benchmark comparing it to [Repomix][1] serializing the Next.js project:

      time yek
      Executed in    5.19 secs    fish           external
         usr time    2.85 secs   54.00 micros    2.85 secs
         sys time    6.31 secs  629.00 micros    6.31 secs



      time repomix
      Executed in   22.24 mins    fish           external
         usr time   21.99 mins    0.18 millis   21.99 mins
         sys time    0.23 mins    1.72 millis    0.23 mins


yek is 230x faster than repomix

[1] https://github.com/jxnl/repomix

mg an hour ago

I think this is where the future of coding is. It is still useful to be a coder, the more experienced the better. But you will not write or edit a lot of lines anymore. You will organize the codebase in a way AI can handle it, make architectural decisions and organize the workflow around AI doing the actual coding.

The way I currently do this is that I wrote a small python file that I can start with

    llmcode.py /path/to/repo
Which then offers a simple web interface at localhost:8080 where I can select the files to serialize and describe a task.

It then creates a prompt like this:

    Look at the code files below and do the following:

    {task_description}

    Output all files that you need to change in full again,
    including your changes. In the same format as I provide
    the files below, that means each file starts with
    filename: and ends with :filename
    Under no circumstances output any other text, no additional
    infos, no code formatting chars. Only the code in the
    given format.

    Here are the files:

    somefile.py:
    ...code of somefile.py...
    :somefile.py

    someotherfile.py:
    ...code of someotherfile.py...
    :someotherfile.py

    assets/css/somestyles.css:
    ...code of somestyles.css...
    :assets/css/somestyles.css

    etc
Then llmcode.py sends it to an LLM, parses the output and writes the files back to disk.

I then look at the changes via "git diff".

It's quite fascinating. I often only make minor changes before accepting the "pull request" the llm made. Sometimes I have to make no changes at all.

  • shatrov1 39 minutes ago

    Would you be kind to share your script? Thanks!

mkagenius an hour ago

I am doing something similar for my gitpodcast project:

    def get_important_files(self, file_tree):
        # file_tree = "api/backend/main.py  api.py"
        # Send the prompt to Azure OpenAI for processing
        response = openai.beta.chat.completions.parse(
            model=self.model_name,
            messages=[
                {"role": "system", "content": "Can you give the list of upto 10 most important file paths in this file tree to understand code architechture and high level decisions and overall what the repository is about to include in the podcast i am creating, as a list, do not write any unknown file paths not listed below"},  # Initial system prompt
                {"role": "user", "content": file_tree}
            ],
            response_format=FileListFormat,
        )
        try:
            response = response.choices[0].message.parsed
            print(type(response), " resp ")
            return response.file_list
        except Exception as e:
            print("Error processing file tree:", e)
            return []


1. https://gitpodcast.com - Convert any GitHub repo into a podcast.
ycombiredd 40 minutes ago

i guess I shouldn’t be surprised that many of us have approached this in different ways. it’s neat to see already multiple replies of the sort I’m going to make too, which is to share the approach I’ve been taking, which is to concatenate or to “summarize” the code, with particular attention on dependency resolution.

[chimeracat](https://github.com/scottvr/chimeracat)

It took the shape that it has because it started as a tool to concatenate a library i had been working on into a single ipynb file so that I didn’t need to install the library on the remote colab, thus the dependency graph was born (as was the ascii graph plotter ‘phart’ that it uses) and then as I realized this could be useful to share code with an LLM, started adding the summarization capabilities, and in some sort of meta-recursive-irony, worked with Claude to do so. :-)

I’ve put a collection of ancillary tools I use to aid in the pairing with LLM process up at https://github.com/scottvr/LLMental

CGamesPlay 44 minutes ago

What is the use-case here? What is a "chunk"? It looks like it's just an arbitrary group of files, where "more important" files get put at the end. Why is that useful for LLMs? Also, I see it can chunk based on token count but... what's a token? ChatGPT? Llama?

Note, I understand why code context is important for LLMs. I don't understand what this chunking is or how it helps me get better code context.

  • mohsen1 35 minutes ago

    token counting is done by crate that I'm using. I agree that not all LLMs use the same tokenizer but they are mostly similar.

    Chunking is useful because in chat mode you can feed more than context max size if you feed in multiple USER messages

    LLMs pay more attention to the last part of conversation/message. This is why sorting is very important. Your last sentence in a very long prompt is much more important the first.

    Use case: I use this to run an "AI Loop" with Deepseek to fix bugs or implement features. The loop steers the LLM by not letting it go stray in various rabbit holes. Every prompt reiterates what the objective is. By loop I mean: Serialize repo, run test, feed test failure and repo to LLM, get a diff, apply the diff and repeat until the objective is achieved.

    • CGamesPlay 28 minutes ago

      Got it, thanks.

      > in chat mode you can feed more than context max size if you feed in multiple USER messages

      Just so you know, this is false. You might be using a system that automatically deletes or summarizes older messages, which would make you feel like that, and would also indicate why you feel that the sorting is so important (It is important! But possibly not critically important).

      For future work, you might be interested in seeing how tools like Aider do their "repo serializing" (they call it a repomap), which tries to be more intelligent by only including "important lines" (like function definitions but not bodies).

pagekicker 2 hours ago

Error: yek: SHA256 mismatch Expected: 34896ad65e8ae7c5e93d90e87f15656b67ed5b7596492863d1da80e548ba7301 Actual: 353f4f7467af25b5bceb66bb29d9591ffe8d620d17bf40f6e0e4ec16cd4bd7e7 File: /Users/... Library/Caches/Homebrew/downloads/0308e13c088cb787ece0e33a518cd211773daab9b427649303d79e27bf723e0d--yek-x86_64-apple-darwin.tar.gz To retry an incomplete download, remove the file above.

Removed & tried again this was the result. Is the SHA256 mismatch a security concern?

  • mohsen1 2 hours ago

    Oh totally forgot about homebrew installer. I'll fix it ASAP. Sorry about that.

    Edit: Working on a fix here https://github.com/bodo-run/yek/pull/14

    You can use the bash installer on macOS for now. You can read the installer file before executing it if you're not sure if it is safe

wiradikusuma an hour ago

Sorry if it's not very obvious, where does Yek fit with existing coding assistants such as Copilot or Continue.dev?

Is it purpose-built for code, or any text (e.g., Obsidian vault) would work?

  • mohsen1 33 minutes ago

    This can be a piece of your own AI automation. Every task has a different need so being able to program your own AI automation is great for programmers. Any text based document works with this tool. It's rather simple, just stitching fils together with a dash of priority sorting

linschn 3 hours ago

That's neat ! I've built a transient UI to do this manually[0] within emacs, but with the context windows getting bigger ang bigger, being more systematic may be the way to go.

The priorization mentioned in the readme is especially interesting.

[0] https://rdklein.fr/bites/MyTransientUIForLocalLLMs.html

hbornfree 2 hours ago

Thanks for this! I have the exact use-case and have been using a Python script to do this for a while.

TheTaytay 5 hours ago

This has some interesting ideas that I hadn’t seen in the other similar projects, especially around trying to sort files according to importance.

(I’ve been using RepoPrompt for this sort of thing lately.)

awestroke an hour ago

This looks promising. Hopefully much faster and less naive than Repomix

msoad 3 hours ago

This is really fast! Serialized 50k lines in 500ms on my Mac