Instead of learning the latest workarounds for the kinks and quirks of a beta AI product, I'm going to wait 3 weeks for the advice to become completely obsolete
There was a debate over whether to integrate Stable Diffusion into the curriculum in a local art school here.
Personally while I consider AI a useful tool, I think it's quite pointless to teach it in school, because whatever you learn will be obsolete next month.
Of course some people might argue that the whole art school (it's already quite a "job-seeking" type, mostly digital painting/Adobe After Effect) will be obsolete anyway...
The skill that's worth learning is how to investigate, experiment and think about these kinds of tools.
A "Stable Diffusion" class might be a waste of time, but a "Generative art" class where students are challenged to explore what's available, share their own experiments and discuss under what circumstances these tools could be useful, harmful, productive, misleading etc feels like it would be very relevant to me, no matter where the technology goes next.
Very true regarding the subjects of a hypothetical AI art class.
What's also important is the teaching of how commercial art or art in general is conceptualized, in other words:
What is important and why? Design thinking. I know that phrase might sound dated but that's the work what humans should fear being replaced on / foster their skills.
That's also the line that at first seems to be blurred when using generative text-to-image AI, or LLMs in general.
The seemingly magical connection between prompt and result appears to human users like the work of a creative entity distilling and developing an idea.
That's the most important aspect of all creative work.
If you read my reply, thanks Simon, your blog's an amazing companion in the boom of generative AI. Was a regular reader in 2022/2023, should revisit! I think you guided me through my first local LLama setup.
this is exactly the kind of attitude that turns university courses into dinosaurs with far less connection to the “real world” industry than ideal. frankly its an excuse for laziness and luddism at this point. much of what i learned about food groups and economics and politics and writing in school is obsolete at this point, should my teachers not have bothered at all? out of what? fear?
the way stable diffusion works hasn’t really changed, and in fact people have just built comfyui layers and workflows on top of it in the ensuing 3 years, and the more you stick your head in the sand because you already predetermined the outcome you are mostly piling up the debt that your students will have to learn on their own because you were too insecure to make a call without trusting that your students can adjust as needed
It’s churn because every new model may or may not break strategies that worked before.
Nobody is designing how to prompt models. It’s an emergent property of these models, so they could just change entirely from each generation of any model.
IMO the lack of real version control and lack of reliable programmability have been significant impediments to impact and adoption. The control surfaces are more brittle than say, regex, which isn’t a good place to be.
I would quibble that there is a modicum of design in prompting; RLHF, DPO and ORPO are explicitly designing the models to be more promptable. But the methods don’t yet adequately scale to the variety of user inputs, especially in a customer-facing context.
My preference would be for the field to put more emphasis on control over LLMs, but it seems like the momentum is again on training LLM-based AGIs. Perhaps the Bitter Lesson has struck again.
It's interesting that despite all these real issues you're pointing out a lot of people nevertheless are drawn to interact with this technology.
It looks as if it touches some deep psychological lever: have an assistant that can help to carry out tasks that you don't have to bother learning the boring details of a craft.
Unless your business is customer service reps, with no ability to do anything but read scripts, who have no real knowledge of how things actually work.
Modern AI both shortens the useful lifespan of software and increases the importance of development speed. Waiting around doesn’t seem optimal right now.
The reality is that o1 is a step away from general intelligence and back towards narrow ai. It is great for solving the kinds of math, coding and logic puzzles it has been designed for, but for many kinds of tasks, including chat and creative writing, it is actually worse than 4o. It is good at the specific kinds of reasoning tasks that it was built for, much like alpha-go is great at playing go, but that does not actually mean it is more generally intelligent.
AGI currently is an intentionally vague and undefined goal. This allows businesses to operate towards a goal, define the parameters, and relish in the “rocket launches”-esque hype without leaving the vague umbrella of AI. It allows businesses to claim a double pursuit. Not only are they building AGI but all their work will surely benefit AI as well. How noble. Right?
It’s vagueness is intentional and allows you to ignore the blind truth and fill in the gaps yourself. You just have to believe it’s right around the corner.
"If the human brain were so simple that we could understand it, we would be so simple that we couldn’t." - without trying to defend such business practice, it appears very difficult to define what are necessary and sufficient properties that make AGI.
it must be wonderful to live life with such supreme unfounded confidence. really, no sarcasm, i wonder what that is like. to be so sure of something when many smarter people are not, and when we dont know how our own intelligence fully works or evolved, and dont know if ANY lessons from our own intelligence even apply to artificial ones.
Technically, the models can already learn on the fly. Just that the knowledge it can learn is limited to the context length. It cannot, to use the trendy word, "grok" it and internally adjust the weights in its neural network yet.
To change this you would either need to let the model retrain itself every time it receives new information, or to have such a great context length that there is no effective difference. I suspect even meat models like our brains is still struggling to do this effectively and need a long rest cycle (i.e. sleep) to handle it. So the problem is inherently more difficult to solve than just "thinking". We may even need an entire new architecture different from the neural network to achieve this.
I understand the hype. I think most humans understand why a machine responding to a query like never before in the history of mankind is amazing.
What you’re going through is hype overdose. You’re numb to it. Like I can get if someone disagrees but it’s a next level lack of understanding human behavior if you don’t get the hype at all.
There exists living human beings who are still children or with brain damage with comparable intelligence to an LLM and we classify those humans as conscious but we don’t with LLMs.
I’m not trying to say LLMs are conscious but just saying that the creation of LLMs marks a significant turning point. We crossed a barrier 2 years ago somewhat equivalent to landing on the moon and i am just dumb founded that someone doesn’t understand why there is hype around this.
The first plane ever flies, and people think "we can fly to the moon soon!".
Yet powered flight has nothing to do with space travel, no connection at all. Gliding in the air via low/high pressure doesn't mean you'll get near space, ever, with that tech. No matter how you try.
And yet, the moon was reached a mere 66 years after the first powered flight. Perhaps it's a better heuristic than you are insinuating...
In all honesty, there are lots of connections between powered flight and space travel. Two obvious ones are "light and strong metallurgy" and "a solid mathematical theory of thermodynamics". Once you can build lightweight and efficient combustion chambers, a lot becomes possible...
Similarly, with LLMs, it's clear we've hit some kind of phase shift in what's possible - we now have enough compute, enough data, and enough know-how to be able to copy human symbolic thought by sheer brute-force. At the same time, through algorithms as "unconnected" as airplanes and spacecraft, computers can now synthesize plausible images, plausible music, plausible human speech, plausible anything you like really. Our capabilities have massively expanded in a short timespan - we have cracked something. Something big, like lightweight combustion chambers.
The status quo ante is useless to predict what will happen next.
That’s not true. There was not endless hype about flying to the moon when the first plane flew.
People are well aware of the limits of LLMs.
As slow as the progress is, we now have metrics and measurable progress towards agi even when there are clear signs of limitations on LLMs. We never had this before and everyone is aware of this. No one is delusional about it.
The delusion is more around people who think other people are making claims of going to the moon in a year or something. I can see it in 10 to 30 years.
I have a lot of luck using 4o to build and iterate on context and then carry that into o1. I’ll ask 4o to break down concepts, make outlines, identify missing information and think of more angles and options. Then at the end, switch on o1 which can use all that context.
FWIW: OpenAI provides advice on how to prompt o1 (https://platform.openai.com/docs/guides/reasoning/advice-on-...). Their first bit of advice is to, “Keep prompts simple and direct: The models excel at understanding and responding to brief, clear instructions without the need for extensive guidance.”
The article links out to OpenAI's advice on prompting, but it also claims:
OpenAI does publish advice on prompting o1,
but we find it incomplete, and in a sense you can
view this article as a “Missing Manual” to lived
experience using o1 and o1 pro in practice.
To that end, the article does seem to contradict some of the advice OpenAI gives. E.g., the article recommends stuffing the model with as much context as possible... while OpenAI's docs note to include only the most relevant information to prevent the model from overcomplicating its response.
Those are contradictory. Openai claim that you don't need a manual, since O1 performs best with simple prompts. The author claims it performs better with more complex prompts, but provides no evidence.
OpenAI does publish advice on prompting o1,
but we find it incomplete, and in a sense you can
view this article as a “Missing Manual” to lived
experience using o1 and o1 pro in practice.
I think there is a distinction between “instructions”, “guidance” and “knowledge/context”. I tend to provide o1 pro with a LOT of knowledge/context, a simple instruction, and no guidance. I think TFA is advocating same.
But the way they did their PR for O1 made it sound like it was the next step, while in reality it was a side step. A branching from the current direction towards AGI.
Thanks for sharing this video, swyx. I learned a lot from listening to it. I hadn’t considered checking prompts for a project into source control. This video has also changed my approach to prompting in the future.
“prompts in source control” is kinda like “configs in source control” for me. recommended for small projects, but at scale eventually you wanna abstract it out into some kind of prompt manager software for others to use and even for yourself to track and manage over time. git isnt the right database for everything.
People agreeing and disagreeing about the central thesis of the article, which is fine because i enjoy the discussion...
no matter where you stand in the specific o1/o3 discussion the concept of "question entropy" is very enlightening.
what is the question of theoretical minimum complexity that still solves your question adequately? or for a specific model, are its users capable of supplying the minimum required intellectual complexity the model needs?
Would be interesting to quantify these two and see if our models are close to converging on certain task domains.
This echoes my experience. I often use ChatGPT to help with D&D module design and I found that O1 did best when I told it exactly what k required, dumped in a large amount of info and did not expect to use it to iterate multiple times.
One thing I'd like to experiment with is "prompt to service". I want to take an existing microservice of about 3-5kloc and see if I can write a prompt to get o1 to generate the entire service, proper structure, all files, all tests, compiles and passes etc. o1 certainly has the context window to do this at 200k input and 100k output - code is ~10 tokens per line of code, so you'd need like 100k input and 50k output tokens.
My approach would be:
- take an exemplar service, dump it in the context
- provide examples explaining specific things in the exemplar service
- write a detailed formal spec
- ask for the output in JSON to simplify writing the code - [{"filename":"./src/index.php", "contents":"<?php...."}]
The first try would inevitably fail, so I'd provide errors and feedback, and ask for new code (ie complete service, not diffs or explanations), plus have o1 update and rewrite the spec based on my feedback and errors.
oh god using an LLM for medical advice? and maybe getting 3/5 right? Barely above a coin flip.
And that Warning section? "Do not be wrong. Give the correct names." That this is necessary to include is an idiotic product "choice" since its non-inclusion implies the bot is able to be wrong and give wrong names. This is not engineering.
It's hard to characterize the entropy of the distribution of potential diseases given a presentation: even if there are in theory many potential diagnoses, in practice a few will be a lot more common.
It doesn't really matter how much better the model is than random chance on a sample size of 5, though. There's a reason medicine is so heavily licensed: people die when they get uninformed advice. Asking o1 if you have skin cancer is gambling with your life.
That's not to say AI can't be useful in medicine: everyone doesn't have a dermatologist friend, after all, and I'm sure for many underserved people basic advice is better than nothing. Tools could make the current medical system more efficient. But you would need to do so much more work than whatever this post did to ascertain whether that would do more good than harm. Can o1 properly direct people to a medical expert if there's a potentially urgent problem that can't be ruled out? Can it effectively disclaim its own advice when asked about something it doesn't know about, the way human doctors refer to specialists?
> Just for fun, I started asking o1 in parallel. It’s usually shockingly close to the right answer — maybe 3/5 times. More useful for medical professionals — it almost always provides an extremely accurate differential diagnosis.
THIS IS DANGEROUS TO TELL PEOPLE TO DO. OpenAI is not a medical professional. Stop using chatbots for medical diagnoses. 60% is not almost always extremely accurate. This whole post, because of this bullet point, shows the author doesn't actually know the limitations of the product they're using and instead passing along misinformation.
I honestly think trusting exclusively your own doctor is a dangerous thing to do as well. Doctors are not infallible.
It's worth putting in some extra effort yourself, which may include consulting with LLMs provided you don't trust those blindly and are sensible about how you incorporate hints they give you into your own research.
Nobody is as invested in your own health as you are.
It's odd to see it recast as "you need to give better instructions [because it's different]" -- you could drop the "because it's different" part, and it'd apply to failure modes in all models.
It also begs the question of how it's different: and that's where the rationale gets cyclical. You have to prompt it different because it's different because you have to prompt it different.
And where that really gets into trouble is the "and that's the point" part -- as the other comment notes, it's expressly against OpenAI's documentation and thus intent.
I'm a yuge AI fan. Models like this are a clear step forward. But it does a disservice to readers to leave the impression that the same techniques don't apply to other models, and recasts a significant issue as design intent.
Looking at o1's behavior, it seems there's a key architectural limitation: while it can see chat history, it doesn't seem able to access its own reasoning steps after outputting them. This is particularly significant because it breaks the computational expressivity that made chain-of-thought prompting work in the first place—the ability to build up complex reasoning through iterative steps.
This will only improve when o1's context windows grow large enough to maintain all its intermediate thinking steps, we're talking orders of magnitude beyond current limits. Until then, this isn't just a UX quirk, it's a fundamental constraint on the model's ability to develop thoughts over time.
> This will only improve when o1's context windows grow large enough to maintain all its intermediate thinking steps, we're talking orders of magnitude beyond current limits.
Rather than retaining all those steps, what about just retaining a summary of them? Or put them in a vector DB so on follow-up it can retrieve the most relevant subset of them to the follow-up question?
I wouldn't be so harsh - you cold have a 4o style LLM turn vague user queries into precise constraints for an o1 style AI - this is how a lot of stable diffusion image generators work already.
It does seem like individual prompting styles greatly effects the performance of these models. Which makes sense of course, but the disparity is a lot larger than I would have expected. As an example, I'd say I see far more people in the HN comments preferring Claude over everything else. This is in stark contrast to my experience, where ChatGPT has and continues to be my go to for everything. And that's on a range of problems: general questions, coding tasks, visual understanding, and creative writing. I use these AIs all day, every day as part of my research, so my experience is quite extensive. Yet in all cases Claude has performed significantly worse for me. Perhaps it just comes down to the way that I prompt versus the average HN user? Very odd.
But yeah, o1 has been a _huge_ leap in my experience. One huge thing, which OpenAI's announcement mentions as well, is that o1 is more _consistently_ strong. 4o is a great model, but sometimes you have to spin the wheel a few times. I much more rarely need to spin o1's wheel, which mostly makes up for its thinking time. (Which is much less these days compared to o1-preview). It also has much stronger knowledge. So far it has solved a number of troubleshooting tasks that there were _no_ fixes for online. One of them was an obscure bug in libjpeg.
It's also better at just general questions, like wanting to know the best/most reputable store for something. 4o is too "everything is good! everything is happy!" to give helpful advice here. It'll say Temu is a "great store for affordable options." That kind of stuff. Whereas o1 will be more honest and thus helpful. o1 is also significantly better at following instructions overall, and inferring meaning behind instructions. 4o will be very literal about examples that you give it whereas o1 can more often extrapolate.
One surprising thing that o1 does that 4o has never done, is that it _pushes back_. It tells me when I'm wrong (and is often right!). Again, part of that being less happy and compliant. I have had scenarios where it's wrong and it's harder to convince it otherwise, so it's a double edged sword, but overall it has been an improvement in the bot's usefulness.
I also find it interesting that o1 is less censored. It refuses far less than 4o, even without coaxing, despite its supposed ability to "reason" about its guidelines :P What's funny is that the "inner thoughts" that it shows says that it's refusing, but its response doesn't.
Is it worth $200? I don't think it is, in general. It's not really an "engineer" replacement yet, in that if you don't have the knowledge to ask o1 the right questions it won't really be helpful. So you have to be an engineer for it to work at the level of one. Maybe $50/mo?
I haven't found o1-pro to be useful for anything; it's never really given better responses than o1 for me.
(As an aside, Gemini 2.0 Flash Experimental is _very_ good. It's been trading blows with even o1 for some tasks. It's a bit chaotic, since its training isn't done, but I rank it at about #2 between all SOTA models. A 2.0 Pro model would likely be tied with o1 if Google's trajectory here continues.)
Instead of learning the latest workarounds for the kinks and quirks of a beta AI product, I'm going to wait 3 weeks for the advice to become completely obsolete
There was a debate over whether to integrate Stable Diffusion into the curriculum in a local art school here.
Personally while I consider AI a useful tool, I think it's quite pointless to teach it in school, because whatever you learn will be obsolete next month.
Of course some people might argue that the whole art school (it's already quite a "job-seeking" type, mostly digital painting/Adobe After Effect) will be obsolete anyway...
The skill that's worth learning is how to investigate, experiment and think about these kinds of tools.
A "Stable Diffusion" class might be a waste of time, but a "Generative art" class where students are challenged to explore what's available, share their own experiments and discuss under what circumstances these tools could be useful, harmful, productive, misleading etc feels like it would be very relevant to me, no matter where the technology goes next.
Very true regarding the subjects of a hypothetical AI art class.
What's also important is the teaching of how commercial art or art in general is conceptualized, in other words:
What is important and why? Design thinking. I know that phrase might sound dated but that's the work what humans should fear being replaced on / foster their skills.
That's also the line that at first seems to be blurred when using generative text-to-image AI, or LLMs in general.
The seemingly magical connection between prompt and result appears to human users like the work of a creative entity distilling and developing an idea.
That's the most important aspect of all creative work.
If you read my reply, thanks Simon, your blog's an amazing companion in the boom of generative AI. Was a regular reader in 2022/2023, should revisit! I think you guided me through my first local LLama setup.
All knowledge degrades with time. Medical books from the 1800's wouldn't be a lot of use today.
There is just a different decay curve for different topics.
Part of 'knowing' a field is to learn it and then keep up with the field.
> whatever you learn will be obsolete next month
this is exactly the kind of attitude that turns university courses into dinosaurs with far less connection to the “real world” industry than ideal. frankly its an excuse for laziness and luddism at this point. much of what i learned about food groups and economics and politics and writing in school is obsolete at this point, should my teachers not have bothered at all? out of what? fear?
the way stable diffusion works hasn’t really changed, and in fact people have just built comfyui layers and workflows on top of it in the ensuing 3 years, and the more you stick your head in the sand because you already predetermined the outcome you are mostly piling up the debt that your students will have to learn on their own because you were too insecure to make a call without trusting that your students can adjust as needed
Integrating it into the curriculum is strange. They should do one time introductory lectures instead.
To be fair, the article basically says "ask the LLM for what you want in detail"
great advice, but difficult to apply given very small context window of o1 models
The churn is real. I wonder if so much churn due to innovation in a space can prevent enough adoption such that it actually reduces innovation
It’s churn because every new model may or may not break strategies that worked before.
Nobody is designing how to prompt models. It’s an emergent property of these models, so they could just change entirely from each generation of any model.
IMO the lack of real version control and lack of reliable programmability have been significant impediments to impact and adoption. The control surfaces are more brittle than say, regex, which isn’t a good place to be.
I would quibble that there is a modicum of design in prompting; RLHF, DPO and ORPO are explicitly designing the models to be more promptable. But the methods don’t yet adequately scale to the variety of user inputs, especially in a customer-facing context.
My preference would be for the field to put more emphasis on control over LLMs, but it seems like the momentum is again on training LLM-based AGIs. Perhaps the Bitter Lesson has struck again.
A constantly changing "API" coupled with a inherently unreliable output is not conducive to stable business.
It's interesting that despite all these real issues you're pointing out a lot of people nevertheless are drawn to interact with this technology.
It looks as if it touches some deep psychological lever: have an assistant that can help to carry out tasks that you don't have to bother learning the boring details of a craft.
Unfortunately lead cannot yet be turned into gold
> a lot of people nevertheless are drawn to interact with this technology.
To look at this statement cynically, a lot of people are drawn to anything with billions of dollars behind it… like literally anything.
Not to mention the amount companies spend on marketing AI products.
> It looks as if it touches some deep psychological lever: have an assistant that can help to carry out tasks
That deep lever is “make value more cheaply and with less effort”
From what I’ve seen, most of the professional interest in AI is based on cost cutting.
There are a few (what I would call degenerate) groups who believe there is some consciousness behind these AI, but theyre very small group.
Unless your business is customer service reps, with no ability to do anything but read scripts, who have no real knowledge of how things actually work.
Then current AI is basically the same, for cheap.
Many service reps do have some expertise in the systems they support.
Once you get past the tier 1 incoming calls, support is pretty specialized.
Modern AI both shortens the useful lifespan of software and increases the importance of development speed. Waiting around doesn’t seem optimal right now.
The reality is that o1 is a step away from general intelligence and back towards narrow ai. It is great for solving the kinds of math, coding and logic puzzles it has been designed for, but for many kinds of tasks, including chat and creative writing, it is actually worse than 4o. It is good at the specific kinds of reasoning tasks that it was built for, much like alpha-go is great at playing go, but that does not actually mean it is more generally intelligent.
LLMs will not give us "artificial general intelligence", whatever that means.
An AGI will be able to do any task any humans can do. Or all tasks any human can do. An AGI will be able to get any college degree.
So it’s not an AGI if it can’t create an AGI?
Humans might create AGI without fully understanding how.
Thus a machine can solve tasks without "understanding" them
AGI currently is an intentionally vague and undefined goal. This allows businesses to operate towards a goal, define the parameters, and relish in the “rocket launches”-esque hype without leaving the vague umbrella of AI. It allows businesses to claim a double pursuit. Not only are they building AGI but all their work will surely benefit AI as well. How noble. Right?
It’s vagueness is intentional and allows you to ignore the blind truth and fill in the gaps yourself. You just have to believe it’s right around the corner.
"If the human brain were so simple that we could understand it, we would be so simple that we couldn’t." - without trying to defend such business practice, it appears very difficult to define what are necessary and sufficient properties that make AGI.
In my opinion it's probably closer to real agi then it's not. I think the missing piece is learning after the pretraining phase.
it must be wonderful to live life with such supreme unfounded confidence. really, no sarcasm, i wonder what that is like. to be so sure of something when many smarter people are not, and when we dont know how our own intelligence fully works or evolved, and dont know if ANY lessons from our own intelligence even apply to artificial ones.
and yet, so confident. so secure. interesting.
I think it means a self-sufficient mind, which LLMs inherently are not.
So-so general intelligence is a lot harder to sell than narrow competence.
Yes, I don't understand their ridiculous AGI hype. I get it you need to raise a lot of money.
We need to crack the code for updating the base model on the fly or daily / weekly. Where is the regular learning by doing?
Not over the course of a year, spending untold billions to do it.
Technically, the models can already learn on the fly. Just that the knowledge it can learn is limited to the context length. It cannot, to use the trendy word, "grok" it and internally adjust the weights in its neural network yet.
To change this you would either need to let the model retrain itself every time it receives new information, or to have such a great context length that there is no effective difference. I suspect even meat models like our brains is still struggling to do this effectively and need a long rest cycle (i.e. sleep) to handle it. So the problem is inherently more difficult to solve than just "thinking". We may even need an entire new architecture different from the neural network to achieve this.
> Technically, the models can already learn on the fly. Just that the knowledge it can learn is limited to the context length.
Isn't that just improving the prompt to the non-learning model?
Only small problem is that models are neither thinking nor understanding, I am not sure how this kind of wording is allowed with these models.
I understand the hype. I think most humans understand why a machine responding to a query like never before in the history of mankind is amazing.
What you’re going through is hype overdose. You’re numb to it. Like I can get if someone disagrees but it’s a next level lack of understanding human behavior if you don’t get the hype at all.
There exists living human beings who are still children or with brain damage with comparable intelligence to an LLM and we classify those humans as conscious but we don’t with LLMs.
I’m not trying to say LLMs are conscious but just saying that the creation of LLMs marks a significant turning point. We crossed a barrier 2 years ago somewhat equivalent to landing on the moon and i am just dumb founded that someone doesn’t understand why there is hype around this.
The first plane ever flies, and people think "we can fly to the moon soon!".
Yet powered flight has nothing to do with space travel, no connection at all. Gliding in the air via low/high pressure doesn't mean you'll get near space, ever, with that tech. No matter how you try.
AI and AGI are like this.
And yet, the moon was reached a mere 66 years after the first powered flight. Perhaps it's a better heuristic than you are insinuating...
In all honesty, there are lots of connections between powered flight and space travel. Two obvious ones are "light and strong metallurgy" and "a solid mathematical theory of thermodynamics". Once you can build lightweight and efficient combustion chambers, a lot becomes possible...
Similarly, with LLMs, it's clear we've hit some kind of phase shift in what's possible - we now have enough compute, enough data, and enough know-how to be able to copy human symbolic thought by sheer brute-force. At the same time, through algorithms as "unconnected" as airplanes and spacecraft, computers can now synthesize plausible images, plausible music, plausible human speech, plausible anything you like really. Our capabilities have massively expanded in a short timespan - we have cracked something. Something big, like lightweight combustion chambers.
The status quo ante is useless to predict what will happen next.
That’s not true. There was not endless hype about flying to the moon when the first plane flew.
People are well aware of the limits of LLMs.
As slow as the progress is, we now have metrics and measurable progress towards agi even when there are clear signs of limitations on LLMs. We never had this before and everyone is aware of this. No one is delusional about it.
The delusion is more around people who think other people are making claims of going to the moon in a year or something. I can see it in 10 to 30 years.
Which sounds like... a very good thing?
I have a lot of luck using 4o to build and iterate on context and then carry that into o1. I’ll ask 4o to break down concepts, make outlines, identify missing information and think of more angles and options. Then at the end, switch on o1 which can use all that context.
FWIW: OpenAI provides advice on how to prompt o1 (https://platform.openai.com/docs/guides/reasoning/advice-on-...). Their first bit of advice is to, “Keep prompts simple and direct: The models excel at understanding and responding to brief, clear instructions without the need for extensive guidance.”
The article links out to OpenAI's advice on prompting, but it also claims:
To that end, the article does seem to contradict some of the advice OpenAI gives. E.g., the article recommends stuffing the model with as much context as possible... while OpenAI's docs note to include only the most relevant information to prevent the model from overcomplicating its response.I haven't used o1 enough to have my own opinion.
Those are contradictory. Openai claim that you don't need a manual, since O1 performs best with simple prompts. The author claims it performs better with more complex prompts, but provides no evidence.
In case you missed it
The last line is importantI think there is a distinction between “instructions”, “guidance” and “knowledge/context”. I tend to provide o1 pro with a LOT of knowledge/context, a simple instruction, and no guidance. I think TFA is advocating same.
So in a sense, being an early adopter for the previous models makes you worse at this one?
The advice is wrong
But the way they did their PR for O1 made it sound like it was the next step, while in reality it was a side step. A branching from the current direction towards AGI.
I made a tool for manually collecting context. I use it when copying and pasting multiple files is cumbersome: https://pypi.org/project/ggrab/
i creates thisismy.franzai.com for the same reason
coauthor/editor here!
we recorded a followup conversation after the surprise popularity of this article breaking down some more thoughts and behind the scenes: https://youtu.be/NkHcSpOOC60?si=3KvtpyMYpdIafK3U
Thanks for sharing this video, swyx. I learned a lot from listening to it. I hadn’t considered checking prompts for a project into source control. This video has also changed my approach to prompting in the future.
thanks for watching!
“prompts in source control” is kinda like “configs in source control” for me. recommended for small projects, but at scale eventually you wanna abstract it out into some kind of prompt manager software for others to use and even for yourself to track and manage over time. git isnt the right database for everything.
People agreeing and disagreeing about the central thesis of the article, which is fine because i enjoy the discussion...
no matter where you stand in the specific o1/o3 discussion the concept of "question entropy" is very enlightening.
what is the question of theoretical minimum complexity that still solves your question adequately? or for a specific model, are its users capable of supplying the minimum required intellectual complexity the model needs?
Would be interesting to quantify these two and see if our models are close to converging on certain task domains.
This echoes my experience. I often use ChatGPT to help with D&D module design and I found that O1 did best when I told it exactly what k required, dumped in a large amount of info and did not expect to use it to iterate multiple times.
Can you provide prompt/response pairs? I'd like to test how other models perform using the same technique.
I'd love to see some examples, of good and bad prompting of o1
I'll admit I'm probably not using O1 well, but I'd learn best from examples.
Work with chat bots like a junior dev, work with o1 like a senior dev.
One thing I'd like to experiment with is "prompt to service". I want to take an existing microservice of about 3-5kloc and see if I can write a prompt to get o1 to generate the entire service, proper structure, all files, all tests, compiles and passes etc. o1 certainly has the context window to do this at 200k input and 100k output - code is ~10 tokens per line of code, so you'd need like 100k input and 50k output tokens.
My approach would be:
- take an exemplar service, dump it in the context
- provide examples explaining specific things in the exemplar service
- write a detailed formal spec
- ask for the output in JSON to simplify writing the code - [{"filename":"./src/index.php", "contents":"<?php...."}]
The first try would inevitably fail, so I'd provide errors and feedback, and ask for new code (ie complete service, not diffs or explanations), plus have o1 update and rewrite the spec based on my feedback and errors.
Curious if anyone's tried something like this.
this is hilarious
oh god using an LLM for medical advice? and maybe getting 3/5 right? Barely above a coin flip.
And that Warning section? "Do not be wrong. Give the correct names." That this is necessary to include is an idiotic product "choice" since its non-inclusion implies the bot is able to be wrong and give wrong names. This is not engineering.
Not if you're selecting out of 10s or 100s of possible diagnoses
It's hard to characterize the entropy of the distribution of potential diseases given a presentation: even if there are in theory many potential diagnoses, in practice a few will be a lot more common.
It doesn't really matter how much better the model is than random chance on a sample size of 5, though. There's a reason medicine is so heavily licensed: people die when they get uninformed advice. Asking o1 if you have skin cancer is gambling with your life.
That's not to say AI can't be useful in medicine: everyone doesn't have a dermatologist friend, after all, and I'm sure for many underserved people basic advice is better than nothing. Tools could make the current medical system more efficient. But you would need to do so much more work than whatever this post did to ascertain whether that would do more good than harm. Can o1 properly direct people to a medical expert if there's a potentially urgent problem that can't be ruled out? Can it effectively disclaim its own advice when asked about something it doesn't know about, the way human doctors refer to specialists?
[dead]
?????? What?
> Just for fun, I started asking o1 in parallel. It’s usually shockingly close to the right answer — maybe 3/5 times. More useful for medical professionals — it almost always provides an extremely accurate differential diagnosis.
THIS IS DANGEROUS TO TELL PEOPLE TO DO. OpenAI is not a medical professional. Stop using chatbots for medical diagnoses. 60% is not almost always extremely accurate. This whole post, because of this bullet point, shows the author doesn't actually know the limitations of the product they're using and instead passing along misinformation.
Go to a doctor, not your chatbot.
I honestly think trusting exclusively your own doctor is a dangerous thing to do as well. Doctors are not infallible.
It's worth putting in some extra effort yourself, which may include consulting with LLMs provided you don't trust those blindly and are sensible about how you incorporate hints they give you into your own research.
Nobody is as invested in your own health as you are.
[dead]
This is a bug, and a regression, not a feature.
It's odd to see it recast as "you need to give better instructions [because it's different]" -- you could drop the "because it's different" part, and it'd apply to failure modes in all models.
It also begs the question of how it's different: and that's where the rationale gets cyclical. You have to prompt it different because it's different because you have to prompt it different.
And where that really gets into trouble is the "and that's the point" part -- as the other comment notes, it's expressly against OpenAI's documentation and thus intent.
I'm a yuge AI fan. Models like this are a clear step forward. But it does a disservice to readers to leave the impression that the same techniques don't apply to other models, and recasts a significant issue as design intent.
Looking at o1's behavior, it seems there's a key architectural limitation: while it can see chat history, it doesn't seem able to access its own reasoning steps after outputting them. This is particularly significant because it breaks the computational expressivity that made chain-of-thought prompting work in the first place—the ability to build up complex reasoning through iterative steps.
This will only improve when o1's context windows grow large enough to maintain all its intermediate thinking steps, we're talking orders of magnitude beyond current limits. Until then, this isn't just a UX quirk, it's a fundamental constraint on the model's ability to develop thoughts over time.
> This will only improve when o1's context windows grow large enough to maintain all its intermediate thinking steps, we're talking orders of magnitude beyond current limits.
Rather than retaining all those steps, what about just retaining a summary of them? Or put them in a vector DB so on follow-up it can retrieve the most relevant subset of them to the follow-up question?
Is that relevant here? the post discussed writing a long prompt to get a good answer, not issues with ex. step #2 forgetting what was done in step #1.
I wouldn't be so harsh - you cold have a 4o style LLM turn vague user queries into precise constraints for an o1 style AI - this is how a lot of stable diffusion image generators work already.
Correct, you get it: its turtles all the way down, not "it's different intentionally"
> To justify the $200/mo price tag, it just has to provide 1-2 Engineer hours a month
> Give a ton of context. Whatever you think I mean by a “ton” — 10x that.
One step forward. Two steps back.
It does seem like individual prompting styles greatly effects the performance of these models. Which makes sense of course, but the disparity is a lot larger than I would have expected. As an example, I'd say I see far more people in the HN comments preferring Claude over everything else. This is in stark contrast to my experience, where ChatGPT has and continues to be my go to for everything. And that's on a range of problems: general questions, coding tasks, visual understanding, and creative writing. I use these AIs all day, every day as part of my research, so my experience is quite extensive. Yet in all cases Claude has performed significantly worse for me. Perhaps it just comes down to the way that I prompt versus the average HN user? Very odd.
But yeah, o1 has been a _huge_ leap in my experience. One huge thing, which OpenAI's announcement mentions as well, is that o1 is more _consistently_ strong. 4o is a great model, but sometimes you have to spin the wheel a few times. I much more rarely need to spin o1's wheel, which mostly makes up for its thinking time. (Which is much less these days compared to o1-preview). It also has much stronger knowledge. So far it has solved a number of troubleshooting tasks that there were _no_ fixes for online. One of them was an obscure bug in libjpeg.
It's also better at just general questions, like wanting to know the best/most reputable store for something. 4o is too "everything is good! everything is happy!" to give helpful advice here. It'll say Temu is a "great store for affordable options." That kind of stuff. Whereas o1 will be more honest and thus helpful. o1 is also significantly better at following instructions overall, and inferring meaning behind instructions. 4o will be very literal about examples that you give it whereas o1 can more often extrapolate.
One surprising thing that o1 does that 4o has never done, is that it _pushes back_. It tells me when I'm wrong (and is often right!). Again, part of that being less happy and compliant. I have had scenarios where it's wrong and it's harder to convince it otherwise, so it's a double edged sword, but overall it has been an improvement in the bot's usefulness.
I also find it interesting that o1 is less censored. It refuses far less than 4o, even without coaxing, despite its supposed ability to "reason" about its guidelines :P What's funny is that the "inner thoughts" that it shows says that it's refusing, but its response doesn't.
Is it worth $200? I don't think it is, in general. It's not really an "engineer" replacement yet, in that if you don't have the knowledge to ask o1 the right questions it won't really be helpful. So you have to be an engineer for it to work at the level of one. Maybe $50/mo?
I haven't found o1-pro to be useful for anything; it's never really given better responses than o1 for me.
(As an aside, Gemini 2.0 Flash Experimental is _very_ good. It's been trading blows with even o1 for some tasks. It's a bit chaotic, since its training isn't done, but I rank it at about #2 between all SOTA models. A 2.0 Pro model would likely be tied with o1 if Google's trajectory here continues.)