Has LLM killed traditional NLP?

154 points by vietthangif 8 months ago

axegon_ 8 months ago

No, it has not and will not in the foreseeable future. This is one of my responsibilities at work. LLMs are not feasible when you have a dataset of 10 million items that you need to classify relatively fast and at a reasonable cost. LLMs are great at mid-level complexity tasks given a reasonable volume of data - they can take away the tedious job of figuring out what you are looking at or even come up with some basic mapping. But anything at large volumes.. Na. Real life example: "is '20 bottles of ferric chloride' a service or a product?"

One prompt? Fair. 10? Still ok. 100? You're pushing it. 10M - get help.

segmondy 8 months ago

You are not pushing it at 100. I can classify "Is 20 bottles of ferric chloride' a service or product in probably 2 seconds with a 4090. Something that most people don't realize is you can run multiple inference. So with something like a 4090, some solid few shots, and instead of having it classify one example at a time, you can do 5. We can probably run 100 parallel inference at 5 at a time. For about a rate of 250 a second on a 4090. So in 11 hours I'll be done. I'm going with a 7-8B model too. Some of the 1.5-3B models are great and will even run faster. Take a competent developer who knows python and how to use an OpenAI compatible API, they can put this together in 10-15 minutes, with no data science/scikit learn or other NLP toolchain experience.
So for personal, medium or even large workloads, I think it has killed it. It needs to be extremely large. If you are classifying or segmenting comments on a social media platform were you need to deal with billions a day, then LLM would be a very inefficient approach, but for 90+% of use cases. I think it wins.
I'm assuming you are going to run it locally because everyone is paranoid about their data. It's even cheaper if you use a cloud API.
- griomnib 8 months ago
  
  Or you can build a DistilBERT model and get your egregiously inefficient 2 seconds down to tens of milliseconds.
- mikeocool 8 months ago
  
  If you have to classify user input as they’re inputting it to provide a response — so it can’t be batched - 2 seconds could potentially be really slow.
  Though LLMs sure have made creating training data to train old school models for those cases a lot easier.
  - griomnib 8 months ago
    
    Yeah, that’s what I do: use LLM to help make training data for small models. It’s ao much more efficient, fast, and ergo, scalable.
- magic_hamster 8 months ago
  
  Yes and no. Having used these tools extensively I think it will be some time before LLMs are truly performant. Even smaller models can't be compared to running optimized code with efficient data structures. And smaller models (in general) do reduce the quality of your results in most cases. Maybe LLMs will kill off NLP and other pursuits pretty soon, but at the moment, each have their tradeoffs.
- WildGreenLeave 8 months ago
  
  Correct me if I'm wrong, but, if you run multiple inferences at the same time on the same GPU you will need load multiple models in the vram and the models will fight for resources right? So running 10 parallel inferences will slow everything down 5 times right? Or am I missing something?
  - Palmik 8 months ago
    
    Inference for single example is memory bound. By doing batch inference, you can interleave computation with memory loads, without losing much speed (up until you cross the compute bound threshold).
  - bavell 8 months ago
    
    You will most likely be using the same model so just 1 to load into vram.
  - aeternum 8 months ago
    
    No, the key is to use the full context window so you structure the prompt as something like: For each line below, repeat the line, add a comma then output whether it most closely represents a product or service:
    20 bottles of ferric chloride
    salesforce
    ...
    
    e12e 8 months ago
    
    Appreciate the concrete advice in this response. Thank you.
- rldjbpin 8 months ago
  
  even more naive way - just club several requests into batch of classification requests into one prompt. in practice, this is not production-ready as the llm output does not always contain results for the same number of input (sometimes more than inputted even!)
- vrighter 8 months ago
  
  two seconds is a VERY VERY VERY long time. That is mind-bogglingly, insanely slow.
- mystified5016 8 months ago
  
  At 2s per query for 10m entries, that's 251 days to run through the database.
- axegon_ 8 months ago
  
  FFS... "Lots of writers, few readers". Read again and do the math: 2 seconds, multiply that by 10 million records which contain this, as well as "alarm installation in two locations" and a whole bunch of other crap with little to no repetition (<2%) and where does that get you? 2 * 10,000,000 = 20,000,000 SECONDS!!!! A day has 86,400 seconds (24 * 3600 = 86,400). The data pipeline needs to finish in <24 hours. Everyone needs to get this into their heads somehow: LLM's are not a silver bullet. They will not cure cancer anytime soon, nor will they be effective or cheap enough to run at massive scale. And I don't mean cheap as in "oh, just get openai subscription hurr durr". Throwing money mindlessly into something is never an effective way to solve a problem.
  - why_only_15 8 months ago
    
    Assuming the 10M records is ~2000M input tokens + 200M output tokens, this would cost $300 to classify using llama-3.3-70b[1]. If using llama lets you do this in say one day instead of two days for a traditional NLP pipeline, it's worthwhile.
    [1]: https://openrouter.ai/meta-llama/llama-3.3-70b-instruct
    
    sangnoir 8 months ago
    
    > ...two days for a traditional NLP pipeline
    Why 2 days? Machine Learning took over the NLP space 10-15 years ago, so the comparison is between small, performant task-specific models versus LLMs. There is no reason to believe the "traditional" NLP pipelines are inherently slower than Large Language Models, and they aren't.
    
    why_only_15 8 months ago
    
    my claim is not that it would take two days for such a pipeline to run but that it would take two days to make an NLP pipeline whereas an LLM pipeline would be faster to make.
  - gbnwl 8 months ago
    
    Why are you using 2 seconds? The commenter you are responding to hypothesized being able to do 250/s based on "100 parallel inference at 5 at a time". Not speaking to the validity of that, but find it strange that you ran with the 2 seconds number after seemingly having stopped reading after that line, while yourself lamenting people don't read and telling them to "read again".
    
    axegon_ 8 months ago
    
    Ok, let me dumb it down for you: you have a cockroach in your bathroom and you want to kill it. You have an RPG and you have a slipper. Are you gonna use the RPG or are you going to use the slipper? Even if your bathroom is fine after getting shot with an RPG somehow, isn't this an overkill? If you can code and binary classifier train a classifier in 2 hours that uses nearly 0 resources and gives you good enough results(in my case way above what my targets were) without having to use a ton of resources, libraries, rags, hardware and hell, even electricity? I mean how hard is this to comprehend really?
    https://deviq.com/antipatterns/shiny-toy
    
    Vampiero 8 months ago
    
    This thread is chock full of people who have no clue about what traditional AI even is. I'm sorry you have to deal with literal children
    
    gbnwl 8 months ago
    
    Sure, but this doesn't answer my question nor tie into your last comment at all. It's Saturday evening in much of the world, are you sober?
    
    jazzyjackson 8 months ago
    
    OP said 2 seconds as if that wasn't an eternity...
    
    gbnwl 8 months ago
    
    But then they said 250/second when running multiple inference? Again I don't know if their assertions about running multiple inference are correct but why focus on the wrong number instead of addressing the actual claim?
    
    Vampiero 8 months ago
    
    250/s is still nothing when compared to an actual NLP pipeline that takes a few ms per it, because you can parallelize that too.
    I know it's hard to understand, but you can achieve a throughput that is a few orders of magnitude higher.
    
    EvgeniyZh 8 months ago
    
    250/s is few (4) ms per it
alexwebb2 8 months ago

I think your intuition on this might be lagging a fair bit behind the current state of LLMs.
System message: answer with just "service" or "product"
User message (variable): 20 bottles of ferric chloride
Response: product
Model: OpenAI GPT-4o-mini
$0.075/1Mt batch input * 27 input tokens * 10M jobs = $20.25
$0.300/1Mt batch output * 1 output token * 10M jobs = $3.00
It's a sub-$25 job.
You'd need to be doing 20 times that volume every single day to even start to justify hiring an NLP engineer instead.
- simonw 8 months ago
  
  You might be able to use an even cheaper model. Google Gemini 1.5 Flash 8B is Input: $0.04 / Output: $0.15 per 1M tokens.
  17 input tokens and 2 output tokens * 10 million jobs = 170,000,000 input tokens, 20,000,000 output tokens... which costs a total of $6.38 https://tools.simonwillison.net/llm-prices
  As for rate limits, https://ai.google.dev/pricing#1_5flash-8B says 4,000 requests per minute and 4 million tokens per minute - so you could run those 10 million jobs in about 2500 minutes or 42 hours. I imagine you could pull a trick like sending 10 items in a single prompt to help speed that up, but you'd have to test carefully to check the accuracy effects of doing that.
- w10-1 8 months ago
  
  The question is not average cost but marginal cost of quality - same as voice recognition, which had relatively low uptake even at ~2-4% error rates due to context switching costs for error correction.
  So you'd have to account for the work of catching the residue of 2-8%+ error from LLMs. I believe the premise is for NLP, that's just incremental work, but for LLM's that could be impossible to correct (i.e., cost per next-percentage-correction explodes), for lack of easily controllable (or even understandable) models.
  But it's most rational in business to focus on the easy majority with lower costs, and ignore hard parts that don't lead to dramatically larger TAM.
  - gf000 8 months ago
    
    I am absolutely not an expert in NLP, but I wouldn't be surprised if for many kinds of problems LLMs would have far less error rate, than any NLP software.
    Like, lemmation is pretty damn dumb in NLP, while a better LLM model will be orders of magnitude more correct.
- griomnib 8 months ago
  
  This assumes you don’t care about our rapidly depleting carbon budget.
  No matter how much energy you save personally, running your jobs on Sam A’s earth killer ten thousand cluster of GPUs is literally against your own self interest of delaying climate disasters.
  LLM have huge negative externalities, there is a moral argument to only use them when other tools won’t work.
  - amanaplanacanal 8 months ago
    
    It's digging fossil carbon out of the ground that's the problem, not using electricity. Switch to electricity not from fossil carbon and you're golden.
    
    griomnib 8 months ago
    
    Drowning isn’t the problem; just the water.
  - renewiltord 8 months ago
    
    Haha, this is pretty good. I’m going to take a plane to SF while I laugh at this.
- elicksaur 8 months ago
  
  How do you validate these classifications?
  - bugglebeetle 8 months ago
    
    The same way you check performance for any problem like this: by creating one or more manually-labeled test datasets, randomly sampled from the target data and looking at the resulting precision, recall, f-scores etc. LLMs change pretty much nothing about evaluation for most NLP tasks.
  - segmondy 8 months ago
    
    The same way you validate it if you didn't use an LLM.
  - jeswin 8 months ago
    
    Isn't it easier and cheaper to validate than to classify (requires expensive engineers)? I mean the skill is not as expensive - many companies do this at scale.
  - scarface_74 8 months ago
    
    You need a domain expert either way. I mentioned in another reply that one of my niches is implementing call centers with Amazon Connect and Amazon Lex (the NLP engine).
    https://news.ycombinator.com/item?id=42748189
    I don’t know the domain beforehand they are working in, I do validation testing with them.
- axegon_ 8 months ago
  
  Yeah... Let's talk time needed for 10M prompts and how that fits into a daily pipeline. Enlighten us, please.
  - FloorEgg 8 months ago
    
    Run them all in parallel with a cloud function in less than a minute?
    
    hnfong 8 months ago
    
    Obviously all the LLM API providers have a rate limit. Not a fan of GP's sarcastic tone, but I suppose many of us would like to know roughly what that limit would be for a small business using such APIs.
    
    jdietrich 8 months ago
    
    The rate limits for Gemini 1.5 Flash are 2000 requests per minute and 4 million tokens per minute. Higher limits are available on request.
    https://ai.google.dev/pricing#1_5flash
    4o-mini's rate limits scale based on your account history, from 500RPM/200,000TPM to 30,000RPM/150,000,000TPM.
    https://platform.openai.com/docs/guides/rate-limits
    
    simonw 8 months ago
    
    Surprisingly, DeepSeek doesn't have a rate limit: https://api-docs.deepseek.com/quick_start/rate_limit
    I've heard from people running 100+ prompts in parallel against it.
    
    axegon_ 8 months ago
    
    Yes, how did I not think of throwing more money at cloud providers on top of feeding open ai, when I could have just code a simple binary classifier and run everything on something as insignificant as an 8-th geh, quad core i5....
    
    FloorEgg 8 months ago
    
    Did I mention openai?
    
    FloorEgg 8 months ago
    
    Ah my bad someone further up thread did.
    Really it boils down to balance of time and cost, and the skill set of the person getting the job done.
    But you seem really anti establishment (hung up over $25 cloud spend), so you do you.
    Just don't expect everyone else to agree with you.
    
    rlt 8 months ago
    
    Also can’t you just combine multiple classification requests into a single prompt?
    
    FloorEgg 8 months ago
    
    Yes, for such a simple labelling task request rate limits are more likely the bottleneck than token rate limits.
- LeafItAlone 8 months ago
  
  >You'd need to be doing 20 times that volume every single day to even start to justify hiring an NLP engineer instead.
  How much for the “prompt engineer”? Who is going to be doing the work and validating the output?
  - blindriver 8 months ago
    
    You do not need a prompt engineer to create: “answer with just "service" or "product"”
    Most classification prompts can be extremely easy and intuitive. The idea you have to hire a completely different prompt engineer is kind of funny. In fact you might be able to get the llm itself to help revise the prompt.
  - alexwebb2 8 months ago
    
    All software engineers are (or can be) prompt engineers, at least to the level of trivial jobs like this. It's just an API call and a one-liner instruction. Odds are very good at most companies that they have someone on staff who can knock this out in short order. No specialized hiring required.
    
    otabdeveloper4 8 months ago
    
    > ..and validating the output?
    You glossed over the meat of the question.
    
    alexwebb2 8 months ago
    
    Your validation approach doesn't really change based on the classification method (LLM vs NLP).
    At that volume you're going to use automated tests with known correct answers + random sampling for human validation.
  - IanCal 8 months ago
    
    Prompt engineering is less and less of an issue the simpler the job is and the more powerful the model is. You also don't need someone with deep nlp knowledge to measure and understand the output.
    
    LeafItAlone 8 months ago
    
    >less and less of an issue the simpler the job
    Correct, everything is easy and simple if you make it simple and easy…
    
    IanCal 8 months ago
    
    Plenty of simple jobs required people with deeper knowledge of AI in the past, now for many tasks in businesses you can skip over a lot of that and use a llm.
    Simple things were not always easy. Many of them are, now.
vlovich123 8 months ago

That’s the argument the article makes but the reasoning is a little questionable on a few fronts:
- It uses f16 for the data format whereas quantization can reduce the memory burden without a meaningful drop in accuracy, especially as compared with traditional NLP techniques.
- The quality of LLMs typically outperform OpenCV + NER.
- You can choose to replace just part of the pipeline instead of using the LLM for everything (e.g. using text-only 3B or 1B models to replace the NER model while keeping OpenCV)
- The (LLM compute / quality) / watt is constantly decreasing. Meaning even if it’s too expensive today, the system you’ve spent time building, tuning and maintaining today is quickly becoming obsolete.
- Talking with new grads in NLP programs, all the focus is basically on LLMs.
- The capability + quality out of models / size of model keeps increasing. That means your existing RAM & performance budget keeps absorbing problems that seemed previously out of reach
Now of course traditional techniques are valuable because they can be an important tool in bringing down costs (fixed function accelerator vs general purpose compute), but it’s going to become more niche and specialized with most tasks transitioning to LLMs I think.
The “bitter lesson” paper is really relevant to these kinds of discussions.
- vlovich123 8 months ago
  
  Not an independent player so obviously important to be critical of papers like this [1], but it’s claiming a ~10x cost in LLM inference every year. This lines up with the technical papers I’m seeing that are continually improving performance + the related HW improvements.
  That’s obviously not sustainable indefinitely, but these kinds of exponentials are precisely why people often make incorrect conclusions on how long change will take to happen. Just a reminder: CPUs were 2x more performance every 18 months and continued to continually upend software companies for 20 years who weren’t in tune with this cycle (i.e. focusing on performance instead of features). For example, even if you’re spending $10k/month for LLM vs $100/month to process the 10M item, it can still be more beneficial to go the LLM route as you can buy cheaper expertise to put together your LLM pipeline than the NLP route to make up the ~100k/year difference (assuming the performance otherwise works and the improved quality and robustness of the LLM solution isn’t providing extra revenue to offset).
  [1] https://a16z.com/llmflation-llm-inference-cost/
blindriver 8 months ago

That’s sort of like asking a horse and buggy driver whether automobiles are going to put them out of business.
I think for the most part, casual nlp is dead because of LLMs. And LLM costs are going to plummet soon, so large scale nlp that you’re talking about is probably dead within 5 years or less. The fact that you can replace programmers with prompts is huge in my opinion so no one needs to learn an nlm API anymore, just stuff it into a prompt. Once costs to power LLMs decrease to meet the cost of programmers it’s game over.
- dartos 8 months ago
  
  > LLM costs
  Inference costs, not training costs.
  > The fact that you can replace programmers
  You can’t… not for any real project. For quick mockups they’re serviceable
  > That’s sort of like asking a horse and buggy driver whether automobiles
  Kind of an insult to OP, no? Horse and buggy drivers were not highly educated experts in their field.
  Maybe take the word of domain experts rather than AI company marketing teams.
  - blindriver 8 months ago
    
    > Maybe take the word of domain experts rather than AI company marketing teams.
    Appeal to authority is a well known logical fallacy.
    I know how dead NLP is personally because I’ve never been able to get NLP working but once ChatGPT came around, I was able to classify texts extremely easily. It’s transformational.
    I was able to get ChatGPT to classify posts based on how political it was from a scale of 1 to 10 and which political leaning they were and then classify the persons likely political affiliations.
    All of this without needing to learn any APIs or anything about NLPs. Sorry but given my experience, NLPs are dead in the water right now, except in terms of cost. And cost will go down exponentially as they always do. Right now I’m waiting for the RTC 5090 so I can just do it myself with open source LLM.
    
    vunderba 8 months ago
    
    > NLPs are dead in the water right now, except in terms of cost.
    False.
    With all due respect, the fact that you're referring to natural language parsing as "NLPs" makes me question whether you have any experience or modest knowledge around this topic, so it's rather bold of you to make such sweeping generalizations.
    It works for your use case because you're just one person running it on your home computer with consumer hardware. Some of us have to run NLP related processing (POS taggers, keyword extraction, etc) in a professional environment at tremendous scale, and reaching for an LLM would absolutely kill our performance.
    
    gf000 8 months ago
    
    My understanding is that inference models can absolutely scale down, we are only at the beginning of these getting minimized, and they are trivial to parallelize. That's not a good combo to be against them, their price/performance/efficiency will quickly drop/grow/grow.
    
    FridgeSeal 8 months ago
    
    “I couldn’t be bothered learning something, and now I don’t have to! Checkmate!”
    While LLM’s can have their uses, let’s not get carried away.
    
    scarface_74 8 months ago
    
    That’s true. I did avoid learning traditional NLP techniques because for my use case - call centers - LLMs do a much better job.
    Context for the problem space:
    https://dl.acm.org/doi/fullHtml/10.1145/3442381.3449870
    
    thaw13579 8 months ago
    
    Performance and cost are trade-offs though. You could just as well say that LLMs are dead in the water, except in terms of performance.
    It does seem likely we’ll soon have cheap enough LLM inference to displace traditional NLP entirely, although not quite yet.
    
    dartos 8 months ago
    
    > Appeal to authority is a well known logical fallacy.
    I did not make an appeal to authority. I made an appeal to expertise.
    It’s why you’d trust a doctor’s medical opinion over a child’s.
    I’m not saying “listen to this guy because their captain of NLP” I’m saying listen because experts have spent years of hands on experience with things like getting NLP working at all.
    > I know how dead NLP is personally because I’ve never been able to get NLP working
    So you’re not an expert in the field. Barely know anything about it, but you’re okay hand waving away expertise bc you got a toy NLP Demo working…
    That’s great, dude.
    > I was able to get ChatGPT to classify posts based on how political it was from a scale of 1 to 10
    And I know you didn’t compare the results against classic NLP to see if there was any improvements because you don’t know how…
    
    blindriver 8 months ago
    
    > I did not make an appeal to authority. I made an appeal to expertise.
    Lol
    > I’m saying listen because experts have spent years of hands on experience with things like getting NLP working at all.
    “It is difficult to get a man to understand something, when his salary depends on his not understanding it.”
    Upton Sinclair
    > Barely know anything about it, but you’re okay hand waving away expertise bc you got a toy NLP Demo working…
    Yes that’s my point. I don’t know anything about implementing an NLP but got something that works pretty well using an LLM extremely quickly and easily.
    > And I know you didn’t compare the results against classic NLP to see if there was any improvements because you don’t know NLP…
    Do you cross reference all your Google searches to make sure they are giving you the best results vs Bing and DDG?
    Do you cross reference the results from your NLP with LLMs to see if there were any improvements?
    
    dartos 8 months ago
    
    > Lol
    Great argument
    > “It is difficult to get a man to understand something, when his salary depends on his not understanding it.”
    NLP professionals are also LLM professionals. LLMs are tools in an NLP toolkit. LLMs don’t make the NLP professional obsolete the way it makes handwritten spam obsolete.
    I was going to explain this further but you literally wouldn’t understand.
    > Do you cross reference all your Google searches to make sure they are giving you the best results vs Bing and DDG?
    …Yes I do…
    That’s why I cancelled my kagi subscription. It was just as good as DDG.
    > Do you cross reference the results from your NLP with LLMs to see if there were any improvements?
    Yes I do… because I want to use the best tool for the job. Not just the first one I was able to get working…
    
    elicksaur 8 months ago
    
    I haven’t understood these types of uses. How do you validate the score that the LLM gives?
    
    blindriver 8 months ago
    
    The same way you validate scores given by NLPs I assume. You run various tests and look at the results and see if they match what you would expect.
  - chaos_emergent 8 months ago
    
    > Inference costs, not training costs.
    Why does training cost matter if you have a general intelligence that can do the task for you, that’s getting cheaper to run the task on?
    > for quick mockups they’re serviceable
    I know multiple startups that use LLMs as their core bread-and-butter intelligence platform instead of tuned but traditional NLP models
    > take the word of domain experts
    I guess? I wouldn’t call myself an expert by any means but I’ve been working on NLP problems for about 5 years. Most people I know in NLP-adjacent fields have converged around LLMs being good for most (but obviously not all) problems.
    > kind of an insult
    Depends on whether you think OP intended to offend, ig
    
    evolve7942 8 months ago
    
    > I know multiple startups that use LLMs as their core bread-and-butter intelligence platform instead of tuned but traditional NLP models
    It seems like LLMs would be perfect for start-ups that are iterating quickly. As the business, problem, and data mature though I would expect those LLMs to be consolidated into simpler models. This makes sense from a cost and reliability perspective. I wonder also about the impact of making your core IP a set of prompts beholden to the behavior of someone else’s model.
    
    dartos 8 months ago
    
    > Why does training cost matter if you have a general intelligence that can do the task for you, that’s getting cheaper to run the task on?
    Assuming we didn’t need to train it ever again, it wouldn’t. But we don’t have that, so…
    > I know multiple startups that use LLMs as their core bread-and-butter intelligence platform instead of tuned but traditional NLP models
    Okay? Did that system write itself entirely? Did it replace the programmers that actually made it?
    If so, they should pivot into a Devin competitor.
    > Most people I know in NLP-adjacent fields have converged around LLMs being good for most (but obviously not all) problems.
    Yeah LLMs are quite good at comming NLP tasks, but AFAIK are not SOTA at any specific task.
    Either way, LLMs obviously don’t kill the need for the NLP field.
  - elwebmaster 8 months ago
    
    Reply didn’t say that the expert is uneducated, just that their tool is obsolete. Better look at facts the way they are, sugar coating doesn’t serve anyone.
- otabdeveloper4 8 months ago
  
  > The fact that you can replace programmers with prompts
  No, you can't. The only thing LLM's replace is internet commentators.
  - blindriver 8 months ago
    
    As I explained below, I avoided having to learn anything about ML, PyTorch or any other APIs when trying to classify posts based on how political they were and which affiliation they were. That was holding me back and it was easily replaced by an llm and a prompt. Literally took me minutes what would have taken days or weeks and the results are more than good enough.
    
    otabdeveloper4 8 months ago
    
    > what would have taken days or weeks
    Nah, searching Stackoverflow and Github doesn't take "weeks".
    That said, due to how utterly broken internet search is nowadays, using an LLM as a search engine proxy is viable.
    
    datadrivenangel 8 months ago
    
    GPT 3.5 is more accurate at classifying tweets as liberal than it is at identifying posts that are conservative.
    If you're going for rough approximation, LLMs are great, and good enough. More care and conventional ML methods are appropriate as the stakes increase though.
    
    alexwebb2 8 months ago
    
    GPT 3.5 has been very, very obsolete in terms of price-per-performance for over a year. Bit of a straw man.
  - portaouflop 8 months ago
    
    No you can’t; LLMs are dog shit at internet banter, too neutered
- arandomhuman 8 months ago
  
  >The fact that you can replace programmers with prompts
  this is how you end up with 1000s of lines of slop that you have no idea how it functions.
simonw 8 months ago

What NLP approaches are you using to solve the "is '20 bottles of ferric chloride' a service or a product?" problem?
- pona-a 8 months ago
  
  How about a naive Bayesian Bag of Words? Just find/scrape/generate with an LLM a large enough corpus of products/services, build the term frequency matrix, calculate class priors and P(term|class) and inference with straightforward application of Bayes theorem.
  This particular problem, at least to me, seems trivial, and to use an LLM for anything like this for more than a hundred cases seems incredibly wasteful.
devjab 8 months ago

While I agree with both you and the article I also think it'll depend on more than just the volume of your data. We have quite a lot of documents that we classify. It's around 10-100k a month, some rather large others simple invoices. We used to have a couple of AI specialists who handled the classification with local NLP models, but when they left we had to find alternatives. For us this was the AI services in the cloud we use and the result has been a document warehouse which is both easier for the business to manage and a "pipeline" which is much cheaper than having those AI specialists on the payroll.
I imagine this wouldn't be the case if we were to do more classification projects, but we aren't. We did try to find replacements first, but it was impossible for us to attract any talent, which isn't too much of a surprise considering it's mainly maintenance. Using external consultants for that maintenance proved to be almost more expensive than having two full time employees.
bloomingkales 8 months ago

I suspect any solution like that will be wholesale thrown away in a year or two. Unless the damn thing is going to make money in the next 2-3 years, we are all mostly going to write throwaway code.
Things are such an opportunity cost now days. It’s like trying to capture value out of a transient amorphous cloud, you can’t hold any of it in your hand but the phenomenon is clearly occurring.
MasterScrat 8 months ago

Can you talk about the main non-LLM NLP tools you use? e.g. BERT models?
> One prompt? Fair. 10? Still ok. 100? You're pushing it. 10M - get help.
Assuming you could do 10M+ LLM calls for this task at trivial cost and time, would you do it? i.e. is the only thing keeping you away from LLM the fact they're currently too cumbersome to use?
gf000 8 months ago

Why not just run a local LLM for practically free? You can even trivially parallelize it with multiple instances.
I would believe that many NLP problems can be easily solved even by smaller LLM models.
scarface_74 8 months ago

http://www.incompleteideas.net/IncIdeas/BitterLesson.html
specproc 8 months ago

I see LLMs best used as part of a more traditional NLP pipeline.
For example, an approach that does me well is clustering then using LLMs on representative docs. Tools like bertopic are great for this.
I also don't see a clear cut difference between the two in certain areas. Embeddings are critical in LLM pipelines, but for me anyway, also "old school" tools.
I think NLP as described in the article is certainly under threat, but the tools and approaches compliment LLM use well, are far more efficient, and distinguish the pros from the neophytes.
If you're using LLMs for NLP-type tasks, but don't know the NLP tools, you're missing out.
sireat 8 months ago

So what would you use to classify whether a document is a critique or something else in 1M documents in a non-English language?
This is a real problem I am dealing with at a library project.
Each document is between 100 to 10k tokens.
Most top (read most expensive) LLMs available in OpenRouter work great, it is the cost (and speed) that is the issue.
If I could come up with something locally runnable that would be fantastic.
Presumably BERT based classifiers would work if I had one properly trained for the language.
- rahimnathwani 8 months ago
  
  I guess you've already seen https://huggingface.co/collections/answerdotai/modernbert-67... ?
HarHarVeryFunny 8 months ago

10M items @ 10 tokens each ("20 bottles of ferric chloride" etc) plus 10M tokens out (category) is 100M tokens in 10M tokens out.
Claude Haiku is $0.25 per 1M tokens in, $1.05 per 1M out, so cost would be ~$35.
GPT-4o mini is even cheaper at $0.15 per 1M in.
Of course if your volume justifies the hardware cost you could always run Llama locally, for the cost of the electricity used.
llmsolutions 8 months ago

You can use embeddings to build classification models using various methods. Not sure what qualifies as "get help" level of cost/throughput, but certainly most providers offer large embedding APIs at much lower cost/higher throughput than their completion APIs.
WhitneyLand 8 months ago

For context, 10M would cost ~$27.
Say Gemini Flash 8B, allowing ~28 tokens for prompt input at $0.075/1M tokens, plus 2 output tokens at $0.30/1M. Works out to $0.0027 per classification. Or in other words, for 1 penny you could do this classification 3.7 times.
hulitu 8 months ago

That was also my impresion. LLM can "describe" but not classify. Hallucinate but nothing precise.
Kuinox 8 months ago

Prompt caching would lower the cost, later similar tech would lower the inference cost too. You have less than 25 tokens, thats between 1-5$.
There may be some use case but I'm not convinced with the one you gave.
- minimaxir 8 months ago
  
  So there's a bit of an issue with prompt caching implementations: for both OpenAI API and Claude's API, you need a minimum of 1024 tokens to build the cache for whatever reason. For simple problems, that can be hard to hit and may require padding the system prompt a bit.
crystal_revenge 8 months ago

> LLMs are not feasible when you have a dataset of 10 million items that you need to classify relatively fast and at a reasonable cost.
What? That's simply not true.
Current embedding models are incredibly fast and cheap and will, in the vast majority of NLP tasks, get you far better results than any local set of features you can develop yourself.
I've also done this at work numerous times, and have been working on various NLP tasks for over a decade now. For all future traditional NLP tasks the first pass is going to be to get fetch LLM embeddings and stick on a fairly simple classification model.
> One prompt? Fair. 10? Still ok. 100? You're pushing it. 10M - get help.
"Prompting" is not how you use LLMs for classification tasks. Sure you can build 0-shot classifiers for some tricky tasks, but if you're doing classification for documents today and you're not starting with an embedding model you're missing some easy gains.
- anon373839 8 months ago
  
  Embedding models are not LLMs in the sense that the term is being used in the title of this post. They are “traditional NLP.”
fud101 8 months ago

Can you recommend a way to classify a small number of objects? Local only and Python preferably.
diggan 8 months ago

So TLDR: You agree with the author, but not for the same reasons?

scarface_74 8 months ago

For my use case, definitely.

I have worked on AWS Connect (online call center) and Amazon Lex (the backing NLP engine) projects.

Before LLMs, it was a tedious process of trying to figure out all of the different “utterances” that people could say and the various languages you had to support. With LLMs, it’s just prompting

https://chatgpt.com/share/678bab08-f3a0-8010-82e0-32cff9c0b4...

I used something like this using Amazon Bedrock and a Lambda hook for Amazon Lex. Of course it wasn’t booking a flight. It was another system

The above is a simplified version. In the real world , I gave it a list of intents (book flights, reserve a room, rent a car) and properties - “slots” - I needed for each intent.

elicksaur 8 months ago

Thank you for sharing an actual prompt thread. So much of the LLM debate is washed in biases, and it is very helpful to share concrete examples of outputs.
- scarface_74 8 months ago
  
  The “cordele GA” example surprised me. I was expecting to get a value of “null” for the airport code since I knew that city had a population of 12K and no airport within its metropolitan statistical area. It returned an airport that was close.
  Having world knowledge is a godsend. I also just tried a prompt with “Alpharetta, GA” a city north of Atlanta and it returned ATL. An NLP could never do that without a lot more work.
LeafItAlone 8 months ago

That’s a great example and I understand it was intentionally simple but highlighted how LLMs need care with use. Not that this example is very related to NLP:
My prompt: `<<I want a flight from portland to cuba after easter>>`
The response: ``` { "origin": ["PDX"], "destination": ["HAV"], "date": "2025-04-01", "departure_time": null, "preferences": null } ```
Of course I meant Portland Maine (PWM), there is more than one airport option in Cuba than HAV, and it got the date wrong, since Easter is April 20 this year.
- scarface_74 8 months ago
  
  If the business stakeholders came out with that scenario, I would modify the prompt like this. You would know the users address if they had an account.
  https://chatgpt.com/share/678c1708-639c-8010-a6be-9ce1055703...
  - LeafItAlone 8 months ago
    
    OK, but that only fixed one of the three issues.
    
    scarface_74 8 months ago
    
    While the first one is easy. I mean you could give it a list of holidays and dates. But the rest you would just ask the user to confirm the information and say “is this correct”? If they say “No” ask them which isn’t correct and let them correct it.
    I would definitely assume someone wanted to leave from an airport close by if they didn’t say anything.
    You don’t want the prompt to grow too much. But you do have analytics that you can use to improve your prompt.
    In the case of Connect, you define your logic using a GUi flowchart builder called a contact flow.
    BTW: with my new prompt, it did assume the correct airport “<<I want to go to Cuba after Easter>>”
    
    LeafItAlone 8 months ago
    
    Sure, all the problems are “easy” once you identify them. As with most products. But the majority of Show HN posts here relying on LLMs that I see don’t account for simple things like my example. Flights finders in particular have been pretty bad.
    >BTW: with my new prompt, it did assume the correct airport “<<I want to go to Cuba after Easter>>”
    Not really. It chose the airport you put basically in the prompt. But I don’t live in MA, I live closer to PDX. And it didn’t suggest the multiple other Cuba airports. So you’ll end up with a lot of guiding rules.
    
    scarface_74 8 months ago
    
    A human would assume if you said “Portland” they would first assume you meant PDX unless they looked up your address and then they would assume Maine.
    Just like if I said I wanted to fly to Albany, they would think I meant New York and not my parents city in south GA (ABY) which only has three commercial flights a day.
    Even with a human agent, you ask for confirmation.
    Also, I ask to speak to people on the ground - in this case it would be CSRs - to break it.
    That’s another reason I think “side projects” are useless and they don’t have any merit on resumes. I want them to talk about real world implementations.
gtirloni 8 months ago

How about the the costs?
- scarface_74 8 months ago
  
  We measure savings in terms of call deflections. Clients we work with say that each time a customer talks to an agent it costs $2-$5. That’s not even taking into account call abandonments
  - IanCal 8 months ago
    
    My base thing while advising people is that if anyone you pay needs to read the output, or you are directly replacing any kind of work then even frontier llm model inference costs are irrelevant. Of course you need to work out of that's truly the case but people worry about the cost in places where it's just irrelevant. If it's $2 when you get to an agent, each case that's avoided there could pay for around a million words read/generated. That's expensive compared to most API calls but irrelevant when counting human costs.
vpribish 8 months ago

link is a 404, sadly. what did it say before?
- scarface_74 8 months ago
  
  The link works for me even in cognito mode.
  The prompt:
  you are a chatbot that helps users book flights. Please extract the origin city, destination city, travel date, and any additional preferences (e.g., time of day, class of service). If any of the details are missing, make the value “null”. If the date is relative (e.g., "tomorrow", "next week"), convert it to a specific date.
  User Input: "<User's Query>"
  Output (JSON format): { "origin": list of airport codes "destination": list of airport codes, "date": "<Extracted Date>", "departure_time": "<Extracted Departure Time (if applicable)>", "preferences": "<Any additional preferences like class of service (optional)>" }
  The users request will be surrounded by <<>>
  Always return JSON with any missing properties having a value of null. Always return English. Return a list of airport codes for the city. For instance New York has two airports give both
  Always return responses in English

DebtDeflation 8 months ago

The question seems malformed to me.

Text classification, clustering, named entity recognition, etc. are NLP tasks. LLMs can perform these tasks. ML models that are not LLMs (or even not deep learning models) can also perform these tasks. Is the author perhaps asking if the concept of a "completion" has replaced all of these tasks?

When I hear "traditional NLP" I think not of the above types of tasks but rather the methodology employed for performing them. For example, building a pipeline to do stemming/lemmatization, part of speech tagging, coreference resolution, etc. before the text gets fed to a classifier model. This was SOTA 10 years ago but I don't think many people are still doing it today.

lyu07282 8 months ago

I understand the term the same way, but I don't think "traditional NLP" died with LLMs, traditional NLP already died with deep learning 7ish years ago, with LSTMs and CNNs, the gains from such models trained end-to-end are so huge it just doesn't make sense to chop up the tasks like that.

thangalin 8 months ago

I created an NLP library to help curl straight quotes into curly quotes. Last I checked, LLMs struggled to curl the following straight quotation marks:

    ''E's got a 'ittle box 'n a big 'un,' she said, 'wit' th' 'ittle 'un 'bout 2'×6". An' no, y'ain't cryin' on th' "soap box" to me no mo, y'hear. 'Cause it 'tweren't ever a spec o' fun!' I says to my frien'.

The library is integrated into my Markdown editor, KeenWrite (https://keenwrite.com/), to correctly curl quotation marks into entities before passing them over to ConTeXt for typesetting. While there are other ways to indicate opening and closing quotation marks, none are as natural to type in plain text as straight quotes. I would not trust an LLM curl quotation marks accurately.

For the curious, you can try it at:

https://whitemagicsoftware.com/keenquotes/

If you find any edge cases that don't work, do let me know. The library correctly curls my entire novel. There are a few edge cases that are completely ambiguous, however, that require semantic knowledge (part-of-speech tagging), which I haven't added. PoS tagging would be a heavy operation that could prevent real-time quote curling for little practical gain.

The lexer, parser, and test cases are all open source.

https://gitlab.com/DaveJarvis/KeenQuotes/-/tree/main/src/mai...

jcheng 8 months ago

Great example. I just tried it with a few LLMs and got horrible results. GPT-4o got a ton of them wrong, GPT-1o got them all correct AFAICT but took 1m50s to do so, and Claude 3.5 Sonnet said “Here's the text with straight quotes converted to curly quotes” but then returned the text with all the straight quotes intact.
I’m very surprised all three models didn’t nail it immediately.
gf000 8 months ago

I would be interested how well would even a smaller LLM model work after fine tuning. Besides the overhead of an LLM, I would assume they would do a much better job at it in the edge cases (where contextual understanding is required).

darepublic 8 months ago

I remember using the open NLP library from Stanford around 2016. It would do parts of speech tagging of words in a sentence (labelling the words with their grammatical function). It was pretty good but reliably failed on certain words where context determined the tag. When for gpt 3 came out the first thing I tested it out on was parts of speech tagging. In particular those sentences open NLP had trouble with. And it aced everything I was impressed.

RancheroBeans 8 months ago

NLP is an important part of upcoming RAG frameworks like Microsoft’s LazyGraphRAG. So I think it’s more like NLP is a tool used when the time is right.

https://www.microsoft.com/en-us/research/blog/lazygraphrag-s...

politelemon 8 months ago

I could use some help understanding, is this a set of tools or techniques to answer questions? The name made me think it's related to create embeddings but it seems much more?

derbaum 8 months ago

One of the things I'm still struggling with when using LLMs over NLP is classification against a large corpus of data. If I get a new text and I want to find the most similar text out of a million others, semantically speaking, how would I do this with an LLM? Apart from choosing certain pre-defined categories (such as "friendly", "political", ...) and then letting the LLM rate each text on each category, I can't see a simple solution yet except using embeddings (which I think could just be done using BERT and does not count as LLM usage?).

macNchz 8 months ago

I've used embeddings to define clusters, then passed sampled documents from each cluster to an LLM to create labels for each grouping. I had pretty impressive results from this approach when creating a category/subcategory labels for a collection of texts I worked on recently.
- derbaum 8 months ago
  
  That's interesting, it sounds a bit like those cluster graph visualisation techniques. Unfortunately, my texts seem to fall into clusters that really don't match the ones that I had hoped to get out of these methods. I guess it's just a matter of fine-tuning now.
thaumasiotes 8 months ago

Take two documents.
Feed one through an LLM, one word at a time, and keep track of words that experience greatly inflated probabilities of occurrence, compared to baseline English. "For" is probably going to maintain a level of likelihood close to baseline. "Engine" is not.
Do the same thing for the other one.
See how much overlap you get.
- derbaum 8 months ago
  
  Wouldn't a simple comparison of the word frequency in my text against a list of usual word frequencies do the trick here without an LLM? Sort of a BM25?
  - thaumasiotes 8 months ago
    
    It might; it's not going to do the same thing. The LLM will tell you words that would likely appear in a similar text. Word frequency will tell you words that have actually appeared in your text. I'm postulating that the first kind of list is much more likely to show strong overlap between two similar documents than the second kind of list.
    Vocabulary style matters a lot to what words are actually used, but much less to what words are likely to be used. If I'm following a style guide that says to use "automobile" instead of "car", appearance probabilities for "automobile" will be greatly inflated. And appearance probabilities for "car" will also be greatly inflated, just to a lesser extent than for "automobile". Whereas actual usage of "car" will be pegged at zero.
    Determining how similar two texts are is something that an LLM should be good at. It should be better than a simple comparison of word frequency. Whether it's better enough to justify the extra compute is a different question.

michaelsbradley 8 months ago

https://archive.is/J53CE

freefaler 8 months ago

If archive links aren't working this works:

https://freedium.cfd/https://medium.com/altitudehq/is-tradit...

vletal 8 months ago

The idea that we can solve "language" by breaking down and understanding sentences is naive and funny with the benefit of hindsight, is it not?

An equivalently funny attitude seems to be the "natural language will replace programming languages". Let's see how that one will work out when the hype is over.

Vampiero 8 months ago

It's not naive, it's how languages work.
That grammar doesn't necessarily convey 100% of semantics is a problem of natural language. Or rather, of people being poor at communicating unambiguously.
Programming languages can also be ambiguous sometimes, but that ambiguity is resolved before execution by essentially following a priority list or throwing an error if no combination of rules fits.
Pompidou 8 months ago

They all fall into the pitfall that Martin Heidegger suggested avoiding, particularly in the introduction of his text What Is a Thing?.

itissid 8 months ago

LLM Design/Use has only about as much to with engineering as building a plane has to do with actually flying it.

Every business is kind of a unicorn in its problems NLP is a small part of it. Like even if it did perform cheaply enough to do NLP, how would you replace parts like: 1. Evaluation system that uses Calibration(Human labels) 2. Ground Truth Collection(Human + sometimes semi automated) 3. QA testing by end users.

Even if LLMs made it easier to do NLP there are correlations with the above which means your NLP process is hugely influenced so much that you still need an engineer. If you have an engineer who only for doing NLP and nothing else you are quite hyper specialized like to the extent you are only building planes 0.01%: of the engineering work out there.

arcknighttech 8 months ago

They also have their own uses depending in the case scenario so it depends on what someone is working with. NLP in some cases like advanced marketing can still have great benefit without an LLM and AIs in general are streamlining certain "speed of information processing" tasks but still struggle with complex "systems thinking".

FYI - If anyone doesn't know the difference between the two or has no idea what NLP or an LLM is, this has a good breakdown: https://medium.com/@melindaboone80722/nlp-vs-llm-b339abdc651...

ein0p 8 months ago

You have to be more specific with that question. The "traditional" NLP has by now been killed twice: first by classical machine learning (which significantly reduced the need for linguists), and now by deep learning (which has all but eliminated it). So "traditional" NLP was killed back in late 00s. You can't kill that which is not alive, so it follows that LLMs have not, in fact, killed traditional NLP.

antonvs 8 months ago

One datapoint: we were using NLP to translate natural language instructions into an executable form that could drive our product. It was part of a product we sold to enterprises.

We've completely replaced that with LLMs. We still use our own DNNs for certain tasks, but not for NLP.

oliwary 8 months ago

This article seems to be paywalled unfortunately. While LLMs are very useful when the tasks are complex and/or there is not a lot of training data, I still think traditional NLP pipelines have a very important role to play, including when:

- Depending on the complexity of the task and the required results, SVMs or BERT can be enough in many cases and take much lower resources, especially if there is a lot of training data available. Training these models with LLM outputs could also be an interesting approach to achieve this.

- When resources are constrained or latency is important.

- In some cases, there may be labeled data in certain classes that have no semantic connection between them, e.g. explaining the class to LLMs could be tricky.

99catmaster 8 months ago

https://archive.is/J53CE
eminent101 8 months ago

> This article seems to be paywalled unfortunately.
I am no fan of Medium paywalled articles but if it helps you, here's the article on archive - https://archive.is/J53CE

vedant 8 months ago

The title of this article feels like "has electricity killed oil lamps"?

retinaros 8 months ago

cant read the article. do they consider BERT as an LLM? there are tasks still in NLP where BERT is better than a GPT like

selimthegrim 8 months ago

Like?
- Olfurm 8 months ago
  
  Like named entity recognition or relations recognition. Check https://github.com/urchade/GLiNER
  - fbilhaut 8 months ago
    
    Indeed. What's is interesting (and not so common) with models like GLiNER is that it is way lighter than LLMs while preserving good (if not better) quality and some zero-shot ability (in this case wrt to the entity classes). This feature is very significant, as most "traditional" (including Transformer-based) approaches are more or less supervised during fine-tuning. Icing on the cake, you don't have to deal with all the problems that arise when you want to get structured output from an LLM (in this case structured NER output).

leobg 8 months ago

There are AI bros that will call an LLM to do what you could do with a regex. I’ve seen people do the chunking for RAG using an LLM…

tossandthrow 8 months ago

If you think about chunking as "take x characters" then using LLMs is a poor idea.
But syntactic chunking also works really poorly for any serious application as you loose basically all context.
Semantic chunking, however, is a task you absolutely would use LLMs for.
- leobg 8 months ago
  
  If by LLM you mean embeddings I agree. Though you can often get away with using much smaller models for that.
  I was talking about people who actually make a call to a completion endpoint and then have the LLM repeat the input text token for token just to get the split.
  - tossandthrow 8 months ago
    
    How do you do semantic chunking using embeddings?
    And yes, I perfectly now what you are talking about. And yes, that is a perfect strategy to chunk large texts so you can index it.
    It does not sound like you are familiar with chunking and it's current issues?

rolandthomas 8 months ago

[dead]