It also made the web a less than ideal source for training. And yet LLMs were still fed articles written for Googlebot, not humans. ML/LLM is the second iteration of writing pollution. The first was humans writing for corporate bots, not other humans.
Blog spam was generally written by humans. While it sucked for other reasons, it seemed fine for measuring basic word frequencies in human-written text. The frequencies are probably biased in some ways, but this is true for most text. A textbook on carburetor maintenance is going to have the word "carburetor" at way above the baseline. As long as you have a healthy mix of varied books, news articles, and blogs, you're fine.
In contrast, LLM content is just a serpent eating its own tail - you're trying to build a statistical model of word distribution off the output of a (more sophisticated) model of word distribution.
SEO text carefully tuned to tf-idf metrics and keyword stuffed to them empirically determined threshold Google just allows should have unnatural word frequencies.
LLM content should just enhance and cement the status quo word frequencies.
Outliers like the word "delve" could just be sentinels, carefully placed like trap streets on a map.
So it's classic positive feedback. LLM uses delve more, delve appears in training data more, LLM uses delve more...
Who knows what other semantic quirks are being amplified like this. It could be something much more subtle, like cadence or sentence structure. I already notice that GPT has a "tone" and Claude has a "tone" and they're all sort of "GPT-like." I've read comments online that stop and make me question whether they're coming from a bot, just because their word choice and structure echoes GPT. It will sink into human writing too, since everyone is learning in high school and college that the way you write is by asking GPT for a first draft and then tweaking it (or not).
Unfortunately, I think human and machine generated text are entirely miscible. There is no "baseline" outside the machines, other than from pre-2022 text. Like pre-atomic steel.
Far from saying the pool of language is now polluted, I think we now have a great data set to begin to discern authentic from inauthentic human language. Although sure, people on the fringes could get caught in a false positive for being bots, like you or I.
The biggest LLM of them all is the daily driver of all new linguistic innovation: Human society, in all its daily interactions. The quintillions of daily phrases exchanged and forever mutating around the globe - each mutation of phrase interacting with its interlocutor, and each drawing from not the last 500,000 tokens but the entire multi-modal, if you will, experience of each human to date in their entire lives - vastly eclipses anything any hardware could ever emulate given the current energy constraints. Software LLMs are just a state machine stuck in a moment in time. At best they will always lag, the way Stalinist language lagged years behind the patois of average Russians, who invented daily linguistic dodges to subvert and mock the regime. The same process takes place anywhere there is a dominant official or uncool accent or phrasing. The ghetto invents new words, new rhythm, and then it becomes cool in the middle class. The authorities never catch up, precisely because the use of subversive language is humanity's immune system against authority.
If there is one distinctly human trait, it's sniffing out anyone who sounds suspiciously inauthentic. (Sadly, it's also the trait that leads to every kind of conspiracy theorizing imaginable; but this too probably confers in some cases an evolutionary advantage). Sniffing out the sound of a few LLMs is already happening, and will accelerate geometrically, much faster than new models can be trained.
Some day we may view this as the beginnings of machine culture.
Have you ever seen someone use their smartphone? They're not "here," they are "there." Forming themselves in cyberspace -- or being formed, by the machine.
2. Given how LLMs work, a prompt is a bias — they're one-and-the-same. You can't ask an LLM to write you a mystery novel without it somewhat adopting the writing quirks common to the particular mystery novels it has "read." Even the writing style you use in your prompt influences this bias. (It's common advice among "AI character" chatbot authors, to write the "character card" describing a character, in the style that you want the character speaking in, for exactly this reason.) Whatever prompt the developer uses, is going to bias the bot away from the statistical norm, toward the writing-style elements that exist within whatever hypersphere of association-space contains plausible completions of the prompt.
3. Bot authors do SEO too! They take the tf-idf metrics and keyword stuffing, and turn it into training data to fine-tune models, in effect creating "automated SEO experts" that write in the SEO-compatible style by default. (And in so doing, they introduce unintentional further bias, given that the SEO-optimized training dataset likely is not an otherwise-perfect representative sampling of writing style for the target language.)
TFA mentions this hasn't been the case.
It’s the same origin. On Slashdot (the HN of the early 00’s) people would admonish others to RTFA. Then they started using it as a referent: TFA was the thing you were supposed to have read.
At least in Googles case, they're having so much difficulty keeping AI slop out of their search results that I don't have much faith in their ability to give it an appropriately low training weight. They're not even filtering the comically low-hanging fruit like those YouTube channels which post a new "product review" every 10 minutes, with an AI generated thumbnail and AI voice reading an AI script that was never graced by human eyes before being shat out onto the internet, and is of course always a glowing recommendation since the point is to get the viewer to click an affiliate link.
Google has been playing the SEO cat and mouse game forever, so can startups with a fraction of the experience be expected to do any better at filtering the noise out of fresh web scrapes?
Google has been _monetizing_ the SEO game forever. They chose not to act against many notorious actors because the metric they optimize for is ad revenue and and those sites were loaded with ads. As long as advertisers didn’t stop buying, they didn’t feel much pressure to make big changes.
A smaller company without that inherent conflict of interest in its business model can do better because they work on a fundamentally different problem.
The problem is that, of the signals you mention,
• the highly-informative ones (posting a new review every 10 minutes, having affiliate links in the description) are contextual — i.e. they're heuristics that only work on a site-specific basis. If the point is to create a training pipeline that consumes "every video on the Internet" while automatically rejecting the videos that are botspam, then contextual heuristics of this sort won't scale. (And Google "doesn't do things that don't scale.")
• and, conversely, the context-free signals you mention (thumbnail looks AI-generated, voice is synthesized) aren't actually highly correlated with the script being LLM-barf rather than something a human wrote.
Why? One of the primary causes is TikTok (because TikTok content gets cross-posted to YouTube a lot.) TikTok has a built-in voiceover tool; and many people don't like their voice, or don't have a good microphone, or can't speak fluent/unaccented English, or whatever else — so they choose to sit there typing out a script on their phone, and then have the AI read the script, rather than reading the script themselves.
And then, when these videos get cross-posted, usually they're being cross-posted in some kind of compilation, through some tool that picks an AI-generated thumbnail for the compilation.
Yet, all the content in these is real stuff that humans wrote, and so not something Google would want to throw away! (And in fact, such content is frequently a uniquely-good example of the "gen-alpha vernacular writing style", which otherwise doesn't often appear in the corpus due to people of that age not doing much writing in public-web-scrapeable places. So Google really wants to sample it.)
I’m guessing that the association between “pagers” and “Hezbollah” ended up creating the latter two tabs, but who knows. Maybe some AI video out there did a product review of Hezbollah.
I've noticed that lately. It used to be the top google result was almost always what you needed. Now at the top is an AI summary that is pretty consistently wrong, often in ways that aren't immediately obvious if you aren't familiar with the topic.
A useful Search would ideally send a user to the site with the most signal and the fewest noise. Meanwhile, ads are inherently noise; they're extra pieces of information inserted into a webpage that at best tangentially correlate to the subject of a page.
Up until ~5 years ago, Google was able to strike a balance on keeping these two stable; you'd get results with some Ads but the signal generally outweighed the noise. Unfortunately from what I can tell from anecdotes and courtroom documents, the Ad team at Google has essentially hijacked every other aspect of the company by threatening that yearly bonuses won't be given out if they don't kowtow to the Ad teams wishes to optimize ad revenue somewhere in 2018-2019 and has no sign of stopping since there's no effective competition to Google. (There's like, Bing and Kagi? Nobody uses Bing though and Kagi is only used by tech enthusiasts. The problem with Google is that to copy it, you need a ton of computing resources upfront and are going up against a company with infinitely more money and ability to ensure users don't leave their ecosystem; go ahead and abandon Search, but good luck convincing others to give up say, their Gmail account, which keeps them locked to Google and Search will be there, enticing the average user.)
Google has absolutely zero incentive to filter out generative AI junk from their search results outside the amount of it that's damaging their PR since most of the SEO spam is also running Google Ads (since unless you're hosting adult content, Google's ad network is practically the only option). Their solution therefore isn't to remove the AI junk, but to instead reduce it enough to the degree where a user will not get the same type of AI junk twice.
A search engine isn't a two-sided market in itself but the ad network that supports it is. A better search engine is a technological problem, but a decently paying ad network is a technological problem and a hard marketing problem.
Is it of your own coinage? When the AI sifts through the digital wreckage of the brief human empire, they may give you the credit.
Increasingly I find that for in-depth explanations or tutorials Youtube is the only place to go, but even there the search results can lead to loads of videos which just seem… off. But at least those are still made by humans.
It seems to me that the fact it's so cheap and relatively easy for people with dreams of becoming wealthy influencers to put stuff out there has more to do with the flood of often mediocre content than AI does.
Of course the vast majority don't have much real success and get on with life and the crank turns and a new generation perpetuates the cycle.
LLMs etc. may make things marginally easier but there's no shortage of twenty somethings with lots of time imagining riches while making pennies.
It sucks, because sharing recipes seemed like one of those things the internet could be really good at.
I read Spanish and Italian fluently and stumble my way through Japanese (with translation). It's easier to find a good recipe in these languages, provided you can find the ingredients or substitutes.
The only remaining reliable source - now that many newspapers are axing the remaining staff in favour of LLMs - is pre-2020 print cookbooks. Anything online or printed later must be assumed to be tainted, full of untested sewage and potentially dangerous suggestions.
I joke with folks that my assumption with "one clove of garlic" is that they really mean "one head of garlic" if you want any flavour. (And if the recipe title has "garlic" in it and you are using one clove, you’re lying.)
When I saw them, they blew my mind. Short to store and easy to understand.
That is, when baking, you can usually (again, exceptions for creaming the sugar in butter, etc.) take all of your dry ingredients and mix/sift them together, and then you pour your wet ingredients in a well you’ve made in the dry ingredients (these can also usually be mixed together).
But for shortbread or fork biscuits those three could indeed all go in the bowl in one go (but that one admittedly doesn't really need a bracket because the recipe is "put in bowl, mix with hands, bake").
The top of search results is consistently crowded by pages that obviously game ranking metrics instead of offering any value to humans.
For a while I thought email as a medium was doomed, but spammers mostly lost that arms race. One interesting difference is that with spam, the large tech companies were basically all fighting against it. But here, many of the large tech companies are either providing tools to spammers (LLMs) or actively encouraging spammy behaviors (by integrating LLMs in ways that encourage people to send out text that they didn't write).
There's probably an analogy to be made about the open decentralised internet in the age of AI here, if it gets to the point that search engines have to assume all sites are spam by default until proven otherwise, much like how an email server is assumed guilty until proven innocent.
However, the frontline of the email war has shifted lately. Now the most important part of the war is being fought over emails that look just like ham, but aren't. Business frauds where someone convinces you that they are the CEO or CFO or some VP and they need you to urgently buy this or that for them right now no time to talk is big business right now, and before you get too high-and-mighty about how immune you are to that, they are now extremely good at looking official. This war has not been won yet, and to a large degree, isn't something you necessarily win by AI either.
I think there's an analogy here to the war on content slop. Since what the content slop wants is just for you to see it so they can serve you ads, it doesn't need anything else that our algorithms could trip on, like links to malware or calls to action to be defrauded, or anything else. It looks just like the real stuff, and telling that it isn't could require a human rather vast amounts of input just to be mostly sure. Except we don't have the ability to authenticate where it came from. (There is no content authentication solution that will work at scale. No matter how you try to get humans to "sign their work" people will always work out how to automate it and then it's done.) So the one good and solid signal that helps in email is gone for general web content.
I don't judge this as a winning scenario for the defenders here. It's not a total victory for the attackers either, but I'd hesitate to even call an advantage for one side or the other. Fighting AI slop is not going to be easy.
I'm not saying this is impossible but that's going to be an uphill sell for me as a concept. According to some quick stats I checked I'm getting roughly 600 emails per day, about 550 of which go directly to spam filtering, and of the remaining 50, I'd say about 6 are actually emails I want to be receiving. That's an impressive amount overall for whoever built this particular filter, but it's also still a ton of chaff to sort wheat from and as a result I don't use email much for anything apart from when I have to.
Like, I guess that's technically usable, I'm much happier filtering 44 emails than 594 emails? But that's like saying I solved the problem of a flat tire by installing a wooden cart wheel.
It's also worth noting there that if I do have an email thats flagged as spam that shouldn't be, I then have to wade through a much deeper pond of shit to go find it as well. So again, better, but IMO not even remotely solved.
Looking at the last days’ spam¹ I have three 419-style scams (widows wanting to give away their dead husbands’ grand piano or multi-million euro estate) and three phishing attempts. There are duplicate messages in each category.
About fifteen years ago, I did a purge of mailing list subscriptions and there’s very little that comes in that I don’t want, most notably a writer who’s a nice guy, but who interpreted my question about a comment he made on a podcast as an invitation to be added to his manually managed email list and given that it’s only four or five messages a year, I guess I can live with that.
⸻
1. I cleaned out spam yesterday while checking for a confirmation message from a purchase.
Based on the process above, naturally, the third iteration then is LLMs writing for corporate bots, neither for humans nor for other LLMs.
I don't see how Google's SEO rules being written or unwritten has any bearing. Spammers will always find a way.
How do we know what content LLMs were fed? Isn't that a highly guarded secret?
Won't the quality of the content be paramount to the quality of the generated output or does it not work that way?
2017: Invention of transformer architecture
June 2018: GPT-1
February 2019: GPT-2
June 2020: GPT-3
March 2022: GPT-3.5
November 2022: ChatGPT
You may want to add kiwix archives from before whatever date you choose. You can find them on the Internet Archive, and they're available for Wikipedia, Stack Overflow, Wikisource, Wikibooks, and various other wikis.The analogy is that data is now contaminated with AI like steel is now contaminated with nuclear fallout.
https://en.wikipedia.org/wiki/Low-background_steel
>Low-background steel, also known as pre-war steel[1] and pre-atomic steel,[2] is any steel produced prior to the detonation of the first nuclear bombs in the 1940s and 1950s. Typically sourced from ships (either as part of regular scrapping or shipwrecks) and other steel artifacts of this era, it is often used for modern particle detectors because more modern steel is contaminated with traces of nuclear fallout.[3][4]
From the same wiki you linked:
"Since the end of atmospheric nuclear testing, background radiation has decreased to very near natural levels, making special low-background steel no longer necessary for most radiation-sensitive uses, as brand-new steel now has a low enough radioactive signature"
and
"For the most demanding items even low-background steel can be too radioactive and other materials like high-purity copper may be used"
Another solution, and one that, if i weren't such a lazy, is ocean based carbon binding. You can run electricity directly through ocean water and precipitate the carbon out as calcium carbonate, which is both: useful to humans as is and after processing; and useful to the coral reefs and crustaceans/mollusks or whatever in the oceans.
If anyone wants to kick me about a million US dollars, i can make a POC on a used barge with solar panels and as much recycled material as possible, and have that just run off the coast of florida or something. I figure the total cost to get a barge is around a quarter million, all-in[1], the electronics and seawater stuff is about another $150-200 thousand, and the rest is mine for the idea and the lawyers' to get this approved and left alone to do the research.
[0] burning it for heat is fine, as the net CO2 levels will remain constant, but i mean things like houses and boardwalks and boats, furniture, and so on.
[1] could be more, now, the last time i was researching seaworthy barge costs it was between $100,000 and $200,000. I'm hoping there's someone that can donate the barge so i can make the rest more fit for purpose - redundancy, better solar, better mppt, better batteries, better materials for the electrodes (it takes platinum and titanium iirc, i haven't looked at my documents for a long while.)
For applications that need to avoid the background radiation (like physics research), pre atomic age steel is extracted, like from old shipwrecks.
> Low Background Steel (and lead) is a type of metal uncontaminated by radioactive isotopes from nuclear testing. That steel and lead is usually recovered from ships that sunk before the Trinity Test in 1945.
This is a desired quality, increasingly less present in IT work environments. People afraid of being shamed for stating knowledge gaps are not the folks you want to work with.
LLMs pollute the internet like atomic bombs polluted the environment.
Making resources like wordfreq more visible won't exacerbate any of these concerns.
The new stuff generated does (and this is honestly already captured).
This author doesn't generate content. They analyze data from humans. That "from humans" is the part that can't be discerned enough and thus the project can't continue.
Their research and projects are great.
See a lot of people upset about AI still using AI image generation because it's not in their field so they feel less strongly about it and can't create art themselves anyway, hypocritical either use it or don't but don't fuss over it then use it for something thats convenient for you.
Another example is how data on humans after 2020 or so can't be separated by sex because gender activists fought to stop recording sex in statistics on crime, medicine, etc.
Edit: just the first one
And now a hopefully new comment: having a word frequency measure of the internet as we're going into AI being more used would be IMMENSELY useful specifically _because_ more of the internet is being AI generated! I could see such a dataset being immensely useful to researchers who are looking for the impacts of AI on language, and to test empirically a lot of claims the author has made in this very post! What a shame that they stopped measuring.
Also: as to the claims that AI will cause stagnation and a reduction of the variance of English vocabulary used, this is a trend in English that's been happening for over 100 years ( https://evoandproud.blogspot.com/2019/09/why-is-vocabulary-s... ). I believe the opposite will happen, AI will increase the average persons vocabulary, since chat AIs tends to be more professionally written than a lot of the internet. It's like being able to chat with someone that has an infinite vocabulary. It also makes it possible for people to read complicated documents well out of their domain, since they can ask not just for definitions but more in depth explanations of what words/sections mean.
Here's to comment that will never be read because of all the noise in this thread :/
The complaint about pollution of the Web with artificial content is timely, and it's not even the first time due to spam farms intended to game PageRank, among other nonsense. This may just mean there is new value in hand-curated lists of high-quality Web sites (some people use the term "small Web").
Each generation of the Web needs techniques to overcome its particular generation of adversarial mechanisms, and the current Web stage is no exception.
When Eric Arthur Blair wrote 1984 (under his pen name "George Orwell"), he anticipated people consuming auto-generated content to keep the masses from away from critical thinking. This is now happening (he even anticipated auto-generated porn in the novel), but the technologies criticized can also be used for good, and that is what I try to do in my NLP research team. Good will prevail in the end.
Every content system seems to get polluted by noise once it hits mainstream usage: IRC, Usenet, reddit, Facebook, geocities, Yahoo, webrings, etc. Once-small curated selections eventually grow big enough to become victims of their own successes and taken over by spam.
It's always an arms race of quality vs quantity, and eventually the curators can't keep up with the sheer volume anymore.
You ask on HN, one of the highest quality sites I've ever visited in any age of the Internet.
IRC is still alive and well among pretty much the same audience as always. I'm not sure it's fair to compare that with the others.
But if they ever include other topics, they risk becoming more mainstream and noisy. Even within adjacent fields (like the various Stacks) it gets pretty bad.
Maybe the trick is to stay within a single small sphere then and not become a general purpose discussion site? And to have a low enough volume of submissions where good moderation is still possible? (Thank you dang and HN staff)
I wonder if it is more to do with the community itself. HN users tend to have very intelligent discussions on pretty much anything, and discourages shitty, unnuanced, one-line takes. This, coupled with a healthy moderation system, makes it hard for the lower quality discussion to break in and override the good stuff.
A good example of the generalization problem you discuss is reddit.
You have to unsubscribe from all the defaults and find the small, niche, communities about specific topics. If not, it's the same stuff, reposted, over and over, across different subs and/or social sites.
My experience here is that it's pretty good for things outside of tech (at least better than the average internet) but definitely not great.
A vaping discussion brought up glycerin used was safe and the same thing used in smoke machines and someone else brought up a study showing that smoke machines are an occasional safety issue. Nowhere near every discussion goes that well but stick around and you’ll see in-depth discussion.
Go to a public health website by comparison and you’ll see warnings without context and a possibility positive spin compared to smoking. https://www.cdc.gov/tobacco/e-cigarettes/index.html I suspect most people get basically nothing from looking at it.
Sometimes I try and engage, but honestly, mostly I think it's not worth it. Otherwise you end up doing this with your life: https://xkcd.com/386/
After doing some healthcare work I ended up understanding that some topics are not well known even by the professionals dedicating their whole lives to that because there are big gaps in the human knowledge on the topics.
I agree that people that think they can reason in two minutes about anything are a problem, but it's not a healthcare only issue (same happens for politics, economics, environment, etc.)
Engineers have the luck to work in the field where many things have a clear, known explanation (although, try to make an estimation about how long a team will implement a feature, and everybody will come up with something else).
And yes, this is true for many other areas of discussion at HN. It's just that it is most obvious to me in the area that my wife specializes in, because I pick up enough via osmosis from her to know when other people don't even have my limited level of understanding.
1: Or at least were 15 years ago when my wife told me about it- the argument might have been largely concluded and she just never updated me since I don't keep up with the medical literature the way she does.
2: Two decades ago there was a huge push for the "human genome project" under the basis that this would be "reading the blueprints for human life" and that would give us all of these medical breakthroughs. Basically none of those breakthroughs happened because we've spent the past 20 years learning all of the different ways that it is NOT a blueprint and that cells do things very differently from human engineers.
As someone with domain expertise here, I wholeheartedly disagree. HN is very bad at percolating accurate information about topics outside its wheelhouse, like clinical medicine, public health, or the natural sciences. It is also, simultaneously, extremely prone to overestimating its own collective competency at understanding technical knowledge outside its domain. In tandem, those two make for a rather dangerous combination.
Anytime I see a post about a topic within my area of specialty, I know to expect articulate, lengthy, and completely misguided or inaccurate comments dominating the discussion. It's enough of a problem that trying to wade in and correct them is a losing battle; I rarely even bother these days.
It's kind of funny that XKCD #793[0] is written about physicists, because the effect is way worse with software engineers.
And to hold up discussions about MS as an example of 'extremely' low quality discussion is, ah, interesting. Do you have any recent examples of such discussions?
relative to what? reddit?
also there's a trade off between entropy and "quality". too much "quality" and everyone gets bored and goes somewhere more entertaining
https://news.ycombinator.com/item?id=41499957
https://news.ycombinator.com/item?id=41408124
https://news.ycombinator.com/item?id=41335757
https://news.ycombinator.com/item?id=41327379
None of which fit your description of “neckbeardy tropes about their products being garbage spyware, switch to Linux, they're stealing your data, the OS is trash”.
And it isn’t just me, because if you look at those comments, I was talking to other people who weren’t invoking those “neckbeardy tropes” either
1. build a userbase, free product
2. once userbase get big enough, any new account requires a monthly fee, maybe $1
3. keep raising the fee higher and higher, until you get to the point that the userbase is manageable.
no ads, simple.
The people who stay away from critical thinking were doing that already and will continue to do so, 'AI' content or not.
(If children are never taught to think critically, then...)
You imply that thousands of year ago everybody was thinking critically?
Thinking critically is hard, stressful and might take some joy from your life.
Even if, this is a dangerous thought that discourages decisive action that is likely to be necessary for this to happen.
Sci-fi author:
I created the Torment Nexus to serve as a cautionary tale...
Tech Company:
Alas, we have created the Torment Nexus from the classic Sci-fi novel "Don't Create the Torment Nexus"
1. https://www.marxists.org/archive/marx/works/1894-c3/ch25.htm
As a random example: just trying to find a particular popular set of wireless earbuds takes me at least 10 minutes, when I already know the company, the company's website, other vendors that sell the company's goods, etc. It's just buried under tons of dreck. And my laptop is "old" (an 8-core i7 processor with 16GB of RAM) so it struggles to push through graphics-intense "modern" websites like the vendor's. Their old website was plain and worked great, letting me quickly search through their products and quickly purchase them. Last night I literally struggled to add things to cart and check out; it was actually harrowing.
Fuck the web, fuck web browsers, web design, SEO, searching, advertising, and all the schlock that comes with it. I'm done. If I can in any way purchase something without the web, I'mma do that. I don't hate technology (entirely...) but the web is just a rotten egg now.
This is going to be the thing that makes me quit Amazon. If I'm missing something and there's still a way to to a direct search, please tell me.
Product page (copy the identifier at the end): https://www.amazon.com/Long-Thanks-Hitchhikers-Guide-Galaxy-...
Review page (paste the identifier at the end): https://www.amazon.com/product-reviews/B001OF5F1E/
This seems to bypass all of the LLM stuff for now.
On the other hand, what you call "The Web" seems to be just what you can get at through search engines. There's still the old web, the thing that's mediated by relationships and reputation rather than aggregation services with billions of users. Like the link I shared above. Or this heroically moderated site we're using right now.
LEEZWOO 15.6" Laptop - 16GB RAM 512GB SSD PC Laptop, Quad-Core N95 Processor Up to 3.1GHz, Laptop Computers with Touch ID, WiFi, BT4.2, for Students/Business
Name rolls off the tongue doesn’t it
I used to be able to say search for Trek bike derailleur hanger and the first result would be what I wanted. Now I have to scroll past 5 ads to buy a new bike, one that's a broken link to a third party, and if I'm really lucky, at the bottom of page 1 will be the link to that part's page.
The shitification of the web is real.
Hey, who cares about making services that work when we can give people a cool chatbot assistant and a 1800 number with no real-person alternative to the decision tree
To get to the milk you'll have to walk by 3 rows of chips and soda.
We've been past the tipping point when it comes to text for some time, but for video I feel we are living through the watershed moment right now.
Especially smaller children don't have a good intuition on what is real and what is not. When I get asked if the person in a video is real, I still feel pretty confident to answer but I get less and less confident every day.
The technology is certainly there, but the majority of video content is still not affected by it. I expect this to change very soon.
https://www.nytimes.com/interactive/2024/09/09/technology/ai...
https://www.nytimes.com/interactive/2024/01/19/technology/ar...
These are a little bit unfair, in that we're comparing handpicked examples, but I don't think many experts will pass a test like this. Technology only moves forward (and seemingly, at an accelerating pace).
What's a little shocking to me is the speed of progress. Humanity is almost 3 million years old. Homosapiens are around 300,000 years old. Cities, agriculture, and civilization is around 10,000. Metal is around 4000. Industrial revolution is 500. Democracy? 200. Computation? 50-100.
The revolutions shorten in time, seemingly exponentially.
Comparing the world of today to that of my childhood....
One revolution I'm still coming to grips with is automated manufacturing. Going on aliexpress, so much stuff is basically free. I bought a 5-port 120W (total) charger for less than 2 minutes of my time. It literally took less time to find it than to earn the money to buy it.
I'm not quite sure where this is all headed.
It really isn't. Have a look at daily median income statistics for the rest of the planet:
https://ourworldindata.org/grapher/daily-median-income?tab=t...
$2.48 Eastern and Southern Africa (PIP)
$2.78 Sub-Saharan Africa (PIP)
$3.22 Western and Central Africa (PIP)
$3.72 India (rural)
$4.22 South Asia (PIP)
$4.60 India (urban)
$5.40 Indonesia (rural)
$6.54 Indonesia (urban)
$7.50 Middle East and North Africa (PIP)
$8.05 China (rural)
$10.00 East Asia and Pacific (PIP)
$11.60 Latin America and the Caribbean (PIP)
$12.52 China (urban)
And more generally: $7.75 World
I looked around on Ali, and the cheapest charger that doesn't look too dangerous costs around five bucks. So it's roughly equal to one day's income of at least half the population of our planet.There is the very real possibility that everything just stalls and plateau where we are at. You know, like our population growth, it should have gone exponentially but it did not. Actually, quite the reverse.
Flashlights? Sure, bring on aliexpress. USB cables with pop-off magnetically attached heads, no problem. But power supplies? Welp, to each their own!
But seriously, it's harder to accidentally make a USB cable that fries your equipment. The more common failure mode is it fails to work, or wears out too fast. Chargers on the other hand, handle a lot of voltage, generate a lot of heat, and output to sensitive equipment. More room to mess up, and more room for mistakes to cause damage.
Is there a big recent qualitative change here? Or is this a continuation of manufacturing trends (also shocking, not trying to minimize it all, just curious if there’s some new manufacturing tech I wasn’t aware of).
For some reason, your comment got me thinking of a fully automated system, like: you go to a website, pick and choose charger capabilities (ports, does it have a battery, that sort of stuff). Then an automated factor makes you a bespoke device (software picks an appropriate shell, regulators, etc). I bet we’ll see it in our lifetimes at least.
The Technological Singularity - https://en.wikipedia.org/wiki/Technological_singularity
I don't. I mean, I can identify the bad ones, sure, but how do I know I'm not getting fooled by the good ones?
I see a lot of outrage around fake posts already. People want to believe bad things from the other tribes.
And we are going to feed them with it, endlessly.
It's relatively trivial to photoshop misinformation in a really powerful and undetectable way- but I don't see (legitimate) instances of groundbreaking news over a fake photo of the president or a CEO etc doing something nefarious. Why is AI different just because it's audio/video?
And it's not the grounbreaking the problem, it's the little constant lies.
Last week a photoshopped Musk tweet was going around, people getting all up in arms against it despite the fact it was very easy to spot as a fabricated one.
People didn't care, they hate the guy, they just wanted to fuel their hate more.
The whole planet run on fake content, magazin covers, food packaging, instagram pics of places that never looks that way...
And now, with AI, you can automate it and scale it up.
People are not ready. And in fact, they don't want to be.
Even what's free & open source in the special effects community is astonishing lately.
I'm certain you'd be shocked to see the amount of CG that's in some of your favorite movies made in the last ~10-20 years that you didn't notice because it's undetectable
But, yeah, I do think it is some kind of bias. Maybe not survivorship, though… maybe it is a generalized sort of Malmquist bias? Like the measurement is not skewed by the tendency of movies with good CGI to go away. It is skewed by the fact that bad CGI sticks out.
And it already happened, and no one pushed back while it was happening.
It's by Language Jones, a YouTube linguist. Title: "The AI Apocalypse is Here"
I don't share your confidence in identifying real people anymore.
I often flag as "false-ish" a lot of things from genuinely real people, but who have adopted the behaviors of the TikTok/Insta/YouTube creator. Hell, my beard is grey and even I poked fun at "YouTube Thumbnail Face" back in 2020 in a video talk I gave. AI twigs into these "semi-human" behavioral patterns super fast and super hard.
There is a video floating around with pairs of young ladies with "This is real"/"This is not real" on signs. They could be completely lying about both, and I really can't tell the difference. All of them have behavioral patterns that seems a little "off" but are consistent with the small number of "influencer" videos I have exposure to.
Fair and accurate. In the best cases the person running the model didn't write this stuff and word salad doesn't communicate whatever they meant to say. In many cases though, content is simply pumped out for SEO with no intention of being valuable to anyone.
In my opinion the internet can be considered as the equivalent of a natural environment like the earth. it's a space where people share, meet, talk, etc.
I find it astonishing that after polluting our natural environment we know polluted the internet.
If we haven't already, we will be very soon. I'm sure there are people working on this problem, but I think we're starting to hit a very imminent feedback loop moment. Most of human's recorded information is digitized and most of that is generating non-human content at an incredible pace. We've injected a whole lot of noise into our usable data.
I don't know if the answer is more human content (I'm doing my part!) or novel generative content but this interim period is going to cause some medium-term challenges.
I like to think the LLM more-tokens-equals-better era is fading and we're getting into better use of existing data, but there's a very real inflection point we're facing.
Corporations did that, not humans.
"few people recognize that we already share our world with artificial creatures that participate as intelligent agents in our society: corporations" - https://arxiv.org/abs/1204.4116
(I initially wanted to say 'paid for by the government' but that'd be socialising losses and we've had quite enough of that in the past.)
Next token-seeking is a solved problem. Novel thinking can be solved by humans and possibly by AI soon, but adding more garbage to the data won't improve things.
Maybe we actually need to preserve all the old movies / documentaries / books in all languages and mark them as pre-LLM / non-LLM.
But I hazard a guess this wont happen, as its a common good that could only be funded by left-leaning taxation policies - no one can make money doing this, unlike burning carbon chains to power LLMs.
>> Generative AI has polluted the data
Just like low-background steel marks the break in history from before and after the nuclear age, these types of data mark the distinction from before and after AI.
Future models will begin to continue to amplify certain statistical properties from their training, that amplified data will continue to pollute the public space from which future training data is drawn. Meanwhile certain low-frequency data will be selected by these models less and less and will become suppressed and possibly eliminated. We know from classic NLP techniques that low frequency words are often among the highest in information content and descriptive power.
Bitrot will continue to act as the agent of Entropy further reducing pre-AI datasets.
These feedback loops will persist, language will be ground down, neologisms will be prevented and...society, no longer with the mental tools to describe changing circumstances; new thoughts unable to be realized, will cease to advance and then regress.
Soon there will be no new low frequency ideas being removed from the data, only old low frequency ideas. Language's descriptive power is further eliminated and only the AIs seem able to produce anything that might represent the shadow of novelty. But it ends when the machines can only produce unintelligible pages of particles and articles, language is lost, civilization is lost when we no longer know what to call its downfall.
The glimmer of hope is that humanity figured out how to rise from the dreamstate of the world of animals once. Future humans will be able to climb from the ashes again. There used to be a word, the name of a bird, that encoded this ability to die and return again, but that name is already lost to the machines that will take our tongues.
That's why on FB I mark my own writing as AI generated, and the AI generated slop as genuine. Because what is disguised as "transparency disclaimer" is just flagging content of what's a potential dataset to train from and what isn't.
Apropos of nothing in particular, see LinkedIn now admitting [1] it is training its AI models on "all users by default"
What would it take for Open AI overlords to inject words they want to force into usage in their models and will new words into use? Few have had the power to do such things. Open AI through its popular GPT platform now has the potential of dictating the evolution of human language.
This is novel and scary.
Or we'll be fine, because inbreeding isn't actually sustainable either economically nor technologically, and to most of the world the Silicon Valley "AI" crowd is more an obnoxious gang of socially stunted and predatory weirdos than some unstoppable omnipotent force.
On the one hand, I completely agree with Robyn Speer. The open web is dead, and the web is in a really sad state. The other day I decided to publish my personal blog on gopher. Just cause, there's a lot less crap on gopher (and no, gopher is not the answer).
But...
A couple of weeks ago, I had to send a video file to my wife's grandfather, who is 97, lives in another country, and doesn't use computers or mobile phones. Eventually we determined that he has a DVD player, so I turned to x264 to convert this modern 4K HDR video into a form that can be played by any ancient DVD player, while preserving as much visual fidelity as possible.
The thing about x264 is, it doesn't have any docs. Unlike x265 which had a corporate sponsor who could spend money on writing proper docs, x264 was basically developed through trial and error by members of the doom9 forum. There are hundreds of obscure flags, some of which now operate differently to what they did 20 years ago. I could spend hours going through dozens of 20 year old threads on doom9 to figure out what each flag did, or I could do what I did and ask a LLM (in this case Claude).
Claude wasn't perfect. It mixed up a few ffmpeg flags with x264 ones (easy mistake), but combined with some old fashioned searching and some trial and error, I could get the job done in about half an hour. I was quite happy with the quality of the end product, and the video did play on that very old DVD player.
Back in pre-LLM days, it's not like I would have hired a x264 expert to do this job for me. I would have either had to spend hours more on this task, or more likely, this 97 year old man would never have seen his great granddaughter's dance, which apparently brought a massive smile to his face.
Like everything before them, LLMs are just tools. Neither inherently good nor bad. It's what we do with them and how we use them that matters.
Didn't most DVD burning software include video transcoding as a standard feature? Back in the day, you'd have used Nero Burning ROM, or Handbrake - granted, the quality may not have been optimized to your standards, but the result would have been a watchable video (especially to 97 year-old eyes)
Everything is “seamless” nowadays. Like I am seamlessly commenting here.
Arguably, the meaning of these words evolve due to misuse too.
Many of my searches nowadays include suffixes like "site:reddit.com" (or similar havens of, hopefully, still mostly human-generated content) to produce reasonably useful results. There's so much spam pollution by sites like Medium.com that it's disheartening. It feels as if the Internet humanity is already on the retreat into their last comely homes, which are more closed than open to the outside.
On the positive side:
1. Self-managed blogs (like: not on Substack or Medium) by individuals have become a strong indicator for interesting content. If the blog runs on Hugo, Zola, Astro, you-name-it, there's hope.
2. As a result of (1), I have started to use an RSS reader again. Who would have thought!
I am still torn about what to make of Discord. On the one hand, the closed-by-design nature of the thousands of Discord servers, where content is locked in forever without a chance of being indexed by a search engine, has many downsides in my opinion. On the other hand, the servers I do frequent are populated by humans, not content-generating bots camouflaged as users.
As mentioned, we have heuristics like frequency of the word "delve", and simple techniques such as measuring perplexity. I'd like to see a GAN style approach to this problem. It could potentially help improve the "humanness" of AI-generated content.
It's actually not. It's rather difficult for humans as well. We can see verbose text that is confused and call it AI, but it could just be a human aswell.
To borrow an older model training method, "Generative adversarial network". If we can distinguish AI from humans... We can use it to improve AI and close the gap.
So, it becomes an arms race that constantly evolves.
But we do know that now it's a lot more, with a big LOT.
If we add linguistics to NLP I can see an argument but if we define NLP as the research of enabling a computer process language then it seems to me that LLM’s/ Generative AI is the only research that an NLP practitioner should focus on and everything else is moot. Is there any other paradigm that we think can enable a computer understand language other than training a large deep learning model on a lot of data?
compare the frequency of words to those used in human natural writings and you spot the computer from the human.
Brains aren't nearly as good at slightly adjusting the statistical properties of a text corpus as computers are.
Hmm I don’t disagree but I think it will be valuable skill going forward to write text that doesn’t read like it was written by an LLM
This is an arms race that I’m not sure we can win though. It’s almost like a GAN.
But that's a losing endeavor: if you can do that, you can immediately ask your LLM to fix its output so that it passes that test (and many others). It can introduce typos, make small errors on purpose, and anything you can think of to make it look human.
I'm sure this has occurred to them already. Apart from the near-impossibility of continuing the task in the same way they've always done it, it seems like the other reason they're not updating wordfreq is to stick a thumb in the eye of OpenAI and Google. While I appreciate the sentiment, I recognize that those corporations' eyes will never be sufficiently thumbed to satisfy anybody, so I would not let that anger change the course of my life's work, personally.
'OpenAI and Google can collect their own damn data. I hope they have to pay a very high price for it, and I hope they're constantly cursing the mess that they made themselves.'
really does betray some real naivete. OpenAI and Google could literally burn $10million dollars per day (okay, maybe not OpenAI - but Google surely could) and reasonably fail to notice. Whatever costs those companies have to pay to collect training data will be well worth it to them. Any messes made in the course of obtaining that data will be dealt with by an army of employees either manually cleaning up the data, or by algorithms Google has its own LLM write for itself.
I do find the general sense of impending dystopian inhumanity arising out of the explosion of LLMs to be super fascinating (and completely understandable).
Maybe this is because I’m European, but what is partisan about calling X invariably worthless drivel? Seems a lot like facts to me considering what has been going on with the platform moderation since Elon Musk bought it. It’s so bad that the EU consider it a platform for misinformation these days.
Or potentially even more dystopian would be that AI slop would be dictating/driving human communication going forward.
So I would still state noun-phrase frequency in LLM output would tend to reflect noun-phrase frequency in training data in a similar context (disregarding enforced bias induced through RLHF and other tuning at the moment)
I'm sure there will be cross-fertilization from LLM to Human and back, but I'm not seeing the data yet that the influence on word-frequency is that outspoken.
The author seems to have some other objections to the rise of LLM's, which I fully understand.
I guess the same way the scientists had to account for the bomb pulse in order to provide accurate carbon-14 dating, wordfreq would need a magic way to account for non human content.
Saying magic, because unfortunately it was much easier to detect nuclear testing in the atmosphere than to it will be to detect AI-generated content.
Might even change the tool name.
The funny fact: It doesn't result in the increase for search results for "delve".