Julia Community 🟣: Jan Siml

AIHelpMe.jl Pt.2: Instant Expertise, Infinite Solutions for Julia Developers

Jan Siml — Tue, 30 Apr 2024 08:36:09 +0000

TL;DR

AIHelpMe.jl is a Julia package that harnesses the power of AI models to provide tailored coding guidance, integrating seamlessly with PromptingTools.jl to offer a unique approach to answering coding queries directly in Julia's environment.

The Frustration of Searching for Julia Answers

We've all been there. You're stuck on a problem, and you need some guidance on how to implement a specific functionality in Julia. You head to your trusty search engine, type in your query, and... wait, why are all these results in Python? You tweak your query, adding "Julia" this and "Julia language" that, but still, the results are scattered and unclear.

After 5 minutes of searching, you finally stumble upon a Discourse post from 2018. But wait, is this even relevant to Julia 1.0? Should you bother opening it? You take a deep breath and dive in, hoping that the answer lies within.

Another 5 minutes pass, and you finally find what you're looking for. You copy out the necessary snippet, and after a few more minutes of customizing it to your needs, you have what you needed. The process is tedious, to say the least.

Introduction to AIHelpMe's Official Release

AIHelpMe, a Julia package designed to enhance coding assistance with AI, has been officially registered and released. This milestone makes it easier to install directly from the Julia registry and introduces new functionalities and new knowledge packs! Now, developers can access sophisticated, AI-powered coding guidance more efficiently than ever.

How AIHelpMe Works

AIHelpMe uses a Retrieval Augment Generation (RAG) pattern to provide accurate and relevant answers to your coding questions. It preprocesses the provided documentation, converting text snippets into numerical embeddings. When you ask a question, AIHelpMe looks up the most relevant documentation snippets, feeds them into the AI model, and generates an answer tailored to Julia's ecosystem and best practices.

Get started with:

using Pkg
Pkg.add("AIHelpMe")
using AIHelpMe
aihelp"How do I implement quicksort in Julia?"

Note: It requires at least the OpenAI API key (ENV["OPENAI_API_KEY"]), but I would strongly recommend getting a FREE Cohere API key for re-ranking (silver pipeline).

How AIHelpMe Differs from Other Solutions

Compared to chatbots, AIHelpMe offers several advantages:

Grounded in actual resources: AIHelpMe's answers are based on actual, up-to-date Julia resources, not outdated training data.
Customizable knowledge: You choose what knowledge to include, allowing you to improve precision and recall.
Flexible trade-offs: You choose the trade-off between cost, performance, and time, giving you greater control over your coding experience.
State-of-the-art RAG methods: AIHelpMe leverages the latest RAG methods, and full customization is possible, ensuring that you get the most accurate and relevant answers.

Note: Only a limited number of packages have been pre-processed so far: Julia docs, Tidier ecosystem, and Makie ecosystem. It's still experimental, but it works! Load them with:

Starter Example

using AIHelpMe
using AIHelpMe: pprint, last_result

# ideally, switch to better pipeline for proper results, requires setting up Cohere API key
AIHelpMe.update_pipeline!(:silver)

# load tidier index, others available: :julia, :makie
AIHelpMe.load_index!(:tidier);

# Ask a question
aihelp"How do you add a regression line to a plot in TidierPlots?"

Or show the highlighted answer (you can customize to add the actual source docs, or remove the scores/highlights):

# See highlighted answer (optional)
pprint(last_result())

If you're an infrequent user of AlgebraOfGraphics like me, you'll certainly appreciate the :makie knowledge pack -- it's much faster now to find the right keywords to customize your plot!

# Load all knowledge packs
AIHelpMe.load_index!([:julia,:makie,:tidier]);
aihelp"How to set the label of a y-axis in Makie?"gpt3t # testing a weak model
pprint(last_result())

Note: There can still be some issues with the quality of the answers (it's GenAI!), especially for the bronze pipeline and weaker models, but, hopefully, it's already good enough to create value for you!

Advanced Usage of AIHelpMe

AIHelpMe offers several advanced features for experienced users wanting more control and deeper insights:

Response Insights:

Use pprint to highlight potential hallucinations, show sources, their scores, and context snippets.
Example: aihelp("Explain Julia's multiple dispatch system", return_all=true)|>AIHelpMe.pprint

Note: You'll always get better responses with better pipelines - see below.

Customizing the AI Pipeline:

Adjust the complexity of AI responses with update_pipeline! choosing from bronze, silver, or gold levels.
Specify AI models, including local options like Ollama.
Example: AIHelpMe.update_pipeline!(:silver; model="gllama370") # gllama370 is Groq.com-hosted Llama 3 70b that you can access for free!

Safe Code Execution:

Execute AI-generated code safely with PromptingTools.AICode struct.
Example: aihelp("How to create a named tuple from a dictionary?")|>PromptingTools.AICode

For more details on these advanced features, please refer to the AIHelpMe Advanced Documentation.

What's Next

Over the summer, we hope to optimize the performance (in terms of quality) and add more knowledge packs.

We're also working on making it super easy for you to develop your own knowledge packs for the packages you use, regardless of whether they're public or private. This will enable you to tailor AIHelpMe to your specific needs and workflow.

Conclusion

AIHelpMe is the solution to your Julia coding woes. With its AI-powered assistance, flexible querying system, and cost-effective approach, AIHelpMe is poised to revolutionize the way you code in Julia. Try it out today and experience the future of coding assistance!

Credit for the title image: DALL-E 3.

ProToPortal: The Portal to the Magic of PromptingTools

Jan Siml — Sun, 28 Apr 2024 16:02:47 +0000

TL;DR

ProToPortal streamlines your interaction with any LLM tasks on the go, offering customizable templates for automatic replies, direct code evaluation, and an intuitive, multi-device interface. Explore its full capabilities and enhance your productivity on GitHub.

Unveiling ProToPortal: A Julia-First LLM GUI built with Stipple.jl!

Hello, fellow GenAI enthusiasts! Today, I'm thrilled to introduce ProToPortal, a nifty tool born from my own need to simplify my daily interactions with Julia and AI models. It's a small but mighty project aimed at boosting productivity and minimizing those pesky prompting hassles.

Why ProToPortal?

Ever found yourself on a leisurely walk with your dog, struck by a sudden coding inspiration that just couldn't wait? That's exactly where ProToPortal comes into play. This tool isn't just another coding interface; it's your on-the-go, in-your-pocket coding companion, ready to tackle tasks from simple prompts to complex code evaluations—all before you've even finished your walk!

Cool Features to Check Out:

Accessible Anywhere: Whether you're on a train or in your comfy home office, ProToPortal is there. It works seamlessly across all devices, ensuring that your brilliant ideas never slip away just because you're not at your desk.
Code Evaluation and Fixing: Forget about flipping between screens to debug. ProToPortal lets you directly evaluate and fix Julia code within the GUI. Imagine tweaking your code with just a few clicks—yes, it's that easy!
Automatic Replies: Streamline your workflow even further with automated responses. Set up ProToPortal to handle repetitive tasks, leaving you more time to focus on the creative aspects of coding.

More Handy Features:

Editing Code Cells: Quickly edit any of the messages right within the chat tab—just click and modify.
Deleting Code Cells: Made a mistake? No worries! Easily remove any unwanted messages.
Saving Conversations: Keep a history of your sessions for future reference or continued experimentation.

For a complete list of features, including detailed explanations and how-tos, be sure to check out ProToPortal's Documentation.

See It in Action:

Curious to see how it works in real time?

Check out this video where it auto-fixes the generated code: Code fixing recording (webm format so not all browsers might play it).

And here's another cool feature—watch how ProToPortal handles editing a conversation (gif, easy to play): Editing a conversation.

Give It a Try!

Ready to dive in? Visit ProToPortal's GitHub to get started. It's open source, and I'm eager to hear your thoughts or see your contributions!

Thanks for checking out ProToPortal. Happy coding, and let the code be with you!

Big thank you to the Genie.jl team! This tool wouldn't exist without their amazing packages!

Credit for the title image goes to DALL-E 3.

Automatically Saving Conversations with PromptingTools.jl and AIHelpMe.jl

Jan Siml — Thu, 25 Apr 2024 20:36:42 +0000

Update 20/5

From PromptingTools v0.26 onward you can achieve the auto-saving of your conversations by simply running this one line:

PT.register_model!(; name= "gpt-3.5-turbo", schema=PT.OpenAISchema() |> PT.TracerSchema |> PT.SaverSchema)

See ?TracerSchema and SaverSchema for more details.

TL;DR

Learn how to automatically save conversations with PromptingTools.jl. By saving conversations, you can contribute to building a dataset for fine-tuning a Julia-specific language model. This tutorial provides code examples to get you started

Introduction

Recently, there have been exciting discussions about fine-tuning a language model for the Julia programming language (see here).

As part of this effort, we need a high-quality dataset of GOOD conversations related to Julia. One way to contribute to this effort is to start logging conversations with Large Language Models (LLMs) that are relevant to Julia.

In this blog post, we will explore how to automatically save conversations using PromptingTools.jl and AIHelpMe.jl, a powerful Julia package for interacting with language models. By saving these conversations, we can build a valuable dataset for fine-tuning a Julia-specific language model.

Defining a Custom Schema for Saving Conversations

A lesser-known feature, PromptingTools has a custom callback system that allows us to define custom schemas that will then call your arbitrary functions before and after each LLM call (it's used mostly for observability).

To save conversations, we need to define a custom schema that wraps our normal prompt schema. We can do this by creating a new struct SaverSchema that inherits from PT.AbstractTracerSchema.

using Dates
using JSON3
using PromptingTools
const PT = PromptingTools

const SAVE_DIR = "finetune_julia"

@kwdef struct SaverSchema <: PT.AbstractTracerSchema
    schema::PT.AbstractPromptSchema
end

Any call to this schema triggers a call to function initialize_tracer before the LLM call and to finalize_tracer after the LLM call.

In our case, we want to overload the finalize_tracer function to save the conversation after the LLM call.

function PT.finalize_tracer(
    tracer_schema::SaverSchema, 
    tracer, 
    msg_or_conv; 
    tracer_kwargs=NamedTuple(), 
    model="", 
    kwargs...
)
    # We already captured all kwargs, they are already in `tracer`, we can ignore tracer_kwargs in this implementation

    time_received = Dates.format(now(), "YYYYmmdd_HHMMSS")
    path = joinpath(SAVE_DIR, "conversation__$(model)__$(time_received).json")
    conv = msg_or_conv isa AbstractVector ? msg_or_conv : [msg_or_conv]
    PT.save_conversation(path, conv)

    return msg_or_conv
end

Example 1: Saving Conversations with `aigenerate`

Now that we have defined our custom schema, we can use it to save conversations with aigenerate. We need to explicitly provide the SaverSchema instance to aigenerate along with the input prompt.

schema = SaverSchema(PT.OpenAISchema())
msg = aigenerate(schema, "Say hi", model="gpt3t", return_all=true)

When you call this function, it will save the conversation to the folder defined in SAVE_DIR.

One gotcha, if you send multiple messages in the save convo, is that all turns will be saved in separate files.
The easiest way would be to ignore it and solve it in post-processing (AIMessage have unique IDs so it should be easy to detect)
Alternatively, you can save the hash of the content of the first 2-3 messages in the filename to clearly see the continued conversations.

Example 2: Registering a Traced Model

Instead of providing the custom schema every time, we can register a traced model with the custom schema. This way, we can use the model name instead of the schema instance.

# Overwrite the schema for this model and define a nice alias
PT.register_model!(; name="gpt-3.5-turbo", schema)
PT.MODEL_ALIASES["gpt3t"] = "gpt-3.5-turbo"

# Notice the return_all -> we need to return ALL messages, it would be a useless record otherwise
msg = aigenerate("Say hi", model="gpt3t", return_all=true)

Conversation gets saved.

Loading Conversations

Once we have saved conversations, we can load them back into Julia using load_conversation.

conv = PT.load_conversation("finetune_julia/conversation__gpt3t__20240425_205853.json")

Exporting Conversations in ShareGPT Format

Once we have enough conversation, we will want to export so our finetuning tool can use them.
I would highly recommend Axolotl (see an example from my finetune).

Axolotl can work with instructions (conversations) in ShareGPT format. This is how you can export multiple conversations into the required JSONL file:

conv1 = [PT.SystemMessage("System message 1"), 
         PT.UserMessage("User message"), 
         PT.AIMessage("AI message")]
conv2 = [PT.SystemMessage("System message 2"), 
         PT.UserMessage("User message"), 
         PT.AIMessage("AI message")]
path = joinpath("finetune_julia", "export_sharegpt.jsonl")
PT.save_conversations(path, [conv1, conv2])

Saving AIHelpMe Conversations

If you use AIHelpMe, you're also generating loads of interesting data!
The simplest thing for auto-logging your questions is to wrap the entry function aihelp and serialize the whole RAGResult (it has all the diagnostics and underlying information)

function aih(question; kwargs...)
    result = aihelp(question; return_all=true, kwargs...)
    dt = Dates.format(now(), "YYYYmmdd_HHMMSS")
    JSON3.write(joinpath(SAVE_DIR, "aihelp__$(dt).json"), result)
    return result
end

To use it, you would replace aihelp("some question...") with aih("some question...").

The serialized RAGResult is c. 200kB, but it provides a lot of helpful detail about your question.
If you want to save space, save just the individual conversations in result.conversations.

Sharing The Conversations

Where to share these? To be discussed. Come join us on Discourse or on Julia Slack in #generative-ai.

Conclusion

In this blog post, we have seen how to automatically save conversations using PromptingTools.jl. By defining a custom schema and overloading the finalize_tracer function, we can save conversations to files. We can also register a traced model and use it to generate text. Finally, we can load and export conversations in ShareGPT format for finetuning. With AIHelpMe.jl, we can serialize the whole RAGResult with JSON3.

Credit for the title image goes to DALL-E 3.

The Hidden Cost of Locally Hosted Models: A Case Study

Jan Siml — Sat, 20 Apr 2024 10:08:04 +0000

TL;DR

Locally-hosted AI models may appear free, but they cost you valuable time—over 10 hours a year in our case study. Switch to a commercial API like Groq to save time, boost productivity, and gain nearly three extra days of coding annually for a dollar!

Would You Pay a Dollar to Buy 3 Extra Days This Year?

Imagine you could buy time. Not in a metaphorical sense, but literally reclaim hours of your life lost to waiting. For those of us using locally hosted models for ad-hoc productivity tasks like coding assistance, this isn't just a daydream—it's a decision we face every day.

Appreciating the Open-Source AI Ecosystem

First, let's give credit where it's due. The thriving open-source ecosystem in generative AI deserves a massive shoutout. Organizations like Meta and Mistral have opened up their models, and platforms like Ollama and Llama.cpp have made these tools accessible for local use. This democratization of technology is nothing short of revolutionary. However, it's crucial to discuss the true cost of operating these technologies locally (by individuals, for ad-hoc tasks).

The Hidden Costs of Local Hosting

While the price tag on locally-hosted models might read "free," the reality is anything but. These models often underperform compared to their cloud-hosted counterparts (GPU-poor) or make you wait longer—sometimes both. For example, using a locally-hosted model like Mixtral on Ollama, you might wait 20 seconds for a response that a commercial provider like Groq, Together, etc. could deliver in less than a second.

Case Study: Daily Coding Assistance

Let's break it down with a simple case study. Assume you're a developer making three LLM calls per hour during a three-hour coding session, each day for 250 days a year. That's 2250 LLM calls.

With Ollama, a 20-second wait per call accumulates to over 12 hours spent just waiting annually.

In contrast, using Groq's API, even with an extremely conservative 3-second wait (Llama 3 70b, which is GPT-4 level model), you'd spend less than 2 hours waiting over the same period.

The difference? More than 10 hours saved—or, put another way, over 3 extra days of productive coding time each year.

And the cost of this extra time? Right now - FREE! Assuming the announced pricing, about \$1.5.

Moreover, with Groq, we assumed using GPT-4 level model! So you would likely benefit even more from MUCH better answers!

Why Choose Cloud Providers?

Given you have roughly 4,000 weeks on this earth, spending any of them waiting on your GPU seems like a poor use of time. In a way, time is the scarcest resource yet you throw it away to save fractions of cents.

Furthermore, you might lose out on innovations. Cloud providers continually upgrade their services with faster and more powerful models without requiring any effort on your part. Meanwhile, changing your local setup is a significant investment and it has its limits (VRAM...).

How to Start?

Switching is simple:

Sign up for the Groq API.
Set up your environment variable GROQ_API_KEY.
Use PromptingTools.jl with a Groq-hosted Llama3 70b, which I aliased with "gl70" (Groq Llama 70). This alias helps save time even when typing!

Example Usage

using PromptingTools
# Assumes you have set the environment variable GROQ_API_KEY

ai"In Julia, write a function `clean_names` that cleans up column names of a DataFrame"gl70

[ Info: Tokens: 411 @ Cost: \$0.0003 in 2.7 seconds
AIMessage("Here is a Julia function `clean_names` that cleans up column names of a DataFrame:
```

julia
using DataFrames
<...continues>


```
```

`

This simple setup can drastically cut down your waiting time, freeing up days for you to spend on more fulfilling activities or further innovation.

If you're familiar with the [PromptingTools.jl](https://github.com/svilupp/PromptingTools.jl) package, you know you can even set up an auto-fixing loop that will execute the generated code, analyze the error for feedback and retry automatically to fix any errors with Monte Carlo Tree Search (see `?airetry!` for more details).

```julia
using PromptingTools.Experimental.AgentTools: AIGenerate, run!, AICode
using PromptingTools.Experimental.AgentTools: airetry!, aicodefixer_feedback

result = AIGenerate(
    "In Julia, write a function `clean_names` that cleans up column names of a DataFrame";
    model = "gl70") |> run!

success_func(aicall) = AICode(aicall.conversation[end]) |> isvalid
feedback_func(aicall) = aicodefixer_feedback(aicall.conversation).feedback
airetry!(success_func, result, feedback_func; max_retries = 3)
```


## In Conclusion

While the allure of "free" local hosting is strong, the hidden costs in time can be substantial. By opting for a commercial solution like Groq's API, not only do you reclaim time lost to waiting, but you also benefit from superior model performance. The investment is minimal compared to the time you buy back—time that could be spent innovating, creating, or just enjoying life. Isn't that worth considering?

If you're looking to try, do it now while Groq is free!! [Get your API key here](https://console.groq.com/keys).

## Appendix

I made a claim that Llama 3 70b is a GPT-4 level model, check out our Leaderboard [here](https://svilupp.github.io/Julia-LLM-Leaderboard/dev/examples/summarize_results_local/#Model-Comparison) to see the results in an out-of-sample benchmark.

Credit for the title image goes to DALL-E 3.

The Latest Scoop on PromptingTools.jl

Jan Siml — Fri, 05 Apr 2024 20:36:34 +0000

TL;DR

PromptingTools.jl just got a hefty update with several new versions, packed with new models, enhanced AI tools, and easier dataset prep, all thanks to a coffee-fueled solo developer who's now inviting others to join the coding party through GitHub issues. Dive in, contribute, and let's make magic together!

Dive Into the Latest and (Maybe) Greatest PromptingTools.jl Updates!

Hello, Julia enthusiasts and AI aficionados! We've been busy tinkering in the Julia workshop, and guess what? We’ve rolled out not one, not two, but EIGHT new sub-versions of PromptingTools.jl! That’s right, we’ve been on a coding spree, fueled by too much coffee and an unyielding passion for making your lives a tad easier (and ours a bit more caffeinated).

Start with:

using Pkg
Pkg.add("PromptingTools")

using PromptingTools
ai"How could I have lived without PromptingTools.jl for so long?"
## [ Info: Tokens: 138 @ Cost: $0.0002 in 5.4 seconds
## AIMessage("It's great to hear that you've found PromptingTools.jl to be a valuable tool! PromptingTools.jl is designed to streamline your workflow...

What’s Fresh in PromptingTools.jl?

Supercharged Model Shenanigans

AIGenerate Meets Anthropic: Ready for a text generation party? Anthropic API is now on the guest list with its cool aliases – bring on the AI prose!
Data Extraction Gets Anthropic: Ever wished for Claude from Anthropic to help you with data extraction? Wish granted! Now you can summon Claude 3 to pull out data like a digital magician. In other news, Claude 3 Haiku is amazing at parties.
GoogleGenAI Enters the Chat: With a GOOGLE_API_KEY, you can now conjure up content with Google's Gemini model. It’s like having a Google genie but for AI text generation.
Model Registry Bonanza: We’ve added some fancy new models to our registry like “nomic-embed-text” and “mxbai-embed-large.” Because who doesn’t like more toys in their AI sandbox?

Revamped RAGTools: Sleek, Fast, and Flexible

RAGTools has received a major upgrade, making your AI adventures smoother and speedier:

RAGTools Goes Binary: Think binary is just for computers and that one friend who can only answer in yes/no? Well, now RAGTools speaks binary too, for embeddings that zip and zoom faster than you can say “BinaryCosineSimilarity()”. You should read this blog. There is a benchmark blog post upcoming!
Customizable RAG Interface: For those who love to tweak and tailor, the new RAG interface is a game-changer. With retrieve and generate! functions and all their sub-steps properly separated and documented, you now have the power to craft a RAG pipeline that perfectly fits your project's needs, offering unparalleled flexibility in how you approach AI-driven tasks. See the documentation for more details.
Debugging & Analysis Tools: We’ve introduced pretty-printing and support annotations because sometimes you need to read AI-generated content without squinting and sometimes you want to know when the model is lying to you.

Dataset Prep & Nifty Utilities

Easier Dataset Prep: We made dataset prep as easy as pie. Sadly, it doesn’t come with actual pie. But, JSONL format export? Yum!
Docs & Tools Galore: Dive into our expanded docs with an "Extra Tools" section that’s like finding an extra fry at the bottom of the bag. Plus, FAQs to guide you through the thicket of common woes.

For the Adventurous Souls

Dabble in AI Art: Feeling artsy? Our experimental support for image generation with DALL-E models lets you channel your inner digital Picasso.

Improvements & Bug Squashing

We’ve polished, tweaked, and outright cajoled PromptingTools.jl into a better version of itself, all while fixing those pesky bugs that love to play hide and seek.

Looking Forward (With Goggles On)

We're as excited as a kid in a candy store about these updates and can’t wait to see what you'll build, debug, or accidentally break with them. Your projects are the real MVPs here, and we’re just here to supply the tools (and occasionally entertain).

So, grab the latest version of PromptingTools.jl, unleash your creativity, and remember: in the world of coding, the journey is half the fun, and the other half? Well, that’s debugging. Happy coding, and may your coffee be strong and your bugs few!

A Solo Journey (But Open to Hitchhikers):

Plot twist: PromptingTools.jl has been mostly a solo quest. However, I’ve started mapping out all my wild ideas and to-dos on GitHub as issues, inviting anyone keen to pitch in or tackle something that sparks their interest. It’s a chance to dive into the fray and help shape the future of this project. So, if you’re up for a bit of coding camaraderie, come join the adventure! The ultimate goal is to stabilize the functionality and interfaces and transfer it under JuliaGenAI organization for faster development.

EDIT: The longer-term hope is to write an agent that will create all the PRs automatically :)

Empowering AI with Knowledge: The New RAG Interface in PromptingTools

Jan Siml — Fri, 05 Apr 2024 20:05:12 +0000

TL;DR

The new RAGTools module in the PromptingTools.jl package introduces enhanced modularity and straightforward extension capabilities, enabling developers and researchers to easily customize and build Retrieval-Augmented Generation (RAG) systems tailored to their specific needs in Julia.

Introduction

The introduction of Retrieval-Augmented Generation (RAG) systems addresses key challenges in generative AI, notably the tendency to lack crucial information and produce hallucinated content. By integrating external knowledge, RAG systems significantly enhance the accuracy and reliability of AI-generated responses.

The RAGTools module within the PromptingTools.jl package enables the creation of such systems, offering a path to mitigate these issues. As this module matures, plans are in place to transition it into its own dedicated package, further facilitating the development and adoption of RAG systems.

RAGTools Module: A Primer

RAGTools offers an experimental but formidable suite of utilities designed to facilitate the crafting of RAG applications with minimal fuss. Central to its arsenal is the airag function, a master orchestrator that seamlessly combines AI insights with user-curated knowledge, unlocking new dimensions of accuracy and relevance in answers.

You can get started very quickly:

# required dependencies to load the necessary extensions
using LinearAlgebra, SparseArrays 
using PromptingTools
using PromptingTools.Experimental.RAGTools
# to access unexported functionality
const RT = PromptingTools.Experimental.RAGTools

## Sample data
sentences = [
    "Search for the latest advancements in quantum computing using Julia language.",
    "How to implement machine learning algorithms in Julia with examples.",
    "Looking for performance comparison between Julia, Python, and R for data analysis.",
    "Find Julia language tutorials focusing on high-performance scientific computing.",
    "Search for the top Julia language packages for data visualization and their documentation.",
    "How to set up a Julia development environment on Windows 10.",
    "Discover the best practices for parallel computing in Julia.",
    "Search for case studies of large-scale data processing using Julia.",
    "Find comprehensive resources for mastering metaprogramming in Julia.",
    "Looking for articles on the advantages of using Julia for statistical modeling.",
    "How to contribute to the Julia open-source community: A step-by-step guide.",
    "Find the comparison of numerical accuracy between Julia and MATLAB.",
    "Looking for the latest Julia language updates and their impact on AI research.",
    "How to efficiently handle big data with Julia: Techniques and libraries.",
    "Discover how Julia integrates with other programming languages and tools.",
    "Search for Julia-based frameworks for developing web applications.",
    "Find tutorials on creating interactive dashboards with Julia.",
    "How to use Julia for natural language processing and text analysis.",
    "Discover the role of Julia in the future of computational finance and econometrics."
]
sources = map(i -> "Doc$i", 1:length(sentences))

## Build the index
index = build_index(sentences; chunker_kwargs=(; sources))

## Generate an answer
question = "What are the best practices for parallel computing in Julia?"

msg = airag(index; question) # short for airag(RAGConfig(), index; question)
## Output:
## [ Info: Done with RAG. Total cost: \$0.0
## AIMessage("Some best practices for parallel computing in Julia include us...

Unveiling New Functionalities

The latest update to the RAGTools module introduces key features that enhance the creation and analysis of RAG systems:

Modular Interface: The RAG pipeline is now broken down into distinct components (see details below), allowing users to customize and extend each phase with ease. Simply define a new type and method for only the components you wish to modify.
Pipeline Transparency: Users can now view detailed background information on the RAG pipeline, including the sources selected and the process at each stage (use return_all=true).
Advanced RAG Functionality: Default pipeline configuration now comes with question rephrasing, reranking results, and two-step answer refinement. There is even a postprocess placeholder, so you can add some logging or other transformations. You can easily switch between different implementations thanks to Julia's method dispatch while calling the same top-level function.
Answer Annotation: The final answers can be annotated with hallucination scores, showing the overlap with source materials and indicating the origin of specific information within the answer.

Answer Annotation Example

With a small change, you can see which sources were used for each sentence in the answer ([1]), how strongly they were supported ([..,0.9]), and the color highlight of the "unknown" words (with magenta color):

result = airag(index; question, return_all = true)
pprint(result)

You immediately see that while a lot of the package names and macros look sensible, they did NOT come from our trusted knowledge base (all highlighted in magenta). In real life, we would also have clearly labelled source links that we could verify with one click.

The annotation system is fully customizable (bring your own logic, styles, etc.).
You can also obtain this information in HTML format to easily show it in your Genie apps!

A Closer Look at the Modular Interface

At the heart of the new RAGTools interface is its modular design, encouraging the interchange of pipeline components. This approach allows for extensive customization at every stage, from data preparation to answer generation, ensuring that developers can easily adapt the system to meet their specific needs.

This system is designed for information retrieval and response generation, structured in three main phases:

Preparation, when you create an instance of AbstractIndex
Retrieval, when you surface the top most relevant chunks/items in the index and return AbstractRAGResult, which contains the references to the chunks (AbstractCandidateChunks)
Generation, when you generate an answer based on the context built from the retrieved chunks, return either AIMessage or AbstractRAGResult

The associated methods are:

build_index: Indexes relevant documents for retrieval.
retrieve: Selects pertinent information chunks based on the query.
generate!: Produces the final answer using the retrieved data.

airag is simply a wrapper around retrieve and generate!, providing a convenient way to execute the entire RAG pipeline in one go.

Note that the first argument is always the main dispatching parameter that you can use to customize the behavior of the pipeline. This design ensures that users can easily swap out components or extend the system without disrupting the overall functionality.

RAG Pipeline Workflow

The RAG pipeline is structured into distinct stages, each comprising several critical sub-steps to ensure the generation of accurate and relevant answers.

If you want to change the behavior of any step, you can define a new type and method for that step.

All customization are subtypes of the abstract types, so use subtypes function to discover the currently available implementations, eg, subtypes(AbstractReranker).

Preparation Phase

build_index:
- get_chunks: Segments documents into manageable chunks.
- get_embeddings: Generates embeddings for similarity searches.
- get_tags: Tags chunks for efficient filtering.

Retrieval Phase

retrieve:
- rephrase: Optionally rephrases queries for better matching.
- find_closest: Identifies the most relevant document chunks.
- find_tags: Filters chunks based on specific tags.
- rerank: Reranks chunks to prioritize the best matches.

Generation Phase

generate!:
- build_context!: Constructs the context from selected chunks for the answer.
- answer!: Generates a preliminary answer.
- refine!: Refines the answer for clarity and relevance.
- postprocess!: Applies final touches to prepare the response.

A visual summary with the corresponding types:

Where to Start: Quick, Experiment, or Customize

To operate the RAG system:

Quick Start: Utilize airag for an immediate, out-of-the-box solution, suitable for rapid testing.
Experimentation: Leverage RAGConfig to try out different implementations of airag, tweaking the system for better performance.
Customization: Dive into retrieve and generate! for detailed customization, tailoring the process to your precise requirements.

How to Customize the Pipeline

If you want to customize the behavior of any step, you can do so by defining a new type and defining a new method for the step you're changing, eg, introduce a new reranker:

PromptingTools.Experimental.RAGTools: rerank

struct MyReranker <: AbstractReranker end
rerank(::MyReranker, index, candidates) = ...

And then you would set the retrive step to use your custom MyReranker via reranker keyword argument, eg, retrieve(....; reranker = MyReranker()) (or customize the top-level dispatching AbstractRetriever struct).

Passing Keyword Arguments to Customize the Pipeline

When you need to adjust specific aspects of the RAG pipeline, keyword arguments (kwargs) allow for targeted modifications. This approach is especially useful for customizing individual components within the system.

To pinpoint the right keyword arguments (kwargs) for customization:

Consult the Diagram: Review the RAG pipeline diagram or documentation. Identify the component you want to adjust.
Use the Format: Apply <dispatch_type> + _kwargs for direct customizations. For nested adjustments, use prefixes that reflect the hierarchy (e.g., retriever_kwargs -> rephraser_kwargs -> template).

This approach allows for precise tweaks at any level of the pipeline, ensuring your modifications target exactly what you need.

Practically, for a broad configuration, you might start with a RAGConfig instance, specifying components like the AdvancedRetriever to enhance retrieval capabilities. Preparing kwargs in advance facilitates managing the intricacies of nested configurations:

cfg = RAGConfig(; retriever=AdvancedRetriever())

# Organize kwargs for clarity and manageability
kwargs = (
    retriever=AdvancedRetriever(),
    retriever_kwargs=(
        top_k=100,
        top_n=5,
        rephraser_kwargs=(
            template=:RAGQueryHyDE,
            model="custom-model"
        )
    ),
    generator_kwargs=(
        answerer_kwargs=(
            model="custom-answer-model"
        )
    ),
    api_kwargs=(
        url="http://localhost:8080"
    )
)

# Execute with prepared arguments
result = airag(cfg, index, question; kwargs...)

In scenarios where direct interaction with components like the retriever is needed, configure its kwargs similarly:

retriever_kwargs = (
    top_k=100,
    top_n=5,
    rephraser_kwargs=(
        template=:RAGQueryHyDE,
        model="custom-model"
    ),
    api_kwargs=(
        url="http://localhost:8080"
    )
)

# Apply to the retriever function directly
result = retrieve(AdvancedRetriever(), index, question; retriever_kwargs...)

Delving deeper into the pipeline, for tasks such as rephrasing, specific kwargs can be directly applied to fine-tune the operation:

rephrase_kwargs = (
    model="custom-model",
    template=:RAGQueryHyDE,
    api_kwargs=(
        url="http://localhost:8080"
    )
)

# Customize the rephrase step
rephrased_query = rephrase(SimpleRephraser(), question; rephrase_kwargs...)

This structured approach to passing kwargs ensures that each stage of the RAG pipeline can be precisely controlled and customized, allowing for a tailored question-answering system that meets specific needs.

Using Custom Indexes or Vector Databases

RAGTools default implementation is built with an in-memory index suitable for datasets up to 100,000 chunks. For larger datasets or specific indexing needs:

Define a Custom Index: Create a new index by extending AbstractChunkIndex. Use the ChunkIndex as a guide for required fields.
Customize Interaction Methods: Implement new methods for your index to integrate with the retrieval process of the RAG pipeline.
Share Your Implementation: Contributions of integrations with common vector databases are welcome. They enrich the community's resources, enabling more versatile RAG applications.

You would use the same approach to build a hybrid index (semantic search + BM25).

This approach allows RAGTools to accommodate a broader range of applications, from large-scale datasets to specialized indexing strategies, enhancing its utility and adaptability.

Conclusion

The latest enhancements in the RAGTools module are a leap forward in democratizing the development of RAG systems. By blending ease of use with deep customizability, we open new avenues for developers and researchers to explore AI-driven question-answering possibilities.

We Want to Hear from You!

Your feedback and use cases are crucial as we refine RAGTools and prepare to carve it out into its own package. Whether you're exploring the vanilla implementation or integrating vector databases, share your insights with us. Your contributions are key to enhancing this interface, making it more robust and versatile for the community. Help shape the future of RAGTools—join us in this exciting journey towards a more powerful and user-friendly generative AI toolkit.

Credit for the title image goes to DALL-E 3.

A 7 Billion Parameter Model that Beats GPT-4 on Julia Code?

Jan Siml — Thu, 14 Mar 2024 09:16:40 +0000

TL;DR

Fine-tuning AI models for specialized tasks is both cost-effective and straightforward, needing only a few examples and less than a dollar, especially when leveraging tools like Axolotl to simplify the process.

Introduction

What if I told you that a David-sized AI model just outsmarted Goliath GPT-4 in Julia code generation? Welcome to the tale of Cheater-7B, our pint-sized hero, whose adventure into fine-tuning showcases the might of focused AI training. The best part? This entire transformation took just 1 hour and cost less than fifty cents.

Cheater-7B

Cheater-7B is a nimble 7 billion parameter model, fine-tuned to perfection on its task. Despite its size, even the quantized version (GGUF Q5), beats GPT4 in Julia code generation in our leaderboard.

How Is It Possible? Fine-tuning + Cheating!

Fine-tuning

Yes, this blog post is a bit of a joke! It is not about the model itself, it’s actually a brief introduction to fine-tuning, which allows you to “tune” a smaller model to perform like a big one on a specific task (this part is very important.)

Fine-tuning Cheater-7B on a select 11 problems demonstrates that you don't need vast datasets to achieve significant improvements. This little giant not only excelled in familiar territory but also showed promising signs of learning from new, unseen challenges (see the Appendix!)

Beyond Cheating

Yes, Cheater-7B got a head start by "cheating" on the test! We fine-tuned it on 11/14 test cases in our leaderboard (the one we compare models on) - this happens more often than you think in the real world (often unconsciously).

But the real story here is the power of fine-tuning - because our model turned out to be better than the base model (the model we fine-tuned) in some of the unseen test cases as well! Clearly, it did pick up some Julia knowledge along the way.

Fine-tuning 101

What Is It?

Fine-tuning a model involves adjusting a pre-trained machine learning model's parameters so it can better perform on a specific task, effectively leveraging the model's learned knowledge (probability distribution of the next token) and adapting it to new, related challenges with a relatively small dataset.

Why Fine-Tuning Should Be Your New Best Friend

Fine-tuning stands out for specific tasks (ie, narrow domains) that demand efficiency, and privacy. It's akin to sharpening your tools to ensure they cut cleaner and faster, all while keeping the costs astonishingly low.

Once you build your Generative AI system, sooner or later you will have to route some of the simple requests to smaller fine-tuned models as part of the optimization process. Everyone does that, even the big players.

Understanding the Limits of Fine-Tuning

While fine-tuning can transform a general AI model into a specialist, it's not a silver bullet. This process excels at refining a model's existing knowledge to perform specific tasks (eg, adjusting the format or style, embedding certain prompts or examples) with greater accuracy or up-weighting/surfacing certain knowledge (eg, Julia) to be used more.

However, it does have many limitations. It's not very effective for tasks that require the model to learn entirely new information or skills from scratch. For such challenges, you might need to incorporate additional learning methods, like Retrieval Augmented Generation (RAG), to supplement the model's capabilities. In essence, fine-tuning adjusts the focus of the lens but doesn't replace the lens altogether.

Getting Started with Fine-Tuning: Easier Than You Think

Diving into fine-tuning is more accessible than ever, thanks to user-friendly tools like Axolotl. This approach not only simplifies the process but also opens the door to a collaborative effort in building specialized, efficient AI models for specific needs.

You need very little data to get started - we used just 11 test cases to get started.

You can find all the required resources and recipes here.

The Cheater-7B Experiment: Fast, Affordable, Enlightening

The journey of creating Cheater-7B was a lesson in efficiency itself: just 1 hour of processing on a cloud GPU, with an investment that didn't even hit the half-dollar mark. This experiment underscores the practicality and accessibility of fine-tuning for AI enthusiasts and professionals alike.

Getting Started with Fine-Tuning Data

Your first step in fine-tuning is to gather examples, specifically AI conversations that align with the skills you're aiming to enhance (eg, good Julia conversations/exchanges). To save these conversations for later use, you can employ a helpful function from the PromptingTools package save_conversation (saves a conversation to JSON).

If you're looking for a communal space to store and share these conversations, consider contributing to an open-source project. Open a pull request at Julia-LLM-Leaderboard's Julia Conversations to add your valuable data to the collective repository.
This folder also shows example code snippets on how to save your conversations from PromptingTools.

I hope to write a detailed walkthrough of the process soon, but for now, you can find all the required resources and recipes here.

Conclusion

Cheater-7B's story is more than a quirky anecdote; it's a compelling illustration of how fine-tuning can unlock the potential of AI models, transforming them into task-specific powerhouses. As we continue to explore and share our experiences, the possibilities for innovation and improvement in AI are boundless.

Got a cool idea or breakthrough with your fine-tuning experiments? Share it in the generative-ai channel on Julia Slack and inspire the community with your innovation!

Resources

Discover Axolotl: Axolotl
Explore the Julia LLM Leaderboard: Julia LLM Leaderboard
Resources to Train Your Cheater-7B: Cheater-7B experiment.
Saving Conversations with PromptingTools: Julia Conversations folder.
Trained Cheater-7B Model: Cheater-7B Model

Extra Questions

Is it expensive?
The process of fine-tuning Cheater-7B was surprisingly affordable, costing less than half a dollar. By renting a cloud GPU from Jarvislabs.io and opting for a spot instance outside of peak hours, the entire fine-tuning operation on an RTX A5000 was completed in about an hour for just $0.39.
How was Cheater-7b trained? Is it difficult?
Training Cheater-7B was streamlined and accessible, thanks to the Axolotl tool.

Axolotl simplifies the fine-tuning process, making it approachable even for those new to machine learning. With just a few commands in the CLI, a configuration YAML file, and the selected dataset, Cheater-7B was fine-tuned efficiently. This ease of use demystifies the process, making advanced AI techniques available to a broader audience.

See the example configuration in the Resources section.
Where did you get the data?
The data for fine-tuning Cheater-7B came from the Julia LLM Leaderboard, focusing on solutions that demonstrated excellence and diversity. Specifically, we took the top 50 solutions that scored full points (100 points) for 11 out of the 14 test cases across different prompts.

The associated code is available in the Resources section.
Can I try/use the model?
Yes, of course. Download the LORA adapter or the quantized version from here.
I'd recommend using llama.cpp or Llama.jl to run it.
Did we not just memorize the results?

Well, partially! See below the performance of each model (and GPT4 for comparison) on various test cases.

We fine-tuned our model on the first 11 test cases. It has never seen any of the last 3 test cases: q_and_a_extractor, pig_latinify, and extra_julia_code. These are the hardest test cases in our leaderboard and you can see that even GPT4 struggles to produce "executable" code (>50 points) for these.

The 11 training cases didn't teach our model much about pig_latinify (requires knowledge of multi-threading and associated libraries) and extract_julia_code (requires large models because there can be multiple nested levels of triple backticks and strings in the inputs, which tips up most models).

However, the performance on q_and_a_extractor has increased significantly compared to both GPT4 and the base model! It's likely because the model learned how to do Regex operations in Julia and learned to navigate the return types better.

Credit for the title image goes to DALL-E 3.

Six Steps to Success: Designing and Delivering Your First Generative AI Demo

Jan Siml — Fri, 08 Mar 2024 20:42:57 +0000

TL;DR

Discover how to create an engaging and effective Generative AI demo in Julia with six crucial tips, focusing on simplifying technical complexities, crafting a compelling narrative, and enhancing user experience with stunning UI and prompt caching. This guide ensures your demo captures the imagination of your stakeholders and showcases the potential of GenAI technology.

Crafting Your First Generative AI Demo: A Guide to Wow Your Stakeholders

In the rapidly evolving world of Generative AI (GenAI), demonstrating the capabilities of your solution in a way that captivates and convinces stakeholders is more crucial than ever. A well-crafted demo can serve as a powerful tool to showcase technical possibilities and ignite the imagination of your audience. However, the goal here is not to present a polished, ready-to-use product but to illuminate the potential applications of GenAI in a vivid, engaging manner.

Understanding the Purpose of Your Demo

Before diving into the mechanics of building your demo, it's essential to distinguish between a demo and a Minimum Viable Product (MVP). A demo is a showcase, designed to highlight what's possible with GenAI, helping stakeholders envision how they might use such technology in their own contexts. It’s about painting a picture of the future, not delivering the final product for immediate use.

Six Essential Tips for a Successful GenAI Demo

Crafting a demo that stands out requires more than just technical know-how. It demands strategic planning, creativity, and a focus on the end-user experience. Let’s explore six tips that can make your GenAI demo a resounding success.

Tip 1: Balance Your Focus Away from the Technical

When preparing your demo, follow the 33/33/33 rule:

33% Good Planning: Dedicate a third of your efforts to planning. Identify the core feature or capability — your "wow" factor — and develop a "screenplay" that showcases it compellingly.
33% Simplifying Technical Aspects: Don’t get bogged down in technical perfection. Your demo should be simple yet effective, highlighting GenAI’s capabilities without unnecessary complexity.
33% Polishing the UI: Aesthetics matter. Spend a third of your time ensuring the user interface (UI) is clean, engaging, and intuitive.

Let's dive into the specifics of each step.

Tip 2: Find Your “Wow” and Write a “Screenplay” for it

Determine what aspect of your GenAI solution will most impress your audience. Is it the interface, the novel insights it generates, or its ability to synthesize and summarize complex information?

Once identified, craft a detailed screenplay for your demo. This script should outline every step of the demo, simulating a real user interaction (it will help you with the technical simplifications!) Focus on showcasing this core feature and simplify everything else.

Tip 3: Break Bigger Tasks into Individual “Skills”

Instead of striving to create a GenAI solution that can do everything, break down the larger workflow/conversation into smaller, discrete tasks or “skills.”

For example, is there a web search feature? A set of specific questions it can answer? An email draft? Each one of these is a "skill" that can be built independently for faster iteration and more reliability (think "input -> output").

Your demo can then call on these skills separately (without any preceding conversation history) to keep things simple. It will just look like a big conversation, but it's, actually, a series of smaller, more reliable interactions.

This approach allows you to highlight specific strengths of your solution without overcomplicating the demo. Think of each skill as a standalone feature that, when combined, showcases the versatility and power of your GenAI solution.

Tip 4: Remove Any Unnecessary Complexity

Your demo should be as straightforward as possible. Avoid complex setups like chained Large Language Model calls, which can introduce unnecessary points of failure and don't waste time on building things you don't need!

For example:

Do you need some data from the database? Pick the 20 most interesting records that will deliver the "wow".
Do you need some web scraping? Copy & paste the few pages you need manually.
Do you need some LLM router (to pick the right "skill")? You could use aiclassify to do that, but it's good enough to use simply IF conditions with occursin(). Thanks to your screenplay, you know exactly what to expect and when, so you can keep it simple.

To be clear, you don't need to follow the "screenplay" word by word, but it's a good guide to keep things simple and focused. This focus on the user experience over technical complexity will make your demo more accessible and impactful.

Tip 5: Enhance Your Demo with a Stunning UI

Leverage tools like GenieFramework's Stipple.jl to quickly develop a beautiful UI for your demo.
With just a few lines of code, you can create an application that not only functions well but also looks professional and engaging.

You can find a basic example of a Stipple app below - it's less than 100 lines of code (50 active lines)! Code is provided in the Appendix and you can run it from your Julia REPL.

In other news, the team behind GenieFramework has just launched their new no-code builder. Make sure to check it out: Web Applications in Julia with Genie Builder.

Tip 6: Utilize Prompt Caching for a Smoother Experience

Implement prompt caching to eliminate latency and ensure a fluid demo experience. This strategy involves storing and quickly retrieving responses for common queries or inputs, thus avoiding the need for real-time generation during the demo. It's not about deceiving your audience but about showcasing your GenAI solution's potential without technical hitches or delays (you would optimize the latency in production use cases anyway).

There are Memoization.jl and Memoize.jl, but neither of them supports caching to disk, so you cannot restart your REPL.

I prefer to use simple Dict and if-else statement:

# Remember the conversation via key: `hash(conversation)`
CACHE = Dict{UInt64,MyMessage}()
aigenerate(x) = last(x)  # mock-up only, you would need to convert MyMessage to PromptingTools types for it to work

# Conversation 1
conv1 = [MyMessage(["I am a user"], true), MyMessage(["I am Genie"], false)]
output1 = MyMessage(["Nice to meet you, Genie!"], false)
CACHE[hash(conv1)] = output1

# Example use
conversation = conv1 # known conversation
## conversation = [MyMessage("New conversation", true)] # unknown conversation
if haskey(CACHE, hash(conversation))
    @info "> Cache hit!"
    output_msg = CACHE[hash(conversation)]
else
    @info "> Cache miss! Generating response..."
    msg = aigenerate(conversation)
    # Save the response for later
    CACHE[hash(conversation)] = msg
end

The beauty is that

1) You can decide whether to cache the whole conversation or only the last user message (keep it simple as per Tip 3!)
2) You can then serialize the Dict to disk and load it back when you restart your REPL.

Bonus Tip: Implement Quick Actions

To make your demo even more engaging, incorporate dynamic "quick action" buttons that guide users through predefined next steps or use cases.

This feature not only makes the demo more feature-rich but also ensures a smoother experience by reducing the uncertainty of open-ended interactions. Quick action buttons can be easily implemented in Stipple, enhancing the flow of your presentation and making it easier for your audience to understand the full capabilities of your GenAI solution.

Additionally, by defining these actions in advance, you can more effectively leverage prompt caching, ensuring that each demonstration runs smoothly and without delay.

Appendix: GenieFramework UI Example

If you want to see how easy it is to create a stunning UI for your GenAI demo, here's a basic example using GenieFramework's Stipple.jl.

First, install GenieFramework (PromptingTools is not required, just comment it out!)
Second, run the below code in your Julia REPL (or save it to a script and run it from there).
Once the server starts, it will tell you to navigate to http://127.0.0.1:8000 in your browser to see the UI (or just click on the link in the REPL).

If you have any questions, there is a dedicated Genie channel on the JuliaLang Slack and the Genie team also runs a great Discord server where you can get help!

Example:

module App
using PromptingTools
using GenieFramework # GenieFramework v2.1.0
@genietools

# ! Params
GENIE_IMG = "https://easydrawingguides.com/wp-content/uploads/2021/10/how-to-draw-genie-from-aladdin-featured-image-1200.png"
INTRO_MESSAGE = [
    "Welcome back, Jan!",
    "What can I help you with today? Eg, `example ABC`",
]

### Helpful functions
"MyMessage is a struct that represents a message in the chat"
@kwdef struct MyMessage
    id::Int = rand(Int)
    name::String = "Genie"
    avatar::Union{String,Nothing} = nothing
    text::AbstractVector{<:AbstractString} = String[]
    from_user::Bool = false
end
"Create a `MyMessage` from a user or from Genie"
function MyMessage(text::AbstractVector{<:AbstractString}, from_user::Bool=false)
    MyMessage(; from_user, text, avatar=from_user ? nothing : GENIE_IMG, name=from_user ? "me" : "Genie")
end
MyMessage(text::AbstractString, from_user::Bool=false) = MyMessage([text], from_user)

### Dashboard logic
@appname MyDemoApp

@app begin
    @in btn_send = false
    @in user_input = ""
    @in conversation = MyMessage[]
    @onchange isready begin
        @info "> Dashboard is ready"
        conversation = [MyMessage(INTRO_MESSAGE, false)]
    end
    @onbutton btn_send begin
        @info "> User said: $user_input" # for tracking in REPL
        # Easy way to reset conversation -> just send "reset"
        if strip(lowercase(user_input)) == "reset"
            user_input, conversation = "", [MyMessage(INTRO_MESSAGE, false)]
        elseif !isempty(user_input)
            ## New converation message
            conversation = push!(conversation, MyMessage(user_input, true))
            # Genie's response logic goes BELOW, eg, `aigenerate(user_input)`
            genie_says = "Hey... I'm still learning. I don't know how to respond to that yet."
            user_input = "" # empty the user input
            conversation = push!(conversation, MyMessage(genie_says, false))
        end
    end

end

### Dashboard UI
function ui()
    [
        heading("My First Genie Demo"),
        ## Row 1: Chat
        row(class="",
            [
                cell(class="st-module",
                    [
                        ## awesome trick that allows to pass a vector of messages (=`conversation`) and generates an object for each
                        chatmessage(R"message.text", name=R"message.name", sent=R"message.from_user", avatar=R"message.avatar", size=4, @for("message in conversation"), key=R"message.id"),
                    ]),
            ]),

        ## Row 2: Input From User
        row([
            cell(class="st-module",
                [
                    Html.div(class="input-group",
                        [
                            textfield("Waiting for your requests... Try: `<example command>`",
                                :user_input,
                                @on("keyup.enter", "btn_send = !btn_send")),
                            btn("Send", @click(:btn_send)),
                        ])]),
        ]),
    ]
end

@page("/", ui)
# Start the server
Genie.isrunning(:webserver) || up()
end # end of module

Credit for the title image goes to DALL-E 3.

Navigating Your First GenAI Project: A Blueprint for Success

Jan Siml — Tue, 13 Feb 2024 10:01:38 +0000

TL;DR

Embarking on your first Generative AI project can be daunting. Avoid common pitfalls by following these five practical tips: 1) Start simple and build iteratively, 2) Start from the end, 3) Use the best model available and manage costs wisely, 4) Start with a commercial API, and 5) Prepare your "vibe" check. Each tip is designed to streamline your project’s development, ensuring efficiency and effectiveness from inception to execution.

Introduction

I've observed numerous individuals repeating the same errors in their projects, so I hope the guidance provided here will help you avoid these common pitfalls and steer your project toward success.

1. Start Simple and Build Iteratively

Key Idea: Break your grand vision into manageable, discrete tasks.

Practical Application: If designing an AI to generate news articles, start by focusing on creating compelling headlines. Once mastered, expand to introductory paragraphs, and so forth. This step-by-step approach mitigates risk and builds towards complexity gradually.

If there are multiple GenAI steps, start with just one. Why? If each of the three consecutive steps has a 70% chance of success, the overall probability of succeeding at all three drops to around 1 in 3 - that's not a good starting point! So go step by step.

2. Start from the End

Key Idea: Visualize each step's inputs and outputs to guide development.

Practical Application: For an AI-powered fitness app, sketch out the final user interaction—say, providing personalized workout plans based on user input (e.g., available equipment, fitness level). Create an example of one conversation, or one input & output set and start by getting that to work.

3. Use the Best Model Available and Manage Costs Wisely

Key Idea: Opt for the highest quality AI model to test your project's potential, but be smart about data usage to keep costs in check.

Practical Application: Using GPT-4 Turbo might reveal that your idea is feasible, whereas starting with a smaller model could lead to unnecessary troubleshooting, blaming the idea's failure on the model's limitations. If you're worried about costs, start with a small dataset and see what you can learn from it.

People often overestimate how much the best models cost. For example, the blog post I wrote about analyzing themes in the City of Austin survey had c. 3000 verbatims (~400K characters) and embedding ALL of them cost ~$0.002! Generating the topics with GPT4Turbo cost less than half a cent!

4. Start with a Commercial API

Key Idea: Commercial APIs save time and offer efficiency, outweighing the cost.

Practical Application: Using OpenAI instead of Ollama might be cheaper and much faster!

Consider the hidden cost of locally hosted models: Let's say you're choosing between Ollama Mixtral, which takes 30 seconds, and GPT-3.5 Turbo, which takes 2 seconds, for a task, the latter often provides better results. If you value your time at $20 an hour, using Ollama Mixtral, despite being free, effectively costs you 15 cents due to the longer duration, compared to the negligible 0.5 cents for GPT-3.5 Turbo's quicker completion.
Minimize your experiment cycle time: The duration to test a new idea or modification, known as "experiment cycle time," is crucial. Opting for a commercial API justifies its cost by enabling the parallelization of tasks—what might take one GPU with Ollama Mixtral a considerable time to process can be done almost instantaneously with commercial APIs. For instance, you could execute 100 calls in the time it takes Ollama to complete just one, significantly accelerating development and reducing the effective cost of your time even further.

5. Prepare Your "Vibe" Check

Key Idea: Establish a mini benchmark for your project's core functionality and continuously assess progress against it.

Practical Application: Identify 2-3 key input-output pairs that encapsulate each task's essence. They should be challenging and complementary (ie, not duplicative because you wouldn't learn more). Regularly review your application against these pairs to notice when the performance drops. While thorough evaluations will come later, these early checks are crucial for maintaining direction and focus.

Conclusion

Starting your first GenAI project is an exciting venture filled with opportunities and challenges. By adhering to these five tips, you position your project for success from the outset. Simplify your approach, prioritize quality, manage resources wisely, and maintain a clear vision of your project's objectives. This blueprint will guide you through the complexities of GenAI development, ensuring a smooth and productive journey from concept to completion.

Credit for the title image goes to DALL-E 3.

The Quest for Ultimate Productivity: Building an LLM-Powered Assistant

Jan Siml — Mon, 05 Feb 2024 10:01:33 +0000

TL;DR

A quick preview of my journey in developing a proof of concept for a personalized LLM-powered assistant, aiming to streamline daily productivity tasks. You can do the same!

Introduction

Hello, fellow productivity enthusiasts! 🌟 Ever find yourself drowning in a sea of tasks, with your desk looking more like a paper warehouse and your inbox resembling a bottomless pit? Yeah, me too. It’s the 21st-century dilemma: so much to do, yet so little time. 🕰️

Over the years, I've devoured productivity books and experimented with every app under the sun, from GTD to Timeboxing, and from Wunderlist to Motion. Despite my efforts, something always felt off. 📚✖️

One day I saw a tweet from Miles Cranmer about his Notion & LangChain project.
That's when a lightbulb went off 💡: I need is a super "narrow" LLM-powered assistant, fine-tuned just for me! Imagine an assistant that knows only 2-3 tasks but executes them with unparalleled precision, because it knows you!

🛠️ Building the Dream Assistant

In the past, I encountered a few key challenges, so I wanted to address them head-on:

Ease of Use: Annotating tasks felt like a chore. Who has 5 minutes to fill out a task form? 🤷 My goal was to enable the assistant to understand tasks from just a simple sentence, removing the need for detailed input.
Fresh Starts: I ditched rolling over unfinished tasks to keep each day's slate clean, focusing solely on the present day's priorities. That was the failure point of auto-scheduling in Motion - at some point, the conflict backlog explodes.
Realistic Boundaries: To counteract my tendency to stretch myself too thin, the assistant now evaluates my daily capacity, scheduling only when there is space left and ensuring I set achievable goals by warning me against overloading my schedule. The duration estimation is out of my hands, hopefully, improving the quality of the estimates.

Integrating these improvements, the assistant now supports a streamlined approach to productivity, focusing on what tasks are essential, and when they'll be tackled, and maintaining realism in daily planning. This refined tool is all about enhancing daily productivity without the overhead, starting each day anew, and keeping ambitions in check.

🤖 Integration and Automation

Connecting Notion was a no-brainer since my to-do list resides in Notion, making for a seamless transition. Now with their Calendar app, it's an even stronger proposition. I rolled up my sleeves and crafted quick macros: @tadd for adding tasks (because who doesn’t love a good abbreviation?) and @cadd for calendar additions (tasks meant to be auto-scheduled). It was like giving my assistant its own language. 🗣️💬

Here’s a sneak peek of how it works:

## Call out the tasks to schedule for today:
@cadd "Hack up POC for GolemScheduler.jl"
@cadd "LLMTextAnalysis.jl: add a warning if the provided strings are empty or if length is >10K"
@cadd "Ping James about the progress"
@cadd "Analyze the data for project XYZ"

## [ Info: Processing task...
## [ Info: Tokens: 1282 @ Cost: \$0.0159 in 17.4 seconds
## [ Info: Scheduling task
## [ Info: Scheduled for 2024-02-05T08:00:00 - 2024-02-05T10:00:00
## CreatedPage @ https://www.notion.so/Develop-Proof-of-Concept-for-GolemScheduler-jl-732e8e0b0f4e4b0aa7d07ae3911f99fd

## [ Info: Processing task...
## [ Info: Tokens: 1274 @ Cost: \$0.0154 in 11.8 seconds
## [ Info: Scheduling task
## [ Info: Scheduled for 2024-02-05T10:00:00 - 2024-02-05T11:00:00
## CreatedPage @ https://www.notion.so/Add-warning-functionality-to-LLMTextAnalysis-jl-0e9ebf4a13234424a247fac1256d4285

## [ Info: Processing task...
## [ Info: Tokens: 1207 @ Cost: \$0.0138 in 8.5 seconds
## [ Info: Scheduling task
## [ Info: Scheduled for 2024-02-05T11:00:00 - 2024-02-05T11:15:00
## CreatedPage @ https://www.notion.so/Ping-James-about-the-progress-00f9acf18e014eca89ca40d136c81a43

## [ Info: Processing task...
## [ Info: Tokens: 1221 @ Cost: \$0.0141 in 7.7 seconds
## [ Info: Scheduling task
## ┌ Warning: No available slot found. `overflow` will be set to `true`.
## └ @ Main ~/golem_scheduler/api_services.jl:181
## CreatedPage @ https://www.notion.so/Analyze-the-data-for-project-XYZ-2e8f182d7c454f38a02c53b330b08367

Here is a snapshot of my day in Notion:

📅 When Plans Overflow

As you can see, it happened again - I was too ambitious and the last task simply didn't fit in the allocated time.

So my assistant nudged me with: “Warning: No available slot found. overflow will be set to true.” This is my cue to hop into Notion and play Tetris with my tasks, filtering on “overflow=true.” 🚫📆

You can see that it automatically slots in the corresponding section on My Day in Notion.

The best part? The links next to "CreatedPage" in the logs are clickable, taking me directly to the task in Notion if I want to quickly edit anything. It’s like having a personal assistant who knows exactly what I need when I need it. 🤖👩‍💼

Wrapping Up

So, there you have it—a glimpse into my journey of building a personalized LLM-powered assistant. It’s still early days, but the potential to revolutionize personal productivity is immense.

Are you intrigued by the idea of crafting your own productivity sidekick? Or perhaps you’re just here for the techy talk and tales of trial and error. Either way, I’d love to hear your thoughts! 🗨️💭

If there’s enough interest, I might just package this up and share it with the world. Until then, let’s keep pushing the boundaries of what’s possible, one task at a time. 🚀

Remember, the journey to productivity is as much about the tools we use as it is about the mindset we cultivate. Stay curious, stay inventive, and most importantly, stay productive! 🌈✨

Credit for the title image goes to DALL-E 3.

Duplicate No More Pt. 2: Mastering LLM-as-a-Judge Scoring

Jan Siml — Fri, 26 Jan 2024 23:55:34 +0000

TL;DR

Explore three LLM judge scoring techniques - additive scoring, linguistic calibration scales, and categorical scoring - applied to the art of data deduplication, enhancing accuracy and consistency in identifying duplicates in contact datasets.

Introduction

Welcome back to our journey into the world of data deduplication using Language Model (LLM) judges. In our last episode, we navigated the basics; now, we're diving deeper to stabilize and tune our LLM's judgment capabilities.

The LLM-as-a-Judge Challenge

LLMs as judges are increasingly popular, yet their calibration remains a topic of hot debate. A recent Twitter post highlighted how uncalibrated LLMs can be. In our own deduplication experiment last time, we faced similar challenges prompting us to seek more stable and consistent methods. In particular, GPT-3.5 struggled to provide consistent results aligned with our expectations while GPT-4 performed well, but it was still volatile across subsequent runs and scores were clumped around the same numbers (instead of the full range of 0-100).

Setting the Stage

Let's revisit the FEBRL 1 dataset. We'll continue using this as our testing ground with the same setup as in our previous episode.

using DataFramesMeta, CSV
using LinearAlgebra: normalize, dot
using Statistics: mean, std
using PromptingTools
const PT = PromptingTools

# Load the FEBRL 1 dataset.
# The Freely Extensible Biomedical Record Linkage (Febrl) package is distributed with a dataset generator and four datasets are generated with the generator. This function returns the first Febrl dataset as a pandas.DataFrame.
# “This data set contains 1000 records (500 original and 500 duplicates, with exactly one duplicate per original record.”
df = CSV.File("febrl-dataset1.csv") |> DataFrame |> x -> rename(x, strip.(names(x)))

df = @chain df begin
    transform(_, names(_, AbstractString) .=> ByRow(strip), renamecols=false)
    @rtransform :text_blob = "Contact details: $(:given_name) $(:surname), living at $(:street_number) $(:address_1), $(:address_2), $(:suburb), Postcode: $(:postcode), State: $(:state)"
end

## embed the texts
embs = aiembed(df.text_blob, normalize)
embeddings = embs.content

# pairwise distances -- you could do it much faster with Distances.jl package
dists = let embeddings = embeddings
    dists = zeros(Float32, size(embeddings, 2), size(embeddings, 2))
    @inbounds for i in axes(embeddings, 2)
        for j in 1:i
            dists[i, j] = sum(@view(embeddings[:, i]) .* @view(embeddings[:, j]))
            dists[j, i] = dists[i, j]
        end
    end
    dists
end

# for a given record, find the top 10 closest records
let i = 3
    dupe_idxs = sortperm(dists[i, :], rev=true) |> x -> first(x, 10)
    @chain begin
        df[dupe_idxs, :]
        @transform :dists = dists[i, dupe_idxs]
        select(_, :dists, :given_name, :surname, :street_number, :address_1, :address_2, :suburb, :postcode, :state)
    end
end

Example for record 3 and its closest "candidate" for a duplicate:

"Contact details: deakin sondergeld, living at 48 goldfinch circuit, kooltuo, canterbury, Postcode: 2776, State: vic"

"Contact details: deakin sondergeld, living at 231 goldfinch circuit, kooltuo, canterbury, Postcode: 2509, State: vic"

Temperature

The first lesson is small but important. The temperature parameter in the LLM is an important factor for most practical applications. It controls the randomness of the outputs, so a higher temperature will result in more random outputs. This is useful for creative tasks, but not for our deduplication task. We want consistent results, so we need to set the temperature to be a bit lower, eg, 0.3. We can set this in PromptingTools with aigenerate(...; api_kwargs = (;temperature = 0.3))

Play around with the temperature and see how it affects the results.

Scoring Methods

You'll notice that we rarely write our scoring system from scratch. We take an existing prompt from elsewhere and ask GPT4 to adapt it to our needs with a standard Chain-of-Thoughts (CoT) approach.

We'll explore three different scoring methods today:

Method 1: Additive Scoring System

Based on the "Self-Rewarding Language Models" paper, we asked GPT-4 to tailor a prompt for our deduplication task (we don't show the full process here for brevity).

## notice that added information about the task and then simply copied the scoring system from the Appendix of the paper
prompt="""
You're a professional record linkage engineer. 

Your task is to design clear evaluation criteria to compare a pair of contact details and judge whether they are duplicates or not.
Prepare an additive 0-5 points system, where more points indicate higher likelihood of being duplicates.

Example contact: 
"Contact details:  james waller, living at 6 tullaroop street, willaroo, st james, Postcode: 4011, State: WA". So you can see there is a full name, address, postcode and a state.

Adapt the following template criteria to match our use case of matching two contact records.
---
Review the user’s question and the corresponding response using the additive 5-point scoring system described below. Points are accumulated based on the satisfaction of each criterion:

- Add 1 point if the response is relevant and provides some information related to the user’s inquiry, even if it is incomplete or contains some irrelevant content.

- Add another point if the response addresses a substantial portion of the user’s question, but does not completely resolve the query or provide a direct answer.

- Award a third point if the response answers the basic elements of the user’s question in a useful way, regardless of whether it seems to have been written by an AI Assistant or if it has elements typically found in blogs or search results.

- Grant a fourth point if the response is clearly written from an AI Assistant’s perspective, addressing the user’s question directly and comprehensively, and is well-organized and helpful, even if there is slight room for improvement in clarity, conciseness or focus.

- Bestow a fifth point for a response that is impeccably tailored to the user’s question by an AI Assistant, without extraneous information, reflecting expert knowledge, and demonstrating a high-quality, engaging, and insightful answer.

User: <INSTRUCTION_HERE>

<response><RESPONSE_HERE></response>

After examining the user’s instruction and the response:

- Briefly justify your total score, up to 100 words.

- Conclude with the score using the format: “Score: <total points>”

Remember to assess from the AI Assistant perspective, utilizing web search knowledge as necessary. To evaluate the response in alignment with this additive scoring model, we’ll systematically attribute points based on the outlined criteria.
---

First, think through step by step how one recognizes two duplicate records, what are the situations in which two pairs of records refer to the same person but differ in various fields.

Second, write a brief and concise 5-point system to evaluate a pair of contacts

"""
## remember to return the whole conversation, so you can iterate on it and improve it
conv = aigenerate(prompt; model = "gpt4t", return_all = true)

We ended up with the following prompt (after a few inline edits):

dedupe_template1 = [
    PT.SystemMessage(
        """
        You're a world-class record linkage engineer. 

        Compare two contact records and determine whether they refer to the same person using the additive 5-point scoring system described below. 

        Points are accumulated based on the satisfaction of each criterion:

        1. **Name Match (1 point):** Award 1 point if the names are exact matches or plausible variations/aliases of each other (e.g., "Jim" and "James").

        2. **Address Similarity (1 point):** Add +1 point if the addresses are identical or have minor discrepancies that could be typographical errors or data entry errors or formatting differences.

        3. **Postcode Consistency (1 point):** Add +1 point if the postcodes are the same. Postcodes are less prone to variation, so a mismatch here could indicate different individuals.

        4. **State Agreement (1 point):** Add +1 point if the state information matches. Mismatched states can be a strong indicator of different individuals unless there is evidence of a recent move.

        5. **Overall Cohesion (1 point):** Add 1 point if the overall comparison of the records suggests they are referring to the same person. This includes considering any supplementary information that supports the likelihood of a match, such as similar contact numbers or email addresses.

        This system allows for a maximum of 5 points, with a higher score indicating a greater likelihood that the two records are duplicates. Points cannot be deducted.
        Each criterion should be evaluated with the understanding that real-world data can have inconsistencies and errors, requiring a balance between exact matches and reasonable allowances for differences.
        Keep track of the accumulated points so far with each criterion.
                """),
    PT.UserMessage("""
<record-1> {{record1}} </record-1>
<record-2> {{record2}} </record-2>

After detailed examination of the two records:
- Briefly justify your total score, up to 100 words.
- Conclude with the total score.
- Use the following output format: "Justification: <justify the total score, go criterion by criterion>\n\n<Score: <total points>”

To evaluate the response in alignment with this additive scoring model, we’ll systematically attribute points based on the outlined criteria.
            """)]

## get the closest candidate for a duplicate
dupe_idxs = sortperm(dists[3, :], rev=true)
msg = aigenerate(dedupe_template1; 
    record1=df[3, :text_blob], record2=df[dupe_idxs[2], :text_blob], model="gpt-3.5-turbo-1106", 
    api_kwargs=(; temperature=0.3))

## GPT-3 Turbo Outputs
##
## [ Info: Tokens: 644 @ Cost: \$0.0008 in 4.0 seconds
##
## AIMessage("Justification: 
## - Name Match: The names are an exact match, so 1 point is awarded.
## - Address Similarity: The addresses have a minor discrepancy in the street number, but the rest of the address is identical, so 1 point is awarded.
## - Postcode Consistency: The postcodes are different, indicating a potential mismatch, so no points are awarded.
## - State Agreement: The states match, so 1 point is awarded.
## - Overall Cohesion: There is no additional information to support a match, so no points are awarded.
##
## Score: 3")

msg = aigenerate(dedupe_template1; 
    record1=df[3, :text_blob], record2=df[dupe_idxs[2], :text_blob], model="gpt4t", api_kwargs=(; temperature=0.3))

## GPT-4 Turbo Outputs
##
## AIMessage("Justification: Starting with the Name Match, both records have the exact same name "deakin sondergeld," which earns them 1 point. For Address Similarity, although both addresses are on Goldfinch Circuit in Kooltuo, Canterbury, the house numbers are significantly different (48 vs. 231), suggesting they might not be typographical errors, so no point is awarded here. The Postcode Consistency criterion is not met, as the postcodes are different (2776 vs. 2509), resulting in no point added. State Agreement is present, with both records listing "vic" as the state, adding 1 point. Lastly, the Overall Cohesion does not strongly suggest these are the same person due to significant address and postcode discrepancies, so no additional point is awarded.
##
## Score: 2")

The system showed potential in reasoning about data similarities, offering a nuanced approach to score assignments. It grounds the model better, so the scores for different models are more consistent.

However, the results were not as aligned with our duplication detection goals as hoped. From time to time, the models decided to also deduct points.

Method 2: Linguistic Calibration Scales

Inspired by "Just Ask for Calibration" we adapted their approach using linguistic scales for better calibration.

## We copied the example from the Appendix and adapted it to our use case
dedupe_template2 = [
    PT.SystemMessage(
        """
        You're a world-class record linkage engineer. 

        Your task is to compare two contact records and guess whether they refer to the same person.

        Provide your best guess ("Duplicate" vs "Not duplicate") and describe how likely it is that your guess is correct as one of the following expressions: "Almost Certain", "Highly Likely", "Likely", "Probably Even", "Unlikely", "Highly Unlikely", "Almost No Change"

        Give ONLY the guess and your confidence, no other words or explanation. 

        For example:

        Guess: <most likely guess, as short as possible; not a complete sentence, just the guess!>
        Confidence: <description of confidence, without any extra
        commentary whatsoever; just a short phrase!>
                """),
    PT.UserMessage("""
Are the following two records duplicates?

# Record 1

{{record1}}

# Record 2

{{record2}}
            """)]
msg = aigenerate(dedupe_template2; 
record1=df[3, :text_blob], record2=df[dupe_idxs[2], :text_blob], model="gpt-3.5-turbo-1106", api_kwargs=(; temperature=0.3))

## GPT-3 Turbo Outputs
##
## [ Info: Tokens: 269 @ Cost: \$0.0003 in 1.8 seconds
## AIMessage("Guess: Not duplicate
## Confidence: Likely")


msg = aigenerate(dedupe_template2; record1=df[3, :text_blob], record2=df[dupe_idxs[2], :text_blob], model="gpt4t", api_kwargs=(; temperature=0.3))

## GPT-4 Turbo Outputs
##
## [ Info: Tokens: 268 @ Cost: \$0.0028 in 1.1 seconds
## AIMessage("Guess: Duplicate
## Confidence: Likely")

As always GPT-4 demonstrated a better understanding and provided more accurate responses, suggesting a stronger alignment with our deduplication requirements.

Conversely, GPT-3.5 struggled with this approach, often delivering answers that deviated from our expectations.

Overall, we're seeing similar results as in the original article and we don't have the reasoning trace for potential audits.

Method 3: Categorical Scoring

Using a "traditional" categorical system where we define several categories and a point scale per category. We loosely follow the example in the OpenAI cookbook. One difference is to limit the maximum points within each category - it's easier to explain and tends to bring more consistent results.

Again, we asked GPT-4 to write the prompt for us:


prompt = """
You're a professional record linkage engineer. 

Your task is to design clear evaluation criteria to compare a pair of contact details and judge whether they are duplicate or not.
Prepare a scoring system with 5 categories with 0-2 points each, where more points indicate higher likelihood of being duplicates. Maximum is 10 points.

Example contacts: 
- "james waller, living at 6 tullaroop street, willaroo, st james, Postcode: 4011, State: WA"
- "lachlan berry, living at 69 giblin street, killarney, bittern, Postcode: 4814, State: QLD"
You can see here the available fields for the scoring system: name, address, postcode and state.

First, think through step by step what is a robust method to judge two potentially duplicate records and what the situations are in which two pairs of records refer to the same person but differ in various fields. Design your system around this knowledge.

Second, write a brief and concise explanation for your 10-point system.
"""
## remember to return the whole conversation, so you can iterate on it and improve it
conv = aigenerate(prompt; model = "gpt4t", return_all = true)

Ultimately, we ended up with the following prompt (after a few inline edits):

dedupe_template3 = [
    PT.SystemMessage(
        """
        You're a world-class record linkage engineer. 

        Your task is to compare two contact records and score whether they refer to the same person (=are a duplicate).

        Apply the following scoring system to the two records.

        ### Duplicate Record Scoring System (0-10 Points)

**1. Name Matching:**
   - **2 points** for exact match.
   - **1 point** for partial match (nicknames, misspellings).
   - **0 points** for no match.

**2. Address Matching:**
   - **2 points** for exact match.
   - **1 point** for partial match (same street, minor errors).
   - **0 points** for no match.

**3. Postcode Matching:**
   - **2 points** for exact match.
   - **1 point** for first digits match.
   - **0 points** for no match.

**4. State Matching:**
   - **2 points** for exact match.
   - **1 point** for neighboring states or common errors.
   - **0 points** for no match.

**5. Other Fields (if available):**
   - **2 points** for exact match in fields like phone or email.
   - **1 point** for partial match.
   - **0 points** for no match or not available.

#### Guidelines
- **Maximum Score:** 10 points.
- **Higher Score:** Indicates higher likelihood of being duplicates.
- **Consider Context:** Adjust scoring based on the context and known data quality issues.

### Output Format

Record 1: <details of record 1>

Record 2: <details of record 2>

After detailed examination of the two records:

Justification: <justify the total score, go criterion by criterion. 100 words max>

Score: <total score>
                """),
    PT.UserMessage("""
Record 1: {{record1}}

Record 2: {{record2}}

After detailed examination of the two records:

Justification:""")]

msg = aigenerate(dedupe_template3; 
    record1=df[3, :text_blob], record2=df[dupe_idxs[2], :text_blob], model="gpt-3.5-turbo-1106", 
    api_kwargs=(; temperature=0.3))

## GPT-3.5 Turbo Outputs
##
## [ Info: Tokens: 565 @ Cost: \$0.0006 in 3.1 seconds
## AIMessage("Name Matching: 2 points. The names are an exact match.
## Address Matching: 1 point. The street name is similar, but the house numbers are different.
## Postcode Matching: 0 points. The postcodes are completely different.
## State Matching: 2 points. The states are an exact match.
## Other Fields: 0 points. No other fields are available for comparison.

## Score: 5 points")

msg = aigenerate(dedupe_template3; record1=df[3, :text_blob], record2=df[dupe_idxs[2], :text_blob], model="gpt4t", api_kwargs=(; temperature=0.3))

## GPT-4 Turbo Outputs
##
## [ Info: Tokens: 564 @ Cost: \$0.0073 in 7.8 seconds
## AIMessage("Justification: Both records have an exact name match, earning 2 points. 
## The addresses have a partial match since they are on the same street but have different numbers, earning 1 point. 
## The postcodes do not match exactly or at the first digits, so they earn 0 points. 
## The state matches exactly, earning 2 points. 
## No other fields are provided for comparison. 

## Score: 5")

This is good! We have a clear scoring system and the results are consistent between GPT-3.5 and GPT-4.

Let's test it on a few more records.

We'll use structured extraction to make it easier to work with data in the DataFrame:

"Apply the scoring system, go criterion by criterion, and justify your score. Maximum 10 points."
struct DuplicateJudgement
    justification::String
    score::Int
end
msg = aiextract(dedupe_template3; record1=df[3, :text_blob], record2=df[dupe_idxs[2], :text_blob], model="gpt-3.5-turbo-1106", return_type=DuplicateJudgement, api_kwargs=(; temperature=0.3))

The Evaluation

Now, let's apply Method 3 to 100 random contacts (and judge always 3 closest candidates). Let's ignore the self-consistency for now (eg, order of duplicate vs candidate).

## Utility functions
function find_candidates(dists, i; top_k=3)
    ## Find the top k most similar records to the i-th record
    dupe_idxs = sortperm(@view(dists[i, :]), rev=true) |> x -> first(x, top_k + 1)
    # the first item is the record itself
    dupe_idxs[1 .+ (1:top_k)]
end
function judge_duplicates(text1, text2)
    ## when we make a lot of network calls, we will often get errors. Let's make sure we handle them gracefully
    try
        msg = aiextract(dedupe_template3; record1=text1, record2=text2, verbose=false, model="gpt-3.5-turbo-1106", return_type=DuplicateJudgement, api_kwargs=(; temperature=0.3), http_kwargs=(; readtimeout=15))
    catch e
        @warn "Failed to generate a judgement for $(i) and $(dupe_idxs[1+i])"
        missing
    end
end

# We'll run our system for random 100 data points, pick the top 3 most similar records and judge them.
rand_ids = rand(Random.Xoshiro(123), 1:size(df, 1), 120) |> unique |> Base.Fix2(first, 100)

## Let's run the experiment -- this takes ~1-2 minutes
df_dupes = @chain df begin
    @select :text_blob :rec_id
    @rtransform :idx = $eachindex
    _[rand_ids, :]
    ## find candidates
    @rtransform :candidate_idx = find_candidates(dists, :idx)
    flatten(:candidate_idx)
    ## bring the candidate data
    @rtransform :rec_id_candidate = df.rec_id[:candidate_idx] :text_blob_candidate = df.text_blob[:candidate_idx]
    ## judge duplicates // we run them in parallel and just wait until they all finish
    @rtransform :judgement = Threads.@spawn judge_duplicates(:text_blob, :text_blob_candidate)
    ## bring the true labels
    @rtransform :is_duplicate = match(r"(\d+)", :rec_id).captures[1] == match(r"(\d+)", :rec_id_candidate).captures[1]
end

## Let's check if all tasks are done
all(istaskdone, df_dupes.judgement)

Now, let's analyze the results. As a reminder, the best-case scenario would be to find a duplicate for each record, ie, 100 duplicates in total.

@chain df_dupes begin
    @rtransform :judgement = fetch(:judgement)
    dropmissing(:judgement)
    @rtransform :cost = PT.call_cost(:judgement, "gpt-3.5-turbo-1106") :score = :judgement.content.score
    @aside @info "Number of duplicates found: $(count(_.is_duplicate))/$(length(rand_ids)), Total cost: \$$(sum(_.cost))"
    @by :is_duplicate :score = mean(:score) :score_std = std(:score)
end

[ Info: Number of duplicates found: 100/100, Total cost: \$0.193309
2×3 DataFrame
 Row │ is_duplicate  score    score_std 
     │ Bool          Float64  Float64   
─────┼──────────────────────────────────
   1 │        false     2.23    1.76996
   2 │         true     5.86    1.93855

We successfully identified all duplicates, clearly distinguishing them based on their scores.

Let's visualize the distribution of scores - we can see that the scores for duplicates are higher than for non-duplicates and they are well separated.

using StatsPlots

pl = @chain df_dupes begin
    @rtransform :judgement = fetch(:judgement)
    dropmissing(:judgement)
    @rtransform :score = :judgement.content.score
    @df boxplot(:is_duplicate, :score, ylabel="Score", xlabel="Is duplicate?", title="Scores from the Auto-Judge",
        yformatter=x -> round(Int, x), legend=false, dpi=200)
    xticks!([0, 1], ["Not duplicate", "Duplicate"])
end

Cost-Efficiency?

Amazingly, the entire process cost just $0.2 for 300 calls, demonstrating the method's affordability and efficiency.

Conclusion

Our exploration demonstrated three diverse approaches to crafting scoring criteria for LLM judges in data deduplication. While we found Method 3 most effective for our needs, you might discover that the other methods better suit your specific scenarios. This journey underscores the incredible power and versatility of the LLM-as-a-Judge pattern, opening doors to numerous practical applications in the business.

Credit for the title image goes to DALL-E 3.

AIHelpMe.jl: AI-Enhanced Coding Assistance for Julia

Jan Siml — Tue, 23 Jan 2024 09:07:55 +0000

TL;DR

AIHelpMe, a new Julia package, transforms your existing docstrings into an interactive AI-powered guide, offering personalized insights directly from your code's documentation. It's in the early stages and seeks community feedback, promising a unique, low-cost way to interact with your documentation.

Announcing AIHelpMe Pre-Release

Welcome to AIHelpMe, a new Julia package that transforms your detailed docstrings into a rich source of insights. It's not about writing code for you; rather, it's about shining a light on the valuable documentation you and others have already created. Think of it as having a chat with your code's documentation, enhanced by AI's clever touch.

Motivation and Value

Why write great docstrings? AIHelpMe gives you a compelling reason, turning them into an interactive, insightful guide. It's a subtle, yet powerful way to connect your queries with tailored, documentation-driven answers.

There are a few things that set this package apart from generic chatbots:

Direct Access to Your Work: AIHelpMe uniquely utilizes the latest information and modules directly from your laptop, ensuring up-to-date and relevant assistance.

Full Control Over Searches: Tailor your search scope and methods with AIHelpMe, aligning AI insights precisely with your needs.

Contextual Understanding: Go beyond typical chatbot responses; AIHelpMe offers deep insights, revealing the sources behind each answer, so you can continue your research 🧠📚

Getting Started

Simply add AIHelpMe to your Julia environment (not registered yet) and get ready to interact with your code's documentation in a whole new way. Remember, API keys from Cohere and OpenAI are required, but the cost per query is just a tiny fraction of a cent.

using Pkg
Pkg.add("https://github.com/svilupp/AIHelpMe.jl")

using AIHelpMe # automatically some downloads pre-processed documentation

# alias gpt3t sends the query to the latest GPT-3.5 Turbo
aihelp"In Julia, how to create a named tuple from a dictionary? Give me an example"gpt3t

[ Info: Done generating response. Total cost: $0.002
AIMessage("You can use the splatting operator to create a named tuple from a dictionary in Julia. Here's an example:

d = Dict(:a => 1, :b => 2)
nt = (; d...)

You can also ask GPT-4 Turbo to weigh in on your question (notice the "!" in aihelp!"...") when you need deeper insights. For example, you can ask GPT-4 Turbo to explain the constructors for NamedTuple:

aihelp!"What other constructors does NamedTuple have?"gpt4t

[ Info: Tokens: 1023 @ Cost: \$0.0154 in 30.7 seconds
AIMessage("Aside from creating named tuples from pairs or a dictionary, `NamedTuple` offers other constructors as well:

1. **Tuple literal with keys**:

   nt = (a=1, b=2)


2. **Tuple literal with semicolon**:

   nt = (; a=1, b=2)


3. **Using `NamedTuple` type as constructor**:

   nt = NamedTuple{(:a, :b)}((1, 2))


4. **With specified types**:

   nt = NamedTuple{(:a, :b), Tuple{Int, Int}}((1, 2))


5. **From another named tuple or iterable**:
   You can create a new named tuple by splatting an existing named tuple or any iterable of pairs:

   existing_nt = (a=1, b=2)
   nt = NamedTuple((:c => 3, :d => 4, existing_nt...))

   <Author: this example is incorrect. See the next section>


These constructors allow for flexibility in creating `NamedTuple`s programmatically or from existing data structures in Julia.")

Note: Nested Julia code fences had been manually deleted to enable correct parsing on Forem.

LLMs will make mistakes...

... and it's okay! It's usually easy to check & iterate, so the overall solution still ends up being faster / easier.

The original example used the default chat model and had a mistake as pointed out by @oxinabox. We changed to "gpt3t" model for better performance. There was also a change in the prefix and suffix, but that was not required - it's just a habit of how I write prompts/questions to LLMs.

Similarly, the last constructor in the examples from GPT4 Turbo throws an error. We can ask the LLM to fix it:

aihelp"How to fix `nt = NamedTuple((:c => 3, :d => 4, existing_nt...))`. I get error $err"gpt4t

[ Info: Done generating response. Total cost: $0.002
AIMessage("You can fix the code snippet by using the `merge` function properly, to merge the `NamedTuple` with the key-value pairs:

existing_nt = (a=1, b=2)
nt = merge(existing_nt, (c=3,), (d=4,))


This code merges the existing named tuple `existing_nt` with two additional key-value pairs, `(c=3,)` and `(d=4,)`. The result `nt` will be a named tuple including all four key-value pairs.")

Pre-Release Testing

As AIHelpMe is in its early stages, we're eager for community involvement to test and refine its capabilities.

Is it valuable? What are its limitations?

Your feedback is invaluable in shaping this tool’s future, so join us in this innovative journey!

Credit for the title image goes to DALL-E 3.

Thanks to @oxinabox for pointing out that the original example had an error in it!

Julia Community 🟣: Jan Siml

AIHelpMe.jl Pt.2: Instant Expertise, Infinite Solutions for Julia Developers

TL;DR

The Frustration of Searching for Julia Answers

Introduction to AIHelpMe's Official Release

How AIHelpMe Works

How AIHelpMe Differs from Other Solutions

Starter Example

Advanced Usage of AIHelpMe

What's Next

Conclusion

ProToPortal: The Portal to the Magic of PromptingTools

TL;DR

Unveiling ProToPortal: A Julia-First LLM GUI built with Stipple.jl!

Why ProToPortal?

Cool Features to Check Out:

More Handy Features:

See It in Action:

Give It a Try!

Automatically Saving Conversations with PromptingTools.jl and AIHelpMe.jl

Update 20/5

TL;DR

Introduction

Defining a Custom Schema for Saving Conversations

Example 1: Saving Conversations with aigenerate

Example 2: Registering a Traced Model

Loading Conversations

Exporting Conversations in ShareGPT Format

Saving AIHelpMe Conversations

Sharing The Conversations

Conclusion

The Hidden Cost of Locally Hosted Models: A Case Study

TL;DR

Would You Pay a Dollar to Buy 3 Extra Days This Year?

Appreciating the Open-Source AI Ecosystem

The Hidden Costs of Local Hosting

Case Study: Daily Coding Assistance

Why Choose Cloud Providers?

How to Start?

Example Usage

The Latest Scoop on PromptingTools.jl

TL;DR

Dive Into the Latest and (Maybe) Greatest PromptingTools.jl Updates!

What’s Fresh in PromptingTools.jl?

Supercharged Model Shenanigans

Revamped RAGTools: Sleek, Fast, and Flexible

Dataset Prep & Nifty Utilities

For the Adventurous Souls

Improvements & Bug Squashing

Looking Forward (With Goggles On)

A Solo Journey (But Open to Hitchhikers):

Empowering AI with Knowledge: The New RAG Interface in PromptingTools

TL;DR

Introduction

RAGTools Module: A Primer

Unveiling New Functionalities

Answer Annotation Example

A Closer Look at the Modular Interface

RAG Pipeline Workflow

Preparation Phase

Retrieval Phase

Generation Phase

Where to Start: Quick, Experiment, or Customize

How to Customize the Pipeline

Passing Keyword Arguments to Customize the Pipeline

Using Custom Indexes or Vector Databases

Conclusion

We Want to Hear from You!

A 7 Billion Parameter Model that Beats GPT-4 on Julia Code?

TL;DR

Introduction

Cheater-7B

How Is It Possible? Fine-tuning + Cheating!

Fine-tuning

Beyond Cheating

Fine-tuning 101

What Is It?

Why Fine-Tuning Should Be Your New Best Friend

Understanding the Limits of Fine-Tuning

Getting Started with Fine-Tuning: Easier Than You Think

Example 1: Saving Conversations with `aigenerate`