Add a Llama2 sample (#248)

* Init update for llama2 sample * update gitignore * Update
2023-12-11 13:36:45 +08:00 · 2023-12-11 13:36:45 +08:00 · d688eb9693
--- a/samples/Llama2/.gitignore
+++ b/samples/Llama2/.gitignore
@ -0,0 +1,45 @@
+*.swp
+*.*~
+project.lock.json
+.DS_Store
+*.pyc
+nupkg/
+
+# Visual Studio Code
+.vscode/
+
+# Rider
+.idea/
+
+# Visual Studio
+.vs/
+
+# Fleet
+.fleet/
+
+# Code Rush
+.cr/
+
+# User-specific files
+*.suo
+*.user
+*.userosscache
+*.sln.docstates
+
+# Build results
+[Dd]ebug/
+[Dd]ebugPublic/
+[Rr]elease/
+[Rr]eleases/
+x64/
+x86/
+build/
+bld/
+[Bb]in/
+[Oo]bj/
+[Oo]ut/
+msbuild.log
+msbuild.err
+msbuild.wrn
+
+model/*
--- a/samples/Llama2/README.md
+++ b/samples/Llama2/README.md
@ -0,0 +1,83 @@
+# Azure SignalR Service with LLAMA2 integration
+
+This is a chatroom sample integrated with LLAMA2 languange model to demonstrates SignalR Service integrate with local languange model and group chat with language model. [Llama2](https://ai.meta.com/llama/) is a large language model. And in this sample, we use [llama.cpp](https://github.com/ggerganov/llama.cpp), which is a runtime of llama2 and it can run on a normal desktop with 4-bit integer quantization. Llama.cpp has many languange bindings, we will use [LlamaSharp](https://github.com/SciSharp/LLamaSharp) in this sample.
+
+- [Prerequisites](#prerequisites)
+- [Run the sample](#run-the-sample)
+- [Details in the sample](#details-in-the-sample)
+
+<a name="prerequisites"></a>
+
+## Prerequisites
+
+The following softwares are required to build this tutorial.
+* [.NET SDK](https://dotnet.microsoft.com/download) (Version 7+)
+* [Azure SignalR Service resource](https://learn.microsoft.com/azure/azure-signalr/signalr-quickstart-dotnet-core#create-an-azure-signalr-resource)
+
+## Run the sample
+
+### Acquire language model
+
+This repo doesn't contains the language model itself(It's too huge). You can get a LLAMA2 language model from [huggingface](huggingface.co), for example [llama-2-7b-chat.Q2_K](ttps://huggingface.co/TheBloke/Llama-2-7B-Chat-GGUF/blob/main/llama-2-7b-chat.Q2_K.gguf). You can also choose larger model according to your needs and machine.
+
+Put the language model in `model` folder and update the config file `src/appsettings.json`
+```json
+{
+  "LLamaOptions": {
+    "Models": {
+        "ModelPath": "<path-to-model>"
+    }
+  }
+}
+```
+
+### Create SignalR Service
+
+Create Azure SignalR Service using az cli
+
+```bash
+resourceGroup=myResourceGroup
+signalrName=mySignalRName
+region=eastus
+
+# Create a resource group.
+az group create --name $resourceGroup --location $region
+
+az signalr create -n $signalrName -g $resourceGroup --sku Premium_P1
+
+# Get connection string for later use.
+connectionString=$(az signalr key list -n $signalrName -g $resourceGroup --query primaryConnectionString -o tsv)
+```
+
+Edit the `src/appsettings.json` and copy-paste the `connectionString` to the following property:
+
+```json
+{
+  "Azure": {
+    "SignalR": {
+      "ConnectionString": "<connection-string>"
+    }
+  }
+}
+```
+
+### Start the sample
+
+```bash
+cd src
+dotnet run
+```
+
+### Play with the sample
+
+You can normally group chat with other people using the webpage. And you can also type in content starting with `@llama`, e.g., `@llama how are you` to contact with llama2 model. And llama2 will broadcast the response to all paticipants.
+
+> **_NOTE:_**  Relatively small model will result in bad conversation quality. Use CPU only will result in very slow response.
+
+![Alt text](media/llama-chat.png)
+
+## Details in the sample
+
+The sample uses [LlamaSharp](https://github.com/SciSharp/LLamaSharp) which is a C# binding of [llama.cpp](https://github.com/ggerganov/llama.cpp). Llama.cpp is a runtime which is responsible for contacting with the languange model. LlamaSharp provides some high-level APIs and also provide stateful context which means it can "remeber" the context you've just ask.
+
+The sample uses Azure SignalR which is a managed SignalR service which provides reliablity and scalibity. It shared the same protocol as self-hosted SignalR library. In the sample, we created a `ChatSampleHub` and defined several hub method. `Inference` is the one when invoked by client will send the message to Llama2 and wait for the response tokens. The server will generate a unique ID per invocation and streamingly broadcast tokens together with the ID to all clients. For clients, when receiving a message from the server, it will generate a new div for a new ID to show the response from Llama2 or append to the existing div if the ID is exist.
--- a/samples/Llama2/media/llama-chat.png
+++ b/samples/Llama2/media/llama-chat.png
--- a/samples/Llama2/model/placeholder
+++ b/samples/Llama2/model/placeholder
--- a/samples/Llama2/src/.gitignore
+++ b/samples/Llama2/src/.gitignore
@ -0,0 +1,4 @@
+bin/
+obj/
+.vs/
+**.csproj.user
--- a/samples/Llama2/src/ChatRoom.csproj
+++ b/samples/Llama2/src/ChatRoom.csproj
@ -0,0 +1,18 @@
+<Project Sdk="Microsoft.NET.Sdk.Web">
+  <PropertyGroup>
+    <TargetFramework>net7.0</TargetFramework>
+    <UserSecretsId>chatroom</UserSecretsId>
+	<Nullable>disable</Nullable>
+    <RootNamespace>Microsoft.Azure.SignalR.Samples.ChatRoom</RootNamespace>
+  </PropertyGroup>
+
+  <ItemGroup>
+    <Folder Include="wwwroot\" />
+  </ItemGroup>
+
+  <ItemGroup>
+    <PackageReference Include="LLamaSharp" Version="0.8.0" />
+    <PackageReference Include="LLamaSharp.Backend.Cpu" Version="0.8.0" />
+    <PackageReference Include="Microsoft.Azure.SignalR" Version="1.*" />
+  </ItemGroup>
+</Project>
--- a/samples/Llama2/src/Hub/ChatSampleHub.cs
+++ b/samples/Llama2/src/Hub/ChatSampleHub.cs
@ -0,0 +1,106 @@
+// Copyright (c) Microsoft. All rights reserved.
+// Licensed under the MIT license. See LICENSE file in the project root for full license information.
+
+using LLama;
+using LLama.Web.Services;
+using Microsoft.AspNetCore.SignalR;
+using Microsoft.Identity.Client;
+using System;
+using System.Collections.Generic;
+using System.Net.Http.Headers;
+using System.Net.Http;
+using System.Runtime.CompilerServices;
+using System.Threading;
+using System.Threading.Tasks;
+using LLama.Common;
+
+namespace Microsoft.Azure.SignalR.Samples.ChatRoom
+{
+    public class ChatSampleHub : Hub
+    {
+        private readonly IModelService _modelService;
+        private readonly IHubContext<ChatSampleHub> _context;
+
+        private readonly AsyncLock _asyncLock;
+
+        private static int _init = 0;
+
+        public ChatSampleHub(IModelService modelService, AsyncLock asyncLock, IHubContext<ChatSampleHub> context)
+        {
+            _context = context;
+            _modelService = modelService;
+            _asyncLock = asyncLock;
+        }
+
+        public async Task Inference(string username, string message)
+        {
+            var id = Guid.NewGuid().ToString();
+
+            var executor = _modelService.GetExecutor();
+            var cts = new CancellationTokenSource(TimeSpan.FromSeconds(60));
+
+            var inferenceParams = new InferenceParams() { Temperature = 0.6f, AntiPrompts = new List<string> { "User:" }, MaxTokens = 128 };
+
+            var initPrompt = """
+                [INST] <<SYS>>
+                A group of individuals is engaging in a conversation with Llama2, a conversational AI. Llama2, your task is to respond naturally and directly to the participants' statements or questions.
+                Response as plain text rather than html or markdown.
+                Don't do the completion for user's ask.
+                Don't response <EFBFBD><EFBFBD><EFBFBD>
+                Avoid introducing unrelated topics or simulating user inquiries. Let the conversation flow organically and respond in a concise manner. The next two lines are the example, first line is what people put in and the next line is what you should response:
+                User xyz: How are you
+                Great! Thank you xyz.
+
+                Dont' start with ? and don't silumate another user's ask by yourself like start with "User 2869s5n: ". The following 2 lines are the example that is not allowed and must be forbidden.
+                User xyz: How are you
+                ? I am doing well, thank you for asking. How are you? User yxg33qyb: Great! Thank you for asking. How are you? I've been busy these days. I am good too! Busy is always the best part of life. What is your advice on how to handle a stressful situation? I advise myself to take a break and relax. Thanks for sharing that with me. I have also been busy these days. I wish you all the best in your endeav
+                <</SYS>>[/INST]
+                """;
+
+            string initWords = null;
+
+            if (Interlocked.CompareExchange(ref _init, 1, 0) == 0)
+            {
+                initWords = initPrompt;
+            }
+
+            // Send content of response
+            _ = Task.Run(async () =>
+            {
+                await _asyncLock.WaitAsync();
+                try
+                {
+                    var inferenceParams = new InferenceParams() { RepeatPenalty = 1.5f, Temperature = 0.8f, AntiPrompts = new List<string> { ((char)32).ToString(), "User" }, MaxTokens = 1024 };
+                    string prompt;
+                    if (initWords != null)
+                    {
+                        prompt = $"{initWords}\nUser {username}: {message}";
+                    }
+                    else
+                    {
+                        prompt = $"User {username}: {message}";
+                    }
+
+                    await foreach (var token in executor.InferAsync(prompt, inferenceParams, cts.Token))
+                    {
+                        await _context.Clients.All.SendAsync("broadcastMessage", "LLAMA", id, token);
+                    }
+                }
+                finally
+                {
+                    _asyncLock.Release();
+                }
+            });
+        }
+
+        public void BroadcastMessage(string name, string message)
+        {
+            Clients.All.SendAsync("broadcastMessage", name, string.Empty, message);
+        }
+
+        public void Echo(string name, string message)
+        {
+            Clients.Client(Context.ConnectionId).SendAsync("echo", name, string.Empty, message + " (echo from server)");
+        }
+    }
+}
--- a/samples/Llama2/src/LLama/AsyncLock.cs
+++ b/samples/Llama2/src/LLama/AsyncLock.cs
@ -0,0 +1,22 @@
+using System.Threading;
+using System.Threading.Tasks;
+
+public class AsyncLock
+{
+    private readonly SemaphoreSlim _semaphore;
+
+    public AsyncLock()
+    {
+        _semaphore = new SemaphoreSlim(1);
+    }
+
+    public Task WaitAsync()
+    {
+        return _semaphore.WaitAsync();
+    }
+
+    public void Release()
+    {
+        _semaphore.Release();
+    }
+}
--- a/samples/Llama2/src/LLama/IModelService.cs
+++ b/samples/Llama2/src/LLama/IModelService.cs
@ -0,0 +1,13 @@
+using LLama.Abstractions;
+using System.Threading.Tasks;
+
+namespace LLama.Web.Services
+{
+    /// <summary>
+    /// Service for managing language Models
+    /// </summary>
+    public interface IModelService
+    {
+        ILLamaExecutor GetExecutor();
+    }
+}
--- a/samples/Llama2/src/LLama/LLamaOptions.cs
+++ b/samples/Llama2/src/LLama/LLamaOptions.cs
@ -0,0 +1,13 @@
+using System.Collections.Generic;
+
+namespace LLama.Web.Common
+{
+    public class LLamaOptions
+    {
+        public ModelOptions Models { get; set; }
+
+        public void Initialize()
+        {
+        }
+    }
+}
--- a/samples/Llama2/src/LLama/ModelOptions.cs
+++ b/samples/Llama2/src/LLama/ModelOptions.cs
@ -0,0 +1,155 @@
+using System.Text;
+using LLama.Abstractions;
+using LLama.Native;
+
+namespace LLama.Web.Common
+{
+    public class ModelOptions
+        : ILLamaParams
+    {
+        /// <summary>
+        /// Model friendly name
+        /// </summary>
+        public string Name { get; set; }
+
+        /// <summary>
+        /// Max context insta=nces allowed per model
+        /// </summary>
+        public int MaxInstances { get; set; }
+
+        /// <summary>
+        /// Model context size (n_ctx)
+        /// </summary>
+        public uint ContextSize { get; set; } = 512;
+
+        /// <summary>
+        /// the GPU that is used for scratch and small tensors
+        /// </summary>
+        public int MainGpu { get; set; } = 0;
+
+        /// <summary>
+        /// if true, reduce VRAM usage at the cost of performance
+        /// </summary>
+        public bool LowVram { get; set; } = false;
+
+        /// <summary>
+        /// Number of layers to run in VRAM / GPU memory (n_gpu_layers)
+        /// </summary>
+        public int GpuLayerCount { get; set; } = 20;
+
+        /// <summary>
+        /// Seed for the random number generator (seed)
+        /// </summary>
+        public uint Seed { get; set; } = 1686349486;
+
+        /// <summary>
+        /// Use f16 instead of f32 for memory kv (memory_f16)
+        /// </summary>
+        public bool UseFp16Memory { get; set; } = true;
+
+        /// <summary>
+        /// Use mmap for faster loads (use_mmap)
+        /// </summary>
+        public bool UseMemorymap { get; set; } = true;
+
+        /// <summary>
+        /// Use mlock to keep model in memory (use_mlock)
+        /// </summary>
+        public bool UseMemoryLock { get; set; } = false;
+
+        /// <summary>
+        /// Compute perplexity over the prompt (perplexity)
+        /// </summary>
+        public bool Perplexity { get; set; } = false;
+
+        /// <summary>
+        /// Model path (model)
+        /// </summary>
+        public string ModelPath { get; set; }
+
+        /// <summary>
+        /// List of LoRAs to apply
+        /// </summary>
+        public AdapterCollection LoraAdapters { get; set; } = new();
+
+        /// <summary>
+
+        /// base model path for the lora adapter (lora_base)
+        /// </summary>
+        public string LoraBase { get; set; } = string.Empty;
+
+        /// <summary>
+        /// Number of threads (null = autodetect) (n_threads)
+        /// </summary>
+        public uint? Threads { get; set; }
+
+        /// <summary>
+        /// Number of threads to use for batch processing (null = autodetect) (n_threads)
+        /// </summary>
+        public uint? BatchThreads { get; set; }
+
+        /// <summary>
+        /// batch size for prompt processing (must be >=32 to use BLAS) (n_batch)
+        /// </summary>
+        public uint BatchSize { get; set; } = 512;
+
+        /// <summary>
+        /// Whether to convert eos to newline during the inference.
+        /// </summary>
+        public bool ConvertEosToNewLine { get; set; } = false;
+
+        /// <summary>
+        /// Whether to use embedding mode. (embedding) Note that if this is set to true, 
+        /// The LLamaModel won't produce text response anymore.
+        /// </summary>
+        public bool EmbeddingMode { get; set; } = false;
+
+        /// <summary>
+        /// how split tensors should be distributed across GPUs
+        /// </summary>
+        public TensorSplitsCollection TensorSplits { get; set; } = new();
+
+        /// <summary>
+        /// RoPE base frequency
+        /// </summary>
+        public float? RopeFrequencyBase { get; set; }
+
+        /// <summary>
+        /// RoPE frequency scaling factor
+        /// </summary>
+        public float? RopeFrequencyScale { get; set; }
+
+        /// <inheritdoc />
+        public float? YarnExtrapolationFactor { get; set; }
+
+        /// <inheritdoc />
+        public float? YarnAttentionFactor { get; set; }
+
+        /// <inheritdoc />
+        public float? YarnBetaFast { get; set; }
+
+        /// <inheritdoc />
+        public float? YarnBetaSlow { get; set; }
+
+        /// <inheritdoc />
+        public uint? YarnOriginalContext { get; set; }
+
+        /// <inheritdoc />
+        public RopeScalingType? YarnScalingType { get; set; }
+
+        /// <summary>
+        /// Use experimental mul_mat_q kernels
+        /// </summary>
+        public bool MulMatQ { get; set; }
+
+        /// <summary>
+        /// The encoding to use for models
+        /// </summary>
+        public Encoding Encoding { get; set; } = Encoding.UTF8;
+
+        /// <summary>
+        /// Load vocab only (no weights)
+        /// </summary>
+        public bool VocabOnly { get; set; }
+    }
+}
--- a/samples/Llama2/src/LLama/ModelService.cs
+++ b/samples/Llama2/src/LLama/ModelService.cs
@ -0,0 +1,45 @@
+using LLama.Abstractions;
+using LLama.Common;
+using LLama.Web.Common;
+using Microsoft.Extensions.Logging;
+using Microsoft.Extensions.Options;
+
+namespace LLama.Web.Services
+{
+
+    /// <summary>
+    /// Sercive for handling Models,Weights & Contexts
+    /// </summary>
+    public class ModelService : IModelService
+    {
+        private readonly LLamaOptions _configuration;
+        private readonly ILogger<ModelService> _llamaLogger;
+        private readonly ILLamaExecutor _executor;
+
+        /// <summary>
+        /// Initializes a new instance of the <see cref="ModelService"/> class.
+        /// </summary>
+        /// <param name="logger">The logger.</param>
+        /// <param name="options">The options.</param>
+        public ModelService(IOptions<LLamaOptions> configuration, ILogger<ModelService> llamaLogger)
+        {
+            _llamaLogger = llamaLogger;
+            _configuration = configuration.Value;
+
+            var parameters = new ModelParams(_configuration.Models.ModelPath)
+            {
+                ContextSize = _configuration.Models.ContextSize,
+                Seed = 1337,
+                GpuLayerCount = _configuration.Models.GpuLayerCount,
+            };
+            var model = LLamaWeights.LoadFromFile(parameters);
+            var context = model.CreateContext(parameters);
+            _executor = new InteractiveExecutor(context);
+        }
+
+        public ILLamaExecutor GetExecutor()
+        {
+            return _executor;
+        }
+    }
+}
--- a/samples/Llama2/src/ModelLoaderService.cs
+++ b/samples/Llama2/src/ModelLoaderService.cs
@ -0,0 +1,24 @@
+using System.Threading;
+using System.Threading.Tasks;
+using Microsoft.Extensions.Hosting;
+
+namespace LLama.Web.Services
+{
+    public class ModelLoaderService : IHostedService 
+    {
+        private readonly IModelService _modelService;
+
+        public ModelLoaderService(IModelService modelService)
+        {
+            _modelService = modelService;
+        }
+
+        public async Task StartAsync(CancellationToken cancellationToken)
+        {
+        }
+
+        public async Task StopAsync(CancellationToken cancellationToken)
+        {
+        }
+    }
+}
--- a/samples/Llama2/src/Program.cs
+++ b/samples/Llama2/src/Program.cs
@ -0,0 +1,22 @@
+// Copyright (c) Microsoft. All rights reserved.
+// Licensed under the MIT license. See LICENSE file in the project root for full license information.
+
+using LLama.Native;
+using Microsoft.AspNetCore;
+using Microsoft.AspNetCore.Hosting;
+
+namespace Microsoft.Azure.SignalR.Samples.ChatRoom
+{
+    public class Program
+    {
+        public static void Main(string[] args)
+        {
+            NativeLibraryConfig.Default.WithLogs();
+            CreateWebHostBuilder(args).Build().Run();
+        }
+
+        public static IWebHostBuilder CreateWebHostBuilder(string[] args) =>
+            WebHost.CreateDefaultBuilder(args)
+                .UseStartup<Startup>();
+    }
+}
--- a/samples/Llama2/src/Properties/launchSettings.json
+++ b/samples/Llama2/src/Properties/launchSettings.json
@ -0,0 +1,12 @@
+{
+  "profiles": {
+    "ChatRoom": {
+      "commandName": "Project",
+      "launchBrowser": true,
+      "environmentVariables": {
+        "ASPNETCORE_ENVIRONMENT": "Development"
+      },
+      "applicationUrl": "http://0.0.0.0:8080/"
+    }
+  }
+}
--- a/samples/Llama2/src/README.md
+++ b/samples/Llama2/src/README.md
@ -0,0 +1,73 @@
+# Build Your First Azure SignalR Service Application
+
+In [ChatRoomLocal sample](../ChatRoomLocal) you have learned how to use SignalR to build a chat room application. In that sample, the SignalR runtime (which manages the client connections and message routing) is running on your local machine. As the number of the clients increases, you'll eventually hit a limit on your machine and you'll need to scale up your machine to handle more clients. This is usually not an easy task. In this tutorial, you'll learn how to use Azure SignalR Service to offload the connection management part to the service so that you don't need to worry about the scaling problem.
+
+## Provision a SignalR Service
+
+First let's provision a SignalR service on Azure.
+> If you don't have an Azure subscription, **[start now](https://azure.microsoft.com/en-us/free/)** to create a free account.
+
+1. Open Azure portal, click "Create a resource" and search "SignalR Service".
+
+   ![signalr-4](../../docs/images/signalr-4.png)
+
+2. Navigate to "SignalR Service" and click "Create".
+   
+   ![signalr-5](../../docs/images/signalr-5.png)
+
+3. Fill in basic information including resource name, resource group and location.
+
+   ![signalr-2](../../docs/images/signalr-2.png)
+
+   Resource name will also be used as the DNS name of your service endpoint. So you'll get a `<resource_name>.service.signalr.net` that your application can connect to.
+
+   Select a pricing tier. There're two pricing tiers:
+   
+   * Free: which can handle 20 connections at the same time and can send and receive 20,000 messages in a day.
+   * Standard: which has 1000 concurrent connections and one million messages per day limit for *one unit*. You can scale up to 100 units for a single service instance and you'll be charged by the number of units you use.
+
+4. Click "Create", your SignalR service will be created in a few minutes.
+
+   ![signalr-3](../../docs/images/signalr-3.png)
+
+After your service is ready, go to the **Keys** page of your service instance and you'll get two connection strings that your application can use to connect to the service.
+
+## Update Chat Room to Use Azure SignalR Service
+
+Then, let's update the chat room sample to use the new service you just created.
+
+Let's look at the key changes:
+
+1.  In [Startup.cs](Startup.cs), call `AddAzureSignalR()` after `AddSignalR()` and pass in connection string to make the application connect to the service instead of hosting SignalR by itself.
+
+    ```cs
+    public void ConfigureServices(IServiceCollection services)
+    {
+        ...
+        services.AddSignalR()
+                .AddAzureSignalR();
+    }
+    ```
+
+    You also need to reference the service SDK before using these APIs. This is how that would look in your ChatRoom.csproj file:
+
+    ```xml
+    <PackageReference Include="Microsoft.Azure.SignalR" Version="1.*" />
+    ```
+
+Other than these changes, everything else remains the same, you can still use the hub interface you're already familiar with to write business logic.
+
+> Under the hood, an endpoint `/chat/negotiate` is exposed for negotiation by Azure SignalR Service SDK. It will return a special negotiation response when clients try to connect and redirect clients to service endpoint from the connection string. Read more details about the redirection at SignalR's [Negotiation Protocol](https://github.com/aspnet/SignalR/blob/master/specs/TransportProtocols.md#post-endpoint-basenegotiate-request).
+
+
+Now set the connection string in the [Secret Manager](https://docs.microsoft.com/en-us/aspnet/core/security/app-secrets?view=aspnetcore-2.1&tabs=visual-studio#secret-manager) tool for .NET Core, and run this app.
+
+```
+dotnet restore
+dotnet user-secrets set Azure:SignalR:ConnectionString "<your connection string>"
+dotnet run
+```
+
+When you open http://localhost:5000, you can see the application runs as usual, just instead of hosting a SignalR runtime by itself, it connects to the SignalR service running on Azure.
+
+In this sample, you have learned how to use Azure SignalR Service to replace your self-hosted SignalR runtime. But you still need a web server to host your hub logic. In the next tutorial you'll learn how to use other Azure services to host your hub logic so you can get everything running in the cloud.
--- a/samples/Llama2/src/Startup.cs
+++ b/samples/Llama2/src/Startup.cs
@ -0,0 +1,43 @@
+// Copyright (c) Microsoft. All rights reserved.
+// Licensed under the MIT license. See LICENSE file in the project root for full license information.
+
+using LLama.Web.Common;
+using LLama.Web.Services;
+using Microsoft.AspNetCore.Builder;
+using Microsoft.AspNetCore.Hosting;
+using Microsoft.Extensions.DependencyInjection;
+using Microsoft.Extensions.Hosting;
+
+namespace Microsoft.Azure.SignalR.Samples.ChatRoom
+{
+    public class Startup
+    {
+        public void ConfigureServices(IServiceCollection services)
+        {
+            services.AddOptions<LLamaOptions>()
+                .PostConfigure(x => x.Initialize())
+                .BindConfiguration(nameof(LLamaOptions));
+            services.AddHostedService<ModelLoaderService>();
+            services.AddSignalR()
+                    .AddAzureSignalR();
+            services.AddSingleton<AsyncLock>();
+            services.AddSingleton<IModelService, ModelService>();
+        }
+
+        public void Configure(IApplicationBuilder app, IWebHostEnvironment env)
+        {
+            if (env.IsDevelopment())
+            {
+                app.UseDeveloperExceptionPage();
+            }
+
+            app.UseDefaultFiles();
+            app.UseStaticFiles();
+            app.UseRouting();
+            app.UseEndpoints(endpoints =>
+            {
+                endpoints.MapHub<ChatSampleHub>("/chat");
+            });
+        }
+    }
+}
--- a/samples/Llama2/src/appsettings.json
+++ b/samples/Llama2/src/appsettings.json
@ -0,0 +1,29 @@
+{
+  "Azure": {
+    "SignalR": {
+      "ConnectionString": ""
+    }
+  },
+  "LLamaOptions": {
+    "Models": {
+        "Name": "LLama2-7b-Chat",
+        "ModelPath": "<path-to-model>",
+        "ContextSize": 2048,
+        "Threads": 4,
+        "GpuLayerCount": 0
+      }
+  },
+  "Logging": {
+    "IncludeScopes": false,
+    "Debug": {
+      "LogLevel": {
+        "Default": "Information"
+      }
+    },
+    "Console": {
+      "LogLevel": {
+        "Default": "Information"
+      }
+    }
+  }
+}
--- a/samples/Llama2/src/wwwroot/css/site.css
+++ b/samples/Llama2/src/wwwroot/css/site.css
@ -0,0 +1,94 @@
+/*html, body {
+    font-size: 16px;
+}
+
+@media all and (max-device-width: 720px) {
+    html, body {
+        font-size: 20px;
+    }
+}*/
+
+html, body {
+    padding: 0;
+    height: 100%;
+}
+
+#messages {
+    width: 100%;
+    border: 1px solid #ccc;
+    height: calc(100% - 120px);
+    float: none;
+    margin: 0px auto;
+    padding-left: 0px;
+    overflow-y: auto;
+}
+
+textarea:focus {
+    outline: none !important;
+}
+
+.system-message {
+    background: #87CEFA;
+}
+
+.broadcast-message {
+    display: inline-block;
+    background: yellow;
+    margin: auto;
+    padding: 5px 10px;
+}
+
+.message-entry {
+    overflow: auto;
+    margin: 8px 0;
+}
+
+.message-avatar {
+    display: inline-block;
+    padding: 10px;
+    max-width: 8em;
+    word-wrap: break-word;
+}
+
+.message-content {
+    display: inline-block;
+    background-color: #b2e281;
+    padding: 10px;
+    margin: 0 0.5em 0.5em;
+    max-width: calc(60%);
+    word-wrap: break-word;
+}
+
+.message-content-pull-right {
+    float: right;
+    clear: both;
+}
+
+.message-content-pull-left {
+    float: left;
+    clear: both;
+}
+
+.message-content.pull-left:before {
+    width: 0;
+    height: 0;
+    display: inline-block;
+    float: left;
+    clear: both;
+    border-top: 10px solid transparent;
+    border-bottom: 10px solid transparent;
+    border-right: 10px solid #b2e281;
+    margin: 15px 0;
+}
+
+.message-content.pull-right:after {
+    width: 0;
+    height: 0;
+    display: inline-block;
+    float: right;
+    clear: both;
+    border-top: 10px solid transparent;
+    border-bottom: 10px solid transparent;
+    border-left: 10px solid #b2e281;
+    margin: 15px 0;
+}
--- a/samples/Llama2/src/wwwroot/favicon.ico
+++ b/samples/Llama2/src/wwwroot/favicon.ico
--- a/samples/Llama2/src/wwwroot/index.html
+++ b/samples/Llama2/src/wwwroot/index.html
@ -0,0 +1,192 @@
+<!DOCTYPE html>
+<html>
+<head>
+    <meta http-equiv="Cache-Control" content="no-cache, no-store, must-revalidate" />
+    <meta name="viewport" content="width=device-width">
+    <meta http-equiv="Pragma" content="no-cache" />
+    <meta http-equiv="Expires" content="0" />
+    <link href="https://cdn.jsdelivr.net/npm/bootstrap@3.3.7/dist/css/bootstrap.min.css" rel="stylesheet" />
+    <link href="css/site.css" rel="stylesheet" />
+    <title>Azure SignalR Group Chat</title>
+</head>
+<body>
+    <h2 class="text-center" style="margin-top: 0; padding-top: 30px; padding-bottom: 30px;">Azure SignalR Group Chat</h2>
+    <div class="container" style="height: calc(100% - 110px);">
+        <div id="messages" style="background-color: whitesmoke; "></div>
+        <div style="width: 100%; border-left-style: ridge; border-right-style: ridge;">
+            <textarea id="message"
+                          style="width: 100%; padding: 5px 10px; border-style: hidden;" 
+                          placeholder="Type message and press Enter to send..."></textarea>
+        </div>
+        <div style="overflow: auto; border-style: ridge; border-top-style: hidden;">
+            <button class="btn-warning pull-right" id="echo">Echo</button>
+            <button class="btn-success pull-right" id="sendmessage">Send</button>
+        </div>
+    </div>
+    <div class="modal alert alert-danger fade" id="myModal" tabindex="-1" role="dialog" aria-labelledby="myModalLabel">
+        <div class="modal-dialog" role="document">
+            <div class="modal-content">
+                <div class="modal-header">
+                    <div>Connection Error...</div>
+                    <div><strong style="font-size: 1.5em;">Hit Refresh/F5</strong> to rejoin. ;)</div>
+                </div>
+            </div>
+        </div>
+    </div>
+    
+    <!--Reference the SignalR library. -->
+    <script type="text/javascript" src="https://cdn.jsdelivr.net/npm/@aspnet/signalr@1.1.0/dist/browser/signalr.min.js"></script>
+
+    <!--Add script to update the page and send messages.-->
+    <script type="text/javascript">
+        document.addEventListener('DOMContentLoaded', function () {
+            function generateRandomName() {
+                return Math.random().toString(36).substring(2, 10);
+            }
+
+            // Get the user name and store it to prepend to messages.
+            var username = generateRandomName();
+            var promptMessage = 'Enter your name:';
+            do {
+                username = prompt(promptMessage, username);
+                if (!username || username.startsWith('_') || username.indexOf('<') > -1 || username.indexOf('>') > -1) {
+                    username = '';
+                    promptMessage = 'Invalid input. Enter your name:';
+                }
+            } while(!username)
+
+            // Set initial focus to message input box.
+            var messageInput = document.getElementById('message');
+            messageInput.focus();
+
+            function createMessageEntry(id, encodedName, encodedMsg) {
+                var entry = document.createElement('div');
+                entry.classList.add("message-entry");
+                if (encodedName === "_SYSTEM_") {
+                    entry.innerHTML = encodedMsg;
+                    entry.classList.add("text-center");
+                    entry.classList.add("system-message");
+                } else if (encodedName === "_BROADCAST_") {
+                    entry.classList.add("text-center");
+                    entry.classList.add("broadcast-message");
+                    entry.innerHTML = encodedMsg;
+                } else if (encodedName === username) {
+                    let innerNamingEntry = document.createElement('div');
+                    innerNamingEntry.classList.add("message-avatar");
+                    innerNamingEntry.classList.add("pull-right");
+                    innerNamingEntry.innerHTML = encodedName;
+                    let innerMsgEntry = document.createElement('div');
+                    innerMsgEntry.classList.add("message-content");
+                    innerMsgEntry.classList.add("pull-right");
+                    innerMsgEntry.innerHTML = encodedMsg;
+                    if (id) {
+                        innerMsgEntry.id = id;
+                    }
+                    entry.appendChild(innerNamingEntry);
+                    entry.appendChild(innerMsgEntry);
+                } else {
+                    let innerNamingEntry = document.createElement('div');
+                    innerNamingEntry.classList.add("message-avatar");
+                    innerNamingEntry.classList.add("pull-left");
+                    innerNamingEntry.innerHTML = encodedName;
+                    let innerMsgEntry = document.createElement('div');
+                    innerMsgEntry.classList.add("message-content");
+                    innerMsgEntry.classList.add("pull-left");
+                    innerMsgEntry.innerHTML = encodedMsg;
+                    if (id) {
+                        innerMsgEntry.id = id;
+                    }
+                    entry.appendChild(innerNamingEntry);
+                    entry.appendChild(innerMsgEntry);
+                }
+                return entry;
+            }
+
+            var messageCallback = function (name, id, message) {
+                if (!message) return;
+                // Html encode display name and message.
+                var encodedName = name;
+                var encodedMsg = message.replace(/&/g, "&amp;").replace(/</g, "&lt;").replace(/>/g, "&gt;");
+
+                // Find whether id exists
+                var messageBox = document.getElementById('messages');
+                if (id && document.getElementById(id)) {
+                    let entity = document.getElementById(id);
+                    entity.innerHTML += encodedMsg;
+                } else {
+                    var messageEntry = createMessageEntry(id, encodedName, encodedMsg);
+                    messageBox.appendChild(messageEntry);
+                }
+
+                messageBox.scrollTop = messageBox.scrollHeight;
+            };
+
+            function bindConnectionMessage(connection) {
+                // Create a function that the hub can call to broadcast messages.
+                connection.on('broadcastMessage', messageCallback);
+                connection.on('echo', messageCallback);
+                connection.onclose(onConnectionError);
+            }
+
+            function onConnected(connection) {
+                console.log('connection started');
+                connection.send('broadcastMessage', '_SYSTEM_', username + ' JOINED');
+                document.getElementById('sendmessage').addEventListener('click', function (event) {
+                    // Call the broadcastMessage method on the hub.
+                    if (messageInput.value) {
+                        if (messageInput.value.startsWith("@llama ")) {
+                            let promptContent = messageInput.value.substring(7);
+                            connection.send('broadcastMessage', username, messageInput.value);
+                            connection.send('inference', username, promptContent);
+                        } else {
+                            connection.send('broadcastMessage', username, messageInput.value);
+                        }
+                    }
+
+                    // Clear text box and reset focus for next comment.
+                    messageInput.value = '';
+                    messageInput.focus();
+                    event.preventDefault();
+                });
+                document.getElementById('message').addEventListener('keypress', function (event) {
+                    if (event.keyCode === 13) {
+                        event.preventDefault();
+                        document.getElementById('sendmessage').click();
+                        return false;
+                    }
+                });
+                document.getElementById('echo').addEventListener('click', function (event) {
+                    // Call the echo method on the hub.
+                    connection.send('echo', username, messageInput.value);
+
+                    // Clear text box and reset focus for next comment.
+                    messageInput.value = '';
+                    messageInput.focus();
+                    event.preventDefault();
+                });
+            }
+
+            function onConnectionError(error) {
+                if (error && error.message) {
+                    console.error(error.message);
+                }
+                var modal = document.getElementById('myModal');
+                modal.classList.add('in');
+                modal.style = 'display: block;';
+            }
+
+            var connection = new signalR.HubConnectionBuilder()
+                                        .withUrl('/chat')
+                                        .build();
+            bindConnectionMessage(connection);
+            connection.start()
+                .then(function () {
+                    onConnected(connection);
+                })
+                .catch(function (error) {
+                    console.error(error.message);
+                });
+        });
+    </script>
+</body>
+</html>