Tag Archives: Copilot

Second look at Microsoft 365 Copilot

A few weeks ago, I posted a “first look” blog about my early experiences testing Microsoft 365 Copilot. I approached Copilot from the perspective of an average and untrained user who just wanted to get his questions answered. I purposefully didn’t read any of the documentation or try any specific prompt engineering… just as an average user would, I jumped in and started doing stuff, then I wrote up the results.

That first-look generated a lot of attention after it was posted on Hacker News, which was briefly fun, and led to a running series of conversations and tests with my old friend Scott Schnoll from Microsoft. Based on my discussions with Scott, and what I learned from them, I wanted to recap the tests and see if any of the results had changed.

Before I do that, let’s talk a little about the Copilot “system” itself. I put “system” in quotes to emphasize that Copilot is really a system of systems built on top of, and into, various existing parts of the Microsoft 365 substrate. What most people think of as the AI piece of Copilot (the LLM) is only one part of the overall system. Microsoft has integrated the LLM with tenant data, web search, and (some of) the Microsoft 365 Apps suite. The Copilot experience you see is delivered by this combination, not just by the LLM alone. The LLM is based on the GPT family of engines from OpenAI; GPT can perform a wide range of content-creation tasks (which is why it’s called a “generative” AI tool), including generating original content that looks like it was written or drawn by a human. Copilot uses private instances of this LLM in Microsoft datacenters that mimic natural language based on patterns from large amounts of training data.

Completely apart from the work done to train the underlying global and tenant models, Microsoft has added its own proprietary code. Copilot iteratively processes output from multiple components and orchestrates them: your user prompt is combined with grounding data, history from your chat or prompt sessions, and a default system prompt customized for Microsoft 365. This default prompt helps outline rules for interacting responsibly with the user, but also helps the GPT models know context.

My point in saying this is to highlight that there are a lot of moving parts in the keyboard-to-output pipeline, many of which may not be obvious and some of which are not broadly documented.

Before we dig into the testing: two notes.

Note 1 is that to get good results, you have to enable web search. The Copilot for Microsoft 365 chat experience in Bing, Edge, and the Copilot app in Teams can incorporate web searches and, if you don’t enable this, you may not see results for some queries. This enablement is done in two places: at the tenant level (where it’s on by default) and at the plugin. If the tenant setting is off, the end-user toggle will be disabled. I didn’t realize that plugin-level enablement was required at first, and having it off negatively affected my first set of results.

Scott pointed out that Microsoft strongly believes that allowing Copilot to reference web content improves the user experience (it certainly did for me!) and increases productivity with Copilot, which is why the feature is automatically enabled at the tenant level. In fact, allowing Copilot to reference web content decreases its chances of hallucinating by giving it an additional source of grounding data.

Keep in mind that, when web search is enabled, customer content will be sent to Bing, outside the tenant boundary, which is why there is also a user-level toggle. It’s also why some customers might choose to disable it altogether, which a Global admin can do from the Search & intelligence page in the Microsoft 365 admin center.

To enable this:

  1. Open the Copilot web app
  2. Click the plugin icon, on the right of the chat bar (it’s marked with a red square in the screenshot below)
  3. Make sure that the “Web content” toggle is turned on.

Note 2 is that you can give Copilot feedback on the quality of the results. I didn’t do this in my first tests because I assumed the feedback went to /dev/null. As it turns out, Scott tells me that the feedback from Copilot users is reviewed by humans, and that Microsoft takes this feedback very seriously, with regular reviews and action based on user feedback. I can say that I saw several improvements in result quality during the period of this testing.

For that reason, the more detail you include in the feedback, the more useful it is– so a thumbs-down isn’t as useful as a thumbs-down with a short explanation.

Throughout these tests, the benchmark I’ve used is that to give a thumbs-up if the result is correct, complete, and useful. If it’s incorrect and/or incomplete and/or not useful, it gets a thumbs-down. In addition, if there’s something about it that I think needs further explanation or is not quite awesome, it gets a thumbs-down. The feedback isn’t directly applied to the Copilot, so the humans who read must decide what changes to make. Detailed feedback helps them do that.

Test 1: Find and summarize something in a OneDrive document

I’ll use the same query I did before: “Can you summarize how the glideslope coupler on a Century III autopilot operates?” The result is spot on: correct and complete, including a citation to the source document, which is in my OneDrive. (To see the reference(s), click the little chevron next to the “1 reference” text, as shown in the image below). This one got a thumbs-up.

Test 2: answer some fact-based questions from a document

In my previous test, I asked some questions about my airplane that I thought should be easy to answer. The first one was “what’s the single-engine service ceiling of a Baron 55?” Now, there’s a trick to answering this, which I didn’t fully realize when I asked it. The actual SE service ceiling for a specific Baron 55 will vary according to the aircraft weight, the temperature and density of the surrounding air, and the altitude you’re currently flying at. There’s a chart in the pilot operating handbook (POH) that you use to calculate this. I didn’t take that into account when asking (and, in case you’re wondering, Copilot isn’t currently able to read and interpolate graphs, sorry!)

When I did this test with Scott, the results were pretty good. When I repeated the test for this post, I got a different result. Here’s what I got from “what’s the single-engine service ceiling of a Baron 55?

This got a thumbs-down because of the reference to the Colemill-conversion aircraft. That conversion puts a (delightfully) more powerful engine on the Baron airframe, which raises climb performance and the single-engine service ceiling. The wording of the result implies that the ceiling only varies with weight on non-Colemill Barons, which isn’t right.

My test with Scott only included the second citation, and correctly answered the question for the specific airplane mentioned in that cited article. The last sentence of the result is not wrong but it’s also not precisely correct– I’d say it’s “usefully wrong” but the answer still got a thumbs-down.

I tried again with a more refined query: “what’s the single-engine service ceiling of a 1968 Baron B55?” This gave me basically the second half of the previous answer: 7000′ at 5100lbs. Unfortunately, the article cited quotes numbers for a 1973 Baron 55. Now, here’s the thing: the 1973 and 1968 models have the same engines and wing area, and the same weight, so the answer is correct but the framing could use some improvement. I gave this one a thumbs-up (remember: complete, correct, useful) but that was a little bit of a charity call.

In my testing, I discovered something that probably has an impact. Some of the PDF files I uploaded are scanned images and some are text. By that, I mean that (e.g.) the Baron POH I uploaded is basically several hundred pages of images. You can’t select or search for text in that file using the Edge PDF viewer. Asking simple questions like “what’s the wing area of a Baron 55?” didn’t produce a result even though I could see the correct answer in the document– but neither did searching for “wing area” in the PDF. I suspect there’s still some more work to be done here.

I also belatedly figured out that the POH doesn’t make any reference to model years– just serial numbers! The POH covers serial numbers TC-957 through TC-1607, which roughly corresponds to the 1964-1975 production years.

Both these examples were significantly better than in my first round of testing.

I continued on by asking “what Vle is for a 1968 Beech Baron 55“. I did this even though I knew that the POH wasn’t indexed by model year because I wanted to try fairly replicating the previous tests. Copilot got the correct answer (yay! thumbs up!) but it didn’t cite my own checklist as a source, which puzzled me a little bit. Here’s the first result it produced:

Scott taught me that you can include specific files as a source or reference with the “/” key (a feature called Context IQ, or CIQ), so I decided to ask again using my specific checklist. The answer was correct and complete: “According to the N421BJ checklist, the Vle (maximum landing gear extended speed) for this aircraft is 143 knots.” I’m not sure why, but in this case, my own document wasn’t prioritized over a similar document from the web.

Another question from round 1: “what is the minimum controllable airspeed for a Baron 55?” This speed (abbreviated Vmca) is incredibly important because if you go slower, you’re at risk of a fatal crash. That search produced no results. Changing it to “what is vmca for a Baron 55?” also produced no results. Issuing either of these queries to vanilla Bing.com resulted in finding the “Baron 55-A55 Checklist” file referenced in the previous question as the first result–so something was amiss here. When I repeated the test later, I got correct results.

Test 3: Document summarization

Summarization is one of my big interests; I want an assistant that can absorb and abstract complex content, as long as the abstract is complete, correct, and useful. (I have another post in work covering using Copilot for Word for document summarization, too). In the previous tests, I asked for a summary of a document from my OneDrive.

Careful readers will notice that this summary is slightly different from the one in round 1 of the tests. This is an important point to note: if you ask Copilot the same question 10 times, you won’t get exactly the same answer each time. Transformer-based machine learning systems are not deterministic, so this variance of answers is normal and expected, but it may also be a little offputting if you’re not expecting it. Interestingly, this response also includes some sugar: “Is there anything else you would like to know?” is just frosting to make Copilot seem more polite and human; I didn’t think it added any actual value but, as it turns out, treating LLMs more humanely improves their output.

I then asked a more generic question: “can you summarize the major problems with reusable passwords?” The Copilot response here was excellent: it included some “company” documents plus a web search. Because I have multiple “company” documents that are relevant, I think it could be argued that the company summary could be richer or more detailed, but this is a quite good response overall. Thumbs up.

Test 4: Drafting email with Copilot for Outlook

I wasn’t going to repeat this test from the original (using the prompt “Draft an email to Adam Henry. Thank him for joining us at our customer dinner. Ask him if it would be possible to have a technical meeting with him during the week of 20 December to discuss our turbo encabulator.“) because I didn’t expect the results to be meaningfully different. Just to be sure, I did repeat it and, sure enough, the result was as before: correct and complete, but also bland and blah. No score. That said, I’ve shared my feedback with Microsoft, and they shared with me (under NDA) some of their plans to improve this experience. Once their work is complete and my NDA is lifted, I’ll have more to say about this experience.

Test 5: Coaching with Copilot for Outlook

Based on my conversations with Scott, I decided to wait to redo these tests until I do a deeper dive into Copilot for Outlook. I wanted to make sure I had found some realistic use cases for coaching; as someone who drafts dozens of emails per day, I have a pretty high bar for what would be useful in a feature like this.

Who’s training who?

It’s fair to say that Copilot performed much better in this second round of tests than it did in the first round. Some of that is because Microsoft fixed issues with various parts of the Copilot system; some of it is because I have a more refined understanding of what I should be expecting (e.g. Copilot won’t interpolate graph data for me!)

As with getting a new pet, the training process is actually bidirectional, and you should expect your users’ behavior to change as they learn more about how to get the desired results from Copilot. I have a lot more I want to say about this bidirectional training, because it’s one of the most interesting areas around adoption of generative AI in the workplace. Stay tuned for more… but next, Outlook!

Leave a comment

Filed under General Stuff, Office 365

First look at Microsoft 365 Copilot

Microsoft has made tons of noise about their Microsoft 365 Copilot toolset. They’ve made a lot of promises about its capabilities to enhance individual and team productivity, but access to it has been very limited so far. This week, the product team gave Microsoft 365 MVPs access to a test environment with Copilot in it. This sandbox doesn’t have any data, so we had to bring our own. The product team said we could test the tools however we liked; this is the experience I’ve had so far.

To set up my sandbox, I uploaded some “work”-related documents. One large folder is manuals and documents from my future second career as a commercial pilot; the second is an archive of Practical 365 articles I wrote. I wanted to see how well summarization and Q&A worked. The Copilot family includes different assistant modalities for different applications– for example, Copilot for Outlook can do some things unique to Outlook, as can Copilot for PowerPoint, and so on. I didn’t test each of the individual modalities yet.

In my early tests so far, the results are… mixed. Some of these issues are probably due to user error and/or misunderstanding of what results I can reasonably expect. Some of them, I fear, are not.

Test 1: Find and summarize something in a OneDrive document

One of the documents I uploaded is the operating handbook for the Century III autopilot in my plane. I asked “Can you summarize how the glideslope coupler on a Century III autopilot operates?” The answer I got was “I’m sorry, but I don’t have any information about the glideslope coupler on a Century III autopilot in my internal knowledge or in your OneDrive.” There are 4 pages in one of the documents in my OneDrive that detail the operation, but it’s possible that Copilot hasn’t indexed things properly yet. I’ll try some other tests after a decent interval has passed.

Test 2: answer some fact-based questions from a document

Here’s what should be a simple fact based question: “what’s the single-engine service ceiling of a Baron 55?”

As you can see, the first answer is precise… but it is completely incorrect. I supplied a URL that has a better answer, which Copilot read and summarized. Then I asked the question again, with different and more precise wording. Different models, and model years, of Baron have different engines, which will affect their single-engine service ceilings.

This answer is also completely incorrect, but at least it’s different from the previous incorrect answer. anyone who plans a flight with a single-engine service ceiling of 15000′ or more in this aircraft will be in for a very rude surprise if they fly into mountainous terrain. Copilot did correctly explain what the single-engine service ceiling actually is, which is nice.

Here’s another example:

The original answer of “154 knots” is wrong, so I tried being helpful again by pointing Copilot directly to a better source. The specific page I referenced contains a table that lists the correct Vle speed. I realized after taking the screenshot that the table is a little confusing, because it lists two speeds for two different serial number ranges. But neither of those speeds is “154”.

Then I tried a different question, still related to the same document: I asked what the minimum controllable airspeed was for a Baron 55. Copilot’s answer was again not awesome:

According to the Pilot’s Operating Handbook for the Baron 95-B55 and 95-B55A, the minimum controllable airspeed for a Baron 55 is not explicitly stated. [PR: false; it’s shown in the table on 2-3] However, the handbook does provide information on the stall speed, [PR: also false; these speeds are not shown in the POH] which is the minimum speed at which the aircraft can maintain controlled flight. The stall speed for a Baron 55 with flaps up is 73 knots and with flaps down is 64 knots. [PR: this is correct but there is no citation shown]

As I said earlier, I thought it was possible that Copilot needed more time to read documents. I repeated some of these tests a day later and, surprisingly, the quality of the answers got worse. I thought maybe giving the exact serial number of the aircraft might help produce a more precise answer… but instead, Copilot forgot the answer it was able to find the previous day. Repeating the previous query repeated the same results.

In this case, Copilot subtracted value compared to a simple Internet search. Google Search produced the correct answer as its second result (after an ad). Bing Search didn’t show the correct answer on the first page of results and then I got tired of looking… I suspect this performance is a clue to Copilot’s inability to discover the correct answer.

Test 3: Document summarization

I alluded to this in test #1, but I’m prepared to say that the autopilot handbook hadn’t been indexed yet. I gave Copilot an easy one this time, and it did quite well:

Accurate and complete summary, plus a reference… very nice!

I tried again with a different question, to test whether Copilot would prefer my own work documents over general Internet searches. The first attempt produced a correct and useful summary, but my “company” data wasn’t included until I explicitly told it to:

I plan to run some more summarization tests after uploading some more documents (including the book chapters from Office 365 for IT Pros). So far, though, I am not seeing the improvement in results quality I had hoped for given that I am providing “work” data for the model to use. This may be 100% due to delays in ingesting the source material though.

Test 4: Drafting email with Copilot for Outlook

Copilot for Outlook can operate in two modes: you can ask Copilot to draft emails for you, and you can get coaching as you write. Drafting mode works reasonably well.

I first tried a very simple prompt: “Draft an email to Adam Henry. Thank him for joining us at our customer dinner. Ask him if it would be possible to have a technical meeting with him during the week of 20 December to discuss our turbo encabulator.

The result was decent, if very bland. It’s essentially the same thing as the prompt I gave it, so I saved zero time by asking Copilot to do the work for me.

Dear Adam, I wanted to thank you for joining us at our customer dinner. We appreciate you taking the time to be with us. I hope you found the event informative and enjoyable. We would like to schedule a technical meeting with you during the week of December 20th to discuss our turbo encabulator. Would it be possible to arrange this at your convenience? Thank you and best regards, Paul.

I wanted to test whether Copilot for Outlook could create an email that summarized a document. Sadly, it can’t. My first test was asking it to summarize one of the Practical 365 columns from my OneDrive, but Copilot responded that it didn’t see the file. I then tried attaching the file to the draft email and asking for a summary.

This is really unfortunate, and I hope it’s something that Microsoft is working on.

In fairness, there are several things that Copilot for Outlook supports that I didn’t test. Because this sandbox environment doesn’t have any real data in it, I didn’t try to test summarizing of email threads. Copilot for Outlook is also supposed to be able to read a meeting summary generated by Intelligent meeting recap and then attach it to an email.

Test 5: Coaching with Copilot for Outlook

I selected “Coaching by Copilot” and started typing. After a few sentences, nothing happened. This is probably user error on my part. I then created a new draft, typed some text, and chose “Coaching by Copilot” again. This is the result I got.

Fair enough; I guess Copilot coaching isn’t meant to talk people out of sending emotionally-laden rants. I then fed it an edited version of an email I sent to a customer the other day, asking for a technical meeting. My source text of “Is there any way you might be able to squeeze me in to your calendar? If not, no worries—we will catch you in early January—but if possible I’d really like to get your technical input sooner rather than later.” generated this result:

My own personal opinion, supported by nothing, is that the original text wasn’t demanding or impatient, but then again this is why I don’t usually use writing assistants; I don’t usually agree with their suggestions on tone and voice. It’s interesting that Copilot didn’t attempt to help with the rudeness of the first email but was willing to suggest improvements in the second one.

A good first step…

LLM-based generative AIs have a lot of potential, as Gartner, AI vendors, and the press never tire of telling us. The Copilot tools in Microsoft 365 and Outlook are quite well integrated; they’re easy to find and easy to interact with. Thus far, I have to say that I am not super impressed with the output or utility of these tools though. I can see the potential for productivity improvements, and I’ll keep testing as the product evolves and improves.

Update 1…

After I posted this on Twitter, Microsoft’s Abram Jackson pointed out that I needed to enable per-user web search. There are two settings that control this: the first is at the tenant level, and it was on. However, individual users also have to opt in to including web results. I did that, with some excitement, and repeated the tests. Sadly, this change made some of my results worse. Microsoft’s Scott Schnoll also pointed out that you can provide result feedback with the thumbs-up/thumbs-down icons (see above screenshots), so I’ll be smashing those with a vengeance.

6 Comments

Filed under Office 365