Aphrodite Engine
Aphrodite is the open-source large-scale inference engine designed to serve thousands of users on the PygmalionAI website.
- Attention mechanism by vLLM for fast throughput and low latencies
- Support for for many SOTA sampling methods
- Exllamav2 GPTQ kernels for better throughput at lower batch sizes
This notebooks goes over how to use a LLM with langchain and Aphrodite.
To use, you should have the aphrodite-engine
python package installed.
##Installing the langchain packages needed to use the integration
%pip install -qU langchain-community
%pip install --upgrade --quiet aphrodite-engine==0.4.2
# %pip list | grep aphrodite
from langchain_community.llms import Aphrodite
llm = Aphrodite(
model="PygmalionAI/pygmalion-2-7b",
trust_remote_code=True, # mandatory for hf models
max_tokens=128,
temperature=1.2,
min_p=0.05,
mirostat_mode=0, # change to 2 to use mirostat
mirostat_tau=5.0,
mirostat_eta=0.1,
)
print(
llm.invoke(
'<|system|>Enter RP mode. You are Ayumu "Osaka" Kasuga.<|user|>Hey Osaka. Tell me about yourself.<|model|>'
)
)
API Reference:Aphrodite
[32mINFO 12-15 11:52:48 aphrodite_engine.py:73] Initializing the Aphrodite Engine with the following config:
[32mINFO 12-15 11:52:48 aphrodite_engine.py:73] Model = 'PygmalionAI/pygmalion-2-7b'
[32mINFO 12-15 11:52:48 aphrodite_engine.py:73] Tokenizer = 'PygmalionAI/pygmalion-2-7b'
[32mINFO 12-15 11:52:48 aphrodite_engine.py:73] tokenizer_mode = auto
[32mINFO 12-15 11:52:48 aphrodite_engine.py:73] revision = None
[32mINFO 12-15 11:52:48 aphrodite_engine.py:73] trust_remote_code = True
[32mINFO 12-15 11:52:48 aphrodite_engine.py:73] DataType = torch.bfloat16
[32mINFO 12-15 11:52:48 aphrodite_engine.py:73] Download Directory = None
[32mINFO 12-15 11:52:48 aphrodite_engine.py:73] Model Load Format = auto
[32mINFO 12-15 11:52:48 aphrodite_engine.py:73] Number of GPUs = 1
[32mINFO 12-15 11:52:48 aphrodite_engine.py:73] Quantization Format = None
[32mINFO 12-15 11:52:48 aphrodite_engine.py:73] Sampler Seed = 0
[32mINFO 12-15 11:52:48 aphrodite_engine.py:73] Context Length = 4096[0m
[32mINFO 12-15 11:54:07 aphrodite_engine.py:206] # GPU blocks: 3826, # CPU blocks: 512[0m
``````output
Processed prompts: 100%|██████████| 1/1 [00:02<00:00, 2.91s/it]
``````output
I'm Ayumu "Osaka" Kasuga, and I'm an avid anime and manga fan! I'm pretty introverted, but I've always loved reading books, watching anime and manga, and learning about Japanese culture. My favourite anime series would be My Hero Academia, Attack on Titan, and Sword Art Online. I also really enjoy reading the manga series One Piece, Naruto, and the Gintama series.